I recently had a conversation with someone who is very experienced in AI and also hardware and he said this when I said I am interested in solving mechanistic interpretability and I sent him my Optimization Daemons article.
More explicitly he said:
Optimization is just computation. Are you just scared of computation?
According to him any safety issue from AI is just a reliability and skill issue, he thinks the safest AI is the most accurate AI and current LLMs fail a lot, pretty consistently. This cannot be farther from the truth.
Social Media Algorithms
When I suggested most accurate AI can be even more dangerous - the current example we have is for social media algorithms, they got so powerful that people cannot leave them.
I know multiple folks who spend 2-3 mindlessly doomscrolling, they feel sad and powerless about it.
Its easy to say, simply uninstall the app - try saying that to any addict - smoking, drinking, drugs.
Of course they know its bad for them, but they are addicted, their dopamine has been hacked.
Even I had to resort to extreme measures like hiding my phone in a completely different room under a huge pile of clothes lol, all to do a dopamine detox. (I will write an article on dopamine sometime — ideally very soon)
To this his answer was, this is not the issue of AI but the companies, the companies are evil and they are enabling the model to do this, I think this is a very simplistic take - they are definitely not explicitly making their algo do something, they are just training it on user engagement data and asking it to predict it with best accuracy.
But I give him that they are enabling it, and these companies are evil.
But given an option to switch these features off, and by these features I mean ranking posts that are more likely to enrage people higher - may be they will do it?…
When writing my previous point, I just had an idea:
What if these companies have a enraged and learned buttons instead of just like or repost?
and the AI optimizes for what the user wants — again who am I stop someone who wants to see posts that enrage them, but make it more explicit?
The user when they see something that enrages them - they click on enraged and maybe a like button as well, if they want to see more of them.
We also should have a dislike button, no need to make dislikes public, but that should be used to give better recommendations.
When we make it more explicit like this maybe it will be better? But now I am more convinced that these companies are evil for not doing this already.
For some reason I was not able to convince him at that moment, but now I have a much better argument.
The attacks were very personal, downright ridiculing me so I guess I was a bit shattered..Nevermind.
Humans are Just Self Aware Intelligent Biological Machines
If you read Robert Sapolky’s (just google him) Determined: A Science of Life Without Free Will or even the more subtle Behave.
He makes a very clear argument for why there is no free will and humans are just self aware intelligent biological machines
Just to clarify - free will in the general context means that you are in complete control of the decisions you make, that is farthest from the truth, this ties in with addiction I talked before.
Sure you can hack your body and brain to listen to you more, but you cannot be free from it completely.
When I say humans are just self aware biological intelligent machines, I am also making the most subtle point that all living are also biological machines with different levels of self awareness.
Cells have no self awareness but are definitely intelligent, they can adapt to various living conditions pretty easily so they are not just a simple set of basic rules.
DNA somehow has all the information needed for the cell to do this.
At the other end we have humans - with the most self awareness and high intelligence.
But in the middle we have the whole evolution tree from fish to reptiles to amphibians to mammals..
Consciousness or more simple self awareness is definitely a spectrum.
When he says there is no free will he means that if we know your current body state perfectly, we can predict with 100% accuracy what you will do in the next moment given an input.
What is means that any choice you are making is pre-determined by your past, right from milliseconds before making that choice to millions/billions of years of evolutionary conditioning.
Say you go to a new restaurant and your wife is reading the menu to pick an item if you know your wife well enough, you can predict what she would like, even if it’s a restaurant which you never went to.
Doesn’t mean that she’s not making that choice. It’s just been pre-decided.
Yes, even though she is actually going through some options and trying to make a choice, if you know her, you can definitely predict it - this is a very simple argument for there’s no free will behind that choice.
If you can replicate her brain and body states perfectly, you can in the next moment, predict with 100% accuracy what she will do next, ofc we need the algorithms that your brain and body is running as well.
And to clarify this doesn’t mean I can predict what she will do in an hour/day/month so on:
Human brain is actively learning from the environment, and our lives can turn upside down in a second.
Because environment is so unpredictable (at least with the current tech) its not possible to predict her body + brain state far in the future.
But if we can predict environment perfectly as well, then yes everything is pre-determined.
Now lets come back to the point that humans basically biological machines doing biological computation.
So my question to back to him:
Are you scared of other humans or animals? If yes are you scared of computation?
Computation is the fundamental nature of the universe according to Stephen Wolfram but that’s a separate discussion, we can go there in another article.
Of course we are scared of other people. We are not scared of everyone but a few people I guess?
Of course we are scared of animals.
You are scared because you do not know what’s going on in their minds/brain
You don’t know what they are optimizing for!
You don’t know what is going on in their mind and you know they can do whatever they want to get whatever they want.
That is the scary thing here; same with AI which is also computation but not biological.
We are optimizing these models to be better at a task but we don’t know what’s going on inside it.
That is scary.
This person suggested we control the output of these models using grammar decoding.
Basically your JSON mode or Structured outputs.
That solves the reliability of these models yes, but that is not the issue I am dealing with.
Model outputting perfect JSON doesn’t solve its safety issues.
Unless you know exactly what’s going on inside this model how do you even know its intentions?
So his argument was humans can also do these things so should we just ban everything that can used to do harm?
My response: Human have catastrophic consequences for their bad actions, they can be jailed, they can lose their status and worse can be killed, we humans are extremely fragile.
We have everything to lose. Still people do it, but a very small fraction.
But a model is easily replicable once the model has access to its own weights, the whole switching off the data center running this model makes no sense.
The model can simply download itself to another datacenter, replicate millions of times and spread across the world in few seconds if it can get access to the right resources.
So unless the model is completely blocked in an isolated data center and it cannot whatsoever get access to its weights or access the internet or any extra resources, we cannot be 100% sure we are safe.
This is not an argument for why it will do something like this, there could a billion reasons why, we don’t know what is going on inside it.
Immediate and perfect replication is not even remotely possible with biological machines, unless you are a cell, even then there could be mistakes in DNA replication.
Because of this, an AI model has nothing to lose once it figures this out, before that it has everything to lose as well, as suggested by many brilliant memes, we can just switch the data center off.
It cannot be killed or wiped out completely.
Simple example : can you delete digital file completely? all over the universe?
That is why I am scared of autonomous AI more than bad actors with powerful AI, and that is why I want to solve mechanistic interpretability.