Why am I doing this? What is my end goal?
Notion link
I am genuinely concerned about AI safety.
Both from bad actors, using powerful AI and also the danger from very powerful (autonomous)AI as well.
Bad actors with powerful AI can do anything ranging from bio terrorism, nuclear attacks, cyber attacks at unprecedented scale.
Risk from powerful AI, which is completely autonomous, is even higher.
We have no idea what these models are optimizing for, they might be completely opposite to what the humans want
I talked about these ideas in my article AI systems as Optimization daemons.
It’s extremely naive to assume that they will be completely aligned with what we want.
Two different people cannot completely agree on what they collectively want.
That’s the whole point of the relationship struggle.
How do we expect all of humanity to get to a single few things that we actually want? That’s not gonna happen.
There are multiple companies each building models and they could be optimizing for completely different things.
How can we be sure that they are building something useful to humanity in general?
Now that is clear, what do I want to do?
I think that mechanistic-interpretability is the only way out of this.
Anything else apart from this can definitely be gamed by a sufficiently powerful model.
I am not saying we should not do anything else.
We should definitely do other things as well, but personally I feel unless we solve mech interp and by completely I mean that if I give you a model, we should be able to return a source code of sorts, which literally defines everything that is going on in the model.
That means we can see exactly what the model is doing.
What is the algorithm that it is running?
and we can modify it accordingly.
I am not sure on what to do if we do solve mech-interp fully; if we should open source it fully, I will think about it more later.
I want to get us there as quickly as possible and nothing less than that is acceptable.
I will not rest until we get there.
This is my only goal and I want to fully work on it.
What is my plan of action?
Reasoning models are amazing. They show human-like intelligence tendencies.
self correction, awareness, reflection.
I feel that we have a very good small model in r1-1.5b.
r1-1.5b has just been SFT trained on 800k from r1, no RL.
DeepScaleR was RL trained starting with r1-1.5b to improve it further.
I have done SFT on top of DeepScaleR. (metrics here)
My intuition is that we can keep doing RL and SFT in a loop to get the small model to match the big model.
We can potentially do online RL + SFT where it's like a game where the model progresses through the complete curriculum of a student, elementary school to PhD (all subjects together) and then to the latest cutting edge research.
It goes forward only once it solves all questions correctly of a particular class - all subjects - not just math.
if the model gets the answer right majority of the times, we keep doing online RL but if it fails - it is added to the SFT backlog and then once SFT backlog reaches a minimum batch size, we can trigger the bigger reasoning model and generate SFT data through rejection sampling and update the model on the same.
So online SFT + RL.
I assume we can get synthesis across subjects unlocked through this process, novel discoveries.
Through this we can potentially make the 1.5B model as good as the bigger r1 model.
Why do this?
So that we can do mech-interp on the smaller 1.5B, run 1000s of experiments in parallel because of how small it is.
we can potentially get a good 0.5B model as well.
What do I want to do?
Work at a higher abstraction than neurons/circuits, run 1000s of ablation experiments, where we remove random parts of the network (by remove I mean short circuit them- basically removing its contribution)
And run the same set of benchmarks to see how it affects the model’s performance.
Note: All LLMs have a residual connection, so I can just use the residual input and ignore the current neuron’s output.
This is much easier to do with small models, although it is still very compute heavy.
We can discover lots of interesting things based on the outputs of this analysis, like damaging parts of the brain to see what changes.

