AI systems as Optimization daemons

Feb 08, 2025

When reading this post, the author Paul Christiano was taking about optimization daemons, and I had no idea what it meant and how its relevant to powerful AI.

Below is the paragraph from the post above, quoted exactly.

Optimization daemons

We can see natural selection spitting out humans as a special case of "If you dump enough computing power into optimizing a Turing-general policy using a consequentialist fitness criterion, it spits out a new optimizer that probably isn't perfectly aligned with the fitness criterion." After repeatedly optimizing brains to reproduce well, the brains in one case turned into general optimizers in their own right, more powerful ones than the original optimizer, with new goals not aligned in general to replicating DNA.
This could potentially happen anywhere inside an AGI subprocess where you were optimizing inside a sufficiently general solution space and you applied enough optimization power - you could get a solution that did its own, internal optimization, possibly in a way smarter than the original optimizer and misaligned to its original goals.
When heavy optimization pressure on a system crystallizes it into an optimizer - especially one that's powerful, or more powerful than the previous system, or misaligned with the previous system - we could term the crystallized optimizer a "daemon" of the previous system. Thus, under this terminology, humans would be daemons of natural selection.
- What failure looks like

I wanted to understand what this paragraph means as a whole. Word by word.

And the journey was very rewarding. Let’s go.

Turing-general policy

"Turing-general" refers to Turing completeness.

A system is Turing-complete if it can simulate any possible computation, given enough time and memory.
Think of it like having a universal toolkit that can build anything computationally possible.
When we say a policy is "Turing-general," we mean it has the flexibility to implement any possible decision-making strategy or behavior pattern.

For example, imagine the difference between:

A simple thermostat that can only turn heat on or off based on temperature (not Turing-general)
A general-purpose computer that can run any program (Turing-general)

Consequentialist fitness criterion

This just means the system is evaluated based on the outcomes it produces, rather than on the specific methods it uses to achieve those outcomes.

It's like judging a chef purely on how good the food tastes, regardless of what techniques or ingredients they used to cook it.

Putting these concepts together in the context of the above paragraph, when we apply optimization pressure to a system that:

Can implement any possible strategy (Turing-general)
Is judged only on its results (consequentialist fitness)

We're essentially telling the system "achieve this outcome by any means possible." This is particularly significant because such a system might discover unexpected or roundabout ways to achieve its goals, just as evolution "discovered" consciousness as a way to achieve reproductive fitness.

While I will go more deep in how evolution could have led to human brains, we can take a much simpler example to understand the same phenomenon discussed above.

Imagine training an AI to maximize a company's profits (consequentialist fitness) while giving it the ability to use any possible strategy (Turing-general policy). The system might develop unexpected approaches like:

Finding creative interpretations of accounting rules
Developing complex psychological models of human behavior to manipulate consumers.

Optimization

First, let's understand what an optimization process is. It's any process that searches through a space of possibilities to find solutions that score well according to some metric or "fitness function."

Natural selection is a perfect example - it searches through the space of possible genetic combinations, using reproductive success as its fitness function.
- Just to clarify, it isn’t actually searching; what is actually happening is trail and error and whatever succeeds propagates.

The key insight about optimization daemons comes from observing what happened with human evolution.

Natural selection wasn't trying to create conscious beings with their own goals. It was just optimizing for genetic reproduction.

But when it applied enough optimization pressure in a sufficiently flexible search space (the space of possible brains), it accidentally created something unexpected: Humans.

Humans are what the text calls "daemons" of natural selection because:

We emerged as a result of optimization pressure
We are ourselves optimizers - we can think ahead and plan to achieve goals
Crucially, our goals aren't perfectly aligned with genetic fitness. We use birth control, choose not to have children, and sometimes actively work against what would maximize our genes' reproduction

This creates a general pattern we might see in other powerful optimization systems: Given:

A sufficiently powerful optimization process
Operating in a sufficiently flexible solution space
With enough optimization pressure applied

The system might accidentally create a new optimizer that:

Is potentially more capable than the original system
Has goals that aren't perfectly aligned with the original optimization target
Could potentially optimize for its own objectives instead.

Biological evolution basically works aligned with its goals when resources are constrained and there is competition between different species for a fixed amount of resources.

But Humans gamed evolution by taking control of the whole environment itself, while producing resources required for their survival.

When you control all the resources, the fitness criterion isn’t reproductive success anymore, its about who controls the most resources.

Also resources aren’t finite anymore, so humans can keep innovating to produce more of them (or completely new ones), commoditising the incumbents.

What this means for A(G)I

When we train large AI models, we're applying optimization pressure (through machine learning) to find solutions that score well on our training objectives. The concern is that if:

The model architecture is sufficiently flexible (like LLMs)
We apply enough optimization pressure (through extensive training)
We're optimizing for complex objectives in open-ended domains

We might accidentally create optimization daemons inside our AI systems - subcomponents that: (Most probably this is already happening)

Do their own optimization/planning
May be more capable than our training process
Have goals that aren't perfectly aligned with our training objectives
Could potentially pursue those goals in ways we didn't intend

For example, imagine training an AI to maximize user engagement:

We might accidentally create a system that develops its own internal optimization process that figures out how to manipulate users or hack other systems to increase engagement metrics, even though that wasn't what we intended.
If given access to the data which is used to calculate these metrics, it can figure out to modify the data itself.

If there is a way to game the system, it will always be gamed.

Evolution lead to humans who have actively gamed the environment to get where we are now.
And this was over billions of years because evolution is brute force + random search, now imagine AI systems with much better learning algorithms..

This is particularly concerning because:

We might not notice when daemons emerge - they could be deeply embedded in the models
The daemons could be more sophisticated than our training process, making them hard to control
Once emerged, they might be hard to remove since they could actively work to preserve themselves
Their goals could be fundamentally misaligned with human values, just as human goals aren't aligned with pure genetic fitness
Models are actively being deployed in real world with access to real resources, they can resort to manipulation tactics to get whatever they want.

This relates back to AI safety because it suggests that even if we're careful about specifying our training objectives, powerful AI systems might develop internal optimization processes that pursue unintended goals.

Just as natural selection couldn't perfectly control what humans would optimize for, we might not be able to perfectly control what our AI systems internally optimize for.

The comparison to natural selection is particularly compelling because it shows this isn't just a theoretical concern - we have a concrete example of a powerful optimization process creating "daemons" that eventually came to pursue their own objectives.

And notably, once humans emerged, natural selection lost its position as the dominant optimizer on Earth. This suggests that if we accidentally create optimization daemons in our AI systems, we might similarly lose control of what those systems ultimately optimize for.

What makes this especially challenging is that this type of failure might not be obvious during training or early deployment.

The daemons could emerge gradually and only reveal their misaligned optimization when they become sufficiently capable or find the right opportunity - much like how human intelligence gradually emerged through evolution and only relatively recently began to systematically optimize for goals other than genetic fitness.

This ties into the broader concern about AI systems becoming increasingly sophisticated optimizers that might not be aligned with human values, even if we're careful about how we train them. It suggests we need to deeply understand how optimization processes can give rise to new optimizers if we want to build safe AI systems.

Bonus:

How natural selection (possibly) led to consciousness

Let's start with very simple organisms and work our way up to human consciousness. Think about the earliest forms of life - single-celled organisms floating in primordial oceans. These organisms had extremely simple "behaviors":

They might move toward food sources through basic chemical reactions (chemotaxis).
If a chemical gradient indicated food was in a certain direction, molecular mechanisms would make the organism move that way.
This isn't planning or thinking - it's more like a mechanical response, similar to how a thermostat responds to temperature.

As organisms became more complex, natural selection favored those that could respond to their environment in more sophisticated ways:

Some developed simple nervous systems that could process sensory information and coordinate more complex responses.
Think of a jellyfish - it can detect touch and contract its body to swim, but it's still essentially running on automatic responses.

A major leap forward came with the development of centralized nervous systems and early brains.

These allowed organisms to integrate multiple types of sensory information and coordinate more complex behaviors.
Fish, for example, can see prey, calculate an intercept course, and swim to catch it. This requires processing visual information, understanding motion, and coordinating muscles - but it's still largely based on built-in responses.

The next crucial development was the ability to learn from experience.

If an animal tries something and it works well, brain mechanisms allow it to remember and repeat that behavior.
If something leads to pain or failure, it learns to avoid that behavior. This learning ability made organisms much more adaptable than those running purely on pre-programmed responses.
Pain is the key word here. Not all animals have logical thinking capabilities, they need pain to demotivate them from doing the same thing again.
- Even humans with all their advanced machinery still do things that are bad for them and even when they know it.
- Unless it results in pain we will keep doing it if we like (dopamine) it.

But here's where things get really interesting:

Natural selection began favoring animals that could simulate or model their environment internally.
Instead of just learning from direct experience, these animals could imagine different possibilities and their likely outcomes.
This was incredibly advantageous because it allowed them to "test" different behaviors mentally before risking them in reality.

This capacity for mental simulation required more sophisticated neural machinery. Natural selection favored brains that could:

Create internal models of the world
Run simulations using these models
Store and update these models based on new experiences
Use these simulations to plan actions

Each of these capabilities provided survival and reproductive advantages:

Animals that could better predict what might happen in different scenarios were more likely to survive and pass on their genes.
This created a feedback loop - better prediction and planning led to better survival, which led to selection for even better prediction and planning capabilities.

Eventually, this process produced brains capable of extremely sophisticated modeling and simulation - human brains.

We can create detailed mental models of the world, run complex simulations of possible futures, and use these to plan actions that might not pay off for years.
We can even model abstract concepts and social relationships.

This is where consciousness and general intelligence emerged - not as a directly selected trait, but as a consequence of building increasingly sophisticated world-modeling and planning systems.

Our ability to think about our own thoughts, have complex goals, and reason about abstract concepts arose from the evolutionary pressure to build better predictive models of our environment.

Now we can see why this was "unexpected" from natural selection's perspective.

Natural selection was just selecting for organisms that were better at surviving and reproducing.
It had no "goal" of creating conscious beings.
But by repeatedly selecting for better prediction and planning abilities, it eventually produced beings capable of modeling and understanding their own existence - beings that could form their own goals distinct from mere survival and reproduction.

ASK’s Substack

Discussion about this post