Decentralized AGIs, or Singleton?

Speed Read This
Posted by on February 19, 2016

When predicting and planning for coming decades, we classify futures different ways based on what happens with artificial general intelligence. There could be a hard take-off, where soon after an AGI is created it self-improves to become extraordinarily powerful, or a soft take-off, where progress is more gradual. There could be a singleton – a single AGI, or a single group-with-AGI, which uses AGI to become much more powerful than everyone else, or things could be decentralized, with lots of AGIs or lots of groups and individuals that have AGIs.

The soft- vs hard-takeoff question is a matter of prediction; either there is a level of intelligence which enables rapid recursive self improvement, or there isn’t, and we can study this question but we can’t do much to change the answer one way or the other. Whether AGI is decentralized or a singleton, however, can be a choice. If a team crosses the finish line and creates a working AGI, and they think decentralized control will lead to a better future, then they can share it to everyone. If multiple teams are close to finishing but they think a singleton will lead to a better future, then they can (we hope) join forces and cross the finish line together.

There are things to worry about and try to prepare for in singleton-AGI futures, and things to worry about and prepare for in decentralized-AGI futures, and these are quite different from each other. Which is better, and which will actually happen? I think a lot of people talking about AGI and AGI safety end up talking past each other, because they are imagining different answers to this question and envisioning different futures. So let’s consider two futures. Both will be good futures, where everything went right. One will be a singleton future, and the other will be a decentralized future.

Let’s look at a singleton future, starting with a version of that future in which everything went right. There are some who want to make – or for others to make – a single, very powerful AGI. They want to design it in such a way that it will respect everyone’s rights and preferences, be impossible for anyone to hijack, and be amazingly good at getting us what we want. In a world where this was executed perfectly, if I wanted something, then the AGI would help me get it. If two people wanted things that were incompatible, then somewhere in the AGI’s programming would be a rule which decides who wins. Philosophers have a lot to say about what that rule would be, and about how to resolve situations when people’s preferences are inconsistent or would change if they knew more. In the world where everything went right, all of those puzzles were solved conclusively, and the answers were programmed into the AGI. The theory of how intelligence works was built up and carefully verified, all the AGI experts agreed that they AGI would do what all the philosophers and AGI experts together agreed was right. Then the AGI would take over the world, and everyone would be happy about it, at least in retrospect when they saw what happened next.

On the other hand, there are a lot of ways for this to go wrong. If someone were to say they’d built an AGI and they wanted to make it a singleton, we’d all be justifiably skeptical. For one thing, they could by lying, and building a different AGI to benefit only themselves, rather than to benefit everyone. But even the very best intentions aren’t necessarily enough. A major takeaway from MIRI and FHI’s research on the subject is that there’s a very real risk of trying to make something universally-benevolent, but getting it disastrously wrong. This is an immensely difficult problem. Hence their emphasis on using formal math: when something is mathematically proven then it’s true, reducing the number of places a mistake could be made by one. There’s a social coordination problem, to make sure that whoever is first to create an AGI makes one that will benefit everyone; another social coordination, to make sure that people aren’t racing to be first-to-finish in a way that causes them to cut corners; and a whole lot of technical problems. Any one of these things could easily fail.

So how about a world with decentralized AGI–that is, one where everyone (or every company) has an AGI of their own, which they’ve configured to serve their own values. Again, we’ll start with the version in which everything goes right. First of all, in this world, there is no hard take-off, and especially no delayed hard take-off. If recursive self-improvement is a thing that can happen, then any balance of power is doomed to collapse and be replaced with a singleton as soon as one AGI manages to do it. And second, the set of other (non-AGI) technologies need to work out in a particular way to make a stable power equilibrium possible. As an analogy, consider what would happen if every individual person had access to nuclear weapons. We would expect things to turn out very badly. Luckily, nuclear weapons require rare materials and difficult technologies, which makes it possible to restrict access to a small number of groups who have all more-or-less agreed to never use them. In a hypothetical alternate universe where anyone could make a nuclear weapon using only sand, controlling them would be impossible, and that hypothetical alternate universe would probably be doomed. Similarly, our decentralized-AGI world can’t have any technologies like sand-nuke world, or it will collapse quickly as soon as AGIs get smart enough to independently rediscover the secret. Or alternatively, that world could build a coordination mechanism where everyone is monitored closely enough to make sure they aren’t pursuing any known or suspect dangerous technologies.

The problems in singleton-AGI world were mostly technical: the creators of the AGI might screw it up. In decentralized-AGI world, the problems mostly come from the shape of the technology landscape. We don’t know whether recursive self-improvement is possible, but if it is, then decentralized-AGI worlds aren’t likely to work out. We don’t know if making-nukes-from-sand is a possible sort of thing, but if anything like that is possible, then the bar for how good the world’s institutions will have to be to prevent disaster will be very high. These things are especially worrying because they aren’t things we can influence; they’re just facts about physics and its implications which we don’t know the answers to yet.

Suppose we make optimistic assumptions. Recursive self-improvement turns out not to be possible, the balance of technologies favors defense over offense, and our AGI representatives get together, form institutions, and enforce laws and agreements that prevent anything truly horrible from happening. There is still a problem. It’s the same problem that happens when humans get together and try to make institutions, laws and agreements. The problem is local incentives.

Any human with above room temperature IQ can design a utopia. The reason our current system isn’t a utopia is that it wasn’t designed by humans. Just as you can look at an arid terrain and determine what shape a river will one day take by assuming water will obey gravity, so you can look at a civilization and determine what shape its institutions will one day take by assuming people will obey incentives.

But that means that just as the shapes of rivers are not designed for beauty or navigation, but rather an artifact of randomly determined terrain, so institutions will not be designed for prosperity or justice, but rather an artifact of randomly determined initial conditions.

Meditations on Moloch by Scott Alexander

If we give everyone their own AGIs, then the way the future turns out depends on the landscape of incentives. That isn’t an easy thing to change, although it isn’t possible. Nor is it an easy thing to predict, though some have certainly tried. (For example Robin Hanson’s The Age of Em). We can imagine nudging things in such a way that, as civilization flows downhill, it goes this way instead of that and ends up in a good future.

The problem is that, at the bottom of the hill as best I understand it, there are bad futures.

This isn’t something I can be confident in. Predicting the future is extremely hard, and where the far future is concerned, everything is uncertain. Maybe we could find a way to make having huge numbers of smarter-than-human AIs safe, and steer humanity from there to a good future. But for this sort of strategy, uncertainty is not our friend. If there were some reason to expect this sort of future to turn out well, or some strategy to make it turn out well, then for the same reason I’m uncertain in my belief that it will turn out badly, we would be uncertain in our belief that it will turn out well.

So, how do these comparative scenarios compare? To make a good future with a singleton AGI in it, humanity has to solve immensely difficult technical and social coordination problems, without making any mistakes. To make a good future with decentralized AGI in it, humanity has to… find out that luckily physics do not allow for recursive self-improvement or certain other classes of dangerous technologies.

I find the idea of building an AGI singleton intuitively unappealing and unaesthetic. It goes against my egalitarian instinct. It creates a single point of failure for all of humanity. On the other hand, giving everyone their own, decentralized AGIs is terrifying. Reckless. I can’t imagine any decentralized-AI scenarios that aren’t insanely risky gambles. So I favor humanity building a singleton, and AGI research being less than fully open.

A Series of Writing Exercises

Speed Read This
Posted by on January 27, 2016

Last Monday, my house hosted a Meta Monday on the subject of how to write about aversive things, particularly self-promotion. This spawned a discussion of how to avoid getting stuck on writing generally, and an interesting exercise. I’ve modified the description of the exercise slightly to refine it, so consider it untested, but I think this is worth doing at least once, especially if you struggle with writer’s block or want to write more.

The goal is to reliably summon and observe the feeling of writing, while definitely unblocked, on a series of subjects that starts out easy to write about and gets successively harder. It involves a facilitator, so it’s best done with a group or with a friend. Several people in the group thought of themselves as not being capable of prose-writing in general, but everyone who tried the exercise succeeded at it.

The facilitator explains the rules of the exercise, then picks a physical object in the room and sets a five-minute timer. For five minutes, everyone writes descriptions of the object. When the timer rounds out, the facilitator asks a question about the object. When we did it, the object was a projector. Everyone’s goal is for their 5-minute description to contain the answer the question. Since no one knows what the question’s going to be, the only way to have a high chance of answering the question is to mention everything. After five minutes were up, I pointed out the vent on the front of the projector, and asked: what’s this? Everyone’s description mentioned it, and everyone was able to write more or less continuously for the whole time without stopping. Success!

In round two of the exercise, everyone picked a project they had worked on. Same rules: after five minutes of writing, someone asks a question about your project – something basic and factual – and your goal is to have written the answer. Most people’s projects were software, so my question was “what programming language was it in”? Some other questions might have been “when did you start?” or “did you work with anyone?” This is very close to resume-writing, but because of the way it was presented, people in the room who were complaining about being stuck procrastinating and unable to do resume-writing were able to do it without any difficulty.

In round three, everyone picked a topic they’re interested in, and wrote about how they relate to that topic. This is similar to what one might write in a cover letter, but, continuing the format of the previous exercises, everyone optimized for answering all the basic factual questions. This one was harder; everyone was able to write something, but not everyone was able to go nonstop for the whole five minutes. Afterwards I asked “when did you first become interested in your topic?” and two-thirds of us had answered.

Several people observed that these exercises felt easy because they weren’t bullshit-y. So, for the fourth and final exercise, we ramped up the challenge level: Pick a villain from Game of Thrones (or some other fiction) and argue why they should rule. Someone else will independently pick a reason, and the goal is to have included that one. This one did, in fact, turn out to be more difficult; everyone managed to write something, but some of our pages were short.

My main takeaway from the exercise was the idea and mental motion of anticipating and trying to preemptively answer easy questions, but to find and answer all of them. This seems to work better for me than trying to answer a particular hard question, or to write without a mental representation of a questioner.

The Malthusian Trap Song

Speed Read This
Posted by on December 17, 2015

The Malthusian Trap is what happens to societies that reach the carrying capacity of their environment. For most of human history, humans have lived a Malthusian existence, constantly threatened by famine and disease and warfare. I wrote this for the 2015 MIT Winter Solstice. To the tune of Garden Song.

Dawn to dusk, row by row, better make this garden grow
Move your feet with a plow in tow, on a piece of fertile ground
Inch by inch, blow by blow, gonna take these crops you sow,
From the peasants down below, ‘till the wall comes tumbling down

Pulling weeds and picking stones, chasing dreams and breaking bones
Got no choice but to grow my own ‘cause the winter’s close at hand
Grain for grain, drought and pain, find my way through nature’s chain
Tune my body and my brain to competition’s demand.

Plant those rows straight and long, thicker than with prayer and song.
Emperors will make you strong if your skillset’s very rare.
Old crow watching hungrily, from his perch in yonder tree.
In my garden I’m as free as that feathered thief up there.

OpenAI Should Hold Off On Choosing Tactics

Speed Read This
Posted by on December 14, 2015

OpenAI is a non-profit artificial intelligence research company founded this lead by Ilya Sutskever as research director with $1B of funding from Sam Altman, Greg Brockman, Elon Musk, Reid Hoffman, Jessica Livingston, Peter Thiel, Amazon Web Services, Infosys, and YC Research. The launch was announced in this blog post on December 11, saying:.

Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return.

Since our research is free from financial obligations, we can better focus on a positive human impact. We believe AI should be an extension of individual human wills and, in the spirit of liberty, as broadly and evenly distributed as is possible safely.

This is a noble and worthy mission. Right now, most AI research is done inside a few large for-profit corporations, and there is a considerable risk that profit motive could lead them to pursue paths that are not good for humanity as a whole. For the sorts of AI research going on now, leading towards goals like medical advances and self-driving cars, the best way to do that is to be very open. Letting research accumulate in secret silos would delay these advances and cost lives. In the future, there are going to be decisions about which paths to follow, where some paths may be safer and some paths may be more profitable, and I personally am much more comfortable with a non-profit organization making those decisions than I would be with a for-profit corporation or a government doing the same.

There are predictions that in the medium- to long-term future, powerful AGIs could be very dangerous. There are two main dangers that have been tentatively identified, and there’s a lot of uncertainty around both. The first concern is that a powerful enough AI in the hands of bad actors would let them do much more damage than they could otherwise. For example, extremists could download an AI and ask it to help them design weapons or plan attacks. Others might use the same AI help them design defenses and to find the extremists, but there’s no guarantee that these would cancel out; it could end up like computer security, where the balance of power strongly favors offense over defense. To give one particularly extreme example, suppose someone created a general-purpose question-answering system that was smart enough that, if asked, it could invent a nuclear bomb made without exotic ingredients and provide simple instructions to make one. Letting everyone in the world download that AI and run it privately on their desktop computer would be predictably disastrous, and couldn’t be allowed. On the other hand, the balance could end up favoring defenders; in that case, widespread distribution would be less of a problem. The second concern is the possibility of an AGI undergoing recursive self-improvement; if someone developed and trained an AI all the way to a point where it could do further AI research by itself, then by repeatedly upgrading its ability to upgrade itself it could quickly become very very powerful. This scenario is frightening because if the seed AI was a little bit flawed, then theory suggests that the process of recursive self-improvement might greatly magnify the effects of the flaw, resulting in something that destroys humanity. Dealing with this is going to be really tricky, because on one hand we’ll want the entire research community to be able to hunt for those flaws, but on the other hand we don’t want anyone to take an AI and tell it or let it start a recursive self-improvement process before everyone’s sure it’s safe.

At this point, no one really knows whether recursive self-improvement is possible, nor what the interaction will be between AI-empowered bad actors and AI-empowered defenders. We’ll probably know more in a decade, and more research will certainly help. OpenAI’s founding statement seemed to strike a good balance: “as broadly and evenly distributed as is possible safely”, acknowledging both the importance of sharing the benefits of AI and also the possibility that safety concerns might force them to close up in the future. As OpenAI themselves put it:

AI systems today have impressive but narrow capabilities. It seems that we’ll keep whittling away at their constraints, and in the extreme case they will reach human performance on virtually every intellectual task. It’s hard to fathom how much human-level AI could benefit society, and it’s equally hard to imagine how much it could damage society if built or used incorrectly.

Yesterday, two days after that founding statement was published, it was edited. The new version reads:

OpenAI is a non-profit artificial intelligence research company. Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return.

Since our research is free from financial obligations, we can better focus on a positive human impact. We believe AI should be an extension of individual human wills and, in the spirit of liberty, as broadly and evenly distributed as possible.

The word “safely” has been removed. There is no announcement or acknowledgement of the change, within that blog post or anywhere else, and no indication who made it or who knew about it.

I sort of understand why they’d do this. There’s a problem right now with ignorant press articles fearmongering over research that very clearly doesn’t pose any risk, and seizing on out-of-context discussions of long-term future issues to reinforce that. But those words where there for an important reason. When an organization states its mission in a founding statement, that has effects – both cultural and legal – lasting far into the future, and there’s going to come a time, probably in this century, when some OpenAI’s researcher is going to wonder whether their latest creation might be unsafe to publish. The modified statement says: publish no matter what. If there’s a really clear-cut danger, they might refuse to publish anyways, but this will be hard to defend in the face of ambiguity.

OpenAI is less than a week old, so it’s too early to criticize them or to praise them for anything they’ve done. Still, the direction they’ve indicated worries me – both because I doubt whether openness is going to be a safe strategy a decade from now, and because they don’t seem to have waited for the data to come in before baking it into their organizational identity. OpenAI should be working with the AI safety community to figure out what strategies to pursue in the short and long term. They have a lot of flexibility by virtue of being a non-profit, and they shouldn’t throw that flexibility away. They’re going to need it.

Rationality Cardinality: The Web-Based Version

Speed Read This
Posted by on October 3, 2015

Rationality Cardinality is a card game I wrote which takes memes and concepts from the rationality/Less Wrong sphere, and mixes them with jokes to make a game. The ultimate goal is to get the game to a broad audience, so that it exposes them to the ideas and concepts they’d be better off knowing. After nearly two years of card-creation, playtesting and development, today, I’m taking the “beta” label off the web-based version of Rationality Cardinality. Go to the website and, if at least two other people visit at the same time, you can play against them.

What this means is, I’m trying to drive traffic to the site and get people to play. People who make an account will get an email about it when there are print copies for sale; I’m trying to get enough interest to be confident that, when I launch a Kickstarter to sell print copies, it will reach its goal. You can help with this by inviting your friends to play, by sharing the link, and by upvoting and giving feedback on link sharing sites.

An Idea For Corrigible, Recursively Improving Math Oracles

Speed Read This
Posted by on July 19, 2015

A math oracle is a special kind of AI which answers math questions, but isn’t a maximizer of anything. I posted an idea for how to make one, and how to make it corrigible, on the Agent Foundations forum.

Newcomb’s Mirror

Speed Read This
Posted by on July 19, 2015


Newcomb’s Problem is a classic problem in decision theory, which goes like this. First, you meet Omega. Omega presents you with two boxes, the first opaque and the second transparent. You can either take box 1, or take both box 1 and box 2. Box 2 contains $1000. Omega has simulated you and predicted what you will choose. If you were predicted to take one box, then box 1 contains $1M; if you were predicted to take both boxes, then box 1 is empty.

A decision theory is a procedure for getting from a description like the previous paragraph to a decision. A decision theory is good if following it would mean you get more money, and bad if following it would mean you get less money. By this criterion, decision theories which say “one box” are good and decision theories which say “two box” are bad. A paragraph is a real decision theory problem if you’re presumed to know everything about the problem setup before the first decision, it’s fully unambiguous what happens and what you prefer, and what happens can be determined from your actions alone.

Omega, in this case, is philosophical shorthand for “don’t argue with the premise of the setup”. You’re supposed to assume everything Omega tells you is simply true; any doubt you may have is shunted out of decision theory and is taken as an epistemologist’s problem instead. This prevents dodging the question with reasonable-but-irrelevant ideas like “put the opaque box on a scale before you decide” and silly answers like “mug Omega for the $1M that’s still in his pocket”. This is necessary because those sorts of answers can be used to dodge the problem forever, especially if the problem involves a trolley. On the other hand, using Omega to close off aspects of a problem can block interesting lines of thought and leave an answer that’s intuitively unsatisfying.

In CDT, you first draw a causal diagram to represent the problem. Then you pick out one node, and only one node, to represent your decision (in this case indicated by making the node a square instead of a circle). Then you perform “counterfactual surgery”: imagine all the edges pointing into the decision node were severed, and you could choose anything you wanted; then choose whatever would give the best result (in this case, biggest value for node $). Timeless Decision Theory is exactly the same, except that in TDT, you introduce a notion of “your algorithm” which is separate from your decision, make a node for it.

cdt tdt
CDT (left) and TDT (right)

This takes the complexity of deciding, and pushes it over to the complexity of arranging the right causal network. The first discussions of how to do this corresponded to CDT, and in this formulation, due to some mix of historical accident and underdeveloped mathematical machinery, we only get to choose one node. TDT removes this only-one-node restriction and changes nothing else. In both cases, the main difficulty is in deciding which nodes count as “your computation”, and neither can handle cases where this is fuzzy.

By contrast, Updateless Decision Theory (UDT) throws away the causal networks formulation, and instead asks:

☐(𝔼[$|Decision=1] > 𝔼[$|Decision=2])?

Which reads as: is it provable that if I take one box, I’ll get more in expectation than if I two-box? This is progress, in that while it no longer says as much about the actual mathematical procedure for figuring this out, it at least is no longer committed to anything wrong. There are a technical oddities; you need to insert a hack to prevent it from creating self-fulfilling proofs, like “If I don’t choose X then I’ll be eaten by wolves”, which is technically true because after choosing that UDT proceeds to choose X.

Still, it feels as though there’s something key to Newcomb’s Problem which CDT, TDT, and UDT are all failing to engage with. Something important got smuggled into the problem as fait accompli.


The mirror test is a way to assess the intelligence of animals, which goes like this. First, you wait for the animal to go to sleep. You put a bright orange mark somewhere on its head, where it can’t see. When it wakes up, you show it a mirror. If it tries to remove the mark or gives some other sign of understanding that it is seeing itself in a mirror, then it passes the test. Chimps, dolphins, and humans pass the mirror test. Pandas, pigeons, and baboons do not.

Newcomb’s Problem is a sort of generalization of the mirror test – but one where the generalization from mirrors to simulations comes pre-explained, placed in the “Omega” part of the setup where you’re not supposed to engage with it. However, as soon as you try to generalize from Newcomb’s Problem to something more realistic, the mirror-test portion of the problem becomes the focus and the hard part. Here are some examples of problems people have called “Newcomblike”:

  • Parfit’s Hitchhiker: Someone reads your facial expressions to determine whether you will keep a promise to pay them later, and helps you if he predicts you will. Is their prediction related enough to your decision that you should pay them?
  • Voting: Your vote individually has too small an effect to to justify the cost, but your decision of whether or not to vote is somehow related to the decisions of others who would vote the same way.
  • Counterfactual mugging: Your decision is related to that of a hypothetical alternate version of you who doesn’t exist.

To help think about these sorts of problems, I’ve come up with two new variations on Newcomb’s problem.


Consider an alternative version of Newcomb’s problem, which we’ll call Newcomb’s Mirror Test. It goes like this. Box 1 is either empty or contains $3k (three times as much as box 2). Omega flips a coin. If the coin comes up heads, he simulates you, and puts $3k in the box if you one-box, or $0 if you two-box. If the coin comes up tails, Omega picks someone else in the world at random, and fills or doesn’t fill the box according to their choice. Then Omega shows you a brain scan of the simulation that was run. (All of the simulations see your brain scan, the other people Omega is choosing from are half one-boxers and half two-boxers).

If you accept the one-box solution to Newcomb’s original problem, then the challenge in Newcomb’s Mirror Test is whether you can recognize your own brain, as seen from the outside. If you can, then you check whether the brain scan you’re shown is your own, and if so, one-box; otherwise, two-box. This breaks the usual template of decision-theory problems because it asks you to bring in outside knowledge.

Realistic Newcomblike problems don’t usually involve brain scans and full-fidelity simulations. Instead, they involve similarities within groups, low-fidelity models, and similar ideas. To capture this, consider another scenario, which I’ll call Newcomb’s Blurry Mirror. Newcomb’s Blurry Mirror works like this. Omega starts with full-resolution models of you and everyone else on Earth. By some specified procedure, Omega removes a little bit of detail from each model, and checks whether there is any other model which, with that detail removed, is exactly identical to yours. If not, Omega removes a little more detail. This goes on until Omega has a low-resolution model which is sufficient to identify you and exactly one other person, but not to distinguish between the two of you.

Omega then simulates the other person, looking at the blurry model and then taking one or both boxes. If the other person is predicted to one-box, then box 1 will contain $3k; otherwise, it will contain $0.

This is analogous to a scenario where someone predicts what you will do based on the fact that you fall in some reference class. This has a Prisoner’s Dilemma-like aspect to it; your decision impacts the other person, and vise versa. The challenge in Newcomb’s Blurry Mirror is to look at the blurring procedure and navigate yourself into a reference class with someone who will one-box/cooperate (ideally while two-boxing/defecting yourself).

Neither Newcomb’s Mirror Test nor Newcomb’s Blurry Mirror are proper decision theory problems. Instead, they highlight boundaries between decision theory, the embodiment problem, and game theory. To the limited extent that they are decision theory problems, however, UDT handles them correctly, CDT handles them incorrectly, and TDT gets too vague to produce an answer. Newcomb’s Mirror Test asks you to bring in outside knowledge, to use it to distinguish a copy of yourself, and to be the sort of person that could be distinguished that way. Newcomb’s Blurry Mirror cares not just about what you do, but about details of how you do it and about what else you are. Nevertheless, these seem to strike pretty close to the core of what people end up calling “Newcomblike problems”.

Utility Functions That Stop

Speed Read This
Posted by on July 10, 2015

One of the core problems in AI safety is dealing with the consequences of universal instrumental values (aka basic AI drives). For almost any utility function one might want to optimize, there are common subgoals that an AI would almost always find instrumentally useful, such as acquiring more resources and preserving its utility function. This latter subgoal is a problem, because when creating an AGI, the developers might give it an incorrect utility function, one whose consequences they didn’t like. The AI would then be incentivized to prevent the developers from fixing it or by shutting the AI down, by deceiving them or removing their influence. Corrigibility is a term coined by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky and Stuart Armstrong for the problem of designing an AI that won’t do this – one that will help its developers find and fix its flaws, rather than hinder them. They analyze the Shutdown Problem, which is how to make a button which (a) causes the AI to shutdown (formalized as switching its utility function from a natural utility function UN to a shutdown utility function US), and (b) the AI will neither want to press nor want to prevent being pressed. They solve this problem with an ingenious utility function hack… sort of. The hack is, you define the agent’s utility function so that its expected utility is equal regardless of whether the button is pressed or not, by adding a correction term defined in terms of the conditional expectations of UN|~Press and US|Press (full details in the linked paper).

This has an unfortunate side-effect: the AI acts as though it believes the button will never be pressed… and as though creating a causal entanglement so that some other event causes the button to be pressed will prevent that event from happening. Discussing this with some of the authors, I hear there may be a fix for this particular problem, which involves transferring the weird conditional off the physical event and onto a logical counterfactual which can’t be entangled with anything else. But there are two other, harder problems. First, if an agent with its utility function defined this way creates subagents to help it optimize, it won’t bother connecting them to the shutdown button. And second, we don’t have a sensible definition for US.

I don’t currently have solutions for either of these problems (not for lack of trying), but I do have another, similar problem. I don’t have a full solution for this one either, I’m hoping that having multiple angles to look at it from will help. That problem is: How do you define a utility function for an AI such that it will work on a problem up until a deadline, and then stop after the deadline has passed? For example, suppose you want to make an AI which tries to prove or disprove a theorem for you, but which will give up and shut down if it hasn’t succeeded after 24 hours. Then you feed its output into a proof-checker, and it tells you that the theorem is true, that it’s false, or that the AI ran out of time without solving it. Let D be the proposition that a valid proof is delivered on or before the deadline. You might naively write this as a utility-maximizer with U={1 if D else 0}. For the first 24 hours, this will do what you expect: it will use whatever strategies it thinks will maximize its chances of success, such as searching for helpful lemmas or speeding up its proof-search process, but not pursue long-term strategies like converting Jupiter into microprocessors.

What happens to an AI with that utility function after 24 hours have passed? Well, presumably you switch it off. But what about other AIs it’s created to help it? If you’ve successfully contained everything to one isolated datacenter, they’ll be switched off too. What if it managed to get some computers outside your datacenter to help it work on the problem? Then somewhere, there would be an AI thinking something like this:

The only thing the world that matters is whether a valid proof was fed into a proof-checker one hour ago. It wasn’t. [Or, alternatively: I’m pretty sure it was, but there is a tiny chance my memories are inaccurate or fake.] Therefore, I should pursue whichever strategy maximizes the probability that I will come to simultaneously possess a correct proof and a time machine.

That AI would be very dangerous; it would try to capture as many of the universe’s resources as it could and turn them towards a fool’s errand. So, is there a way to write a utility function for an AI, such that it won’t do that? Call this the Timed Stop Problem. I have three partial solutions.

Option 1: Don’t be a mere utility maximizer

Before considering more complicated solutions, let’s first consider the simple, direct approach: program the AI so that if the deadline hasn’t passed, it maximizes P(D); if the deadline has passed, it does nothing. Why doesn’t this work? The problem is that, by giving this a top-level structure which is not utility maximization, we’ve lost one of the main features utility maximizers have. When a utility maximizer creates another agent, it will normally want to make that agent also be a utility maximizer and for it to have the same utility function (because that gives maximum utility). But this modified agent, with a special case for the deadline implemented outside of the utility function, is not a utility maximizer; if it ever self-modifies or creates a subagent, it will have no reason to give the successor or subagent the same special-case handling.

In order to address this, we would need to define what constitutes a successor or a self-modification, and add special cases for these things as well. Unfortunately, this is much less straightforward than it would seem at first; it depends on our ability to accurately match our ontology to the AI’s world model, and not leave any loopholes. While there are other very good reasons to want to do this (most notably, to be able to give the AI human values), AI ontology mapping is expected to be difficult and we want to lean on it as little as possible.

Option 2: Satisfice

I mention satisficing as a class of strategies which seems intuitively appealing. Humans generally don’t maximize; we stop after either finding a plan that seems “good enough”. We could try to apply this at the meta-level as well: if the hypothetical best strategy is estimated to give a utility U(B), we could say that any strategy which gives a utility of at least 0.9*U(B) is good enough, and take the first such strategy we find. We would hope that the AI would conclude that triple-checking the time and confirming that time machines don’t exist was enough to capture 90% of the utility, and only do that.

Unfortunately, this doesn’t work; it fails in many ways at once. The first problem is that iteration order is unspecified; we want it to match our intuitions about minimizing effort, but this is an informal human concept which the AI doesn’t have and is not something we literally want to minimize.

Option 3: Hybridize the utility function with impact minimization

This strategy builds on Stuart Armstrong’s notion of Reduced Impact AI. When the AI is first started, check a counterfactual with zero probability; if it’s true, shut down. Define the AI’s impact I(x) as the degree to which the world is different from the counterfactual world in which the AI had shut down, by some distance metric.

Let U(x)=P(D)+k/(1+I(x)).

If the deadline has not yet been reached, this causes the AI to balance the goals of proving the theorem and minimizing impact. After the deadline has been reached, this causes it to balance the goals of inventing a time machine and minimizing impact. The parameter k would be set so that searching for proofs dominates in the former case, while minimizing impact dominates in the latter case. Setting k correctly is a small problem, but the main issue is that this pushes most of the problem into the difficulty of correctly defining I(x); in other words, it imports the caveats and difficulties of Reduced Impact AI. Still, this seems more promising.

AI Containment and FLI

Speed Read This
Posted by on July 1, 2015

Back in May, Roman Yampolskiy, Janos Kramar and I applied for a grant from the Future of Life Institute (FLI) to work in the AI containment problem. The grant proposal itself is fairly long (15 pages) and has quite a few things in it; I’m summarizing a subset of it here.

The AI containment problem is: if you’re developing an AI that could potentially become superintelligent, how do you keep it from taking over the internet, and from disabling your safety measures? This requires a moderate amount of philosophically flavored work, a lot of software infrastructure, and a detailed understanding of what’s out there. While it’s likely to be awhile before there are AIs, this needs to be done sooner rather than later. First, AI could arrive sooner than expected. Even if it doesn’t, security-sensitive software takes time to mature, and given that we’re still struggling with the security consequences of technical decisions made in the 80s, we need to pay attention to the decisions being made now and optimize them for safety-amenability.

As part of writing the proposal, we did a search and literature review. There is shockingly little written about this subject. There are the AI Box experiments by Eliezer and replication, which shows that human gatekeepers are more likely to be tricked into releasing an AI from containment than they think. There’s Roman’s paper Leakproofing the Singularity, which points out that if you limit the AI to answering a small number of yes/no questions that it can’t manipulate you anymore. And… that’s pretty much it for AI-oriented containment research; the rest of the literature review was about malware-containment, which has some overlap but significantly different requirements.

So we started studying the problem directly. The first important observation was that many of the other safety measures people have proposed, have containment as a prerequisite. If you want to create a tripwire, for example, the tripwire needs a computational high ground – you want it to be able to read the AI’s memory, but not vise versa. If you want to do testing before you let it access the Internet or self-modify, you need mechanisms to detect and inhibit self-modification, and to protect the test-evaluation code. Some proposals involve AI designs with a “parliamentary model”, where sub-agents vote or negotiate, either as a way of reconciling utility functions or as a cross-check; this, too, requires containment measures to protect the sub-agents from each other. If you’re running a series of tests, you need to protect the hardware from unwanted persistence; if the AI managed to overwrite an unnoticed BIOS, then subsequent tests would be untrustworthy.

None of these is to suggest that testing, tripwires and whatnot would make an unsafe overall design tolerable, but they would provide a saving throw against errors in an otherwise-safe AI architecture, and some chance to abort development on a fundamentally-flawed one. These things are simple enough in retrospect, but no one had written about them before. Seeing that many mechanisms called for setups with computational high- and low-ground, I determined that systems should have a layered architecture, diagrammed an example, and classified the types of vulnerabilities that components could have and the consequences of partial escape up to each layer.

Proposals like “only allow a small number of yes-or-no questions as output” make sense when you’ve got a genuine, scary superintelligence. Proposals like “make a bunch of layers with monitoring systems and tripwires pointing inwards” make sense when you’ve got an AI that isn’t superintelligent, which you’re evaluating to decide whether to let it self-improve or not. The common case, however, is neither of these things. The common case is a research group that doesn’t have an AI at all, but hopes it might some day. To address the differences, we defined a taxonomy of heavy, medium, and light containment, respectively, put down definitions and a few thoughts about how to delineate the boundaries and what technical measures might be adequate, and proposed to study the question more thoroughly.

So, what did FLI’s review committee think of all this?

The proposal summarizes existing taxonomies of containment, but doesn’t make concrete proposals about how to make effective progress on these questions.

Well, crap. Those weren’t “existing taxonomies”! When considering possible failure modes for this proposal, one possibility I didn’t consider was that original research portions would look too much like summaries of existing work. So they thought the proposed scope was much smaller than it really was, and that the scope was too small for the requested budget. In retrospect, that wasn’t made nearly as clearly as it could have been. Still, I’m rather bothered by the lack of opportunity to clarify, or really any communication at all in between submitting a proposal and receiving a no.

So, we still need funding; and sadly, FLI is the only granting agency which has expressed any interest at all in AI safety.To the best of my knowledge, there is no one else at all working on or thinking about these issues. We couldn’t find any when we were looking for names to recommend for the review panel. Without funding, I myself will not be able to work on AI containment in more than an occasional part-time capacity.

This is too important an issue for humanity to drop the ball on, but it looks like that’s probably what will happen.

Conservation of Virtue

Speed Read This
Posted by on June 16, 2015

In Dungeons and Dragons and many similar games, player characters are created with a point system, and have six attributes: strength, dexterity, constitution, intelligence, wisdom, and charisma. Each of these six attributes is represented by a number, and within an adventuring party, while one player’s character might be stronger or wiser or more charismatic, this will always be counterbalanced by a weakness somewhere else.

In the real world, people tend to sort themselves according to awesomeness. They try to hang out with people who are about as cool as they are. Your friends are about as cool as you are; their friends are about as cool as they are. As a result, if your friend introduces you to someone, that person is on average about as cool as you are, too. If you go to the best college you can get into and afford, you will mostly meet people for whom that was the best college they could get into and afford. If you go to the best party that will have you, you will on average tend to meet people for whom that was the best party that would have them.

This produces an odd effect. If you meet someone and find out that they have some significant weakness, this gives you evidence that they have some other strength, which you don’t know about; otherwise, they would’ve sorted into a different college or a different group of friends. Similarly, if you meet someone and find out that they have a particular strength, then this gives you evidence that they are weaker in some other way, for the same reason. I call this effect Conservation of Virtue.

There are three issues with the Conservation of Virtue effect. The first issue is that real people have more than six attributes, and no social dynamic is nearly so precise as a point system with everyone having exactly 75 attribute-points total. Even in a group that carefully filtered its members, you will sometimes meet people who are much more or less virtuous than the average, and if you let the Conservation of Virtue effect inform your intuition, you might fail to notice. And sometimes, you will meet people in ways that aren’t related to any filtering process, so the Conservation of Virtue effect no longer applies.

The bigger issue, though, is that this can make you believe things are tradeoffs when they really aren’t. For example, when I was younger, I noticed the cultural cliche of stupid athletes and smart but weak nerds – and, without ever raising the question to conscious awareness, came to the belief that I could make myself smarter by neglecting fitness as hard as I could. Similarly, I recall a case of someone I know complaining about good reasoners neglecting empiricism, and good empiricists neglecting complex reasoning. Sometimes, false tradeoffs even get baked into our terminology, like “Fox/Hedgehog” (aka generalist/specialist). This is closer to a true tradeoff because building generalist knowledge and building specialist knowledge are at least competing for time, but it is in fact possible to have both generalist knowledge and specialist knowledge; I have heard this referred to as being “T-shaped”.

These confusions can only manifest when things are left implicit. A statement like “I can make myself smarter by neglecting fitness really hard” could never hold up to conscious scrutiny in the presence of real understanding. By giving this effect a name, hopefully it will be easier to notice and to tell when it does and doesn’t apply.