Category AI Safety

Decentralized AGIs, or Singleton?

Speed Read This
Posted by on February 19, 2016

When predicting and planning for coming decades, we classify futures different ways based on what happens with artificial general intelligence. There could be a hard take-off, where soon after an AGI is created it self-improves to become extraordinarily powerful, or a soft take-off, where progress is more gradual. There could be a singleton – a single AGI, or a single group-with-AGI, which uses AGI to become much more powerful than everyone else, or things could be decentralized, with lots of AGIs or lots of groups and individuals that have AGIs.

The soft- vs hard-takeoff question is a matter of prediction; either there is a level of intelligence which enables rapid recursive self improvement, or there isn’t, and we can study this question but we can’t do much to change the answer one way or the other. Whether AGI is decentralized or a singleton, however, can be a choice. If a team crosses the finish line and creates a working AGI, and they think decentralized control will lead to a better future, then they can share it to everyone. If multiple teams are close to finishing but they think a singleton will lead to a better future, then they can (we hope) join forces and cross the finish line together.

There are things to worry about and try to prepare for in singleton-AGI futures, and things to worry about and prepare for in decentralized-AGI futures, and these are quite different from each other. Which is better, and which will actually happen? I think a lot of people talking about AGI and AGI safety end up talking past each other, because they are imagining different answers to this question and envisioning different futures. So let’s consider two futures. Both will be good futures, where everything went right. One will be a singleton future, and the other will be a decentralized future.

Let’s look at a singleton future, starting with a version of that future in which everything went right. There are some who want to make – or for others to make – a single, very powerful AGI. They want to design it in such a way that it will respect everyone’s rights and preferences, be impossible for anyone to hijack, and be amazingly good at getting us what we want. In a world where this was executed perfectly, if I wanted something, then the AGI would help me get it. If two people wanted things that were incompatible, then somewhere in the AGI’s programming would be a rule which decides who wins. Philosophers have a lot to say about what that rule would be, and about how to resolve situations when people’s preferences are inconsistent or would change if they knew more. In the world where everything went right, all of those puzzles were solved conclusively, and the answers were programmed into the AGI. The theory of how intelligence works was built up and carefully verified, all the AGI experts agreed that they AGI would do what all the philosophers and AGI experts together agreed was right. Then the AGI would take over the world, and everyone would be happy about it, at least in retrospect when they saw what happened next.

On the other hand, there are a lot of ways for this to go wrong. If someone were to say they’d built an AGI and they wanted to make it a singleton, we’d all be justifiably skeptical. For one thing, they could by lying, and building a different AGI to benefit only themselves, rather than to benefit everyone. But even the very best intentions aren’t necessarily enough. A major takeaway from MIRI and FHI’s research on the subject is that there’s a very real risk of trying to make something universally-benevolent, but getting it disastrously wrong. This is an immensely difficult problem. Hence their emphasis on using formal math: when something is mathematically proven then it’s true, reducing the number of places a mistake could be made by one. There’s a social coordination problem, to make sure that whoever is first to create an AGI makes one that will benefit everyone; another social coordination, to make sure that people aren’t racing to be first-to-finish in a way that causes them to cut corners; and a whole lot of technical problems. Any one of these things could easily fail.

So how about a world with decentralized AGI–that is, one where everyone (or every company) has an AGI of their own, which they’ve configured to serve their own values. Again, we’ll start with the version in which everything goes right. First of all, in this world, there is no hard take-off, and especially no delayed hard take-off. If recursive self-improvement is a thing that can happen, then any balance of power is doomed to collapse and be replaced with a singleton as soon as one AGI manages to do it. And second, the set of other (non-AGI) technologies need to work out in a particular way to make a stable power equilibrium possible. As an analogy, consider what would happen if every individual person had access to nuclear weapons. We would expect things to turn out very badly. Luckily, nuclear weapons require rare materials and difficult technologies, which makes it possible to restrict access to a small number of groups who have all more-or-less agreed to never use them. In a hypothetical alternate universe where anyone could make a nuclear weapon using only sand, controlling them would be impossible, and that hypothetical alternate universe would probably be doomed. Similarly, our decentralized-AGI world can’t have any technologies like sand-nuke world, or it will collapse quickly as soon as AGIs get smart enough to independently rediscover the secret. Or alternatively, that world could build a coordination mechanism where everyone is monitored closely enough to make sure they aren’t pursuing any known or suspect dangerous technologies.

The problems in singleton-AGI world were mostly technical: the creators of the AGI might screw it up. In decentralized-AGI world, the problems mostly come from the shape of the technology landscape. We don’t know whether recursive self-improvement is possible, but if it is, then decentralized-AGI worlds aren’t likely to work out. We don’t know if making-nukes-from-sand is a possible sort of thing, but if anything like that is possible, then the bar for how good the world’s institutions will have to be to prevent disaster will be very high. These things are especially worrying because they aren’t things we can influence; they’re just facts about physics and its implications which we don’t know the answers to yet.

Suppose we make optimistic assumptions. Recursive self-improvement turns out not to be possible, the balance of technologies favors defense over offense, and our AGI representatives get together, form institutions, and enforce laws and agreements that prevent anything truly horrible from happening. There is still a problem. It’s the same problem that happens when humans get together and try to make institutions, laws and agreements. The problem is local incentives.

Any human with above room temperature IQ can design a utopia. The reason our current system isn’t a utopia is that it wasn’t designed by humans. Just as you can look at an arid terrain and determine what shape a river will one day take by assuming water will obey gravity, so you can look at a civilization and determine what shape its institutions will one day take by assuming people will obey incentives.

But that means that just as the shapes of rivers are not designed for beauty or navigation, but rather an artifact of randomly determined terrain, so institutions will not be designed for prosperity or justice, but rather an artifact of randomly determined initial conditions.

Meditations on Moloch by Scott Alexander

If we give everyone their own AGIs, then the way the future turns out depends on the landscape of incentives. That isn’t an easy thing to change, although it isn’t possible. Nor is it an easy thing to predict, though some have certainly tried. (For example Robin Hanson’s The Age of Em). We can imagine nudging things in such a way that, as civilization flows downhill, it goes this way instead of that and ends up in a good future.

The problem is that, at the bottom of the hill as best I understand it, there are bad futures.

This isn’t something I can be confident in. Predicting the future is extremely hard, and where the far future is concerned, everything is uncertain. Maybe we could find a way to make having huge numbers of smarter-than-human AIs safe, and steer humanity from there to a good future. But for this sort of strategy, uncertainty is not our friend. If there were some reason to expect this sort of future to turn out well, or some strategy to make it turn out well, then for the same reason I’m uncertain in my belief that it will turn out badly, we would be uncertain in our belief that it will turn out well.

So, how do these comparative scenarios compare? To make a good future with a singleton AGI in it, humanity has to solve immensely difficult technical and social coordination problems, without making any mistakes. To make a good future with decentralized AGI in it, humanity has to… find out that luckily physics do not allow for recursive self-improvement or certain other classes of dangerous technologies.

I find the idea of building an AGI singleton intuitively unappealing and unaesthetic. It goes against my egalitarian instinct. It creates a single point of failure for all of humanity. On the other hand, giving everyone their own, decentralized AGIs is terrifying. Reckless. I can’t imagine any decentralized-AI scenarios that aren’t insanely risky gambles. So I favor humanity building a singleton, and AGI research being less than fully open.

An Idea For Corrigible, Recursively Improving Math Oracles

Speed Read This
Posted by on July 19, 2015

A math oracle is a special kind of AI which answers math questions, but isn’t a maximizer of anything. I posted an idea for how to make one, and how to make it corrigible, on the Agent Foundations forum.

Utility Functions That Stop

Speed Read This
Posted by on July 10, 2015

One of the core problems in AI safety is dealing with the consequences of universal instrumental values (aka basic AI drives). For almost any utility function one might want to optimize, there are common subgoals that an AI would almost always find instrumentally useful, such as acquiring more resources and preserving its utility function. This latter subgoal is a problem, because when creating an AGI, the developers might give it an incorrect utility function, one whose consequences they didn’t like. The AI would then be incentivized to prevent the developers from fixing it or by shutting the AI down, by deceiving them or removing their influence. Corrigibility is a term coined by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky and Stuart Armstrong for the problem of designing an AI that won’t do this – one that will help its developers find and fix its flaws, rather than hinder them. They analyze the Shutdown Problem, which is how to make a button which (a) causes the AI to shutdown (formalized as switching its utility function from a natural utility function UN to a shutdown utility function US), and (b) the AI will neither want to press nor want to prevent being pressed. They solve this problem with an ingenious utility function hack… sort of. The hack is, you define the agent’s utility function so that its expected utility is equal regardless of whether the button is pressed or not, by adding a correction term defined in terms of the conditional expectations of UN|~Press and US|Press (full details in the linked paper).

This has an unfortunate side-effect: the AI acts as though it believes the button will never be pressed… and as though creating a causal entanglement so that some other event causes the button to be pressed will prevent that event from happening. Discussing this with some of the authors, I hear there may be a fix for this particular problem, which involves transferring the weird conditional off the physical event and onto a logical counterfactual which can’t be entangled with anything else. But there are two other, harder problems. First, if an agent with its utility function defined this way creates subagents to help it optimize, it won’t bother connecting them to the shutdown button. And second, we don’t have a sensible definition for US.

I don’t currently have solutions for either of these problems (not for lack of trying), but I do have another, similar problem. I don’t have a full solution for this one either, I’m hoping that having multiple angles to look at it from will help. That problem is: How do you define a utility function for an AI such that it will work on a problem up until a deadline, and then stop after the deadline has passed? For example, suppose you want to make an AI which tries to prove or disprove a theorem for you, but which will give up and shut down if it hasn’t succeeded after 24 hours. Then you feed its output into a proof-checker, and it tells you that the theorem is true, that it’s false, or that the AI ran out of time without solving it. Let D be the proposition that a valid proof is delivered on or before the deadline. You might naively write this as a utility-maximizer with U={1 if D else 0}. For the first 24 hours, this will do what you expect: it will use whatever strategies it thinks will maximize its chances of success, such as searching for helpful lemmas or speeding up its proof-search process, but not pursue long-term strategies like converting Jupiter into microprocessors.

What happens to an AI with that utility function after 24 hours have passed? Well, presumably you switch it off. But what about other AIs it’s created to help it? If you’ve successfully contained everything to one isolated datacenter, they’ll be switched off too. What if it managed to get some computers outside your datacenter to help it work on the problem? Then somewhere, there would be an AI thinking something like this:

The only thing the world that matters is whether a valid proof was fed into a proof-checker one hour ago. It wasn’t. [Or, alternatively: I’m pretty sure it was, but there is a tiny chance my memories are inaccurate or fake.] Therefore, I should pursue whichever strategy maximizes the probability that I will come to simultaneously possess a correct proof and a time machine.

That AI would be very dangerous; it would try to capture as many of the universe’s resources as it could and turn them towards a fool’s errand. So, is there a way to write a utility function for an AI, such that it won’t do that? Call this the Timed Stop Problem. I have three partial solutions.

Option 1: Don’t be a mere utility maximizer

Before considering more complicated solutions, let’s first consider the simple, direct approach: program the AI so that if the deadline hasn’t passed, it maximizes P(D); if the deadline has passed, it does nothing. Why doesn’t this work? The problem is that, by giving this a top-level structure which is not utility maximization, we’ve lost one of the main features utility maximizers have. When a utility maximizer creates another agent, it will normally want to make that agent also be a utility maximizer and for it to have the same utility function (because that gives maximum utility). But this modified agent, with a special case for the deadline implemented outside of the utility function, is not a utility maximizer; if it ever self-modifies or creates a subagent, it will have no reason to give the successor or subagent the same special-case handling.

In order to address this, we would need to define what constitutes a successor or a self-modification, and add special cases for these things as well. Unfortunately, this is much less straightforward than it would seem at first; it depends on our ability to accurately match our ontology to the AI’s world model, and not leave any loopholes. While there are other very good reasons to want to do this (most notably, to be able to give the AI human values), AI ontology mapping is expected to be difficult and we want to lean on it as little as possible.

Option 2: Satisfice

I mention satisficing as a class of strategies which seems intuitively appealing. Humans generally don’t maximize; we stop after either finding a plan that seems “good enough”. We could try to apply this at the meta-level as well: if the hypothetical best strategy is estimated to give a utility U(B), we could say that any strategy which gives a utility of at least 0.9*U(B) is good enough, and take the first such strategy we find. We would hope that the AI would conclude that triple-checking the time and confirming that time machines don’t exist was enough to capture 90% of the utility, and only do that.

Unfortunately, this doesn’t work; it fails in many ways at once. The first problem is that iteration order is unspecified; we want it to match our intuitions about minimizing effort, but this is an informal human concept which the AI doesn’t have and is not something we literally want to minimize.

Option 3: Hybridize the utility function with impact minimization

This strategy builds on Stuart Armstrong’s notion of Reduced Impact AI. When the AI is first started, check a counterfactual with zero probability; if it’s true, shut down. Define the AI’s impact I(x) as the degree to which the world is different from the counterfactual world in which the AI had shut down, by some distance metric.

Let U(x)=P(D)+k/(1+I(x)).

If the deadline has not yet been reached, this causes the AI to balance the goals of proving the theorem and minimizing impact. After the deadline has been reached, this causes it to balance the goals of inventing a time machine and minimizing impact. The parameter k would be set so that searching for proofs dominates in the former case, while minimizing impact dominates in the latter case. Setting k correctly is a small problem, but the main issue is that this pushes most of the problem into the difficulty of correctly defining I(x); in other words, it imports the caveats and difficulties of Reduced Impact AI. Still, this seems more promising.

Shovel-Ready AGI Safety Work for Programmers

Speed Read This
Posted by on May 29, 2015

The AI Containment (or Boxing) problem is, given an artificial general intelligence or a prospective artificial general intelligence, how do you keep it from influencing the world? Containment is vital to safety, because without it, other safety strategies fail to work. If you have an AGI that you’re still testing, you need to keep it securely contained, or else when you find a dangerous bug, it will be too late. If you have a tripwire that’s supposed to warn you if the AGI is planning something dangerous, you need to keep it securely contained, or else it will disable the tripwire or intercept the warning. If you have a multi-part system where your two AGIs have to agree before they proceed with a plan, you have to keep them contained, or else one of the AGIs may decide to remove the other’s veto.

Others have observed that AGI containment isn’t a sufficient safety measure, because an AGI could convince its gatekeepers to let it out. While this is true, what it means is that containment is necessary, but not sufficient. And containment has one major advantage over most other subproblems of AGI safety: we know what needs to be done.

I recently applied for a grant to work on the AGI containment problem. If granted, I’ll be working on the AGI containment problem full time starting in September, with two other collaborators. In the mean time, I’ll be spending as much of my spare time on it as I can on it, because despite being massively important, to the best of my knowledge, the three of us are the only people in the world working on the problem. In the process of writing that grant application, I spent some time exploring the space of possible containment strategies and technologies, and discovered a massive amount of important work to be done.

There should be more people working on this. If you have programming skill, you can get started right now. Here, I present some software projects that need doing.

  1. Write a Linux kernel module for a tripwired network interface: a device which looks like an inactive network interface from userspace, but rather than being an actual interface, activates a tripwire/alarm when activated.
  2. Automated seccomp configuration: Write a tool that takes output from strace, which is a list of syscalls, and uses it to generate a BPF filter for Linux’s seccomp mode which allows those syscalls and no others.
  3. Fuzz test the Torch7 library interfaces for buffer overflows.
  4. Write a library that uses the Intel CPU performance counters to detect suspicious instruction mixes (eg RowHammer).
  5. GPU API filtering. Look through a GPGPU library such as CUDA, identify a relatively-safe subset which preserves most of the important functionality, and write a wrapper which runs in a separate process and provides only that subset via an IPC interface.
  6. Work on anti-steganography techniques for images, audio and video. Ie, make a tool such that if an AGI draws a picture, the gatekeeper run it through the tool and publish the picture and be confident that no one will find a hidden message inside it later.

If you’re convinced that AGI safety is important but have had trouble finding an affordance to actually work on it, hopefully this will help you find a project. These are things that I am not planning to do myself, because I already have a long list of things I’m doing that are this good or better. If no one has jumped in here to say they’re doing it, it probably isn’t getting done.