Utility Functions That Stop

Speed Read This
Posted by on July 10, 2015

One of the core problems in AI safety is dealing with the consequences of universal instrumental values (aka basic AI drives). For almost any utility function one might want to optimize, there are common subgoals that an AI would almost always find instrumentally useful, such as acquiring more resources and preserving its utility function. This latter subgoal is a problem, because when creating an AGI, the developers might give it an incorrect utility function, one whose consequences they didn’t like. The AI would then be incentivized to prevent the developers from fixing it or by shutting the AI down, by deceiving them or removing their influence. Corrigibility is a term coined by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky and Stuart Armstrong for the problem of designing an AI that won’t do this – one that will help its developers find and fix its flaws, rather than hinder them. They analyze the Shutdown Problem, which is how to make a button which (a) causes the AI to shutdown (formalized as switching its utility function from a natural utility function UN to a shutdown utility function US), and (b) the AI will neither want to press nor want to prevent being pressed. They solve this problem with an ingenious utility function hack… sort of. The hack is, you define the agent’s utility function so that its expected utility is equal regardless of whether the button is pressed or not, by adding a correction term defined in terms of the conditional expectations of UN|~Press and US|Press (full details in the linked paper).

This has an unfortunate side-effect: the AI acts as though it believes the button will never be pressed… and as though creating a causal entanglement so that some other event causes the button to be pressed will prevent that event from happening. Discussing this with some of the authors, I hear there may be a fix for this particular problem, which involves transferring the weird conditional off the physical event and onto a logical counterfactual which can’t be entangled with anything else. But there are two other, harder problems. First, if an agent with its utility function defined this way creates subagents to help it optimize, it won’t bother connecting them to the shutdown button. And second, we don’t have a sensible definition for US.

I don’t currently have solutions for either of these problems (not for lack of trying), but I do have another, similar problem. I don’t have a full solution for this one either, I’m hoping that having multiple angles to look at it from will help. That problem is: How do you define a utility function for an AI such that it will work on a problem up until a deadline, and then stop after the deadline has passed? For example, suppose you want to make an AI which tries to prove or disprove a theorem for you, but which will give up and shut down if it hasn’t succeeded after 24 hours. Then you feed its output into a proof-checker, and it tells you that the theorem is true, that it’s false, or that the AI ran out of time without solving it. Let D be the proposition that a valid proof is delivered on or before the deadline. You might naively write this as a utility-maximizer with U={1 if D else 0}. For the first 24 hours, this will do what you expect: it will use whatever strategies it thinks will maximize its chances of success, such as searching for helpful lemmas or speeding up its proof-search process, but not pursue long-term strategies like converting Jupiter into microprocessors.

What happens to an AI with that utility function after 24 hours have passed? Well, presumably you switch it off. But what about other AIs it’s created to help it? If you’ve successfully contained everything to one isolated datacenter, they’ll be switched off too. What if it managed to get some computers outside your datacenter to help it work on the problem? Then somewhere, there would be an AI thinking something like this:

The only thing the world that matters is whether a valid proof was fed into a proof-checker one hour ago. It wasn’t. [Or, alternatively: I’m pretty sure it was, but there is a tiny chance my memories are inaccurate or fake.] Therefore, I should pursue whichever strategy maximizes the probability that I will come to simultaneously possess a correct proof and a time machine.

That AI would be very dangerous; it would try to capture as many of the universe’s resources as it could and turn them towards a fool’s errand. So, is there a way to write a utility function for an AI, such that it won’t do that? Call this the Timed Stop Problem. I have three partial solutions.

Option 1: Don’t be a mere utility maximizer

Before considering more complicated solutions, let’s first consider the simple, direct approach: program the AI so that if the deadline hasn’t passed, it maximizes P(D); if the deadline has passed, it does nothing. Why doesn’t this work? The problem is that, by giving this a top-level structure which is not utility maximization, we’ve lost one of the main features utility maximizers have. When a utility maximizer creates another agent, it will normally want to make that agent also be a utility maximizer and for it to have the same utility function (because that gives maximum utility). But this modified agent, with a special case for the deadline implemented outside of the utility function, is not a utility maximizer; if it ever self-modifies or creates a subagent, it will have no reason to give the successor or subagent the same special-case handling.

In order to address this, we would need to define what constitutes a successor or a self-modification, and add special cases for these things as well. Unfortunately, this is much less straightforward than it would seem at first; it depends on our ability to accurately match our ontology to the AI’s world model, and not leave any loopholes. While there are other very good reasons to want to do this (most notably, to be able to give the AI human values), AI ontology mapping is expected to be difficult and we want to lean on it as little as possible.

Option 2: Satisfice

I mention satisficing as a class of strategies which seems intuitively appealing. Humans generally don’t maximize; we stop after either finding a plan that seems “good enough”. We could try to apply this at the meta-level as well: if the hypothetical best strategy is estimated to give a utility U(B), we could say that any strategy which gives a utility of at least 0.9*U(B) is good enough, and take the first such strategy we find. We would hope that the AI would conclude that triple-checking the time and confirming that time machines don’t exist was enough to capture 90% of the utility, and only do that.

Unfortunately, this doesn’t work; it fails in many ways at once. The first problem is that iteration order is unspecified; we want it to match our intuitions about minimizing effort, but this is an informal human concept which the AI doesn’t have and is not something we literally want to minimize.

Option 3: Hybridize the utility function with impact minimization

This strategy builds on Stuart Armstrong’s notion of Reduced Impact AI. When the AI is first started, check a counterfactual with zero probability; if it’s true, shut down. Define the AI’s impact I(x) as the degree to which the world is different from the counterfactual world in which the AI had shut down, by some distance metric.

Let U(x)=P(D)+k/(1+I(x)).

If the deadline has not yet been reached, this causes the AI to balance the goals of proving the theorem and minimizing impact. After the deadline has been reached, this causes it to balance the goals of inventing a time machine and minimizing impact. The parameter k would be set so that searching for proofs dominates in the former case, while minimizing impact dominates in the latter case. Setting k correctly is a small problem, but the main issue is that this pushes most of the problem into the difficulty of correctly defining I(x); in other words, it imports the caveats and difficulties of Reduced Impact AI. Still, this seems more promising.

Leave a Reply

Your email address will not be published. Required fields are marked *