Back in May, Roman Yampolskiy, Janos Kramar and I applied for a grant from the Future of Life Institute (FLI) to work in the AI containment problem. The grant proposal itself is fairly long (15 pages) and has quite a few things in it; I’m summarizing a subset of it here.
The AI containment problem is: if you’re developing an AI that could potentially become superintelligent, how do you keep it from taking over the internet, and from disabling your safety measures? This requires a moderate amount of philosophically flavored work, a lot of software infrastructure, and a detailed understanding of what’s out there. While it’s likely to be awhile before there are AIs, this needs to be done sooner rather than later. First, AI could arrive sooner than expected. Even if it doesn’t, security-sensitive software takes time to mature, and given that we’re still struggling with the security consequences of technical decisions made in the 80s, we need to pay attention to the decisions being made now and optimize them for safety-amenability.
As part of writing the proposal, we did a search and literature review. There is shockingly little written about this subject. There are the AI Box experiments by Eliezer and replication, which shows that human gatekeepers are more likely to be tricked into releasing an AI from containment than they think. There’s Roman’s paper Leakproofing the Singularity, which points out that if you limit the AI to answering a small number of yes/no questions that it can’t manipulate you anymore. And… that’s pretty much it for AI-oriented containment research; the rest of the literature review was about malware-containment, which has some overlap but significantly different requirements.
So we started studying the problem directly. The first important observation was that many of the other safety measures people have proposed, have containment as a prerequisite. If you want to create a tripwire, for example, the tripwire needs a computational high ground – you want it to be able to read the AI’s memory, but not vise versa. If you want to do testing before you let it access the Internet or self-modify, you need mechanisms to detect and inhibit self-modification, and to protect the test-evaluation code. Some proposals involve AI designs with a “parliamentary model”, where sub-agents vote or negotiate, either as a way of reconciling utility functions or as a cross-check; this, too, requires containment measures to protect the sub-agents from each other. If you’re running a series of tests, you need to protect the hardware from unwanted persistence; if the AI managed to overwrite an unnoticed BIOS, then subsequent tests would be untrustworthy.
None of these is to suggest that testing, tripwires and whatnot would make an unsafe overall design tolerable, but they would provide a saving throw against errors in an otherwise-safe AI architecture, and some chance to abort development on a fundamentally-flawed one. These things are simple enough in retrospect, but no one had written about them before. Seeing that many mechanisms called for setups with computational high- and low-ground, I determined that systems should have a layered architecture, diagrammed an example, and classified the types of vulnerabilities that components could have and the consequences of partial escape up to each layer.
Proposals like “only allow a small number of yes-or-no questions as output” make sense when you’ve got a genuine, scary superintelligence. Proposals like “make a bunch of layers with monitoring systems and tripwires pointing inwards” make sense when you’ve got an AI that isn’t superintelligent, which you’re evaluating to decide whether to let it self-improve or not. The common case, however, is neither of these things. The common case is a research group that doesn’t have an AI at all, but hopes it might some day. To address the differences, we defined a taxonomy of heavy, medium, and light containment, respectively, put down definitions and a few thoughts about how to delineate the boundaries and what technical measures might be adequate, and proposed to study the question more thoroughly.
So, what did FLI’s review committee think of all this?
The proposal summarizes existing taxonomies of containment, but doesn’t make concrete proposals about how to make effective progress on these questions.
Well, crap. Those weren’t “existing taxonomies”! When considering possible failure modes for this proposal, one possibility I didn’t consider was that original research portions would look too much like summaries of existing work. So they thought the proposed scope was much smaller than it really was, and that the scope was too small for the requested budget. In retrospect, that wasn’t made nearly as clearly as it could have been. Still, I’m rather bothered by the lack of opportunity to clarify, or really any communication at all in between submitting a proposal and receiving a no.
So, we still need funding; and sadly, FLI is the only granting agency which has expressed any interest at all in AI safety.To the best of my knowledge, there is no one else at all working on or thinking about these issues. We couldn’t find any when we were looking for names to recommend for the review panel. Without funding, I myself will not be able to work on AI containment in more than an occasional part-time capacity.
This is too important an issue for humanity to drop the ball on, but it looks like that’s probably what will happen.
It seems like a lot of the work here is the same as for handling mundane code with well-informed authors. That seems like a plausible scenario in a world of advanced persistent threats and government-grade espionage. Perhaps you could get funded by regular info security people?
That seems bad. Have you talked to FHI? I was under the impression they occasionally funded interesting Safety research. Wasn’t AI Impacts briefly funded through FHI?