Preview Mode Links will not work in preview mode

The Alignment Newsletter is a weekly publication with recent content relevant to AI alignment.
This podcast is an audio version, recorded by Robert Miles (

More information about the newsletter at:

Jan 23, 2022

Recorded by Robert Miles:

More information about the newsletter here:

YouTube Channel:



Alignment difficulty (Richard Ngo and Eliezer Yudkowsky) (summarized by Rohin): Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His argument in this dialogue is roughly as follows:

1. We are very likely going to keep improving AI capabilities until we reach AGI, at which point either the world is destroyed, or we use the AI system to take some pivotal act before some careless actor destroys the world.

2. In either case, the AI system must be producing high-impact, world-rewriting plans; such plans are “consequentialist” in that the simplest way to get them (and thus, the one we will first build) is if you are forecasting what might happen, thinking about the expected consequences, considering possible obstacles, searching for routes around the obstacles, etc. If you don’t do this sort of reasoning, your plan goes off the rails very quickly - it is highly unlikely to lead to high impact. In particular, long lists of shallow heuristics (as with current deep learning systems) are unlikely to be enough to produce high-impact plans.

3. We’re producing AI systems by selecting for systems that can do impressive stuff, which will eventually produce AI systems that can accomplish high-impact plans using a general underlying “consequentialist”-style reasoning process (because that’s the only way to keep doing more impressive stuff). However, this selection process does not constrain the goals towards which those plans are aimed. In addition, most goals seem to have convergent instrumental subgoals like survival and power-seeking that would lead to extinction. This suggests that we should expect an existential catastrophe by default.

4. None of the methods people have suggested for avoiding this outcome seem like they actually avert this story.

Richard responds to this with a few distinct points:

1. It might be possible to build AI systems which are not of world-destroying intelligence and agency, that humans use to save the world. For example, we could make AI systems that do better alignment research. Such AI systems do not seem to require the property of making long-term plans in the real world in point (3) above, and so could plausibly be safe.

2. It might be possible to build general AI systems that only state plans for achieving a goal of interest that we specify, without executing that plan.

3. It seems possible to create consequentialist systems with constraints upon their reasoning that lead to reduced risk.

4. It also seems possible to create systems with the primary aim of producing plans with certain properties (that aren't just about outcomes in the world) -- think for example of corrigibility (AN #35) or deference to a human user.

5. (Richard is also more bullish on coordinating not to use powerful and/or risky AI systems, though the debate did not discuss this much.)

Eliezer’s responses:

1. AI systems that help with alignment research to such a degree that it actually makes a difference are almost certainly already dangerous.

2. It is the plan itself that is risky; if the AI system made a plan for a goal that wasn’t the one we actually meant, and we don’t understand that plan, that plan can still cause extinction. It is the misaligned optimization that produced the plan that is dangerous.

3 and 4. It is certainly possible to do such things; the space of minds that could be designed is very large. However, it is difficult to do such things, as they tend to make consequentialist reasoning weaker, and on our current trajectory the first AGI that we build will probably not look like that.

This post has also been summarized by others here, though with different emphases than in my summary.


Rohin's opinion: I first want to note my violent agreement with the notion that a major scary thing is “consequentialist reasoning”, and that high-impact plans require such reasoning, and that we will end up building AI systems that produce high-impact plans. Nonetheless, I am still optimistic about AI safety relative to Eliezer, which I suspect comes down to three main disagreements:

1. There are many approaches that don’t solve the problem, but do increase the level of intelligence required before the problem leads to extinction. Examples include Richard’s points 1-4 above. For example, if we build a system that states plans without executing them, then for the plans to cause extinction they need to be complicated enough that the humans executing those plans don’t realize that they are leading to an outcome that was not what they wanted. It seems non-trivially probable to me that such approaches are sufficient to prevent extinction up to the level of AI intelligence needed before we can execute a pivotal act.

2. The consequentialist reasoning is only scary to the extent that it is “aimed” at a bad goal. It seems non-trivially probable to me that it will be “aimed” at a goal sufficiently good to not lead to existential catastrophe, without putting in much alignment effort.

3. I do expect some coordination to not do the most risky things.

I wish the debate had focused more on the claim that non-scary AI can’t e.g. do better alignment research, as it seems like a major crux. (For example, I think that sort of intuition drives my disagreement #1.) I expect AI progress looks a lot like “the heuristics get less and less shallow in a gradual / smooth / continuous manner” which eventually leads to the sorts of plans Eliezer calls “consequentialist”, whereas I think Eliezer expects a sharper qualitative change between “lots of heuristics” and that-which-implements-consequentialist-planning.


Discussion of "Takeoff Speeds" (Eliezer Yudkowsky and Paul Christiano) (summarized by Rohin): This post focuses on the question of whether we should expect AI progress to look discontinuous or not. It seemed to me that the two participants were mostly talking past each other, and so I’ll summarize their views separately and not discuss the parts where they were attempting to address each other’s views.

Some ideas behind the “discontinuous” view:

1. When things are made up of a bunch of parts, you only get impact once all of the parts are working. So, if you have, say, 19 out of 20 parts done, there still won’t be much impact, and then once you get the 20th part, then there is a huge impact, which looks like a discontinuity.

2. A continuous change in inputs can lead to a discontinuous change in outputs or impact. Continuously increasing the amount of fissile material leads to a discontinuous change from “inert-looking lump” to “nuclear explosion”. Continuously scaling up a language model from GPT-2 to GPT-3 leads to many new capabilities, such as few-shot learning. A misaligned AI that is only capable of concealing 95% of its deceptive activities will not perform any such activities; it will only strike once it is scaled up to be capable of concealing 100% of its activities.

3. Fundamentally new approaches to a problem will often have prototypes which didn’t have much impact. The difference is that they will scale much better, and so once they start having an impact this will look like a discontinuity in the rate of improvement on the problem.

4. The evolution from chimps to humans tells us that there is, within the space of possible mind designs, an area in which you can get from shallow, non-widely-generalizing cognition to deep, much-more-generalizing cognition, with only relatively small changes.

5. Our civilization tends to prevent people from doing things via bureaucracy and regulatory constraints, so even if there are productivity gains to be had from applications of non-scary AI, we probably won’t see them; as a result we probably do not see GWP growth before the point where an AI can ignore bureaucracy and regulatory constraints, which makes it look discontinuous.

Some ideas behind the “continuous” view:

1. When people are optimizing hard in pursuit of a metric, then the metric tends to grow smoothly. While individual groups may find new ideas that improve the metric, those new ideas are unlikely to change the metric drastically more than previously observed changes in the metric.

2. A good heuristic for forecasting is to estimate (1) the returns to performance from additional effort, using historical data, and (2) the amount of effort currently being applied. These can then be combined to give a forecast.

3. How smooth and predictable the improvement is depends on how much effort is being put in. In terms of effort put in currently, coding assistants < machine translation < semiconductors, as a result we should expect semiconductor improvement to be smoother than machine translation improvement, which in turn will be smoother than coding assistant improvement.

4. In AI we will probably have crappy versions of economically useful systems before we have good versions of those systems. By the time we have good versions, people will be throwing lots of effort at the problem. For example, Codex is a crappy version of a coding assistant; such assistants will now improve over time in a somewhat smooth way.

There’s further discussion on the differences between these views in a subsequent post.


Rohin's opinion: The ideas I’ve listed in this summary seem quite compatible to me; I believe all of them to at least some degree (though perhaps not in the same way as the authors). I am not sure if either author would strongly disagree with any of the claims on this list. (Of course, this does not mean that they agree -- presumably there are some other claims that have not yet been made explicit on which they disagree.)






AGI Safety Fundamentals curriculum and application (Richard Ngo) (summarized by Rohin): This post presents the curriculum used in the AGI safety fundamentals course, which is meant to serve as an effective introduction to the field of AGI safety.





Visible Thoughts Project and Bounty Announcement (Nate Soares) (summarized by Rohin): MIRI would like to test whether language models can be made more understandable by training them to produce visible thoughts. As part of this project, they need a dataset of thought-annotated dungeon runs. They are offering $200,000 in prizes for building the first fragments of the dataset, plus an additional $1M prize/budget for anyone who demonstrates the ability to build a larger dataset at scale.

Prizes for ELK proposals (Paul Christiano) (summarized by Rohin): The Alignment Research Center (ARC) recently published a technical report on Eliciting Latent Knowledge (ELK). They are offering prizes of $5,000 to $50,000 for proposed strategies that tackle ELK. The deadline is the end of January.


Rohin's opinion: I think this is a particularly good contest to try to test your fit with (a certain kind of) theoretical alignment research: even if you don't have much background, you can plausibly get up to speed in tens of hours. I will also try to summarize ELK next week, but no promises.


Worldbuilding Contest (summarized by Rohin): FLI invites individuals and teams to compete for a prize purse worth $100,000+ by designing visions of a plausible, aspirational future including artificial general intelligence. The deadline for submissions is April 15.

Read more: FLI launches Worldbuilding Contest with $100,000 in prizes

New Seminar Series and Call For Proposals On Cooperative AI (summarized by Rohin): The Cooperative AI Foundation (CAIF) will be hosting a new fortnightly seminar series in which leading thinkers offer their vision for research on Cooperative AI. The first talk, 'AI Agents May Cooperate Better If They Don’t Resemble Us’, was given on Thursday (Jan 20) by Vincent Conitzer (Duke University, University of Oxford). You can find more details and submit a proposal for the seminar series here.

AI Risk Management Framework Concept Paper (summarized by Rohin): After their Request For Information last year (AN #161), NIST has now posted a concept paper detailing their current thinking around the AI Risk Management Framework that they are creating, and are soliciting comments by Jan 25. As before, if you're interested in helping with a response, email Tony Barrett at

Announcing the PIBBSS Summer Research Fellowship (Nora Ammann) (summarized by Rohin): Principles of Intelligent Behavior in Biological and Social Systems (PIBBSS) aims to facilitate knowledge transfer with the goal of building human-aligned AI systems. This summer research fellowship will bring together researchers from fields studying complex and intelligent behavior in natural and social systems, such as evolutionary biology, neuroscience, linguistics, sociology, and more. The application deadline is Jan 23, and there are also bounties for referrals.

Action: Help expand funding for AI Safety by coordinating on NSF response (Evan R. Murphy) (summarized by Rohin): The National Science Foundation (NSF) has put out a Request for Information relating to topics they will be funding in 2023 as part of their NSF Convergence Accelerator program. The author and others are coordinating responses to increase funding to AI safety, and ask that you fill out this short form if you are willing to help out with a few small, simple actions.