Sep 8, 2021

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

This newsletter is a combined summary + opinion for the Finite Factored Sets sequence by Scott Garrabrant. I (Rohin) have taken a lot more liberty than I usually do with the interpretation of the results; Scott may or may not agree with these interpretations.

One view on the importance of deep learning is that it allows
you to automatically *learn* the features that are
relevant for some task of interest. Instead of having to handcraft
features using domain knowledge, we simply point a neural net at an
appropriate dataset, and it figures out the right features.
Arguably this is the *majority* of what makes up
intelligent cognition; in humans it seems very analogous
to System
1, which we use for most decisions and actions. We are also
able to infer causal relations between the resulting features.

Unfortunately, existing models of causal inference don’t model these learned features -- they instead assume that the features are already given to you. Finite Factored Sets (FFS) provide a theory which can talk directly about different possible ways to featurize the space of outcomes, and still allows you to perform causal inference. This sequence develops this underlying theory, and demonstrates a few examples of using finite factored sets to perform causal inference given only observational data.

Another application is to embedded agency (AN #31): we would like to think of “agency” as a way to featurize the world into an “agent” feature and an “environment” feature, that together interact to determine the world. In Cartesian Frames (AN #127), we worked with a function A × E → W, where pairs of (agent, environment) together determined the world. In the finite factored set regime, we’ll think of A and E as features, the space S = A × E as the set of possible feature vectors, and S → W as the mapping from feature vectors to actual world states.

Generalizing this idea to apply more broadly, we will assume
that there is a set of possible worlds Ω, a set S of arbitrary
elements (which we will eventually interpret as feature vectors),
and a function f : S → Ω that maps feature vectors to world states.
Our goal is to have some notion of “features” of elements of S.
Normally, when working with sets, we identify a feature value with
the set of elements that have that value. For example, we can
identify “red” as the set of all red objects, and in some
versions of mathematics, we define “2” to be the set of all
sets that have exactly two elements. So, we define a feature to be
a *partition* of S into subsets, where each subset
corresponds to one of the possible feature values. We can also
interpret a feature as a *question* about items in
S, and the values as possible *answers* to that
question; I’ll be using that terminology going forward.

A finite factored set is then given by (S, B), where B is a set
of **factors** (questions), such that if you
choose a particular answer to every question, that uniquely
determines an element in S (and vice versa). We’ll put aside the
set of possible worlds Ω; for now we’re just going to focus on the
theory of these (S, B) pairs.

Let’s look at a contrived example. Consider S = {chai, caesar salad, lasagna, lava cake, sprite, strawberry sorbet}. Here are some possible questions for this S:

- **FoodType**: Possible answers are Drink =
{chai, sprite}, Dessert = {lava cake, strawberry sorbet}, Savory =
{caesar salad, lasagna}

- **Temperature**: Possible answers are Hot =
{chai, lava cake, lasagna} and Cold = {sprite, strawberry sorbet,
caesar salad}.

- **StartingLetter**: Possible answers are “C”
= {chai, caesar salad}, “L” = {lasagna, lava cake}, and “S” =
{sprite, strawberry sorbet}.

- **NumberOfWords**: Possible answers are “1”
= {chai, lasagna, sprite} and “2” = {caesar salad, lava cake,
strawberry sorbet}.

Given these questions, we could factor S into {FoodType,
Temperature}, or {StartingLetter, NumberOfWords}.
We *cannot* factor it into, say, {StartingLetter,
Temperature}, because if we set StartingLetter = L and Temperature
= Hot, that does not uniquely determine an element in S (it could
be either lava cake or lasagna).

Which of the two factorizations should we use? We’re not going to delve too deeply into this question, but you could imagine that if you were interested in questions like “does this need to be put in a glass” you might be more interested in the {FoodType, Temperature} factorization.

Just to appreciate the castle of abstractions we’ve built, here’s the finite factored set F with the factorization {FoodType, Temperature}:

F = ({chai, caesar salad, lasagna, lava cake, sprite, strawberry sorbet}, {{{chai, sprite}, {lava cake, strawberry sorbet}, {caesar salad, lasagna}}, {{chai, lava cake, lasagna}, {sprite, strawberry sorbet, caesar salad}}})

To keep it all straight, just remember:
a **factorization** B is a set
of **questions** (factors, partitions) each
of which is a set of **possible
answers** (parts), each of which is a set of elements
in S.

Some objections you might have about stuff we’ve talked about so far:

**Q.** Why do we bother with the set S --
couldn’t we just have the set of questions B, and then talk about
answer vectors of the form (a1, a2, … aN)?

**A.** You could in theory do this, as there
is a bijection between S and the Cartesian product of the sets in
B. However, the problem with this framing is that it is hard to
talk about other derived features. For example, the question “what
is the value of B1+B2” has no easy description in this framing.
When we instead directly work with S, the B1+B2 question is just
another partition of S, just like B1 or B2 individually.

**Q.** Why does f map S to Ω? Doesn’t this
mean that a feature vector uniquely determines a world state,
whereas it’s usually the opposite in machine learning?

**A.** This is true, but here the idea is that
the set of features together captures *all* the
information within the setting we are considering. You could think
of feature vectors in deep learning as only capturing an important
subset of all of the features (which we’d have to do in practice
since we only have bounded computation), and those features are not
enough to determine world states.

We’re eventually going to use finite factored sets similarly to Pearlian causal models: to infer which questions (random variables) are conditionally independent of each other. However, our analysis will apply to arbitrary questions, unlike Pearlian models, which can only talk about independence between the predefined variables from which the causal model is built.

Just like Pearl, we will talk about *conditioning on
evidence*: given evidence e, a subset of S, we can “observe”
that we are within e. In the formal setup, this looks like erasing
all elements that are not in e from all questions, answers,
factors, etc.

*Unlike* Pearl, we’re going to assume that all of
our factors are *independent* from each other. In
Pearlian causal models, the random variables are typically not
independent from each other. For example, you might have a model
with two binary variables, e.g. “Variable Rain causes Variable Wet
Sidewalk”; these are obviously not independent. An analogous finite
factored set would have *three* factors: “did it
rain?”, “if it rained did the sidewalk get wet?” and “if it didn’t
rain did the sidewalk get wet?” This way all three factors can be
independent of each other. We will still be able to ask whether Wet
Sidewalk is independent of Rain, since Wet Sidewalk is just another
question about the set S -- it just isn’t one of the underlying
factors any more.

The point of this independence is to allow us to reason
about *counterfactuals*: it should be possible to say
“imagine the element s, except with underlying factor b2 changed to
have value v”. As a result, our definitions will include clauses
that say “and make sure we can still take counterfactuals”. For
example, let’s talk about the “history” of a question X, which for
now you can think of as the “factors relevant to X”.
The *history* of X given e is the smallest set of
factors such that:

1) if you know the answers to these factors, then you can infer the answer to X, and

2) any factors that are *not* in the history
are independent of X. As suggested above, we can think of this as
being about counterfactuals -- we’re saying that for any such
factor, we can counterfactually change its answer, and this will
remain consistent with the evidence e.

(A technicality on the second point: we’ll never be able to counterfactually change a factor to a value that is never found in the evidence; this is fine and doesn’t prevent things from being independent.)

Time for an example! Consider the set S = {000, 001, 010, 011, 100, 101, 110, 111}, and the factorization {X, Y, Z}, where X is the question “what is the first bit”, Y is the question “what is the second bit”, and Z is the question “what is the third bit”. Consider the question Q = “when interpreted as a binary number, is the number >= 2?” In this case, the history of Q given no evidence is {X, Y}, because you can determine the answer to Q with the combination of X and Y. (You can still counterfact on anything, since there is no evidence to be inconsistent with.)

Let’s consider an example with evidence. Suppose we observe that
all the bits are equal, that is, e = {000, 111}. Now, what is the
history of X? If there weren’t any evidence, the history would just
be {X}; you only need to know X in order to determine the value of
X. However, suppose we learned that X = 0, implying that our
element is 000. We can’t counterfact on Y or Z, since that would
produce 010 or 001, both of which are inconsistent with the
evidence. So given this evidence, the history of X is actually {X,
Y, Z}, i.e. the entire set of factors! If we’d only observed that
the first two bits were equal, so e = {000, 001, 110, 111}, then
we *could* counterfact on Z, and the history of X
would be {X, Y}.

(Should you want more examples, here are two relevant posts.)

Given this notion of “history”, it is easy to define
orthogonality: X is orthogonal to Y given evidence e if the history
of X given e has no overlap with the history of Y given e.
Intuitively, this means that the factors relevant to X are
completely separate from those relevant to Y, and so there cannot
be any entanglement between X and Y. For
a *question* Z, we say that X is orthogonal to Y
given Z if we have that X is orthogonal to Y given z, for every
possible answer z in Z.

Now that we have defined orthogonality, we can state
the *Fundamental Theorem of Finite Factored Sets*.
Given some questions X, Y and Z about a finite factored set F, X is
orthogonal to Y given Z if and only if in every probability
distribution on F, X is conditionally independent of Y given Z,
that is, P(X, Y | Z) = P(X | Z) * P(Y | Z).

(I haven’t told you how you put a probability distribution on F. It’s exactly what you would think -- you assign a probability to every possible answer in every factor, and then the probability of an individual element is defined to be the product of the probabilities of its answers across all the factors.)

(I also haven’t given you any intuition about why this theorem
holds. Unfortunately I don’t have great intuition for this; the
proof has multiple non-trivial steps each of which I locally
understand and have intuition for... but globally it’s just a
sequence of non-trivial steps to me. Here’s an attempt, which isn’t
very good: we specifically defined orthogonality to
capture *all* the relevant information for a
question, in particular by having that second condition requiring
that we be able to counterfact on other factors, and so it
intuitively makes sense that if the relevant information doesn’t
overlap then there can’t be a way for the probability distribution
to have interactions between the variables.)

The fundamental theorem is in some sense
a *justification* for calling the property
“orthogonality” -- if we determine just by studying the structure
of the finite factored set that X is orthogonal to Y given Z, then
we know that this implies conditional independence in the “true”
probability distribution, whatever it ends up being. Pearlian
models have a similar theorem, where the graphical property of
d-separation implies conditional independence.

You might be wondering why we have been calling the minimal set of relevant factors “history”. The core philosophical idea is that, if you have the right factorization, then “time” or “causality” can be thought of as flowing in the direction of larger histories. Specifically, we say that X is “before” Y if the history of X is a subset of the history of Y. (We then call it “history” because every factor in the history of X will be “before” X by this definition.)

One intuition pump for this is that in physics, if an event A causes an event B, then the past light cone of A is a subset of the past light cone of B, and A happens before B in every possible reference frame.

But perhaps the best argument for thinking of this as causality is that we can actually use this notion of “time” or “causality” to perform causal inference. Before I talk about that, let’s see what this looks like in Pearlian models.

Strictly speaking, in Pearlian models, the edges do
not *have* to correspond to causality: formally
they only represent conditional independence assumptions on a
probability distribution. However, consider the following Cool
Fact: for some Pearlian models, if you have observational data that
is generated from that model, you can recover the exact graphical
structure of the generating model just by looking at the
observational data. In this case, you really are inferring
cause-and-effect relationships from observational data! (In the
general case where the data is generated by an arbitrary model, you
can recover a lot of the structure of the model, but be uncertain
about the direction of some of the edges, so you are still
doing *some* causal inference from observational
data.)

We will do something similar: we’ll use our notion of “before” to perform causal inference given observational data.

You are given statistical (i.e. observational) data for three
bits: X, Y and Z. You quickly notice that it is always the case
that Z = X xor Y (which implies that X = Y xor Z, and Y = Z xor X).
Clearly, there are only two independent bits here, and the other
bit is derived as the xor of the two independent bits. From the raw
statistical data, can you tell which bits are the independent ones,
and which one is the derived one, thus inferring which one
was *caused* by the other two? It turns out that
you can!

Specifically, you want to look for which two bits
are *orthogonal* to each other, that is, you want
to check whether we approximately have P(X, Y) = P(X) P(Y) (and
similarly for other possible pairings). In the world where two of
the bits were generated by a biased coin, you will find exactly one
pair that is orthogonal in this way. (The case where the bits are
generated by a fair coin is a special case; the argument won’t work
there, but it’s in some sense “accidental” and happens because the
probability of 0.5 is very special.)

Let’s suppose that the orthogonal pair was (X, Z). In this case,
we can *prove* that
in *every* finite factored set that models this
situation, X and Z come “before” Y, i.e. their histories are strict
subsets of Y’s history. Thus, we’ve inferred causality using only
observational data! (And unlike with Pearlian models, we did this
in a case where one “variable” was a deterministic function of two
other “variables”, which is a type of situation that Pearlian
models struggle to handle.)

Remember that motivation section, a couple thousand words ago? We talked about how we can do causal inference with learned featurizations, and apply it to embedded agency. Well, we actually haven’t done that yet, beyond a few examples of causal inference (as in the example above). There is a lot of future work to be done in applying it to the case that motivated it in the first place. The author wrote up potential future work here, which has categories for both causal inference and embedded agency, and also adds a third one: generalizing the theory to infinite sets. If you are interested in this framework, there are many avenues for pushing it forward.