# Assessing significance in a Markov chain without mixing

1. aDepartment of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213;
2. bDepartment of Mathematical Sciences, Carnegie Mellon University, Pittsburgh, PA 15213
1. Edited by Kenneth W. Wachter, University of California, Berkeley, CA, and approved January 24, 2017 (received for review October 21, 2016)

## Significance

Markov chains are simple mathematical objects that can be used to generate random samples from a probability space by taking a random walk on elements of the space. Unfortunately, in applications, it is often unknown how long a chain must be run to generate good samples, and in practice, the time required is often simply too long. This difficulty can preclude the possibility of using Markov chains to make rigorous statistical claims in many cases. We develop a rigorous statistical test for Markov chains which can avoid this problem, and apply it to the problem of detecting bias in Congressional districting.

## Abstract

We present a statistical test to detect that a presented state of a reversible Markov chain was not chosen from a stationary distribution. In particular, given a value function for the states of the Markov chain, we would like to show rigorously that the presented state is an outlier with respect to the values, by establishing a <mml:math><mml:mi>p</mml:mi></mml:math>p value under the null hypothesis that it was chosen from a stationary distribution of the chain. A simple heuristic used in practice is to sample ranks of states from long random trajectories on the Markov chain and compare these with the rank of the presented state; if the presented state is a <mml:math><mml:mrow><mml:mn>0.1</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>0.1% outlier compared with the sampled ranks (its rank is in the bottom <mml:math><mml:mrow><mml:mn>0.1</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>0.1% of sampled ranks), then this observation should correspond to a <mml:math><mml:mi>p</mml:mi></mml:math>p value of <mml:math><mml:mn>0.001</mml:mn></mml:math>0.001. This significance is not rigorous, however, without good bounds on the mixing time of the Markov chain. Our test is the following: Given the presented state in the Markov chain, take a random walk from the presented state for any number of steps. We prove that observing that the presented state is an <mml:math><mml:mi>ε</mml:mi></mml:math>ε-outlier on the walk is significant at <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>p</mml:mi></mml:mpadded><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mn>2</mml:mn><mml:mi>ε</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:math>p=2ε under the null hypothesis that the state was chosen from a stationary distribution. We assume nothing about the Markov chain beyond reversibility and show that significance at <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>p</mml:mi></mml:mpadded><mml:mo>≈</mml:mo><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:mrow></mml:math>p≈ε is best possible in general. We illustrate the use of our test with a potential application to the rigorous detection of gerrymandering in Congressional districting.

The essential problem in statistics is to bound the probability of a surprising observation under a null hypothesis that observations are being drawn from some unbiased probability distribution. This calculation can fail to be straightforward for a number of reasons. On the one hand, defining the way in which the outcome is surprising requires care; for example, intricate techniques have been developed to allow sophisticated analysis of cases where multiple hypotheses are being tested. On the other hand, the correct choice of the unbiased distribution implied by the null hypothesis is often not immediately clear; classical tools like the <mml:math><mml:mi>t</mml:mi></mml:math>t test are often applied by making simplifying assumptions about the distribution in such cases. If the distribution is well-defined but is not be amenable to mathematical analysis, a <mml:math><mml:mi>p</mml:mi></mml:math>p value can still be calculated using bootstrapping if test samples can be drawn from the distribution.

A third way for <mml:math><mml:mi>p</mml:mi></mml:math>p value calculations to be nontrivial occurs when the observation is surprising in a simple way and the null hypothesis distribution is known but where there is no simple algorithm to draw samples from this distribution. In these cases, the best candidate method to sample from the null hypothesis is often through a Markov chain, which essentially takes a long random walk on the possible values of the distribution. Under suitable conditions, theorems are available that guarantee that the chain converges to its stationary distribution, allowing a random sample to be drawn from a distribution quantifiably close to the target distribution. This principle has given rise to diverse applications of Markov chains, including to simulations of chemical reactions, Markov chain Monte Carlo statistical methods, protein folding, and statistical physics models.

A persistent problem in applications of Markov chains is the often unknown rate at which the chain converges with the stationary distribution (1, 2). It is rare to have rigorous results on the mixing time of a real-world Markov chain, which means that, in practice, sampling is performed by running a Markov chain for a “long time” and hoping that sufficient mixing has occurred. In some applications, such as in simulations of the Potts model from statistical physics, practitioners have developed modified Markov chains in the hopes of achieving faster convergence (3), but such algorithms have still been shown to have exponential mixing times in many settings (4?6).

In this article, we are concerned with the problem of assessing statistical significance in a Markov chain without requiring results on the mixing time of the chain or indeed, any special structure at all in the chain beyond reversibility. Formally, we consider a reversible Markov chain <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M on a state space <mml:math><mml:mi mathvariant="normal">Σ</mml:mi></mml:math>Σ, which has an associated label function <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>ω</mml:mi></mml:mpadded><mml:mo>:</mml:mo><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi mathvariant="normal">Σ</mml:mi></mml:mpadded><mml:mo>→</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow></mml:mrow></mml:math>ω:Σ→?. (The definition of Markov chain is recalled at the end of this section.) The labels constitute auxiliary information and are not assumed to have any relationship to the transition probabilities of <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M. We would like to show that a presented state <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0 is unusual for states drawn from a stationary distribution <mml:math><mml:mi>π</mml:mi></mml:math>π. If we have good bounds on the mixing time of <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M, then we can simply sample from a distribution of <mml:math><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>π</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ω(π) and use bootstrapping to obtain a rigorous <mml:math><mml:mi>p</mml:mi></mml:math>p value for the significance of the smallness of the label of <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0. However, such bounds are rarely available.

We propose the following simple and rigorous test to detect that <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0 is unusual relative to states chosen randomly according to <mml:math><mml:mi>π</mml:mi></mml:math>π, which does not require bounds on the mixing rate of <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M.

The <mml:math><mml:msqrt><mml:mi>??</mml:mi></mml:msqrt></mml:math>?? test. Observe a trajectory <mml:math><mml:mrow><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>σ</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>…</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>σ0,σ1,σ2…,σk from the state <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0 for any fixed <mml:math><mml:mi>k</mml:mi></mml:math>k. The event that <mml:math><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ω(σ0) is an <mml:math><mml:mi>ε</mml:mi></mml:math>ε-outlier among <mml:math><mml:mrow><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math>ω(σ0),…,ω(σk) is significant at <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>p</mml:mi></mml:mpadded><mml:mo>=</mml:mo><mml:msqrt><mml:mrow><mml:mn>2</mml:mn><mml:mi>ε</mml:mi></mml:mrow></mml:msqrt></mml:mrow></mml:math>p=2ε under the null hypothesis that <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mpadded><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>σ0～π.

Here, we say that a real number <mml:math><mml:msub><mml:mi>α</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>α0 is an <mml:math><mml:mi>ε</mml:mi></mml:math>ε-outlier among <mml:math><mml:mrow><mml:msub><mml:mi>α</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>α</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>α</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>α0,α2,…,αk if there are, at most, <mml:math><mml:mrow><mml:mi>ε</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ε(k+1) indices <mml:math><mml:mi>i</mml:mi></mml:math>i for which <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>α</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mpadded><mml:mo>≤</mml:mo><mml:msub><mml:mi>α</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:math>αi≤α0. In particular, note for the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test that the only relevant feature of the label function is the ranking that it imposes on the elements of <mml:math><mml:mi mathvariant="normal">Σ</mml:mi></mml:math>Σ. In SI Text, we consider the statistical power of the test and show that the relationship <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>p</mml:mi></mml:mpadded><mml:mo>≈</mml:mo><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:mrow></mml:math>p≈ε is best possible. We leave as an open question whether the constant <mml:math><mml:msqrt><mml:mn>2</mml:mn></mml:msqrt></mml:math>2 can be improved.

Roughly speaking, this kind of test is possible, because a reversible Markov chain cannot have many local outliers (Fig. 1). Rigorously, the validity of the test is a consequence of the following theorem.

Theorem 1.1. Let <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi mathvariant="script">M</mml:mi></mml:mpadded><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo></mml:mrow></mml:mrow></mml:math>M=X0,X1,… be a reversible Markov chain with a stationary distribution <mml:math><mml:mi>π</mml:mi></mml:math>π, and suppose the states of <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M have real-valued labels. If <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mpadded><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>X0～π, then for any fixed <mml:math><mml:mi>k</mml:mi></mml:math>k, the probability that the label of <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 is an <mml:math><mml:mi>ε</mml:mi></mml:math>ε-outlier from among the list of labels observed in the trajectory <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,X1,X2,…,Xk is, at most, <mml:math><mml:msqrt><mml:mrow><mml:mn>2</mml:mn><mml:mi>ε</mml:mi></mml:mrow></mml:msqrt></mml:math>.

We emphasize that Theorem 1.1 makes no assumptions on the structure of the Markov chain beyond reversibility. In particular, it applies even if the chain is not irreducible (in other words, even if the state space is not connected), although in this case, the chain will never mix.

In Detecting Bias in Political Districting, we apply the test to Markov chains generating random political districting for which no results on rapid mixing exist. In particular, we show that, for various simple choices of constraints on what constitutes a “valid” Congressional districting (e.g., that the districts are contiguous and satisfy certain geometric constraints), the current Congressional districting of Pennsylvania is significantly biased under the null hypothesis of a districting chosen at random from the set of valid districting. (We obtain <mml:math><mml:mi>p</mml:mi></mml:math>p values between <mml:math><mml:mrow><mml:mo>≈</mml:mo><mml:mrow><mml:mn>?2.5</mml:mn><mml:mo>?</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>?</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:math>≈?2.5?10?4 and <mml:math><mml:mrow><mml:mo>≈</mml:mo><mml:mrow><mml:mn>?8.1</mml:mn><mml:mo>?</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>?</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:math>≈?8.1?10?7 for the constraints that we considered.)

One hypothetical application of the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test is the possibility of rigorously showing that a chain is not mixed. In particular, suppose that Research Group 1 has run a reversible Markov chain for <mml:math><mml:msub><mml:mi>n</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math>n1 steps and believes that this was sufficient to mix the chain. Research Group 2 runs the chain for another <mml:math><mml:msub><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math>n2 steps, producing a trajectory of total length <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>n</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mpadded><mml:mo>+</mml:mo><mml:msub><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math>n1+n2, and notices that a property of interest changes in these <mml:math><mml:msub><mml:mi>n</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math>n2 additional steps. Heuristically, this observation suggests that <mml:math><mml:msub><mml:mi>n</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math>n1 steps were not sufficient to mix the chain, and the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test quantifies this reasoning rigorously. For this application, however, we must allow <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 to be distributed not exactly as the stationary distribution <mml:math><mml:mi>π</mml:mi></mml:math>π but as some distribution <mml:math><mml:mi>π</mml:mi><mml:mo>′</mml:mo></mml:math>π′ with total variation distance to <mml:math><mml:mi>π</mml:mi></mml:math>π that is small, as is the scenario for a “mixed” Markov chain. In SI Text, we give a version of Theorem 1.1, which applies in this scenario.

One area of research related to this manuscript concerns methods for perfect sampling from Markov chains. Beginning with the Coupling from the Past (CFTP) algorithm of Propp and Wilson (7, 8) and several extensions (9, 10), these techniques are designed to allow sampling of states exactly from the stationary distribution <mml:math><mml:mi>π</mml:mi></mml:math>π without having rigorous bounds on the mixing time of the chain. Compared with the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test, perfect sampling techniques have the disadvantages that they require the Markov chain to possess a certain structure for the method to be implementable and that the time that it takes to generate each perfect sample is unbounded. Moreover, although perfect sampling methods do not require rigorous bounds on mixing times to work, they will not run efficiently on a slowly mixing chain. The point is that for a chain that has the right structure and that actually mixes quickly (despite an absence of a rigorous bound on the mixing time), algorithms like CFTP can be used to rigorously generate perfect samples. However, the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test applies to any reversible Markov chain, regardless of the structure, and has running time <mml:math><mml:mi>k</mml:mi></mml:math>k chosen by the user. Importantly, it is quite possible that the test can detect bias in a sample even when <mml:math><mml:mi>k</mml:mi></mml:math>k is much smaller than the mixing time of the chain, which seems to be the case in the districting example discussed in Detecting Bias in Political Districting. Of course, unlike perfect sampling methods, the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test can only be used to show that a given sample is not chosen from <mml:math><mml:mi>π</mml:mi></mml:math>π; it does not give a way for generating samples from <mml:math><mml:mi>π</mml:mi></mml:math>π.

## SI Text

### S1. Precinct Data.

Precinct-level voting data and associated shape files were obtained from the Harvard Election Data Archive (projects.iq.harvard.edu/eda/home) (20). The data for Pennsylvania contain 9,256 precincts. The data were altered in two ways: 258 precincts that were contained within another precinct were merged and 79 precincts that were not contiguous were split into continuous areas, with voting and population data distributed proportional to the area. The result was a set of 9,060 precincts. All geometry calculations and manipulations were accomplished in R with “maptools,” “rgeos,” and “BARD” R packages. The final input to the Markov chain is a set of precincts with corresponding areas, neighbor precincts, the length of the perimeter shared with each neighbor, voting data from 2012, and the current Congressional district to which the precinct belongs.

### S2. Valid Districting.

We restrict our attention to districtings satisfying four restrictions, each of which we describe here.

#### S2.1. Contiguous districts.

A valid districting must have the property that each of its districts is contiguous. In particular, two precincts are considered adjacent if the length of their shared perimeter is nonzero (in particular, precincts meeting only at a point are not adjacent), and a district is contiguous if any pair of precincts is joined by a sequence of consecutively adjacent pairs.

#### S2.2. Simply connected districts.

A valid districting must have the property that each of its districts is simply connected. Roughly speaking, this constraint means the district cannot have a “hole.” Precisely, a district is simply connected if, for any circuit of precincts in the district, all precincts in the region bounded by the circuit also are in the district.

Apart from aesthetic reasons for insisting that districtings satisfy this condition, there is also a practical reason: it is easier to have a fast local check for contiguity when one is also maintaining that districtings are simply connected.

#### S2.3. Small population difference.

According to the “one person, one vote” doctrine, Congressional districts for a state are required to be roughly equal in population. In the current districting of Pennsylvania, for example, the maximum difference in district population from the average population is roughly <mml:math><mml:mrow><mml:mn>1</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>1%. Our chain can use different tolerances for population difference between districts and the average, and the tolerances used in the runs below are indicated.

#### S2.4. Compactness.

If districtings were drawn randomly with only the first three requirements, the result would be districtings in which districts have very complicated, fractal-like structure (because most districtings have this property). The final requirement on valid districtings prevents this by ensuring that the districts in the districting have a reasonably nice shape. This requirement on district shape is commonly termed “compactness” and explicitly required of Congressional districts by the Pennsylvania Constitution.

Although compactness of districts does not have a precise legal definition, various precise metrics have been proposed to quantify the compactness of a given district mathematically. One of the simplest and most commonly used metrics is the Polsby–Popper metric, which defines the compactness of a district as<mml:math display="block"><mml:mrow><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>C</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:mpadded><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>4</mml:mn><mml:mi>π</mml:mi><mml:msub><mml:mi>A</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:mrow><mml:msubsup><mml:mi>P</mml:mi><mml:mi>D</mml:mi><mml:mn>2</mml:mn></mml:msubsup></mml:mfrac></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math>CD=4πADPD2,where <mml:math><mml:msub><mml:mi>A</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:math>AD and <mml:math><mml:msub><mml:mi>P</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:math>PD are the area and perimeter of the district <mml:math><mml:mi>D</mml:mi></mml:math>D, respectively. Note that the maximum value of this measure is one, which is achieved only by the disk as a result of the isoperimetric inequality. All other shapes have compactness between zero and one, and smaller values indicate more “contorted” shapes.

Perhaps the most straightforward use of this compactness measure is to enforce some threshold on compactness and require valid districtings to have districts with compactness that is above that lower bound. (For consistency with our other metrics, we actually impose an upper bound on the reciprocal <mml:math><mml:mrow><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:msub><mml:mi>C</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:mrow></mml:math>1/CD of the Polsby–Popper compactness <mml:math><mml:msub><mml:mi>C</mml:mi><mml:mi>D</mml:mi></mml:msub></mml:math>CD of each district <mml:math><mml:mi>D</mml:mi></mml:math>D.) In Table S1, this metric is the <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mi mathvariant="normal">∞</mml:mi></mml:msup></mml:math>L∞ compactness metric.

One drawback of using this method is that the current districting of Pennsylvania has a few districts that have very low compactness values (they are much stranger looking than the other districts). Applying this restriction will allow all 18 districts to be as bad as the threshold chosen, so that, in particular, we will be sampling districtings from space in which all 18 districts may be as bad as the worst district in the current plan. In fact, because there are more noncompact regions than compact ones, one expects that, in typical such districting, all 18 districts would be nearly as bad as the currently worst example.

To address this issue and also, to show the robustness of our finding for the districting question, we also consider some alternate restrictions on the districting, which measure how reasonable the districting as a whole is with regard to compactness. For example, one simple measure of this is to have a threshold for the maximum allowable sum<mml:math display="block"><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>C</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mfrac><mml:mo>+</mml:mo><mml:mo>?</mml:mo><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>C</mml:mi><mml:mn>18</mml:mn></mml:msub></mml:mfrac></mml:mrow></mml:math>1C1+?+1C18of the inverse compactness values of 18 districts. This metric is the <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msup></mml:math>L1 metric in Table S1. Similarly, we could have a threshold for the maximum allowable sum of squares<mml:math display="block"><mml:mrow><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:msubsup><mml:mi>C</mml:mi><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:msubsup></mml:mfrac><mml:mo>+</mml:mo><mml:mo>?</mml:mo><mml:mo>+</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msubsup><mml:mi>C</mml:mi><mml:mn>18</mml:mn><mml:mn>2</mml:mn></mml:msubsup></mml:mfrac></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>1C12+?+1C182.This metric is the <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math>L2 metric in Table S1. Finally, we can have a simple condition that simply ensures that the total perimeter<mml:math display="block"><mml:mrow><mml:msub><mml:mi>P</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mo>?</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>P</mml:mi><mml:mn>18</mml:mn></mml:msub></mml:mrow></mml:math>P1+?+P18is less than some threshold.

#### S2.5. Other possible constraints.

It is possible to imagine many other reasonable constraints on valid districtings. For example, the Pennsylvania Constitution currently requires of districtings for the Pennsylvania Senate and Pennsylvania House of Representatives that, unless absolutely necessary, no county, city, incorporated town, borough, township, or ward shall be divided in forming either a senatorial or representative district.

There is no similar requirement for US Congressional districts in Pennsylvania, which is what we consider here, but it is still a reasonable constraint to consider.

There are also interesting legal questions about the extent to which majority–minority districts (in which an ethnic minority is an electoral majority) are either required to be intentionally created or forbidden to be intentionally created. On the one hand, the US Supreme Court ruled in Thornburg v. Gingles (1986) that, in certain cases, a geographically concentrated minority population is entitled to a Congressional district in which it constitutes a majority. On the other hand, in several US Supreme Court cases [Shaw v. Reno (1993), Miller v. Johnson (1995), and Bush v. Vera (1996)], Congressional districtings were thrown out, because they contained intentionally drawn majority–minority districts that were deemed to be a “racial gerrymander.” In any case, we have not attempted to answer the question of whether or how the existence of majority–minority districts should be quantified. (We suspect that the unbiased procedure of drawing a random districting is probably acceptable under current majority–minority district requirements, but in any case, our main intent is to show the application of the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test.)

Importantly, we emphasize that any constraint on districtings that can be precisely defined (i.e., by giving an algorithm that can identify whether a districting satisfies the constraint) can be used in the Markov chain setting in principle.

### S3. The Markov Chain.

The Markov chain <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M that we use has as its state space <mml:math><mml:mi mathvariant="normal">Σ</mml:mi></mml:math>Σ, the space of all valid districtings (with 18 districts) of Pennsylvania. Note that there is no simple way to enumerate these, and there is an enormous number of them.

A simple way to define a Markov chain on this state space is to transition as follows.

• i) From the current state, determine the set <mml:math><mml:mi>S</mml:mi></mml:math>S of all pairs <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>ρ</mml:mi><mml:mo>,</mml:mo><mml:mi>D</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(ρ,D), where <mml:math><mml:mi>ρ</mml:mi></mml:math>ρ is a precinct in some district <mml:math><mml:msub><mml:mi>D</mml:mi><mml:mi>ρ</mml:mi></mml:msub></mml:math>, and <mml:math><mml:mrow><mml:mi>D</mml:mi><mml:mo>≠</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mi>ρ</mml:mi></mml:msub></mml:mrow></mml:math>D≠Dρ is a district that is adjacent to <mml:math><mml:mi>ρ</mml:mi></mml:math>ρ. Let <mml:math><mml:msub><mml:mi>N</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:math>NS denote the size of this set.

• ii) From <mml:math><mml:mi>S</mml:mi></mml:math>S, choose one pair <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>ρ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(ρ0,D0) uniformly at random.

• iii) Change the district membership of <mml:math><mml:msub><mml:mi>ρ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>ρ0 from <mml:math><mml:msub><mml:mi>D</mml:mi><mml:msub><mml:mi>ρ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:msub></mml:math>Dρ0 to <mml:math><mml:msub><mml:mi>D</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>D0 if the resulting district is still valid.

Let the Markov chain with these transition rules be denoted by <mml:math><mml:msub><mml:mi mathvariant="script">M</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>M0. This chain is a perfectly fine reversible Markov chain to which our theorem applies, but the uniform distribution on valid districtings is not stationary for <mml:math><mml:msub><mml:mi mathvariant="script">M</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>M0; therefore, we cannot use <mml:math><mml:msub><mml:mi mathvariant="script">M</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>M0 to make comparisons between a presented districting and a uniformly random valid districting.

A simple way to make the uniform distribution stationary is to “regularize” the chain (that is, to modify the chain so that the number of legal steps from any state is equal). (<mml:math><mml:msub><mml:mi mathvariant="script">M</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>M0 is not already regular, because the number of precincts on the boundaries of districts will vary from districting to districting.) We do this by adding loops to each possible state. In particular, using a theoretical maximum <mml:math><mml:msub><mml:mi>N</mml:mi><mml:mtext>max</mml:mtext></mml:msub></mml:math>Nmax on the possible size of <mml:math><mml:msub><mml:mi>N</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:math>NS for any district, we modify the transition rules as follows.

• i) From the current state, determine the set <mml:math><mml:mi>S</mml:mi></mml:math>S of all pairs <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>ρ</mml:mi><mml:mo>,</mml:mo><mml:mi>D</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(ρ,D), where <mml:math><mml:mi>ρ</mml:mi></mml:math>ρ is a precinct in some district <mml:math><mml:msub><mml:mi>D</mml:mi><mml:mi>ρ</mml:mi></mml:msub></mml:math>, and <mml:math><mml:mrow><mml:mi>D</mml:mi><mml:mo>≠</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mi>ρ</mml:mi></mml:msub></mml:mrow></mml:math>D≠Dρ is a district that is adjacent to <mml:math><mml:mi>ρ</mml:mi></mml:math>ρ. Let <mml:math><mml:msub><mml:mi>N</mml:mi><mml:mi>S</mml:mi></mml:msub></mml:math>NS denote the size of this set.

• ii) With probability <mml:math><mml:mrow><mml:mn>1</mml:mn><mml:mo>?</mml:mo><mml:mfrac><mml:msub><mml:mi>N</mml:mi><mml:mi>S</mml:mi></mml:msub><mml:msub><mml:mi>N</mml:mi><mml:mtext>max</mml:mtext></mml:msub></mml:mfrac></mml:mrow></mml:math>1?NSNmax, remain in the current state for this step. With probability <mml:math><mml:mfrac><mml:msub><mml:mi>N</mml:mi><mml:mi>S</mml:mi></mml:msub><mml:msub><mml:mi>N</mml:mi><mml:mtext>max</mml:mtext></mml:msub></mml:mfrac></mml:math>NSNmax, continue as follows.

• iii) From <mml:math><mml:mi>S</mml:mi></mml:math>S, choose one pair <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>ρ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>D</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(ρ0,D0) uniformly at random.

• iv) Change the district membership of <mml:math><mml:msub><mml:mi>ρ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>ρ0 from <mml:math><mml:msub><mml:mi>D</mml:mi><mml:msub><mml:mi>ρ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:msub></mml:math>Dρ0 to <mml:math><mml:msub><mml:mi>D</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>D0 if the resulting district is still valid. If it is not, remain in the current district for this set.

In particular, with this modification, each state has exactly <mml:math><mml:msub><mml:mi>N</mml:mi><mml:mtext>max</mml:mtext></mml:msub></mml:math>Nmax possible transitions, which are each equally likely; many of these transitions are loops back to the same state. (Some of these loops arise from step ii, but some also arise when the if condition in step iv fails.)

### S4. The Label Function.

In principle, any label function <mml:math><mml:mi>ω</mml:mi></mml:math>ω could be used in the application of the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test; note that Theorem 1.1 places no restrictions on <mml:math><mml:mi>ω</mml:mi></mml:math>ω. Thus, when we choose which label function to use, we are making a choice based on what is likely to achieve good significance rather than what is valid statistical reasoning (subject to the caveat discussed below). To choose a label function that was likely to allow good statistical power, we want to have a function that is

• i) likely very different for a gerrymandered districting compared with a typical districting and

• ii) sensitive enough that small changes in the districting might be detected in the label function.

Although the role of the first condition in achieving outlier status is immediately obvious, the second property is also crucial to detecting significance with our test, which makes use of trajectories that may be quite small compared with the mixing time of the chain. For the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test to succeed at showing significance, it is not enough for the presented state <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0 to actually be an outlier against <mml:math><mml:mi>π</mml:mi></mml:math>π with respect to <mml:math><mml:mi>ω</mml:mi></mml:math>ω; this outlier status must also be detectable on trajectories of the fixed length <mml:math><mml:mi>k</mml:mi></mml:math>k, which may well be too small to mix the chain. This second property discourages the use of “coarse-grained” label functions, such as the number of seats of 18 that the Democrats would hold with the districting in question, because many swaps would be needed to shift a representative from one party to another.

We considered two label functions for our experiments (each selected with the above desired properties in mind) to show the robustness of our framework. The first label function <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar that we used is simply the negative of the variance in the proportions of Democrat voters among the districts. Thus, given a districting <mml:math><mml:mi>σ</mml:mi></mml:math>σ, <mml:math><mml:mrow><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ωvar(σ) is calculated as<mml:math display="block"><mml:mrow><mml:mrow><mml:mrow><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo>?</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mfrac><mml:mrow><mml:msubsup><mml:mi>δ</mml:mi><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>+</mml:mo><mml:msubsup><mml:mi>δ</mml:mi><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup><mml:mo>+</mml:mo><mml:mo>?</mml:mo><mml:mo>+</mml:mo><mml:msubsup><mml:mi>δ</mml:mi><mml:mn>18</mml:mn><mml:mn>2</mml:mn></mml:msubsup></mml:mrow><mml:mn>18</mml:mn></mml:mfrac><mml:mo>?</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mi>δ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>δ</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mo>?</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>δ</mml:mi><mml:mn>18</mml:mn></mml:msub></mml:mrow><mml:mn>18</mml:mn></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math>ωvar(σ)=?(δ12+δ22+?+δ18218?(δ1+δ2+?+δ1818)2),where for each <mml:math><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:mn>18</mml:mn></mml:mrow></mml:mrow></mml:math>i=1,…,18, <mml:math><mml:msub><mml:mi>δ</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math>δi is the fraction of voters in that district that voted for the Democrat presidential candidate in 2012. We suspect that the variance is a good label function from the standpoint of the first characteristic listed above but a great label function from the standpoint of the second characteristic. Recall that, in practice, gerrymandering is accomplished by packing the voters of one party into a few districts, in which they make up an overwhelming majority. This technique, naturally, results in a high-variance vector of party proportions in the districts. However, high-variance districtings can exist that do not favor either party (note, for example, that the variance is symmetric with respect to Democrats and Republicans, ignoring third-party affiliations). Thus, for a districting that is biased against <mml:math><mml:mi>π</mml:mi></mml:math>π because of a partisan gerrymander to “stand out” as an outlier, it must have especially high variance. In particular, statistical significance will be weaker than it might be for a label function that is more strongly correlated with partisan gerrymandering. However, <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar can detect very small changes in the districting, because essentially, every swap will either increase or decrease the variance. Indeed, for the run of the chain corresponding to the <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mi mathvariant="normal">∞</mml:mi></mml:msup></mml:math>L∞ constraint (SI Text, Runs of the Chain), <mml:math><mml:mrow><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ωvar(X0) was strictly greater than <mml:math><mml:mrow><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ωvar(Xi) for the entire trajectory <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>≤</mml:mo><mml:mi>i</mml:mi><mml:mo>≤</mml:mo><mml:msup><mml:mn>2</mml:mn><mml:mn>40</mml:mn></mml:msup></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(1≤i≤240). That is, for the <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mi mathvariant="normal">∞</mml:mi></mml:msup></mml:math>L∞ constraint, the current districting of Pennsylvania was the absolute worst districting seen according to <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar among the more than 1 trillion districtings on the trajectory.

The second label function that we considered is calculated simply as the difference between the median and the mean of the ratios <mml:math><mml:mrow><mml:msub><mml:mi>δ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>δ</mml:mi><mml:mn>18</mml:mn></mml:msub></mml:mrow></mml:math>δ1,…,δ18. This simple metric, called the “Symmetry Vote Bias” by McDonald and Best (13) and denoted as <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM by us, is closely tied to the goal of partisan gerrymandering. As a simple illustration of the connection, we consider the case where the median of the ratios <mml:math><mml:mrow><mml:msub><mml:mi>δ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>δ</mml:mi><mml:mn>18</mml:mn></mml:msub></mml:mrow></mml:math>δ1,…,δ18 is close to <mml:math><mml:mfrac><mml:mn>1</mml:mn><mml:mn>2</mml:mn></mml:mfrac></mml:math>12. In this case, the mean of the <mml:math><mml:msub><mml:mi>δ</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math>δi tracks the fraction of votes that the reference party wins to win one-half of the seats. Thus, a positive Symmetry Vote Bias corresponds to an advantage for the reference party, whereas a negative Symmetry Vote Bias corresponds to a disadvantage. The use of the Symmetry Vote Bias in evaluating districtings dates at least to the 19th century work of Edgeworth (21). These features make it an excellent candidate from the standpoint of our first criterion: gerrymandering is very likely to be reflected in outlier values of <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM.

However, <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM is a rather slow-changing function compared with <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar. Indeed, observe that, in the calculation<mml:math display="block"><mml:mrow><mml:mrow><mml:mtext>Symmetry Vote Bias</mml:mtext><mml:mo>=</mml:mo><mml:mrow><mml:mtext>median</mml:mtext><mml:mo>?</mml:mo><mml:mtext>mean</mml:mtext></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math>Symmetry Vote Bias=median?mean,the mean is essentially fixed, so that changes in <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM depend on changes in the median. In initial changes to the districting, only changes to the 2 districtings giving rise to the median (2 because 18 is even) can have a significant impact on <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM. (However, changes to any district directly affect <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar.)

It is likely possible to make better choices for the label function <mml:math><mml:mi>ω</mml:mi></mml:math>ω to achieve better significance. For example, the metric <mml:math><mml:msub><mml:mi>B</mml:mi><mml:mi>G</mml:mi></mml:msub></mml:math>BG described by Nagle (12) seems likely to be excellent from the standpoints of conditions i and ii simultaneously. However, we have restricted ourselves to the simple choices of <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar and <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM to clearly show our method and make it obvious that we have not tried many labeling functions before finding some that worked (in which case, a multiple hypothesis test would be required).

One point to keep in mind is that, often when applying the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test—including in this application to gerrymandering—we will wish to apply the test and thus, need to define a label function after the presented state <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0 is already known. In these cases, care must be taken to choose a “canonical” label function <mml:math><mml:mi>ω</mml:mi></mml:math>ω, so that there is no concern that <mml:math><mml:mi>ω</mml:mi></mml:math>ω was carefully crafted in response to <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0 (in this case, a multiple hypothesis correction would be required for the various possible <mml:math><mml:mi>ω</mml:mi></mml:math>ω values that could have been crafted depending on <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0); <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar and <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM are excellent choices from this perspective: the variance is a common and natural function on vectors, and the Symmetry Vote Bias has an established history in the evaluation of gerrymandering (and in particular, a history that predates the present districting of Pennsylvania).

### S5. Runs of the Chain.

In Table S1, we give the results of eight runs of the chain under various conditions. Each run was for <mml:math><mml:mrow><mml:mi>k</mml:mi><mml:mo>=</mml:mo><mml:msup><mml:mn>2</mml:mn><mml:mn>40</mml:mn></mml:msup></mml:mrow></mml:math>k=240 steps. Code and input data for our Markov chain are available at the website of W.P. (math.cmu.edu/～wes).

Generally, after an initial “burn-in” period, we expect the chain to (almost) never again see states as unusual as the current districting of Pennsylvania, which means that we expect the test to show significance proportional to the inverse of the square root of the number of steps (i.e., the <mml:math><mml:mi>p</mml:mi></mml:math>p value at <mml:math><mml:msup><mml:mn>2</mml:mn><mml:mn>42</mml:mn></mml:msup></mml:math>242 steps should be one-half the <mml:math><mml:mi>p</mml:mi></mml:math>p value at <mml:math><mml:msup><mml:mn>2</mml:mn><mml:mn>40</mml:mn></mml:msup></mml:math>240 steps). In particular, for the <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msup></mml:math>L1, <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math>L2, and <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mi mathvariant="normal">∞</mml:mi></mml:msup></mml:math>L∞ constraints, these runs never revisited states as bad as the initial state after <mml:math><mml:msup><mml:mn>2</mml:mn><mml:mn>21</mml:mn></mml:msup></mml:math>221 steps for the <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM label and after <mml:math><mml:msup><mml:mn>2</mml:mn><mml:mn>6</mml:mn></mml:msup></mml:math>26 steps for the <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar label. Note that this agrees with our guess that <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar had the potential to change more quickly than <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM. The perimeter constraint did revisit enough states as bad as the given state with respect to the <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar label to adversely affect its <mml:math><mml:mi>p</mml:mi></mml:math>p value with respect to the <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar label. This observation may reflect our guess that the <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar label is worse than the <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM label in terms of how easily it can distinguish gerrymandered districtings from random ones.

The parameters for the first row were used for Fig. 2.

One quick point is that, although we have experimented here with different compactness measures, there is no problem of multiple hypothesis correction to worry about, because every run that we encountered produces strong significance for the bias of the initial districting. The point of experimenting with the notion of compactness is to show that this is a robust framework and that the finding is unlikely to be sensitive to minor disagreements over the proper definition of the set of valid districtings.

### S6. An Example Where <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mtext>??</mml:mtext></mml:mpadded><mml:mo mathvariant="bold">≈</mml:mo><mml:msqrt><mml:mi>??</mml:mi></mml:msqrt></mml:mrow></mml:math>??≈?? Is Best Possible.

It might be natural to suspect that observing <mml:math><mml:mi>ε</mml:mi></mml:math>ε-outlier status for <mml:math><mml:mi>σ</mml:mi></mml:math>σ on a random trajectory from <mml:math><mml:mi>σ</mml:mi></mml:math>σ is significant at something like <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>p</mml:mi></mml:mpadded><mml:mo>≈</mml:mo><mml:mi>ε</mml:mi></mml:mrow></mml:math>p≈ε instead of the significance <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>p</mml:mi></mml:mpadded><mml:mo>≈</mml:mo><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:mrow></mml:math>p≈ε established by Theorem 1.1. However, because Theorem 1.1 places no demand on the mixing rate of <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M, it should instead seem remarkable that any significance can be shown in general, and indeed, we show by example in this section that significance at <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>p</mml:mi></mml:mpadded><mml:mo>≈</mml:mo><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:mrow></mml:math>p≈ε is essentially best possible.

Let <mml:math><mml:mi>N</mml:mi></mml:math>N be some large integer. We let <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M be the Markov chain where <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 is distributed uniformly in <mml:math><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>0,1,2</mml:mn><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mi>N</mml:mi><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:math>[0,1,2,…,N?1], and for any <mml:math><mml:mrow><mml:mi>i</mml:mi><mml:mo>≥</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math>i≥1, <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math>Xi is equal to <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>ζ</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math>Xi?1+ζi computed modulo <mml:math><mml:mi>N</mml:mi></mml:math>N, where <mml:math><mml:msub><mml:mi>ζ</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math>ζi is <mml:math><mml:mn>1</mml:mn></mml:math>1 or <mml:math><mml:mrow><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math>?1 with probability <mml:math><mml:mrow><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:math>1/2. Note that the chain is stationary and reversible.

If <mml:math><mml:mi>N</mml:mi></mml:math>N is chosen large relative to <mml:math><mml:mi>k</mml:mi></mml:math>k, then with probability arbitrarily close to one the value of <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 is at distance greater than <mml:math><mml:mi>k</mml:mi></mml:math>k from zero (in both directions). Conditioning on this event, we have that <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 is minimum among <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,…,Xk if and only if all of the partial sums <mml:math><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">∑</mml:mo><mml:mrow><mml:mi>i</mml:mi><mml:mo>=</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mi>j</mml:mi></mml:msubsup><mml:msub><mml:mi>ζ</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math>∑i=1jζi are positive. The probability of this event is just the probability that a <mml:math><mml:mi>k</mml:mi></mml:math>k-step 1D random walk from the origin takes a first step to the right and does not return to the origin. The calculation of this probability is a classical problem in random walks, which can be solved using the reflection principle.

In particular, for <mml:math><mml:mi>k</mml:mi></mml:math>k even, the probability is given by<mml:math display="block"><mml:mrow><mml:mrow><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:msup><mml:mn>2</mml:mn><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msup></mml:mfrac><mml:mrow><mml:mo>(</mml:mo><mml:mfrac linethickness="0.0pt"><mml:mi>k</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>～</mml:mo><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:mrow><mml:mn>2</mml:mn><mml:mi>π</mml:mi><mml:mi>k</mml:mi></mml:mrow></mml:msqrt></mml:mfrac></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>12k+1(kk/2)～12πk.

Because being the minimum of <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,…,Xk corresponds to being an <mml:math><mml:mi>ε</mml:mi></mml:math>ε-outlier for <mml:math><mml:mrow><mml:mi>ε</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:mpadded width="+1.7pt"><mml:mi>k</mml:mi></mml:mpadded></mml:mrow><mml:mo>+</mml:mo><mml:mn>?1</mml:mn></mml:mrow></mml:mrow></mml:math>ε=1/k+?1, this example shows that the probability of being an <mml:math><mml:mi>ε</mml:mi></mml:math>ε-outlier can be as high as <mml:math><mml:msqrt><mml:mrow><mml:mrow><mml:mi>ε</mml:mi><mml:mo>/</mml:mo><mml:mn>2</mml:mn></mml:mrow><mml:mi>π</mml:mi></mml:mrow></mml:msqrt></mml:math>ε/2π.

The best possible value of the constant in the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test seems to be an interesting problem for future work.

### S7. Notes on Statistical Power.

The effectiveness of the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test depends on the availability of a good choice for <mml:math><mml:mi>ω</mml:mi></mml:math>ω and the ability to run the test for long enough (in other words, choose <mml:math><mml:mi>k</mml:mi></mml:math>k large enough) to detect that the presented state is a local outlier.

It is possible, however, to make a general statement about the power of the test when <mml:math><mml:mi>k</mml:mi></mml:math>k is chosen large relative to the actual mixing time of the chain. Recall that one potential application of the test is in situations where the mixing time of the chain is actually accessible through reasonable computational resources, although this fact cannot be proved rigorously, because theoretical bounds on the mixing time are not available. In particular, we do know that the test is very likely to succeed when <mml:math><mml:mi>k</mml:mi></mml:math>k is sufficiently large and <mml:math><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ω(σ0) is atypical.

Theorem S1. Let <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M be a reversible Markov chain on <mml:math><mml:mi mathvariant="normal">Σ</mml:mi></mml:math>Σ, and let <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>ω</mml:mi></mml:mpadded><mml:mo>:</mml:mo><mml:mrow><mml:mi mathvariant="normal">Σ</mml:mi><mml:mo>→</mml:mo><mml:mi>?</mml:mi></mml:mrow></mml:mrow></mml:math>ω:Σ→? be arbitrary. Given <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0, suppose that, for a random state <mml:math><mml:mrow><mml:mi>σ</mml:mi><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>σ～π, <mml:math><mml:mrow><mml:mi>????</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>≤</mml:mo><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>≤</mml:mo><mml:mi>ε</mml:mi></mml:mrow></mml:math>????(ω(σ)≤ω(σ0))≤ε. Then, with probability at least<mml:math display="block"><mml:mrow><mml:mrow><mml:mi>ρ</mml:mi><mml:mo>≥</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>?</mml:mo><mml:mrow><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi>ε</mml:mi><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>10</mml:mn><mml:msub><mml:mi>τ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfrac></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:msqrt><mml:msub><mml:mi>π</mml:mi><mml:mi>min</mml:mi></mml:msub></mml:msqrt></mml:mfrac><mml:mrow><mml:mi>exp</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mo>?</mml:mo><mml:mrow><mml:msup><mml:mi>ε</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>k</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>20</mml:mn><mml:msub><mml:mi>τ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math>ρ≥1?(1+εk10τ2)1πminexp(?ε2k20τ2),

<mml:math><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ω(σ) is an <mml:math><mml:mrow><mml:mn>2</mml:mn><mml:mi>ε</mml:mi></mml:mrow></mml:math>-outlier among <mml:math><mml:mrow><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math>ω(σ0),ω(σ1),…,ω(σk), where <mml:math><mml:mrow><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo></mml:mrow></mml:math>σ0,σ1,… is a random trajectory starting from <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0.

Here, <mml:math><mml:msub><mml:mi>τ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math>τ2 is the relaxation time for <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M defined as <mml:math><mml:mrow><mml:mn>1</mml:mn><mml:mo>/</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>1</mml:mn><mml:mo>?</mml:mo><mml:msub><mml:mi>λ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>1/(1?λ2), where <mml:math><mml:msub><mml:mi>λ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math>λ2 is the second eigenvalue of <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M. <mml:math><mml:msub><mml:mi>τ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math>τ2 is thus the inverse of the spectral gap for <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M and intimately related to the mixing time of <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M (22?24). The probability <mml:math><mml:mi>ρ</mml:mi></mml:math>ρ in Theorem S1 converges exponentially quickly to <mml:math><mml:mn>1</mml:mn></mml:math>1 and moreover, is very close to one after <mml:math><mml:mi>k</mml:mi></mml:math>k is large relative to <mml:math><mml:msub><mml:mi>τ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:math>τ2. In particular, Theorem S1 shows that the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test will work when the test is run for long enough. Of course, one strength of the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test is that it can sometimes show bias, even when <mml:math><mml:mi>k</mml:mi></mml:math>k is far too small to mix the chain, which is almost certainly the case for our application to gerrymandering. When these short-<mml:math><mml:mi>k</mml:mi></mml:math>k runs are successful at detecting bias is, of course, dependent on the relationship of the presented state <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0 and its local neighborhood in the chain.

Theorem S1 is an application of the following theorem of Gillman.

Theorem S2. Let <mml:math><mml:mrow><mml:mi mathvariant="script">M</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo></mml:mrow></mml:mrow></mml:math>M=X0,X1,… be a reversible Markov chain on <mml:math><mml:mi mathvariant="normal">Σ</mml:mi></mml:math>Σ, let <mml:math><mml:mrow><mml:mi>A</mml:mi><mml:mo>?</mml:mo><mml:mi mathvariant="normal">Σ</mml:mi></mml:mrow></mml:math>A?Σ, and let <mml:math><mml:mrow><mml:msub><mml:mi>N</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>Nk(A) denote the number of visits to <mml:math><mml:mi>A</mml:mi></mml:math>A among <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,…,Xk. Then, for any <mml:math><mml:mrow><mml:mi>γ</mml:mi><mml:mo>></mml:mo><mml:mn>0</mml:mn></mml:mrow></mml:math>γ>0,<mml:math display="block"><mml:mrow><mml:mtext>????</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi>n</mml:mi><mml:mo>?</mml:mo><mml:mi>π</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>></mml:mo><mml:mi>γ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>≤</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi>γ</mml:mi><mml:mi>n</mml:mi></mml:mrow><mml:mrow><mml:mn>10</mml:mn><mml:msub><mml:mi>τ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:msqrt><mml:mrow><mml:munder><mml:mo largeop="true" movablelimits="false" symmetric="true">∑</mml:mo><mml:mi>σ</mml:mi></mml:munder><mml:mfrac><mml:mrow><mml:mtext>????</mml:mtext><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mrow><mml:mi>π</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mfrac></mml:mrow></mml:msqrt><mml:mo>×</mml:mo><mml:mi>exp</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mo>?</mml:mo><mml:mrow><mml:msup><mml:mi>γ</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>n</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>20</mml:mn><mml:msub><mml:mi>τ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>????(Nk(A)/n?π(A)>γ)≤(1+γn10τ2)∑σ????(X0=σ)2π(σ)×exp(?γ2n20τ2).Proof of Theorem S1. We apply Theorem S2, with <mml:math><mml:mi>A</mml:mi></mml:math>A as the set of states <mml:math><mml:mrow><mml:mi>σ</mml:mi><mml:mo>∈</mml:mo><mml:mi mathvariant="normal">Σ</mml:mi></mml:mrow></mml:math>σ∈Σ, such that <mml:math><mml:mrow><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>≤</mml:mo><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math>ω(σ)≤ω(σ0), <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mpadded><mml:mo>=</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow></mml:math>X0=σ0, and <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>γ</mml:mi></mml:mpadded><mml:mo>=</mml:mo><mml:mi>ε</mml:mi></mml:mrow></mml:math>γ=ε. By assumption, <mml:math><mml:mrow><mml:mrow><mml:mi>π</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>≤</mml:mo><mml:mi>ε</mml:mi></mml:mrow></mml:math>π(A)≤ε, and Theorem S2 gives that<mml:math display="block"><mml:mrow><mml:mtext>????</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>N</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>A</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>/</mml:mo><mml:mi>k</mml:mi><mml:mo>></mml:mo><mml:mn>2</mml:mn><mml:mi>ε</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>≤</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mn>1</mml:mn><mml:mo>+</mml:mo><mml:mfrac><mml:mrow><mml:mi>ε</mml:mi><mml:mi>k</mml:mi></mml:mrow><mml:mrow><mml:mn>10</mml:mn><mml:msub><mml:mi>τ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:msqrt><mml:mfrac><mml:mn>1</mml:mn><mml:msub><mml:mi>π</mml:mi><mml:mi>min</mml:mi></mml:msub></mml:mfrac></mml:msqrt><mml:mi>exp</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mfrac><mml:mrow><mml:mo>?</mml:mo><mml:mrow><mml:msup><mml:mi>ε</mml:mi><mml:mn>2</mml:mn></mml:msup><mml:mi>k</mml:mi></mml:mrow></mml:mrow><mml:mrow><mml:mn>20</mml:mn><mml:msub><mml:mi>τ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:mfrac><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>????(Nk(A)/k>2ε)≤(1+εk10τ2)1πminexp(?ε2k20τ2).

<mml:math><mml:mi mathvariant="normal">□</mml:mi></mml:math>

### S8. A Result for Small Variation Distance.

In this section, we give a corollary of Theorem 1.1 that applies to the setting where <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 is not distributed as a stationary distribution <mml:math><mml:mi>π</mml:mi></mml:math>π but instead, is distributed with small total variation distance to <mml:math><mml:mi>π</mml:mi></mml:math>π.

The total variation distance <mml:math><mml:msub><mml:mrow><mml:mo fence="true">||</mml:mo><mml:mrow><mml:msub><mml:mi>ρ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>?</mml:mo><mml:msub><mml:mi>ρ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo fence="true">||</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:math>||ρ1?ρ2||TV between probability distributions <mml:math><mml:mrow><mml:msub><mml:mi>ρ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>ρ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow></mml:math>ρ1,ρ2 on a probability space <mml:math><mml:mi mathvariant="normal">Ω</mml:mi></mml:math>Ω is defined to be<mml:math display="block"><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mo fence="true">||</mml:mo><mml:mrow><mml:msub><mml:mi>ρ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>?</mml:mo><mml:msub><mml:mi>ρ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo fence="true">||</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>:=</mml:mo><mml:mrow><mml:munder><mml:mtext>sup</mml:mtext><mml:mrow><mml:mi>E</mml:mi><mml:mo>?</mml:mo><mml:mi mathvariant="normal">Ω</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:mo stretchy="false">|</mml:mo><mml:mrow><mml:mrow><mml:msub><mml:mi>ρ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>E</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>?</mml:mo><mml:mrow><mml:msub><mml:mi>ρ</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>E</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mo stretchy="false">|</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>||ρ1?ρ2||TV:=supE?Ω|ρ1(E)?ρ2(E)|.[S1]

Corollary S1. Let <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi mathvariant="script">M</mml:mi></mml:mpadded><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo></mml:mrow></mml:mrow></mml:math>M=X0,X1,… be a reversible Markov chain with a stationary distribution <mml:math><mml:mi>π</mml:mi></mml:math>π, and suppose that the states of <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M have real-valued labels. If <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mrow><mml:mo fence="true">||</mml:mo><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>?</mml:mo><mml:mi>π</mml:mi></mml:mrow><mml:mo fence="true">||</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:mpadded><mml:mo>≤</mml:mo><mml:msub><mml:mi>ε</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math>||X0?π||TV≤ε1, then for any fixed <mml:math><mml:mi>k</mml:mi></mml:math>k, the probability that the label of <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 is an <mml:math><mml:mi>ε</mml:mi></mml:math>ε-outlier from among the list of labels observed in the trajectory <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,X1,X2,…,Xk is, at most, <mml:math><mml:mrow><mml:msqrt><mml:mrow><mml:mn>2</mml:mn><mml:mi>ε</mml:mi></mml:mrow></mml:msqrt><mml:mo>+</mml:mo><mml:msub><mml:mi>ε</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math>2ε+ε1.

The theorem is intuitively clear; we provide a formal proof below for completeness.

#### Proof.

If <mml:math><mml:mrow><mml:mrow><mml:msub><mml:mi>ρ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>ρ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math>ρ1,ρ2, and <mml:math><mml:mi>τ</mml:mi></mml:math>τ are probability distributions, then we have that the product distributions <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>ρ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>τ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(ρ1,τ) and <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>ρ</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>τ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(ρ2,τ) satisfy<mml:math display="block"><mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mo fence="true">||</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>ρ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>τ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>?</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>ρ</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mi>τ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo fence="true">||</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mo fence="true">||</mml:mo><mml:mrow><mml:msub><mml:mi>ρ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>?</mml:mo><mml:msub><mml:mi>ρ</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo fence="true">||</mml:mo></mml:mrow><mml:mtext>TV</mml:mtext></mml:msub></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>||(ρ1,τ)?(ρ2,τ)||TV=||ρ1?ρ2||TV.[S2]

Our plan now is to split the randomness in the trajectory <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,…,Xk of the Markov chain into two independent sources: the initial distribution is <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>～</mml:mo><mml:mi>ρ</mml:mi></mml:mrow></mml:math>X0～ρ, and <mml:math><mml:mi>τ</mml:mi></mml:math>τ is the uniform distribution on sequences of length <mml:math><mml:mi>k</mml:mi></mml:math>k of real numbers <mml:math><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>r1,r2,…,rk in <mml:math><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mn>0,1</mml:mn><mml:mo stretchy="false">]</mml:mo></mml:mrow></mml:math>[0,1]. We can view the distribution of the trajectory <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,X1,…,Xk as the product <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>ρ</mml:mi><mml:mo>,</mml:mo><mml:mi>τ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(ρ,τ) by using sequences of reals <mml:math><mml:mrow><mml:msub><mml:mi>r</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>r</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>r1,…,rk to choose transitions in the chain; from <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mpadded><mml:mo>=</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mrow></mml:math>Xi=σi, if there are <mml:math><mml:mi>L</mml:mi></mml:math>L transitions possible, with probabilities <mml:math><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>L</mml:mi></mml:msub></mml:mrow></mml:math>p1,…,pL. Then, we make the <mml:math><mml:mi>t</mml:mi></mml:math>tth possible transition if <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>r</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mpadded><mml:mo>∈</mml:mo><mml:mrow><mml:mo stretchy="false">[</mml:mo><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mo>?</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mrow><mml:msub><mml:mi>p</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>+</mml:mo><mml:mo>?</mml:mo><mml:mo>+</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mrow><mml:mi>t</mml:mi><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>+</mml:mo><mml:msub><mml:mi>p</mml:mi><mml:mi>t</mml:mi></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ri∈[p1+?+pt?1,p1+?+pt?1+pt).

Now we have from Eq. S2 that, if <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mrow><mml:mo fence="true">||</mml:mo><mml:mrow><mml:mi>ρ</mml:mi><mml:mo>?</mml:mo><mml:mi>π</mml:mi></mml:mrow><mml:mo fence="true">||</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:mpadded><mml:mo>≤</mml:mo><mml:msub><mml:mi>ε</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math>||ρ?π||TV≤ε1, then <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mrow><mml:mo fence="true">||</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>ρ</mml:mi><mml:mo>,</mml:mo><mml:mi>τ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>?</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>π</mml:mi><mml:mo>,</mml:mo><mml:mi>τ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo fence="true">||</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:mpadded><mml:mo>≤</mml:mo><mml:msub><mml:mi>ε</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math>||(ρ,τ)?(π,τ)||TV≤ε1. Therefore, any event that would happen with probability at most <mml:math><mml:mi>p</mml:mi></mml:math>p for the sequence <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,…,Xk when <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mpadded><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>X0～π must happen with probability at most <mml:math><mml:mrow><mml:mi>p</mml:mi><mml:mo>+</mml:mo><mml:msub><mml:mi>ε</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math>p+ε1 when <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mpadded><mml:mo>～</mml:mo><mml:mi>ρ</mml:mi></mml:mrow></mml:math>X0～ρ, where <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mrow><mml:mo fence="true">||</mml:mo><mml:mrow><mml:mi>ρ</mml:mi><mml:mo>?</mml:mo><mml:mi>π</mml:mi></mml:mrow><mml:mo fence="true">||</mml:mo></mml:mrow><mml:mrow><mml:mi>T</mml:mi><mml:mi>V</mml:mi></mml:mrow></mml:msub></mml:mpadded><mml:mo>≤</mml:mo><mml:msub><mml:mi>ε</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math>||ρ?π||TV≤ε1. The corollary follows. <mml:math><mml:mi mathvariant="normal">□</mml:mi></mml:math>

## Definitions

We remind the reader that a Markov chain is a discrete time random process; at each step, the chain jumps to a new state, which only depends on the previous state. Formally, a Markov chain <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M on a state space <mml:math><mml:mi mathvariant="normal">Σ</mml:mi></mml:math>Σ is a sequence <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi mathvariant="script">M</mml:mi></mml:mpadded><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>2</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo></mml:mrow></mml:mrow></mml:math>M=X0,X1,X2,… of random variables taking values in <mml:math><mml:mi mathvariant="normal">Σ</mml:mi></mml:math>Σ (which correspond to states that may be occupied at each step), such that, for any <mml:math><mml:mrow><mml:mrow><mml:mi>σ</mml:mi><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo>∈</mml:mo><mml:mi mathvariant="normal">Σ</mml:mi></mml:mrow></mml:math>σ,σ0,…,σn?1∈Σ,<mml:math display="block"><mml:mrow><mml:mtext>????</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>n</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>????</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">|</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mrow><mml:mi>n</mml:mi><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>????(Xn=σ|X0=σ0,X1=σ1,…,Xn?1=σn?1)=????(X1=σ|X0=σn?1).Note that a Markov chain is completely described by the distribution of <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 and the transition probabilities <mml:math><mml:mrow><mml:mi>????</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mpadded><mml:mo>=</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mpadded><mml:mo>=</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>????(X1=σ1|X0=σ0) for all pairs <mml:math><mml:mrow><mml:mrow><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow><mml:mo>∈</mml:mo><mml:mi mathvariant="normal">Σ</mml:mi></mml:mrow></mml:math>σ0,σ1∈Σ. Terminology is often abused, so that the Markov chain refers only to the ensemble of transition probabilities, regardless of the choice of distribution for <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0.

With this abuse of terminology, a stationary distribution for the Markov chain is a distribution <mml:math><mml:mi>π</mml:mi></mml:math>π, such that <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mpadded><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>X0～π implies that <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mpadded><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>X1～π and therefore, that <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mpadded><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>Xi～π for all <mml:math><mml:mi>i</mml:mi></mml:math>i. When the distribution of <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 is a stationary distribution, the Markov chain <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo></mml:mrow></mml:math>X0,X1,… is said to be stationary. A stationary chain is said to be reversible if, for all <mml:math><mml:mrow><mml:mi>i</mml:mi><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:math>i,k, the sequence of random variables <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(Xi,Xi+1,…,Xi+k) is identical in distribution to the sequence <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mrow><mml:mi>i</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(Xi+k,Xi+k?1,…,Xi). Finally, a chain is reducible if there is a pair of states <mml:math><mml:mrow><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mrow></mml:math>σ0,σ1, such that <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math>σ1 is inaccessible from <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0 via legal transitions and irreducible otherwise.

A simple example of a Markov chain is a random walk on a directed graph beginning from an initial vertex <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 chosen from some distribution. Here, <mml:math><mml:mi mathvariant="normal">Σ</mml:mi></mml:math>Σ is the vertex set of the directed graph. If we are allowed to label the directed edges with positive reals and if the probability of traveling along an arc is proportional to the label of the arc (among those leaving the present vertex), then any Markov chain has such a representation, because the transition probability <mml:math><mml:mrow><mml:mtext>????</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:mpadded><mml:mo>=</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo stretchy="false">|</mml:mo><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mpadded><mml:mo>=</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>????(X1=σ1|X0=σ0) can be taken as the label of the arc from <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>σ0 to <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mn>1</mml:mn></mml:msub></mml:math>σ1. Finally, if the graph is undirected, the corresponding Markov chain is reversible.

## Detecting Bias in Political Districting

A central feature of American democracy is the selection of Congressional districts in which local elections are held to directly elect national representatives. Because a separate election is held in each district, the proportions of party affiliations of the slate of representatives elected in a state do not always match the proportions of statewide votes cast for each party. In practice, large deviations from this seemingly desirable target do occur.

Various tests have been proposed to detect “gerrymandering” of districting, in which a district is drawn in such a way as to bias the resulting slate of representatives toward one party, which can be accomplished by concentrating voters of the unfavored party in a few districts. One class of methods to detect gerrymandering concerns heuristic “smell tests,” which judge whether districting seems generally reasonable in its statistical properties (11, 12). For example, such tests may frown on districting in which difference between the mean and median votes on district by district basis is unusually large (13).

The simplest statistical smell test, of course, is whether the party affiliation of the elected slate of representatives is close in proportion to the party affiliations of votes for representatives. Many states have failed this simple test spectacularly, such as in Pennsylvania; in 2012, 48.77% of votes were cast for Republican representatives and 50.20% of votes were cast for Democrat representatives in an election that resulted in a slate of 13 Republican representatives and 5 Democrat representatives.

Heuristic statistical tests such as these all suffer from lack of rigor, however, because of the fact that the statistical properties of “typical” districting are not rigorously characterized. For example, it has been shown (14) that Democrats may be at a natural disadvantage when drawing electoral maps, even when no bias is at play, because Democrat voters are often highly geographically concentrated in urban areas. Particularly problematic is that the degree of geographic clustering of partisans is highly variable from state to state: what looks like gerrymandered districting in one state may be a natural consequence of geography in another.

Some work has been done in which the properties of valid districting are defined (which may be required to have roughly equal populations among districts, districts with reasonable boundaries, etc.), so that the characteristics of a given districting can be compared with what would be typical for valid districting of the state in question, by using computers to generate random districting (15, 16); there is discussion of this in ref. 13. However, much of this work has relied on heuristic sampling procedures, which do not have the property of selecting districting with equal probability (and more generally, distributions that are not well-characterized), undermining rigorous statistical claims about the properties of typical districts.

In an attempt to establish a rigorous framework for this kind of approach, several groups (17?19) have used Markov chains to sample random valid districting for the purpose of such comparisons. Like many other applications of real-world Markov chains, however, these methods suffer from the completely unknown mixing time of the chains in question. Indeed, no work has even established that the Markov chains are irreducible (in the case of districting, irreducibility means that any valid districting can be reached from any other by a legal sequence of steps), even if valid districting was only required to consist of contiguous districts of roughly equal populations. Additionally, indeed, for very restrictive notions of what constitutes valid districting, irreducibility certainly fails.

As a straightforward application of the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test, we can achieve rigorous <mml:math><mml:mi>p</mml:mi></mml:math>p values in Markov models of political districting, despite the lack of bounds on mixing times of the chains. In particular, for all choices of the constraints on valid districting that we tested, the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test showed that the current Congressional districting of Pennsylvania is an outlier at significance thresholds ranging from <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>p</mml:mi></mml:mpadded><mml:mo>≈</mml:mo><mml:mrow><mml:mn>?2.5</mml:mn><mml:mo>?</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>?</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:math>p≈?2.5?10?4 to <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>p</mml:mi></mml:mpadded><mml:mo>≈</mml:mo><mml:mrow><mml:mn>?8.1</mml:mn><mml:mo>?</mml:mo><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>?</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mrow></mml:math>p≈?8.1?10?7. Detailed results of these runs are in SI Text.

A key advantage of the Markov chain approach to gerrymandering is that it rests on a rigorous framework, namely comparing the actual districting of a state with typical (i.e., random) districting from a well-defined set of valid districting. The rigor of the approach thus depends on the availability of a precise definition of what constitutes valid districting; in principle and in practice, the best choice of definition is a legal question. Although some work on Markov chains for redistricting (in particular, ref. 19) has aimed to account for complex constraints on valid districting, our main goal in this manuscript is to illustrate the application of the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test. In particular, we have erred on the side of using relatively simple sets of constraints on valid districting in our Markov chains, while checking that our significance results are not highly sensitive to the parameters that we use. However, our test immediately gives a way of putting the work, such as that in ref. 19, on a rigorous statistical footing.

The full description of the Markov chain that we use in this work is given in SI Text, but its basic structure is as follows: Pennsylvania is divided into roughly 9,000 census blocks. (These blocks can be seen on close inspection of Fig. 2.) We define a division of these blocks into 18 districts to be a valid districting of Pennsylvania if districts differ in population by less than <mml:math><mml:mrow><mml:mn>2</mml:mn><mml:mo>%</mml:mo></mml:mrow></mml:math>2%, are contiguous, are simply connected (districts do not contain holes), and are “compact” in ways that we discuss in SI Text; roughly, this final condition prohibits districts with extremely contorted structure. The state space of the Markov chain is the set of valid districting of the state, and one step of the Markov chain consists of randomly swapping a precinct on the boundary of a district to a neighboring district if the result is still a valid districting. As we discuss in SI Text, the chain is adjusted slightly to ensure that the uniform distribution on valid districting is indeed a stationary distribution for the chain. Observe that this Markov chain has a potentially huge state space; if the only constraint on valid districting was that the districts have roughly equal population, there would be <mml:math><mml:msup><mml:mn>10</mml:mn><mml:mn>10000</mml:mn></mml:msup></mml:math>1010000 or so valid districtings. Although contiguity and especially, compactness are severe restrictions that will decrease this number substantially, it seems difficult to compute effective upper bounds on the number of resulting valid districtings, and certainly, it is still enormous. Impressively, these considerations are all immaterial to our very general method.

View larger version:
Fig. 2.

(Left) The current districting of Pennsylvania. (Right) Districting produced by the Markov chain after <mml:math><mml:msup><mml:mn>2</mml:mn><mml:mn>40</mml:mn></mml:msup></mml:math>240 steps. (Detailed parameters for this run are given in SI Text.)

Applying the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test involves the choice of a label function <mml:math><mml:mrow><mml:mi>ω</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ω(σ), which assigns a real number to each districting. We have conducted runs using two label functions: <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar is the (negative) variance of the proportion of Democrats in each district of the districting (as measured by 2012 presidential votes), and <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM is the difference between the median and mean of the proportions of Democrats in each district; <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM is motivated by the fact that this metric has a long history of use in gerrymandering and is directly tied to the goals of gerrymandering, whereas the use of the variance is motivated by the fact that it can change quickly with small changes in districtings. These two choices are discussed further in SI Text, but an important point is that our use of these label functions is not based on an assumption that small values of <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar or <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM directly imply gerrymandering. Instead, because Theorem 1.1 is valid for any fixed label function, these labels are tools used to show significance, which are chosen because they are simple and natural functions on vectors that can be quickly computed, seem likely to be different for typical versus gerrymandered districtings, and have the potential to change relatively quickly with small changes in districtings. For the various notions of valid districtings that we considered, the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test showed significance at <mml:math><mml:mi>p</mml:mi></mml:math>p values in the range from <mml:math><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>?</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math>10?4 to <mml:math><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>?</mml:mo><mml:mn>5</mml:mn></mml:mrow></mml:msup></mml:math>10?5 for the <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>MM</mml:mtext></mml:msub></mml:math>ωMM label function and the range from <mml:math><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>?</mml:mo><mml:mn>4</mml:mn></mml:mrow></mml:msup></mml:math>10?4 to <mml:math><mml:msup><mml:mn>10</mml:mn><mml:mrow><mml:mo>?</mml:mo><mml:mn>7</mml:mn></mml:mrow></mml:msup></mml:math>10?7 for the <mml:math><mml:msub><mml:mi>ω</mml:mi><mml:mtext>var</mml:mtext></mml:msub></mml:math>ωvar label function (see Fig. S1 and Table S1).

View larger version:
Fig. S1.

The last state from each of the above runs of the chain (perimeter, <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msup></mml:math>L1, <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math>L2, and <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mi mathvariant="normal">∞</mml:mi></mml:msup></mml:math>L∞, respectively). Note that the <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mi mathvariant="normal">∞</mml:mi></mml:msup></mml:math>L∞ districting is quite ugly; with this notion of validity, every district among the 18 is allowed to be as noncompact as the worst district in the current Pennsylvania districting. The perimeter constraint produces a districting that appears clean at a large scale but allows rather messy city districts, because they contribute only moderately to the perimeter anyway. The <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mn>1</mml:mn></mml:msup></mml:math>L1 and <mml:math><mml:msup><mml:mi>L</mml:mi><mml:mn>2</mml:mn></mml:msup></mml:math>L2 constraints are more balanced. Note that none of these districtings should be expected to be geometrically “nicer” than the current districting of Pennsylvania. Indeed, the point of our Markov chain framework is to compare the present districting of Pennsylvania with other “just as bad” districtings to observe that, even among this set, the present districting is atypical.

View this table:
Table S1.

Runs of the redistricting Markov chain with results of the <mml:math><mml:msqrt><mml:mi>??</mml:mi></mml:msqrt></mml:math>?? test

As noted earlier, the <mml:math><mml:msqrt><mml:mi>ε</mml:mi></mml:msqrt></mml:math>ε test can easily be used with more complicated Markov chains that capture more intricate definitions of the set of valid districtings. For example, the current districting of Pennsylvania splits fewer rural counties than the districting in Fig. 2, Right, and the number of county splits is one of many metrics for valid districtings considered by the Markov chains developed in ref. 19. Indeed, our test will be of particular value in cases where complex notions of what constitute valid districting slow the chain to make the heuristic mixing assumption particularly questionable. Regarding mixing time, even our chain with relatively weak constraints on the districtings (and very fast running time in implementation) seems to mix too slowly to sample <mml:math><mml:mi>π</mml:mi></mml:math>π, even heuristically; in Fig. 2, we see that several districts still seem to have not left their general position from the initial districting, even after <mml:math><mml:msup><mml:mn>2</mml:mn><mml:mn>40</mml:mn></mml:msup></mml:math>240 steps.

On the same note, it should also be kept in mind that, although our result gives a method to rigorously disprove that a given districting is unbiased—e.g., to show that the districting is unusual among districtings <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 distributed according to the stationary distribution <mml:math><mml:mi>π</mml:mi></mml:math>π—it does so without giving a method to sample from the stationary distribution. In particular, our method cannot answer the question of how many seats Republicans and Democrats should have in a typical districting of Pennsylvania, because we are still not mixing the chain. Instead, Theorem 1.1 has given us a way to disprove <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mpadded><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>X0～π without sampling <mml:math><mml:mi>π</mml:mi></mml:math>π.

## Proof of Theorem 1.1

We let <mml:math><mml:mi>π</mml:mi></mml:math>π denote any stationary distribution for <mml:math><mml:mi mathvariant="script">M</mml:mi></mml:math>M and suppose that the initial state <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:math>X0 is distributed as <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mpadded><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>X0～π, so that in fact, <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:mpadded><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>Xi～π for all <mml:math><mml:mi>i</mml:mi></mml:math>i. We say <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math>σj is <mml:math><mml:mi mathvariant="normal">?</mml:mi></mml:math>?-small among <mml:math><mml:mrow><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>σ0,…,σk if there are, at most, <mml:math><mml:mi mathvariant="normal">?</mml:mi></mml:math>? indices <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>i</mml:mi></mml:mpadded><mml:mo>≠</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:math>i≠j among <mml:math><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:math>0,…,k, such that the label of <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math>σi is, at most, the label of <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math>σj. In particular, <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math>σj is <mml:math><mml:mn>0</mml:mn></mml:math>0-small among <mml:math><mml:mrow><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>σ0,σ1,…,σk when its label is the unique minimum label, and we encourage readers to focus on this <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi mathvariant="normal">?</mml:mi></mml:mpadded><mml:mo>=</mml:mo><mml:mn>?0</mml:mn></mml:mrow></mml:math>?=?0 case in their first reading of the proof.

For <mml:math><mml:mrow><mml:mn>0</mml:mn><mml:mo>≤</mml:mo><mml:mi>j</mml:mi><mml:mo>≤</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:math>0≤j≤k, we define<mml:math display="block"><mml:mrow><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup></mml:mrow><mml:mo>:=</mml:mo><mml:mrow><mml:mtext>????</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mrow><mml:mo>?</mml:mo><mml:mtext>is</mml:mtext><mml:mo>?</mml:mo><mml:mi mathvariant="normal">?</mml:mi><mml:mo>-</mml:mo><mml:mtext>small among</mml:mtext></mml:mrow><mml:mo>?</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub></mml:mrow><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math>ρj,?k:=????(Xj?is??-small among?X0,…,Xk)<mml:math display="block"><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>:=</mml:mo><mml:mtext>????</mml:mtext><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mrow><mml:mo>?</mml:mo><mml:mtext>is</mml:mtext><mml:mo>?</mml:mo><mml:mi mathvariant="normal">?</mml:mi><mml:mo>-</mml:mo><mml:mtext>small among</mml:mtext></mml:mrow><mml:mo>?</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>∣</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>=</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>ρj,?k(σ):=????(Xj?is??-small among?X0,…,Xk∣Xj=σ).

Observe that, because <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mi>s</mml:mi></mml:msub></mml:mpadded><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>Xs～π for all <mml:math><mml:mi>s</mml:mi></mml:math>s, we also have that<mml:math display="block"><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mtext>????</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo>?</mml:mo><mml:mtext>is</mml:mtext><mml:mo>?</mml:mo><mml:mi mathvariant="normal">?</mml:mi><mml:mo>-</mml:mo><mml:mtext>small among</mml:mtext></mml:mrow><mml:mo>?</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>s</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>+</mml:mo><mml:mi>k</mml:mi></mml:mrow></mml:msub><mml:mo>∣</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mrow><mml:mi>s</mml:mi><mml:mo>+</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:mi>σ</mml:mi><mml:mo>)</mml:mo></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>ρj,?k(σ)=????(Xs+j?is??-small among?Xs,…,Xs+k∣Xs+j=σ).[1]

We begin by noting two easy facts.

### Observation 4.1.

<mml:math display="block"><mml:mrow><mml:mrow><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>?</mml:mo><mml:mi>j</mml:mi></mml:mrow><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>ρj,?k(σ)=ρk?j,?k(σ).

Proof. Because <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi mathvariant="script">M</mml:mi></mml:mpadded><mml:mo>=</mml:mo><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo></mml:mrow></mml:mrow></mml:math>M=X0,X1,… is stationary and reversible, the probability that <mml:math><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>(X0,…,Xk)=(σ0,…,σk) is equal to the probability that <mml:math><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>(X0,…,Xk)=(σk,…,σ0) for any fixed sequence <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(σ0,…,σk). Thus, any sequence <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(σ0,…,σk) for which <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>σ</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mpadded><mml:mo>=</mml:mo><mml:mi>σ</mml:mi></mml:mrow></mml:math>σj=σ and <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math>σj is a <mml:math><mml:mi mathvariant="normal">?</mml:mi></mml:math>?-small corresponds to an equiprobable sequence <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mi>k</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>σ</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(σk,…,σ0), for which <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>σ</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>?</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:mpadded><mml:mo>=</mml:mo><mml:mi>σ</mml:mi></mml:mrow></mml:math>σk?j=σ and <mml:math><mml:msub><mml:mi>σ</mml:mi><mml:mrow><mml:mi>k</mml:mi><mml:mo>?</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msub></mml:math>σk?j is <mml:math><mml:mi mathvariant="normal">?</mml:mi></mml:math>?-small. <mml:math><mml:mi mathvariant="normal">□</mml:mi></mml:math>

### Observation 4.2.

<mml:math display="block"><mml:mrow><mml:mrow><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>≥</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>j</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>?</mml:mo><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>?</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>ρj,2?k(σ)≥ρj,?j(σ)?ρ0,?k?j(σ).

Proof. Consider the events that <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math>Xj is an <mml:math><mml:mi mathvariant="normal">?</mml:mi></mml:math>?-small among <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math>X0,…,Xj and among <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>Xj,…,Xk. These events are conditionally independent when conditioning on the value of <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:msub><mml:mi>X</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mpadded><mml:mo>=</mml:mo><mml:mi>σ</mml:mi></mml:mrow></mml:math>Xj=σ, and <mml:math><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>j</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ρj,?j(σ) gives the probability of the first of these events, whereas applying Eq. 1 with <mml:math><mml:mrow><mml:mpadded width="+1.7pt"><mml:mi>s</mml:mi></mml:mpadded><mml:mo>=</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:math>s=j gives that <mml:math><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>?</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ρ0,?k?j(σ) gives the probability of the second event.

Finally, when both of these events happen, we have that <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math>Xj is <mml:math><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow></mml:math>2?-small among <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,…,Xk. <mml:math><mml:mi mathvariant="normal">□</mml:mi></mml:math>

We can now deduce that<mml:math display="block"><mml:mrow><mml:mrow><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>≥</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>j</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>?</mml:mo><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>?</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>=</mml:mo><mml:mrow><mml:mrow><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>j</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>?</mml:mo><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>?</mml:mo><mml:mi>j</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>≥</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>ρj,2?k(σ)≥ρj,?j(σ)?ρ0,?k?j(σ)=ρ0,?j(σ)?ρ0,?k?j(σ)≥(ρ0,?k(σ))2.[2]Indeed, the first inequality follows from Observation 4.2, the equality follows from Observation 4.1, and the final inequality follows from the fact that <mml:math><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:math>ρj,?k(σ) is monotone nonincreasing in <mml:math><mml:mi>k</mml:mi></mml:math>k for fixed <mml:math><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi><mml:mo>,</mml:mo><mml:mi>σ</mml:mi></mml:mrow></mml:math>j,?,σ.

Observe now that <mml:math><mml:mrow><mml:mrow><mml:mpadded width="+1.7pt"><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup></mml:mpadded><mml:mo>=</mml:mo><mml:mrow><mml:mo>??</mml:mo><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math>ρj,?k=??ρj,?k(Xj), where the expectation is taken over the random choice of <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mi>j</mml:mi></mml:msub><mml:mo>～</mml:mo><mml:mi>π</mml:mi></mml:mrow></mml:math>Xj～π.

Thus, taking expectations in Eq. 2, we find that<mml:math display="block"><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mo>=</mml:mo><mml:mrow><mml:mtext>??</mml:mtext><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>≥</mml:mo><mml:mrow><mml:mtext>??</mml:mtext><mml:mrow><mml:mo>(</mml:mo><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup><mml:mo>)</mml:mo></mml:mrow></mml:mrow></mml:mrow><mml:mrow><mml:mrow><mml:mo>≥</mml:mo><mml:mpadded width="+1.7pt"><mml:msup><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:mtext>??</mml:mtext><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mi>σ</mml:mi><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mpadded><mml:mo>=</mml:mo><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math>ρj,2?k=??ρj,2?k(σ)≥??((ρ0,?k(σ))2)≥(??ρ0,?k(σ))2=(ρ0,?k)2,[3]where the second of the two inequalities is the Cauchy–Schwartz inequality.

For the final step in the proof, we sum the left- and right-hand sides of Eq. 3 to obtain<mml:math display="block"><mml:mrow><mml:mrow><mml:mrow><mml:munderover><mml:mo largeop="true" movablelimits="false" symmetric="true">∑</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>k</mml:mi></mml:munderover><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mi>j</mml:mi><mml:mo>,</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow></mml:mrow><mml:mi>k</mml:mi></mml:msubsup></mml:mrow><mml:mo>≥</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>∑j=0kρj,2?k≥(k+1)(ρ0,?k)2.If we let <mml:math><mml:msub><mml:mi>ξ</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math>ξj <mml:math><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mpadded width="+1.7pt"><mml:mn>0</mml:mn></mml:mpadded><mml:mo>≤</mml:mo><mml:mpadded width="+1.7pt"><mml:mi>i</mml:mi></mml:mpadded><mml:mo>≤</mml:mo><mml:mi>k</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math>(0≤i≤k) be the indicator variable that is one whenever <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:math>Xj is <mml:math><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow></mml:math>2?-small among <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,…,Xk, then <mml:math><mml:mrow><mml:msubsup><mml:mo largeop="true" symmetric="true">∑</mml:mo><mml:mrow><mml:mi>j</mml:mi><mml:mo>=</mml:mo><mml:mn>0</mml:mn></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:msub><mml:mi>ξ</mml:mi><mml:mi>j</mml:mi></mml:msub></mml:mrow></mml:math>∑j=0kξj is the number of <mml:math><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow></mml:math>2?-small terms, which is always, at most, <mml:math><mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:math>2?+1. Therefore, linearity of expectation gives that<mml:math display="block"><mml:mrow><mml:mrow><mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>≥</mml:mo><mml:mrow><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mn>2</mml:mn></mml:msup></mml:mrow></mml:mrow><mml:mo>,</mml:mo></mml:mrow></mml:math>2?+1≥(k+1)(ρ0,?k)2,[4]giving that<mml:math display="block"><mml:mrow><mml:mrow><mml:msubsup><mml:mi>ρ</mml:mi><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mi>k</mml:mi></mml:msubsup><mml:mo>≤</mml:mo><mml:msqrt><mml:mfrac><mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mfrac></mml:msqrt></mml:mrow><mml:mo>.</mml:mo></mml:mrow></mml:math>ρ0,?k≤2?+1k+1.[5]Theorem 1.1 follows, because if <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math>Xi is an <mml:math><mml:mi>ε</mml:mi></mml:math>ε-outlier among <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,…,Xk, then <mml:math><mml:msub><mml:mi>X</mml:mi><mml:mi>i</mml:mi></mml:msub></mml:math>Xi is necessarily <mml:math><mml:mi mathvariant="normal">?</mml:mi></mml:math>?-small among <mml:math><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>0</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:mo>…</mml:mo><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mi>k</mml:mi></mml:msub></mml:mrow></mml:math>X0,…,Xk for <mml:math><mml:mrow><mml:mi mathvariant="normal">?</mml:mi><mml:mo>=</mml:mo><mml:mrow><mml:mo stretchy="false">?</mml:mo><mml:mrow><mml:mrow><mml:mi>ε</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">?</mml:mo></mml:mrow><mml:mo>≤</mml:mo><mml:mrow><mml:mrow><mml:mi>ε</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow></mml:mrow></mml:math>?=?ε(k+1)?1?≤ε(k+1)?1, and then, we have <mml:math><mml:mrow><mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi mathvariant="normal">?</mml:mi></mml:mrow><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>≤</mml:mo><mml:mrow><mml:mrow><mml:mn>2</mml:mn><mml:mi>ε</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow><mml:mo>?</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo>≤</mml:mo><mml:mrow><mml:mn>2</mml:mn><mml:mi>ε</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>k</mml:mi><mml:mo>+</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mrow></mml:mrow></mml:math>2?+1≤2ε(k+1)?1≤2ε(k+1). <mml:math><mml:mi mathvariant="normal">□</mml:mi></mml:math>

## Acknowledgments

We thank John Nagle, Danny Sleator, and Dan Zuckerman for helpful conversations. M.C. was supported, in part, by NIH Grants 1R03MH10900901A1 and U545U54HG00854003. A.F. was supported, in part, by National Science Foundation (NSF) Grants DMS1362785 and CCF1522984 and Simons Foundation Grant 333329. W.P. was supported, in part, by NSF Grant DMS-1363136 and the Sloan Foundation.

## Footnotes

• ?1To whom correspondence should be addressed. Email: wes{at}math.cmu.edu.

## References

1. ?
.
2. ?
.
3. ?
.
4. ?
.
5. ?
.
6. ?
.
7. ?
.
8. ?
.
9. ?
.
10. ?
.
11. ?
.
12. ?
.
13. ?
.
14. ?
.
15. ?
.
16. ?
.
17. ?
.
18. ?
.
19. ?
.
20. ?
.
21. ?
.
22. ?
.
23. ?
.
24. ?
.

#### Online Impact

• 595505886 2018-01-24
• 26570885 2018-01-24
• 724358884 2018-01-24
• 649756883 2018-01-24
• 550841882 2018-01-24
• 436793881 2018-01-24
• 99132880 2018-01-23
• 802899879 2018-01-23
• 295573878 2018-01-23
• 352668877 2018-01-23
• 984633876 2018-01-23
• 545928875 2018-01-23
• 976569874 2018-01-23
• 871324873 2018-01-23
• 263462872 2018-01-23
• 577161871 2018-01-23
• 255603870 2018-01-23
• 117346869 2018-01-23
• 90982868 2018-01-23
• 663415867 2018-01-23