Survey experiments: know your estimand!

Author

Macartan Humphreys

Published

December 6, 2025

Abstract

It’s often unclear what exactly the goal of a survey experiment is. I clarify some distinctions between descriptive and causal inference, walk through different types of survey experiment, noting how some of them might be being used for either purpose, and highlight some inferential traps that researchers using or reading experimental results could fall into. In particular I highlight difficulties of making inferences from estimated effects in a conjoint survey experiment to (expected) effects of actual attributes on actual vote shares in an election, even when the experimental design closely mimics the features of target elections.

1 Overview

Survey experiments can be used for either causal inference (estimating treatment effects) or descriptive inference (measuring properties/preferences). But it can sometimes be confusing which of these two worthy goals is the goal in any given experiment. The confusion likely arises in part because very similar designs can be used for either purpose, and in part because survey experimentalists often describe estimands as effects of one kind or another, even if the ultimate inferential target is not an effect.

The distinction matters because different goals have implications for how you should set things up and how you should interpret results. If you are using a survey experiment for descriptive inference, there might be simpler and less noisy strategies available. If you are using it for causal inference, you need to be sure you are clear on what is being manipulated: it mightn’t be what you think is being manipulated.

Table 1 summarizes different uses for different types of design, highlighting:

how the same design might have distinct purposes
traps to avoid if attracted by this type of survey experiment

Table 1: Summary of different uses for survey experiments

Survey Experiment Type	Causal Inference Use Case	Descriptive Inference Use Case	Traps
Priming experiments	Estimate effect of prime on behavior/attitudes (typical)	Use prime as diagnostic to infer knowledge/beliefs (rarer)	Confusing the effect of the prime with the effect of the thing being primed. For example thinking you are finding the effects of exposure to violence by reminding people about past exposure.
List experiments	Estimate effect of list length or content on response patterns (rare)	Infer prevalence of sensitive beliefs/behaviors (typical)	Using an experiment for a descriptive quantity might mean accepting too much error in order to reduce bias.
Conjoints	Estimate effect of feature on choices, given a distribution of other fixed features (rare?)	Make inferences about preferences, classification rules, or ideal points (typical?)	Confusing the effects of a controlled change in question wording with the effects of intervening on the thing itself. For example thinking you are finding the effects of regime type on willingness to go to war or a candidate’s gender on their vote share.

I use question marks in the last row because I am confused on what some of these are trying to do (see examples below).

So the evergreen advice (Lundberg, Johnson, and Stewart 2021) to know your estimand! seems especially important when using a survey experiment.

The rest of this note just unpacks these ideas, many of which are developed also in Blair, Coppock, and Humphreys (2023) (see for example the discussion of the Conjoint design).

The next section clarifies concepts, the following section walks through different survey types, the final section clarifies the difficulties of making inferences from estimated effects in a conjoint survey experiment to (expected) effects of actual attributes on actual vote shares in an election, even when the experimental design closely mimics the features of target elections.

2 Concepts

Let’s first get clear on the distinction between causal estimands and descriptive estimands and between measurement and inference. (For a bit more, see Ch 14 in Blair, Coppock, and Humphreys (2023).)

2.1 Inference or measurement?

There’s a useful distinction often made between measurement and inference. Measurement is about directly observing a quantity that exists in the world; inference is about estimating a quantity that is at least partly unobserved. So you measure your pulse to make inferences about the state of your heart.

You can have causal inference or descriptive inference, so the measurement / inference distinction is not itself about causality. In the same way, identification — roughly whether you can nail the quantity of interest if you have enough data — is a problem for inference, but it is not a concern unique to experiments [slides]. You can have identification problems for causal estimands or descriptive estimands. And of course it bears repeating: even if a quantity is not identified, you can still learn about it!

Recognizing that you are doing inference rather than measurement in turn helps clarify the need for estimates of uncertainty. If you have data from a sample and you are interested in the sample average, and your quantity is measurable, then just measure; no need for standard errors or similar. If you have sample data and you are interested in the population average, and your quantity is measurable, then do inference, and also report your standard errors.

2.2 Causal or descriptive estimands?

A key difference between causal and descriptive estimands is that we generally think that descriptive estimands are, in principle, measurable: they exist, though may be very hard to measure. Causal estimands however involve counterfactual quantities and cannot be measured, even in principle.

This idea builds on a key idea from the counterfactual model of causation: the causal effect is the difference between two ‘potential outcomes’ [slides]; that is, between two things that don’t in fact exist, things that “could have” happened. When we talk about description however we are usually talking about describing properties that we think things actually have, like knowledge, beliefs, values.

Why does this matter? Because it means that if you are interested in causal estimands, then you have to do inference. If you are interested in descriptive estimands you may or may not have to do inference. You may be able to measure, but you may have to do inference. That matters because if your interest is description, maybe you can get away without doing inference. Worth checking. Maybe you can ask everyone in your sample if they like coffee and also if they like tea. You don’t have to randomly ask half if they like coffee and half if they like tea and infer the values for the full sample based on the ‘effect’ of the question on the answer! Maybe you will find doing it as an experiment rather than a measurement exercise does not add value; or that the possible gains on some fronts, such as reporting biases, don’t make up for the cost of having to do inference.

The distinction between descriptive and causal estimands is not always so sharp though. You might question whether all kinds of properties we might want to describe, preferences, loyalties, and so on really exist and can be described in this sense, even in principle. And you might think that some seemingly describable features are themselves causal quantities in disguise. For example you might think of preferences as a summary of the effects of options on choices. So you might think of a quantity such as “being a racist” both as a property that someone has and as a summary of how they react as a function of features of people they encounter.

I think two arguments can often tip towards a property rather than an effect interpretation. First, in many experiments we are not actually observing effects other than changes in responses: choices are not actually made and utility is not actually realized. Rather individuals make statements about hypotheticals, and so are just giving answers that respond to questions. This is in contrast to audit experiments which can otherwise look like a conjoint, where a subject may indeed believe an applicant is real and make a decision (in which case the effect of beliefs on behavior is clearly a causal effect). Second, in many experiments there is no distinction between the unit and the bundle of teatment conditions: the treatments (features) in a conjoint are constitutive of the unit rather than acting upon it. Thus by “changing” a property you are asking for evaluations on two different units (e.g. evaluation of a corrupt politician versus a clean politician) and not altering a property of a pre-defined unit (e.g. informing subjects that Jack is in fact corrupt).

Regardless, which way you think about the estimand can have implications for your design.¹

Let’s now use these ideas to think through different uses of survey experiments.

3 Survey experiments

The term ‘survey experiment’ is used to cover a large class of experiments. Some are much like any experiment, aiming to estimate a causal effect of a treatment by manipulating that treatment; others use a manipulation, often of survey wording or procedures, to make it possible to measure – or at least make inferences about – a descriptive estimand. See e.g. Cyrus Samii’s post which makes this distinction, or discussions in Blair, Coppock, and Humphreys (2023).

Sometimes people use the term “survey experiment” specifically for experiments in surveys that use changes in wording or survey protocols to aid descriptive inference. Otherwise just say something clunky like “an experiment embedded in a survey” or “delivered through a survey.”

In practice though, it is not obvious whether an experiment is conducted to aid descriptive inference or causal inference.

To fix ideas Table 2 describes two ideal cases.

Table 2: Ideal types, experiment types used clearly for causal inference and for descriptive inference

Example of a survey experiment for causal inference	Example of a survey experiment for descriptive inference
Information experiments are typically used for causal inference, not descriptive inference, whether or not they are delivered through a survey. In some cases survey-delivered information experiments are almost indistinguishable from field experiments — for instance if information is delivered in a way similar to treatments of interest and if outcomes are measured outside of the survey, through measures of subsequent behaviors. The key difficulty with embedding an information experiment in a survey is with respect to external validity—whether the effects of information delivered in this way are similar to effects of information delivered in the wild, and so lots of good work in this vein tries to address that head-on.	Randomized response surveys, in which people are randomly assigned to answer either a sensitive question or a non-sensitive question are typically used for descriptive inference (Blair, Imai, and Zhou 2015). The goal is to estimate the prevalence of some property of subjects, such as whether people have engaged in illegal behavior. The randomization makes it possible to make inferences about the prevalence of the sensitive behavior while protecting individual privacy. Here the randomization is a tool to make measurement possible, not the focus of interest itself. There is a causal effect of the procedure on the answer, but the purpose is to make descriptive inferences about something else.

Information experiments are typically used for causal inference, not descriptive inference, whether or not they are delivered through a survey. In some cases survey-delivered information experiments are almost indistinguishable from field experiments — for instance if information is delivered in a way similar to treatments of interest and if outcomes are measured outside of the survey, through measures of subsequent behaviors. The key difficulty with embedding an information experiment in a survey is with respect to external validity—whether the effects of information delivered in this way are similar to effects of information delivered in the wild, and so lots of good work in this vein tries to address that head-on.

Randomized response surveys, in which people are randomly assigned to answer either a sensitive question or a non-sensitive question are typically used for descriptive inference (Blair, Imai, and Zhou 2015). The goal is to estimate the prevalence of some property of subjects, such as whether people have engaged in illegal behavior. The randomization makes it possible to make inferences about the prevalence of the sensitive behavior while protecting individual privacy. Here the randomization is a tool to make measurement possible, not the focus of interest itself. There is a causal effect of the procedure on the answer, but the purpose is to make descriptive inferences about something else.

I think these two cases show a sharp difference between the two goals. The purpose is not always so clear, however. The next sections illustrate different purposes for common types of survey experiment.

3.1 Priming experiments

Priming experiments can be used for making inferences about both descriptive and causal estimands.

3.1.1 A priming experiment conducted for descriptive inference.

Say I am interested in whether you know (\(K\)) that a weapon was used in a crime. Your knowledge is something I think you have or do not have and I want to know about it. I would love to just measure that evidence, but it is hard.

So I show you a picture (\(X\)) of the weapon and I measure your reaction (\(Y\)). I make inferences about the effect of the prime on your reaction (\(X\) on \(Y\)) in order to make inferences about your knowledge (\(K\)). The effect estimate is a diagnostic tool. I make a causal inference in order to do descriptive inference. But I am clear: my interest is in your knowledge, it is not on the effect of seeing a weapon on your stress levels.

One implication of this is that I would be unhappy with this study if I found no evidence for a causal effect but in fact \(K = 1\), or if I did find evidence but \(K = 0\); for the simple reason that my interest in the causal effect is just instrumental here.

3.1.2 A priming experiment conducted for causal inference.

But I might well be interested in a priming experiment specifically to make causal inferences. I am interested in whether being reminded of corruption by a politician makes you more likely to support the opposing party. I am interested in this because I think politicians or the media do this before elections and I am interested in understanding these effects. If the focus is on the effect of the prime itself this is a standard causal estimand inferred using an experiment, that may or may not just happen to be delivered using a survey.

That makes lots of sense in principle. In practice I think sometimes we see people can trip up and mix up the effect of the prime (e.g. from being reminded that there is corruption) with the effect of the thing being primed (e.g. the effect of corruption itself), or not be clear on whether in fact new information is being provided or not.

3.2 Conjoints

De la Cuesta, Egami, and Imai (2022) describes conjoints as “a factorial survey experiment that is designed to measure multidimensional preferences”. Note the emphasis on measurement. In a similar way, Bansak et al. (2023) describes the (AMCE) estimand as a “summary of voters’ multidimensional preferences” (emphasis added). Arguably, the remit of conjoints for descriptive inference is a little broader. For example they might also be used to study how people make classifications or understand concepts. But, arguably, conjoints might sometimes also be used when the estimand really is causal.

3.2.1 Conjoints for descriptive inference.

In the many cases in which the goal is to measure preferences, interpretations, or classification rules, conjoint experiments may be best thought of as focused on descriptive inference and using causal inference to make those descriptive inferences.

For example, in Hartmann et al. (2024), we use a conjoint to measure policy preferences. We combine the conjoint results with a choice model to estimate ideal points. Although we use the language of effects a bunch we are interested in trying to measure something but are resorting to using the conjoint to make inferences.

Another example: say a bank uses a rule to decide whether to give loans or not. You want to figure out the rule. You do so using a conjoint to assess which profiles are more likely to get loans given different attributes. The estimand of interest is not a set of causal effects, it is a rule. But you try to figure it out by seeing whether notional features “affect” the classification. By analogy when you observe stated preferences for different profiles you can use these to figure out the underlying function—rule—that evaluates the profiles, not trying to figure out preferences over the profiles themselves).

Two implications from recognizing that the goal here is in fact descriptive inference:

Opportunity. You might find out that a more effective strategy would be to figure out the rule from archival sources, such as regulations or instructions to staff. Maybe it is measurable, in which case measure it.
Risk. You might fall into the trap of thinking the relation between feature values and outcomes corresponds to the causal effects of changing the feature (or confuse the direct/controlled effect within the experimental regime with the average effect). This is a little trickier, but to think through a simple example: Say in truth we have \(X_1 \rightarrow X_2 \rightarrow Y\), and \(X_1\) affects \(Y\) via \(X_2\) but not conditional on \(X_2\). Then a conjoint might pick up that \(X_1\) is not part of the classification rule for \(Y\) and \(X_2\) is. But it would be wrong to infer from this that actually changing \(X_1\) will not affect classifications (since it might via changes in \(X_2\)). The problem here is confusing “how the rule determines outcomes given features” with “the effect of changing features, given the rule.”

I think when Schwarz and Coppock (2022) talk about learning about discrimination, they are focused on uncovering preferences in this way; but the language of describing “the average effect of being a woman” (emphasis added), could be misread to suggest an interest in the effect of the attribute itself, that is, an effect of an intervention on a candidate.

3.2.2 Conjoints for causal inference

Even still, conjoints can also be used when the primary target is a causal estimand. Say you really are interested in whether the presence of a given feature on a list of features makes it more likely that an outcome will be selected from the list.

You might have an application where people are electing candidates and know nothing about the candidates other than what they get in a flyer. You might have a pool of potential candidates with a set of features from which you would draw a pair of potential candidates. You want to know how the presence of a given feature on the flyer affects the choice, conditional on all other features, averaged. Although your interest is the effect of the features on choices, not the underlying preferences, you are pretty close to the conjoint. You have to worry about external validity but these are common worries for any experiment.

Note that one bonus of your interest being in the choice rather than preferences, is that you might not be concerned if you found that people didn’t take the exercise too seriously, or didn’t read options carefully, as that is just a part of what creates the mapping from features to choices.

I think this is close to the sort of setting Bansak et al. (2023) have in mind (though, note this means interpreting their language of “the effect of a change in an attribute on a candidate’s or party’s expected vote share” as meaning – as they clarify elsewhere — the effect of a listed feature within a controlled list of features and not the effect of an intervention on a single feature of a candidate while allowing (or preventing) other endogenous changes).

The risk above remains, however: the effect you are getting is the effect of the attribute on the list, not the average (total) effect of the attribute itself on the outcomes. For example you might find that a powerful candidate does well given different values of corruption (even for different distributions of corruption), but this does not give you the effect of power itself, since, after all, power corrupts. You might of course really be interested in effects like this: the causal effect of the candidate’s attribute itself, in the sense of imagining the effect of an intervention on a candidate (e.g. the effect of the candidate’s wealth on their performance), rather than on the listing of a particular attribute value in a list of attributes. That’s a quantity that the conjoint would struggle to identify (see below).

3.3 List experiments

List experiments might also be done for either reason, but the typical use is for descriptive inference.

3.3.1 A list experiment conducted for descriptive inference.

You are interested in whether people think there is corruption (\(K\)) or not. In principle this is measurable, but it is hard to measure. You vary whether there is a long list or short list (\(X\)) and infer from the effects on the count answers (\(Y\)) whether people think there is corruption or not. You are primarily interested in \(K\); there is no independent interest here in how list length affects answer except for its role for descriptive inference.

3.3.2 A list experiment conducted for causal inference.

I think this is not so common but you could imagine being interested in the effect of a long versus short list on whether people exhibit social desirability bias. Here you are interested in the effect of the length itself, or of the mention of the word itself. Blair and Imai (2012), when describing conditions for valid inference of the descriptive estimand, describe a “no design effects” condition that rules out various causal effects. One could in principle be interested in just these (and estimate well if you independently have knowledge of the descriptive estimand)!

There is a good literature comparing experimental and direct approaches for asking sensitive questions. The fact that the estimand is the same in both cases highlights that the focus is typically descriptive. The gain from using an experiment is (hopefully) unbiasedness that comes from providing protection to subjects that require plausible deniability. But the fact that it is an experiment itself implies a cost: you get error from the need to do inference (as well as from complexity; Kuhn and Vivyan (2022)) and so need to determine whether the added error is worth it.

4 Mapping from estimates from survey experiments to real world estimands

Imagine a setting in which we have an actual election with a set of candidates who are defined by a finite set of attributes, and we also have a conjoint experiment that closely parallels the setting. In particular the same set of attributes is randomized and presented to the same voters, and voters react to the attributes in the survey in the same way as they do to the attributes of candidates in the election.

The question is whether the AMCE of an attribute (for example, gender) in the survey corresponds to the ATE of that attribute in the election. (To avoid confusion, Bansak et al. (2023) are interested in a setting with the “election matching the specifications of the conjoint” but they do not have an external estimand in mind – that is, they are not attempting to map to the effect of a real candidate’s actual attributes on actual vote shares.)

This discussion provides a negative answer to the question even under very strong assumptions.

The key idea is that these estimands diverge because the AMCE does not depend on causal relations between attributes, but the ATE does. The issue is not about the joint distribution of attributes, but about the underlying causal structure relating attributes to one another and to voter utilities.

4.1 Causal structure

Imagine two 3-node worlds, with all nodes binary:

Causal World 1:
\(X_1 \rightarrow X_2 \rightarrow U\)
Causal World 2:
\(X_1 \leftarrow X_2 \rightarrow U\)

These worlds describe both the observed and counterfactual distributions of candidate attributes \((X_1, X_2)\) and voter utilities \(U\).

4.2 Attributes

These two worlds can have exactly the same observed joint distribution of variables (in particular, of attributes \(X_1, X_2\)). For example, say in both worlds the correlation structure is:

\[ \Pr(X_1=1) = \Pr(X_2=1) = 0.5 \]

\[ \Pr(X_1=1 \mid X_2=1) = \Pr(X_1=0 \mid X_2=0) = 0.8 \]

However, in World 1, intervening on \(X_1\) induces changes in \(X_2\); in World 2, it does not.

(For contrast: in a conjoint experiment, when we alter \(X_1\) in a profile description, we do not induce changes in \(X_2\), even if \(X_1\) and \(X_2\) are correlated in the population.)

4.3 Voters

Imagine that everyone has very simple preferences across candidates:

\[ U(X_1, X_2) = X_2. \]

Voters ignore \(X_1\) entirely. Given two sets of attributes \((X_1, X_2)\) and \((X_1', X_2')\), the choice

\[ Y\big((X_1, X_2), (X_1', X_2')\big) \]

is determined solely by which profile has the larger value of \(X_2\). Ties are broken at random.

4.4 Parallelism assumptions

We will want to compare the AMCE and the ATE.

The AMCE captures a particular weighting of the effects of attribute values as presented in a description on expressed preferences, holding all other attributes fixed. The weighting depends on a stipulated joint distribution of attributes.
The ATE of interest is the average effect in an election of changing an attribute on the share of votes received by a candidate, including all downstream causal effects of that change.

I formalize these below.

The question is whether these two quantities coincide. I will give more formal definitions of these, but to make things simple I will make the following (very strong) assumptions:

There is a finite and known set of candidate attributes capturing everything relevant about candidates in an election. More specifically there are two candidates with two binary attributes each, both observed (this helps clarify that the issue is not the same as the important concern raised elsewhere by Scott Abramson and Morgan Gillespie).
The distribution of possible candidate attributes in the election is known and can be recreated in the experiment.
Survey respondents are fully representative of voters.
Voters behave identically in the survey and in the election, responding only to attributes. In particular they also make choices between candidates (regardless of the attributes of candidates).
The ATE is defined with respect to a distribution of possible rival candidates, not a particular realized rival (this is to move the estimand closer to an AMCE-type estimand, but it might be already far from what researchers think they are shooting at).
Intervening on one candidate’s attributes does not affect the attributes of the rival candidate (the problems raised here would be more significant absent this assumption).
Intervening on a candidate’s attributes does not affect whether they appear in an election.

Thus, we imagine a conjoint that mimics the election extremely well, except that attributes are randomized in the conjoint experiment but arise endogenously in the election.

For simplicity I will also assume:

All voters have identical preferences (this simplifies the estimand and makes it clearer what variation is doing work)
We know not only the joint distribution of attributes and the average effects of all attributes on each other, but also the underlying response functions giving rise to these average effects (this simplifies calculations of the ATE):
- World 1:
  For \(0.6\) of units, \(X_2(X_1) = X_1\).
  For \(0.2\), \(X_2 = 0\) regardless of \(X_1\).
  For \(0.2\), \(X_2 = 1\) regardless of \(X_1\).
- World 2:
  For \(0.6\) of units, \(X_1(X_2) = X_2\).
  For \(0.2\), \(X_1 = 0\) regardless of \(X_2\).
  For \(0.2\), \(X_1 = 1\) regardless of \(X_2\).
  In this world, \(X_2\) is not affected by changes in \(X_1\).

4.5 Formalizing the estimands

Focusing on \(X_1\), define the AMCE as

\[ \text{AMCE}_{X_1} = \sum_{x_2, x_1', x_2'} \Big[ Y\big((1, x_2), (x_1', x_2')\big) - Y\big((0, x_2), (x_1', x_2')\big) \Big] \, p(x_2, x_1', x_2'). \]

where \(p(x_2, x_1', x_2')\) is the joint distribution of attributes for \(A\) and \(B\) (other than \(x_1\)).

The ATE is

\[ \text{ATE}_{X_1} = \mathbb{E}\left[ \sum_{x_1', x_2'} \Big[ Y\big((1, x_2(1)), (x_1', x_2')\big) - Y\big((0, x_2(0)), (x_1', x_2')\big) \Big] \, p(x_1', x_2')\right]. \]

where the expectation is taken over responses of \(X_2\) to \(X_1\)—that is, across candidates, not across voters. In the case of a single candidate of interest there is no need for this expectation.

There is no expectation across voters in either expression because I have simplified things by assuming all voters are identical.

4.6 Calculations

4.6.1 Calculation of the AMCE for \(X_1\)

We imagine identical voters, two candidates (A and B), two attributes, and two levels per attribute. The AMCE is computed by comparing changes in \(X_1\) for candidate A, holding \(X_2\) fixed, across all possible profiles of candidate B.

Table A1 reports the relevant comparisons and differences (assuming a \(0.5\) probability of winning under indifference).

Table A1: AMCE comparisons for \(X_1\)

\(X_2\) (A)	B	\(p\)	Comparison	Difference
0	(0,0)	0.20	\(\Pr(U(1,0) > U(0,0)) - \Pr(U(0,0) > U(0,0))\)	\(0.5 - 0.5 = 0\)
0	(0,1)	0.05	\(\Pr(U(1,0) > U(0,1)) - \Pr(U(0,0) > U(0,1))\)	\(0 - 0 = 0\)
0	(1,0)	0.05	\(\Pr(U(1,0) > U(1,0)) - \Pr(U(0,0) > U(1,0))\)	\(0.5 - 0.5 = 0\)
0	(1,1)	0.20	\(\Pr(U(1,0) > U(1,1)) - \Pr(U(0,0) > U(1,1))\)	\(0 - 0 = 0\)
1	(0,0)	0.20	\(\Pr(U(1,1) > U(0,0)) - \Pr(U(0,1) > U(0,0))\)	\(0 - 0 = 0\)
1	(0,1)	0.05	\(\Pr(U(1,1) > U(0,1)) - \Pr(U(0,1) > U(0,1))\)	\(0.5 - 0.5 = 0\)
1	(1,0)	0.05	\(\Pr(U(1,1) > U(1,0)) - \Pr(U(0,1) > U(1,0))\)	\(0 - 0 = 0\)
1	(1,1)	0.20	\(\Pr(U(1,1) > U(1,1)) - \Pr(U(0,1) > U(1,1))\)	\(0.5 - 0.5 = 0\)

As Table A1 shows, all comparisons yield zero, so

\[ \text{AMCE}_{X_1} = 0. \] The \(p\) weights don’t matter for this calculation.

A similar calculation (not shown) yields

\[ \text{AMCE}_{X_2} = 0.5. \]

This arises because, depending on the other candidate’s attribute, a change in \(X_2\) either takes you from a loss to a toss up or from a toss up to a win.

4.6.2 ATE of \(X_1\): World 1

In World 1, changing \(X_1\) may induce changes in \(X_2\). Taking the distribution of profiles for candidate B as given, Table A2 reports the expected effect of changing \(X_1\) for candidate A.

Table A2: ATE comparisons for \(X_1\) in World 1

B	p	Comparison	Diff
(0,0)	0.4	.6 (Pr(U(1,1) > U(0,0)) - Pr(U(0,0) > U(0,0)) ) +	.3
		.2 (Pr(U(1,1) > U(0,0)) - Pr(U(0,1) > U(0,0)) ) +
		.2 (Pr(U(1,0) > U(0,0)) - Pr(U(0,0) > U(0,0)) )
(0,1)	0.1	.6 (Pr(U(1,1) > U(0,1)) - Pr(U(0,0) > U(0,1)) ) +	.3
		.2 (Pr(U(1,1) > U(0,1)) - Pr(U(0,1) > U(0,1)) ) +
		.2 (Pr(U(1,0) > U(0,1)) - Pr(U(0,0) > U(0,1)) )
(1,0)	0.1	.6 (Pr(U(1,1) > U(1,0)) - Pr(U(0,0) > U(1,0)) ) +	.3
		.2 (Pr(U(1,1) > U(1,0)) - Pr(U(0,1) > U(1,0)) ) +
		.2 (Pr(U(1,0) > U(1,0)) - Pr(U(0,0) > U(1,0)) )
(1,1)	0.4	.6 (Pr(U(1,1) > U(1,1)) - Pr(U(0,0) > U(1,1)) ) +	.3
		.2 (Pr(U(1,1) > U(1,1)) - Pr(U(0,1) > U(1,1)) ) +
		.2 (Pr(U(1,0) > U(1,1)) - Pr(U(0,0) > U(1,1)) )

So: 0.3, again independent of weights. This is the effect of \(X_1\) on \(X_2\) times the effect of \(X_2\) on \(Y\).

4.6.3 ATE of \(X_1\): World 2

In World 2, changing \(X_1\) does not affect \(X_2\). Table A3 reports the resulting comparisons.

Table A3: ATE comparisons for \(X_1\) in World 2

B	p	Comparison	Diff
(0,0)	0.4	.5 (Pr(U(1,1) > U(0,0)) - Pr(U(0,1) > U(0,0)) ) +	0
		.5 (Pr(U(1,0) > U(0,0)) - Pr(U(0,0) > U(0,0)) )
(0,1)	0.1	.5 (Pr(U(1,1) > U(0,1)) - Pr(U(0,1) > U(0,1)) ) +	0
		.5 (Pr(U(1,0) > U(0,1)) - Pr(U(0,0) > U(0,1)) )
(1,0)	0.1	.5 (Pr(U(1,1) > U(1,0)) - Pr(U(0,1) > U(1,0)) ) +	0
		.5 (Pr(U(1,0) > U(1,0)) - Pr(U(0,0) > U(1,0)) )
(1,1)	0.4	.5 (Pr(U(1,1) > U(1,1)) - Pr(U(0,1) > U(1,1)) ) +	0
		.5 (Pr(U(1,0) > U(1,1)) - Pr(U(0,0) > U(1,1)) )

Hence,

\[ \text{ATE}_{X_1}^{\text{World 2}} = 0. \]

4.7 Interpretation

The AMCE coincides with the ATE in World 2 but not in World 1. The difference arises because, in World 1, changing one attribute causally affects another attribute that voters care about.

This disconnect arises even under very generous conditions that let us attempt the comparison.

There may be conditions under which the AMCEs correspond to the ATEs, but they don’t seem to be easy ones:

all relevant attributes are mutually exogenous
the attribute change of interest occurs after all other attributes are fixed
interest is confined to an intervention on the ballot–keeping other candidates fixed–rather than a real-world intervention on candidates (and voters respond only to the ballot).

References

Bansak, Kirk, Jens Hainmueller, Daniel J Hopkins, and Teppei Yamamoto. 2023. “Using Conjoint Experiments to Analyze Election Outcomes: The Essential Role of the Average Marginal Component Effect.” Political Analysis 31 (4): 500–518. https://doi.org/10.1017/pan.2022.16.

Blair, Graeme, Alexander Coppock, and Macartan Humphreys. 2023. Research Design in the Social Sciences: Declaration, Diagnosis, and Redesign. Princeton University Press.

Blair, Graeme, and Kosuke Imai. 2012. “Statistical Analysis of List Experiments.” Political Analysis 20 (1): 47–77.

Blair, Graeme, Kosuke Imai, and Yang-Yang Zhou. 2015. “Design and Analysis of the Randomized Response Technique.” Journal of the American Statistical Association 110 (511): 1304–19.

De la Cuesta, Brandon, Naoki Egami, and Kosuke Imai. 2022. “Improving the External Validity of Conjoint Analysis: The Essential Role of Profile Distribution.” Political Analysis 30 (1): 19–45. https://doi.org/10.1017/pan.2020.40.

Hartmann, Felix, Macartan Humphreys, Ferdinand Geissler, Heike Klüver, and Johannes Giesecke. 2024. “Trading Liberties: Estimating Covid-19 Policy Preferences from Conjoint Data.” Political Analysis 32 (2): 285–93. https://doi.org/10.1017/pan.2023.25.

Kuhn, Patrick M, and Nick Vivyan. 2022. “The Misreporting Trade-Off Between List Experiments and Direct Questions in Practice: Partition Validation Evidence from Two Countries.” Political Analysis 30 (3): 381–402.

Lundberg, Ian, Rebecca Johnson, and Brandon M Stewart. 2021. “What Is Your Estimand? Defining the Target Quantity Connects Statistical Evidence to Theory.” American Sociological Review 86 (3): 532–65.

Schwarz, Susanne, and Alexander Coppock. 2022. “What Have We Learned about Gender from Candidate Choice Experiments? A Meta-Analysis of Sixty-Seven Factorial Survey Experiments.” The Journal of Politics 84 (2): 655–68.

Footnotes

We discuss this a bit here. As an analogy, we might think of immunity to a disease as a property that someone has, and want to figure out how many have this immunity. We might even be able to measure features that indicate the property (e.g. sickle cell disease for immunity to malaria). In that case we might want to think of this as a descriptive exercise, and even measure the property in a person. But we might also think of immunity as fundamentally about causal relations: that is, we are really asking about how a person would behave in different conditions. A bit more formally one might imagine a world in which \(X \rightarrow Y \leftarrow A\) (all binary nodes) and functional equation \(f : Y = XA\). Then \(X\) causes \(Y\) if and only if \(A\) is present. One way of thinking of the problem is to learn about the causal relations captured by \(f\), the other is to learn about \(A\) — the value of a node that captures a property that implies a reaction given a background model.↩︎