Taking the con out of conjoints and other musings
2026-05-15
[dee ah gwitch]
Check out our new DeclareDesign shiny https://shiny2.wzb.eu/ipi/declaredesign/
Check out our new CausalQueries shiny https://shiny2.wzb.eu/ipi/process_tracing/
Working on a ReplicateEverything package
ReplicateEverythingReplicateEverythingCausalQueriesMethods panels
Survey experiments are now the most common research design in political science (Torreblanca et al. 2025).
The promise is compelling:
“side-stepping the endogeneity and collinearity concerns that threaten our ability to draw causal inferences using observational data”
— Kertzer, Renshon, and Yarhi-Milo (2021)
There is a lot to like. But there are also systematic risks of confusion about what survey experiments can and cannot do.
Risk 1 — Estimand confusion
Is the goal causal or descriptive?
→ Different goals need different designs
Risk 2 — Controls confusion
What do controls actually do? (focus on the presence not the value of controls) → They define estimands, not reduce bias or error
Risk 3 — Extrapolation confusion
Can we go from survey to world?
→ Only under very strong conditions (either stop making these claims or seek to establish them)
Evergreen advice:
Know your estimand!
— Lundberg, Johnson, and Stewart (2021)
This is a grumpy talk
I am sorry
I know not everyone, and maybe even not most, make these errors.
I am confused about this: When these errors are made, are looking at errors in comprehension or errors (or just norms?) in communication.
Causal or descriptive estimand — the distinction shapes everything
Descriptive estimand
A property something actually has: - Knowledge, beliefs, preferences - Measurable “in principle” - You may or may not need an experiment
Example: How many voters prefer female candidates?
Causal estimand
A difference between potential outcomes: - Counterfactual; cannot be measured directly even in principle - You must do inference
Example: Does a candidate being female cause vote losses? (we’ll return to this)
Why the confusion arises: very similar designs can serve both purposes, and researchers often describe descriptive estimands using causal language (“the effect of X on Y”).
Key insight
If your estimand is descriptive, maybe you don’t need an experiment. Worth checking — the experiment may add error without adding value.
Figure 1: Preferences and features combine to determine choices. Manipulating features lets us infer preferences.
But still a kind of formal equivalence:
Figure 2: Preferences and features combine to determine choices. Manipulating features lets us infer preferences.
Clarifies that \(\theta\) is a property in the Pearlean world. Not an event or difference in potential timestamped outcomes as in Rubin.
Figure 3: Preferences and features combine to determine choices. Manipulating features lets us infer preferences.
Three arguments for treating preferences as properties (not causal effects):
You are interested in the average age of people in a population
You divide the subjects into two groups:
You estimate: \(\widehat {\overline {Age}} = \mathbb E_{i \in A}[Y_i(A)] - \mathbb E_{i \in B}[Y_i(B)]\)
You have a causal estimand and have done causal inference. This is valid, it is unbiased thanks to the randomization.
You use Neyman standard errors which are valid from randomization and needed because you are have an incomplete schedule of potential outcomes.
Congratulations.
Scenario A — for descriptive inference
A lawyer shows subjects a picture of a weapon and measures their stress response — to infer whether they already knew a weapon was used.
→ The causal inference is just a tool. The estimand is the subject’s prior knowledge \(K\).
Scenario B — for causal inference
A political scientist asks: does being reminded of corruption increase support for the opposition?
→ The estimand is the effect of the prime itself. Fully causal.
Risks with priming experiments
The classic trap: Confusing the effect of the prime (being reminded of violence) with the effect of the thing being primed (actual exposure to violence).
Claiming you have estimated the effect of past violence when you have only estimated the effect of being reminded of it.
Design: Vary question wording to assess whether form affects substance. The idea of an equivalence frame is that the same question is asked in different ways.
Descriptive use: Purge framing effects to recover “true” underlying preferences
(Goldin and Reck (2015))
Causal use: How does a loss vs. gain frame change how people think about a policy?
(Druckman (2001))
Contrast with conjoints: In a conjoint, different treatments ask different questions in the same way; framing experiments ask the same question in different ways.
“A factorial survey experiment designed to measure multidimensional preferences”
— De la Cuesta, Egami, and Imai (2022)
Conjoints have two genuinely distinct use cases — rarely recognized:
| Causal Use | Descriptive Use | |
|---|---|---|
| Goal | Effect of signal on stated choices | Measure preferences / classification rules / ideal points |
| Think of it as | Mimicking field experiment on choices | Elicitation of a preference function |
| Frequency | Rare? | Typical? |
| Trap | Confuse signal effect with attribute effect | Confuse classification rule with causal effect |
The dual confusion
Researchers often intend descriptive inference but describe results in causal language — or vice versa. The distinction matters for design, analysis, and interpretation.
Descriptive case 1: You ask would you react under each of these imagined situations?
Descriptive case 2: You want to learn about an algorithm: what rule does a bank use when deciding to give credit? What rule does an AI model use to assess the validity of a statement.
In both cases you might consider alternatives:
Classic candidate experiment: in a world where a given set of facts about attributes is available (and only these), how do choices depend on the elements of the sets.
Hainmueller et al external validity study two edged sword supporting this view: it might work! but the actual target applications are quite unusual
Go in peace
Controls define estimands — they do not reduce confounding
Conjoint experiments are celebrated for the ability to control many features simultaneously.
Researchers describe this as addressing confounding:
“varying other features lets them distinguish the effect of democracy from potential confounders”
— Tomz and Weeks (2013)
“holding the military power of the target constant, we reduce the possibility of the respondents drawing inferences about the target’s level of military power from the democracy treatment, which is perhaps the most obvious potential confounder”
— Bell and Quek (2018)
What is wrong here?
Random assignment already addresses confounding. Controls in conjoints do something different — they change what you are estimating.
Three distinct purposes controls serve (but often confused):
The critical insight
Choosing which controls to include — not just their values — directly determines the estimand. This means:
A counterintuitive result?:
Adding controls at the intervention stage can increase variance in conjoint experiments.
Why? Controlling \(A_2\) at the intervention stage introduces variation that must then be removed in the analysis to recover the baseline effect. The net effect on variance can be positive.
Say: \(Y = A_1 \cdot A_2\) where \(A_2 \sim \text{Bernoulli}(0.5)\), and this is known to all subjects.
Then: if no information on \(A_2\) is provided, the variance of estimates of the effect of \(A_1\) is lower than when \(A_2\) is also controlled and randomized.
Note
This undermines the “precision argument” for adding controls in conjoints.
The standard rationale (“controls reduce variance”) does not straightforwardly apply in this context.
Focus not on distribution of levels but presence.
Setup: \(Y = A_2 \times A_3\). So \(A_1\) no causal role in quality. Yet whether \(A_1\) “matters” for preferences depends entirely on what else is known.
Joint distribution of \((A_1, A_2, A_3)\):
| \(A_1 = 0\) | \(A_1 = 1\) | |
|---|---|---|
| \(A_2=0,\; A_3=0\) | \(4/16\) | \(0\) |
| \(A_2=0,\; A_3=1\) | \(4/16\) | \(0\) |
| \(A_2=1,\; A_3=0\) | \(1/16\) | \(2/16\) |
| \(A_2=1,\; A_3=1\) | \(3/16\) | \(2/16\) |
| Total | \(12/16\) | \(4/16\) |
Key feature: \(A_1=1\) is a strong signal that \(A_2=1\) (and thus informative about \(Y\), via \(A_2\)).
“When IE [information equivalence] is violated, the effect of the manipulation need not correspond to the quantity of interest — the effect of beliefs about the focal attribute”
— Dafoe, Zhang, and Caughey (2018)
Results: Say \(\Pr(Y=1) = \Pr(A_2=1 \,\&\, A_3=1)\):
| Controls | \(\Pr(Y{=}1\mid A_1{=}1)\) | \(\Pr(Y{=}1\mid A_1{=}0)\) | Effect |
|---|---|---|---|
| None | \(1/2\) | \(1/4\) | +0.25 |
| \(A_2\) controlled | \(1/2\) | \(3/4\) | −0.25 |
| \(A_2\) and \(A_3\) controlled | \(1\) | \(1\) | 0 |
Warning
All three estimates are correct for their own estimand. None reveals that \(A_1\) is causally irrelevant.
The sign and existence of an apparent preference for \(A_1\) depends entirely on which controls are included — not on \(A_1\)’s role in the world.
Privilege, wealth, and ability each contribute directly to quality.
Figure 4: Left: respondent’s mental model. Right: what the experimenter can observe when only signals for privilege and wealth are provided — ability is unobserved.
In this world, the experiment cleanly recovers the mental model: signals map directly to beliefs about the attributes they describe.
Wealth is produced by privilege and ability — so signalling privilege and wealth is informative about ability.
Figure 5: Left: respondent’s mental model — privilege and ability jointly produce wealth; ability drives quality. Right: what signals about privilege and wealth reveal about beliefs.
Adding the wealth control creates an apparent effect of privilege via collider bias: conditioning on wealth renders privilege informative about quality.
In Mental Model 2 (right column): subjects believe only ability drives quality, and that wealth is produced by privilege and ability. Privilege has no direct effect on quality.
But when both privilege and wealth are controlled, the implied “preferences” are identical in both worlds:
| Control set | World 1 | World 2 |
|---|---|---|
| \(P\) and \(W\) | \(Y = \tfrac{1}{3} - \tfrac{1}{3}P + \tfrac{2}{3}W\) | \(Y = \tfrac{1}{3} - \tfrac{1}{3}P + \tfrac{2}{3}W\) |
| \(P\) only | \(Y = \tfrac{2}{3} - \tfrac{1}{3}P\) | \(Y = 0.25\) |
| \(W\) only | \(Y = \tfrac{1}{6} + \tfrac{2}{3}W\) | \(Y = \tfrac{1}{8} + \tfrac{1}{2}W\) |
You cannot recover respondents’ mental models from conjoint data
The same pattern of responses can arise from very different underlying causal beliefs.
Showing that a signal increases an evaluation does not show that the subject believes the attribute causes quality.
Setup: Researchers randomize a candidate’s race and control for “criminality,” claiming this isolates taste-based discrimination:
“controlling for other features lets them assess taste-based discrimination”
— Ono and Burden (2019); Boittin, Fisher, and Mo (2024); Olinger et al. (2024)
Or: Controlling for content of speech, do observers assess actions differently for Muslim and Christian speakers?
Two risks:
Risk 1 — Controlled ≠ Natural direct effect
Fixing downstream beliefs exogenously estimates possible discrimination, not actual discrimination. If an employer never in fact encounters low-skill candidates, the controlled effect can be non-zero while the natural direct effect is zero.
Risk 2 — Over-control and thinning
If discrimination works through beliefs about skills, controlling skills removes the channel. You can make any direct effect disappear:
In a line of dominoes each with unit effects, only the second-to-last domino has a non-zero controlled direct effect.
The prescription
Claims like “Americans do not select doctors based on race” should specify exactly which downstream features were controlled — the conclusion depends on this.
In a conjoint, there is a causal hierarchy from features to outcomes:
\[\underbrace{A_1, A_2}_{\text{Attributes}} \;\rightarrow\; \underbrace{I_1, I_2}_{\text{Information provided}} \;\rightarrow\; \underbrace{B_1, B_2}_{\text{Beliefs formed}} \;\rightarrow\; \underbrace{Y}_{\text{Evaluation}}\]
Figure 6
The experiment randomizes information (\(I\)). Subjects are not told attributes are random. They may infer \(A_2\) from \(I_1\) — just as in the real world.
| Estimand | Definition | Identified? | |
|---|---|---|---|
| 1 | Attribute effect | \(Y(A_1=1) - Y(A_1=0)\) | No |
| 2 | Information effect | \(Y(I_1=1) - Y(I_1=0)\) | <b>Yes ✓</b> |
| 3 | Belief effect | \(Y(B_1=1) - Y(B_1=0)\) | No |
| 4 | Conditional info effect | \(Y(I_1=1, I_2) - Y(I_1=0, I_2)\) | <b>Yes ✓</b> |
| 5 | Conditional belief effect | \(Y(B_1=1, I_2) - Y(B_1=0, I_2)\) | No |
| 6 | Controlled info effect | \(Y(I_1=1, B_2) - Y(I_1=0, B_2)\) | No |
| 7 | Controlled belief effect | \(Y(B_1=1, B_2) - Y(B_1=0, B_2)\) | No |
The conjoint identifies information effects (rows 2, 4) only.
Survey experiments manipulate signals, not features of the world
Survey experiments are routinely used to make claims like these:
“allies who stood firm in the past indeed gain a reputation for resolve and are seen as more likely to stand firm in the current crisis”
— Kertzer, Renshon, and Yarhi-Milo (2021)
“shared democracy pacifies the public primarily by changing perceptions of threat and morality”
— Tomz and Weeks (2013)
“the promise to renovate schools increases the probability of support by only four percentage points over candidates who do not make these promises”
— Mares and Visconti (2020)
“support for elected governance is not contingent on the state’s providing economic benefits”
— Ridge (2024)
“movement towards the other party improves vote shares when party positions are unpopular”
— Broockman and Kalla (2026)
“the causal effect of candidate extremity on citizens’ preferences”
— Amsalem and Zoizner (2024)
These sound like claims about features of the world. They are based on effects of words in a controlled survey.
Two threats:
Before asking how to export results, ask: is the target estimand even well-defined?
Conjoints estimate effects of signals cleanly. The corresponding real-world estimands — effects of gender, regime type, corruption — may not be.
Three threats to estimand existence:
| Threat | In a nutshell | |
|---|---|---|
| 1 | Attributes as causes | Changing the attribute may change the unit itself |
| 2 | SUTVA violations | Many versions of the treatment, each with different effects |
| 3 | Exclusion restriction | No lever to change the attribute without side effects |
Important
The effect of statements about a feature \(\neq\) the effect of the feature itself
Holland’s challenge: “attributes of units are never causes”
If you change the attribute, you may no longer be talking about the same unit.
Hard cases — features constitutive of the unit:
Softer reading: at minimum, the effect of “being female” requires specifying which female version of the candidate — with the same upbringing? after a transition at age 30? born female 50 years ago?
These are different interventions with different potential outcomes.
Note
A conjoint randomizes a gender label in a survey. This is well-defined. The corresponding real-world intervention is not — unless it is specified precisely (e.g. “the effect of describing the candidate as female in campaign materials”).
The Stable Unit Treatment Value Assumption requires no hidden versions of treatment.
For states rather than interventions, this is almost always violated.
Example: “The effect of being a democracy”
Each version has different potential outcomes. Averaging over them produces an estimand that corresponds to no specific real-world scenario.
Other examples: the effect of “being a migrant,” “being employed,” “being wealthy” — all underspecified without a temporal stamp and a specific pathway.
Note
This is not unique to conjoints. Observational causal inference faces the same challenge whenever treatment is defined by a state rather than an intervention. Survey experiments just make it more visible because the treatment is so easily specified in the survey.
If every lever we can imagine to change an attribute inevitably induces other effects on outcomes, the clean causal estimand does not exist.
Candidate gender:
Can we imagine a Trump victory scenario where he was female — without imagining a gender transition at some point in his life? If a transition is required, then the “gender effect” includes the effect of transitioning, which is surely not the intended estimand.
Regime type:
Can we imagine a democracy without imagining the process that produced it (elections, a revolution, foreign pressure)? Those processes have their own effects on conflict and cooperation.
Warning
The logic: if we can only change \(A_1\) by also changing \(Z\) (the lever), and \(Z\) affects outcomes directly, then the effect of \(A_1\) is entangled with the effect of \(Z\).
In audit experiments the same issue arises: sending a “Black-sounding” name CV versus a “White-sounding” name CV changes more than just the perceived race.
For AMCE = ATE (survey estimand = real-world causal effect):
| Condition | The challenge | |
|---|---|---|
| A1 | Causal autonomy of attributes | Attributes often cause each other (power corrupts) |
| A2 | Sovereignty: votes → vote shares | Abstention, misreporting, mobilization |
| A3 | Sincere voting | Strategic behavior, bandwagon effects |
| A4 | Context irrelevance given attributes | Culture, institutions shape preferences |
| A5 | Signals are complete mediators | Attributes may affect behavior beyond beliefs |
| A6 | Aligned distributions | Survey profiles ≠ real-world distribution |
Figure 7: Left: a structure under which AMCE can recover the effect of attributes on votes. Right: three numbered threats to this inference. Arrow 1 = causal relation between attributes (violates A1). Arrow 2 = cross-attribute updating (violates A5). Arrow 3 = direct behavioral effect (violates A5).
① Endogenous attributes (violates A1)
In the real world \(A_1\) may cause \(A_2\), even if we randomize signals independently.
Example: Power may beget corruption. A conjoint randomizes these independently; the real world does not.
Consequence: AMCE for \(A_1 = 0\), but ATE \(= 0.3\) in a worked example with identical distributions.
② Cross-attribute updating (violates A5)
Subjects update beliefs about unobserved features based on observed signals.
Example: Learning a candidate’s profession, subjects update on their gender.
Consequence: Signal of one attribute affects beliefs about another — even without information on the second attribute.
③ Direct behavioral effects (violates A5)
Attributes may affect behavior through channels beyond preferences and beliefs.
Example: A powerful candidate can intimidate voters. A wealthy candidate can buy votes.
Consequence: The conjoint captures stated preferences; real voting also responds to power and material incentives.
Setup: A researcher randomizes candidate gender and reports:
“Being a woman costs a candidate approximately X% of vote share in competitive elections”
What is wrong here?
1. Maps to nowhere: “The effect of being a woman” on vote shares is underspecified (SUTVA: which version of being female? What transition? Over what period?)
2. Conditions not met: The six conditions — especially A1 (gender may be causally related to other features), A5 (voters may respond to stereotypes via channels beyond stated preferences), and A6 (survey profiles ≠ real campaign information) — are unlikely to hold jointly.
3. Wrong level of description: Subjects were treated (given a gender signal), not candidates. Stated hypothetical preferences were measured, not actual votes.
Better framing
“A gender signal, in this collection of controlled information environments, shifts stated candidate preferences by X percentage points on average.”
| Risk | The confusion | The fix |
|---|---|---|
| 1. Estimand | Is the goal descriptive or causal? | Be explicit; ask if an experiment is even necessary |
| 2. Controls | Controls address confounding | Controls define estimands; different sets → incomparable results |
| 3. Extrapolation | AMCE = effect in the world | Requires A1–A6; these are extremely demanding |
The credibility revolution revealed how hard causal inference is in observational data. Survey experiments seemed to offer a solution — randomize what cannot be randomized in the field. But: Survey experiments do not ‘solve’ these problems. Rather they point our attention to different* problems that can be more readily solved.*
1. Know your estimand — before you design
If your goal is descriptive, check whether an experiment is necessary. Direct measurement may be simpler and less noisy. Choose the estimand first; choose the design to fit it.
2. Be explicit about what controls do
Report which attributes are included and why. Do not pool studies with different control sets — they answer different questions. Claims about “direct effects” (e.g., taste-based discrimination) must specify what is controlled downstream.
3. Keep claims in line with design
There is genuine power here, used right:
Conjoint experiments are excellent for:
The honest statement of an AMCE:
The average effect of a signal about attribute \(X\), holding other controlled information fixed, on stated preferences — averaged over the distribution of information conditions in the experiment.
This is well-defined, identifiable, and useful.
It is just not the same as the effect of \(X\) in the world.
Key
Don’t let the method determine the estimand
Three risks in conjoint experiments:
Estimand confusion — Causal vs. descriptive: different goals need different designs, and the experiment may not be necessary for descriptive goals
Controls confusion — Controls define estimands, not purify them: which controls you include changes what question you are answering; different controls → different, incomparable estimands
Extrapolation confusion — Surveys manipulate signals, not world features: translating AMCE to real-world causal effects requires six strong assumptions, each potentially violated
The AMCE: classic, well-defined, identifiable. Use it carefully and describe it accurately.
Know your estimand!
Slán
[Salve, Sláinte]