Abstract
This article is the first empirical application of the choice matching (CMa) method in discrete choice experiments (DCEs). An artifactual field experiment was conducted to test whether CMa applied to a DCE survey improves the validity and reliability of estimated preferences regarding standard hypothetical DCEs. Two experimental treatments were developed. In the first, subjects were exposed to a CMa-based DCE; in the second, subjects were exposed to a standard hypothetical DCE survey. Results suggest that although a CMa-based DCE does not improve validity, it can increase the reliability of estimated preferences.
1. Introduction
Stated preference (SP) methods are widely used in many branches of applied economics, ranging from environmental to agri-food economics and from health to transportation economics. SP can evaluate consumer preferences and willingness to pay (WTP) for innovative products that are not offered on the market yet and evaluate ex ante the economic value of welfare benefits generated by public policies that are not yet implemented. These evaluations can be very useful for businesses and policy makers and therefore need to be accurately estimated.
The accuracy of SP results, however, has often been questioned in the literature (e.g., Harrison and Rutsröm 2008; Harrison 2014). This article contributes to the literature by providing the first empirical application of the choice matching (CMa) method to discrete choice experiments (DCEs), which is arguably the most used SP technique. The CMa method was recently developed by Cvitanić et al. (2019) to elicit honest responses using any type of discrete choice question and could improve the accuracy of preferences elicited using DCE surveys by bridging the gap between stated and revealed preference methods. CMa can be considered a refinement of Prelec’s (2004) Bayesian Truth Serum (BTS). In our study, an artifactual field experiment was used to compare the accuracy,1 which is measured in terms of validity and reliability, of CMa applied to a DCE survey and a standard hypothetical DCE survey.
The accuracy of SP methods is often criticized for two reasons. First, rational choice theory, which is the theoretical foundation of these methods, has been challenged by empirical evidence from psychology and behavioral economics (e.g., Camerer 1995, 1999). Second, the hypothetical nature of the setting where respondents are asked to make decisions undermines the incentive compatibility of most SP approaches. In such hypothetical settings, truthful responses to the survey questions may no longer be the optimal strategy for respondents (Carson and Groves 2007) and could generate hypothetical bias (HB), which is the discrepancy between behavior observed in hypothetical and real choice settings (Harrison, Harstad, and Rutström 2004). HB often leads to an overestimation of WTP compared with market settings where real transactions occur (List and Gallet 2001; Murphy et al. 2005; Penn and Hu 2018).
Experimental methods are often used to explore these issues and can contribute to SP literature in at least two ways (Harrison and Rutström 2008; Harrison 2014). First, carefully designed experiments conducted in controlled environments can be used to test whether rational choice theory is supported when people face valuation exercises and identify the conditions that facilitate the satisfaction of rational choice theory’s assumptions (Shogren 2005, 2006). Second, experimental methods can be used to assess the extent of HB, provide best practices to mitigate HB, and test the efficacy of approaches developed to minimize HB (Harrison and Rutsröm 2008; Harrison 2014).
A few caveats must be considered when using economic experiments for the latter purpose. Although experimental procedures used to elicit values and preferences are theoretically incentive compatible mechanisms, they may not be fully demand revealing in practice (e.g., Cerroni et al. 2019). This raises the issue of whether economic experiments can elicit respondents’ true values and preferences (e.g., Harrison, Harstad, and Rutström 2004). Furthermore, experimental methods cannot be easily implemented in many nonmarket valuation contexts (e.g., environmental and health economics) because the goods and services under valuation cannot be exchanged for money at the end of the experiment, either because these are not available or because these are too costly. Hence, experimental methods can rarely be used to elicit values for public goods that are often the main focus of SP applications (e.g., Adamowicz 2004; Shogren 2005). Finally, while controlled laboratory experiments using standard subject pools (i.e., students) are characterized by a high degree of internal validity, their ability to draw generalizable conclusions that are meaningful to explain behavior in real-life situations is often questioned (e.g., Gneezy and Imas 2017).2 To overcome this limitation, some experimental research has tried to move away from the laboratory to the field and add context to experimental designs (e.g., Harrison and List 2004). This move comes at a cost: the introduction of possible confounding factors and a lower degree of internal validity (e.g., Smith 1976). Economic experiments in the area of nonmarket valuation are always prone to the tension between laboratory control (internal validity) and natural context (external validity) (Shogren 2010).
Despite these caveats, experimental methods to investigate HB in SP surveys have stimulated the development of several methods to mitigate the problem. These are categorized into ex ante methods, which aim to reduce HB by survey design, and ex post methods, which aim to correct potentially biased preferences using calibration approaches (see the review in Loomis 2014). Several ex ante approaches to reducing HB in DCEs exist: (1) cheap talk (Cummings and Taylor 1999; Silva et al. 2011), (2) consequentiality (Vossler, Doyon, and Rondeau 2012), (3) honesty priming (De-Magistris, Gracia, and Nayga 2013), (4) oaths (Jacquemet et al. 2013), (5) virtual reality (Fang et al. 2021), (6) indirect questioning (IQ) (Lusk and Norwood 2009a), and (7) BTS (Prelec 2004). Results on the efficacy of these methods in reducing HB are generally mixed.
Given the importance of finding new ways to reduce HB in DCEs, this article tests whether the CMa method applied to a DCE survey improves the accuracy (in terms of validity and reliability) of elicited preferences when compared with a standard hypothetical DCE. Our empirical application focuses on consumer preference for a ready-meal product that varies in price, saturated fat, salt, and whether the beef was produced with antibiotics. Our experimental design consists of two treatments. In the CMa treatment, subjects were exposed to the CMa method applied to a DCE survey (CMa-based DCE). In the DCE treatment, subjects were exposed to a standard hypothetical DCE survey.
The CMa method consists of two tasks. The first task (preference task) is equivalent to a standard hypothetical DCE. In the second task (belief task), respondents are asked to predict the choices that all other respondents in their session made in each choice situation in the first task. These predictions are elicited using an incentivized proper scoring rule (i.e., quadratic scoring rule), and respondents receive a payment depending on the accuracy of their predictions. This induces respondents to reveal their own true beliefs about other respondents’ choices. The key mechanism that makes a hypothetical DCE incentive compatible under CMa, and therefore able to elicit truthful choices, is the following: respondents are informed that one choice situation will be selected at random at the end of the experiment (i.e., binding choice situation), and there is a chance that their predictions in the binding choice situation will be replaced by the average predictions of the other respondents who made the same choice in the binding choice situation in the preference task. These predictions are used to calculate respondents’ payoffs from the belief elicitation task; hence, respondents have strict incentives to make honest choices in the preference task. The incentive compatibility of this mechanism relies on a key assumption—impersonal updating—which requires that respondents who make the same choices have similar beliefs about other respondents’ choices. This key assumption is discussed in Section 3 and tested in Section 6. When the impersonal updating assumption is satisfied, the application of CMa to the DCE allows the elicitation of truthful choices related to any type of goods and services, including public goods that cannot be marketed. This implies that the CMa-based DCE could be used in all fields of applied economics that make use of DCEs.
To assess the relative validity and reliability of preferences elicited via the CMa-based DCE, random parameter logit (RPL) models (mixed logit) were estimated in WTP space. Validity and reliability are two concepts widely used in the SP literature to assess the accuracy of results (Mitchell and Carson 1989). Reliability is related to the variance of estimated preferences and can be assessed using the standard errors of the estimated coefficients or error variance (i.e., scale parameter in the DCE). Validity is about the truthfulness of estimated preferences. Since true values of goods under examination are often unobservable, indirect approaches must be used to assess validity, and several criteria are available for this purpose: content validity, construct validity, convergent validity, and criterion validity (Johnston et al. 2017; Bishop and Boyle 2019; Mariel et al. 2021).
We hypothesize that the means of estimated marginal WTP distributions obtained via the CMa-based DCE will be lower than those from a standard hypothetical DCE. This hypothesis is based on previous empirical evidence suggesting that hypothetical DCEs are generally affected by HB. Regarding reliability, we expect that standard deviations of the marginal WTP distributions and standard errors of estimated coefficients via the CMa-based DCE will be lower than those from the hypothetical DCE. In addition, we hypothesize that the scale parameters using data from the CMa-based DCE are higher than those estimated using data collected via the hypothetical DCE. This will signal that the CMa-based DCE produces lower error variance. Our results generally suggest that although the CMa method does not necessarily improve validity, it has the potential to increase the reliability of estimated preferences and WTP.
2. Background
Reliability and Validity in DCEs
Reliability and validity have been extensively tested in many fields of applied economics where the use of SP methods is popular, such as environmental, health, energy, transportation, and agri-food economics (Table 1). The two most recent and exhaustive review studies on the reliability and validity of SP research were developed by Bishop and Boyle (2019) for the contingent valuation method (CVM) and Mariel et al. (2021) for DCEs. Johnston et al.’s (2017) contemporary guidance for SP studies is another important source of information. Other studies have reviewed the evidence on these two key concepts for specific disciplines; for example, Rakotonarivo, Schaafsma, and Hockley (2016) focus on environmental valuation, and Janssen et al. (2017) on health economics.
Synthesis of Contributions to the Discrete Choice Experiment Literature by Discipline
In the SP literature, reliability has been interpreted in terms of consistency of values, choices, and preferences. Most of this literature has explored intertemporal reliability using test-retest experiments, where the same survey is conducted at a time t and replicated at a time t + 1. A preference-elicitation method is deemed to be reliable when values and preferences elicited at time t and t + 1 are not statistically different. Many studies have used this approach in the DCE literature. All disciplines where the use of a DCE is widespread are represented: health economics (e.g., Bryan et al. 2000; Ryan et al. 2006; Skjoldborg, Lauridsen, and Junker 2009; Price, Dupont, and Adamowicz 2017), environmental economics (e.g., Bliem, Getzner, and Rodiga-Laßnig 2012; Schaafsma et al. 2014; Matthews, Scarpa, and Marsh 2017; Brouwer, Logar, and Sheremet 2017), energy economics (e.g., Liebe, Meyerhoff, and Hartje 2012), agri-food economics (e.g., Mørkbak and Olsen 2015; Rigby, Burton, and Pluske 2016), and transportation economics (e.g., Börjesson 2014).
This approach, however, departs from Mitchell and Carson’s (1989) original interpretation of reliability. Mitchell and Carson (1989) relate the reliability of SP studies to the variance of elicited contingent values. Following this interpretation, reliability has been measured in several ways in the literature. Bishop and Boyle (2019) argue that the reliability of the CVM can be assessed by considering the estimated standard errors of the elicited values. Specifically, larger standard errors signal lower reliability (and vice versa). Liebe, Meyerhoff, and Hartje (2012) explore the reliability of the DCE by considering the magnitude of the scale parameter, which is the inverse of the error variance. The larger the scale parameter, the lower the error variance and the higher the reliability of elicited preferences. The idea that error variance measures reliability has been widely used in the DCE literature (e.g., Day et al. 2012; Hess, Hensher, and Daly 2012; Campbell et al. 2015). Kealy, Montgomery, and Dovidio (1990) suggest that the variance of WTP distributions elicited via the CVM can be used as a proxy of reliability. The same criteria are used by Czajkowski, Barczak, and Budziński (2016) to test the reliability of DCEs. In this article, we used these approaches to explore the reliability of the CMa-based DCE compared with the standard hypothetical DCE.
The validity of SP surveys is related to the truthfulness of elicited values, choices, and preferences. Validity has been described in four possible ways: content validity, construct validity, convergent validity, and criterion validity (Johnston et al. 2017; Bishop and Boyle 2019; Mariel et al. 2021). Content validity focuses on the appropriateness of procedures to design and conduct the valuation study, analyze data, and report results. Generally, content validity can be assessed in terms of adherence to best practices highlighted in the literature (e.g., Holmes, Adamowicz, and Carlsson 2017; Johnston et al. 2017).
Construct validity focuses on prior knowledge regarding the relationship between values/preferences and other variables. This prior knowledge comes from economic theory and previous empirical studies. For example, in DCE studies, it is expected that the price coefficient is negative and statistically significant and that lower-income respondents are more sensitive to price changes in general (Mariel et al. 2021). A particular type of construct category is convergent validity, which focuses on comparisons of values/preferences estimated by different value/preference-elicitation mechanisms. An example would be comparing WTPs elicited via CVM and a DCEs (Hanley et al. 1998; Lloyd-Smith, Zawojska, and Adamowicz 2021). The array of examples is wide and covers several disciplines: health economics (e.g., Van der Pol et al. 2008; Ryan and Watson 2009), environmental economics (e.g., Boyle et al. 2001; Caparros, Oviedo, and Campos 2008; Christie and Azevedo 2009), energy economics (e.g., McNair, Bennett, and Hensher 2011), agri-food economics (e.g., Asioli et al. 2016; Yangui et al. 2019), and transportation economics (e.g., Raffaelli et al. 2021).
Criterion validity focuses on comparisons of results from SP studies with those from alternative methods that are deemed to elicit true preferences, for example, simulated markets in experimental settings or real choice/market settings (Mariel et al. 2021). Criterion validity is strictly related to the vast literature addressing HB (see Table 1).
Hypothetical Bias and Ex Ante Corrections in DCEs
There is a vast SP literature exploring the impact that HB has on elicited preferences and WTP. A few meta-analyses showed that hypothetical surveys tend to overestimate WTP compared with real market settings (List and Gallet 2001; Murphy et al. 2005; Penn and Hu 2018).
HB in SP studies could be driven by the lack of incentive compatibility of most hypothetical surveys or behavioral drivers that may affect participant responses to the survey question (see the discussion proposed by Vossler and Zawojska [2020] on elicitation effects). Several behavioral drivers have been explored. First, respondents may be more inclined to state preferences they think the experimenter wants to hear (i.e., experimenter demand effect) (Zizzo 2010). Second, respondents may report preferences they perceive to be more socially acceptable (i.e., social desirability bias) (Norwood and Lusk 2011). Third, respondents may not perceive their choices as having consequences (i.e., lack of consequentiality), either on the influence of their choices on policy makers’ decisions (Carson and Groves 2007) or on the payment they declared themselves to be willing to pay (Mitani and Flores 2014). The latter could lead to strategic behavior and free riding. Finally, respondents may also be uncertain about their responses to a discrete choice situation (i.e., preference uncertainty) and make erroneous judgments (Champ et al. 1997).
This list is far from exhaustive, and we refer interested readers to Loomis (2011, 2014) and Carson, Groves, and List (2014). Carson et al. (1996) reported the lack of a general theory that explains HB and, despite some attempts to develop such a theory (Ajzen, Brown, and Carvajal 2004), their claim is still valid today.
There are several ex ante methods to reduce HB, with each trying to address one or more of the determinants of HB described above. The implementation of these methods in DCEs is cross-disciplinary and spans from environmental to health economics and from energy to agri-food economics (Table 1).
Cheap talk is a script informing participants about the existence of the HB problem and asking them to respond to the SP survey as if they were in front of a real and binding decision (Cummings and Taylor 1999). This ex ante method generally aims to reduce HB, regardless of its determinants. The efficacy of cheap talk is mixed (Carlsson, Frykblom, and Lagerkvist 2005; Özdemir, Johnson, and Hauber 2009; Silva et al. 2011; Fifer, Rose, and Greaves 2014; Howard et al. 2017; Penn and Hu 2018; Wuepper, Clemm, and Wree 2019).
Consequentiality is the construction of a survey design that respondents perceive to be consequential in terms of the payment or policy implications. The idea is that if respondents perceive their choices to affect their budget constraints or policy makers’ decisions, they will make more reliable decisions. This approach was developed for CVM surveys implementing advisory referenda by Carson and Groves (2007) and tested by Carson, Chilton, and Hutchinson (2009) and Vossler and Evans (2009). It was extended to DCEs by Vossler, Doyon, and Rondeau (2012). Although this approach provided encouraging results in many fields of applied economics (e.g., Czajkowski et al. 2017; Lewis, Grebitus, and Nayga 2017; Oehlmann and Meyerhoff 2017; Zawojska, Bartczak, and Czajkowski 2019; Carson et al. 2020), its implementation is most appropriate for public goods.
Honesty priming consists of presenting subjects with a task that indirectly emphasizes the value of honesty among respondents. The honesty priming task is presented to respondents before the DCE. This approach was developed by de-Magistris, Gracia, and Nayga (2013), and empirical evidence suggests it can mitigate HB in DCE studies (e.g., Bello and Abdulai 2016; Howard et al. 2017). Other studies have asked respondents to read and sign oath scripts in which they swear to tell the truth and provide honest answers during the survey. This approach, proposed by Jacquemet et al. (2013), empirically reduces HB in SP studies (e.g., Jacquemet et al. 2017). It has been recently implemented in DCE studies (e.g., de-Magistris and Pascucci 2014; Kemper, Popp, and Nayga 2017; Mamkhezri et al. 2020). Oath scripts and cheap talk seem to provide the highest reductions of HB when combined (e.g., Jacquemet et al. 2013). Another approach is the use of virtual reality (VR) in DCEs. Fang et al. (2020) found that VR can reduce HB, particularly for those who do not significantly experience VR discomfort. VR has been proven to also reduce variability in preferences (Haghani and Sarvi 2019) as well as the asymmetry between WTP and WTA (Bateman et al. 2009). Although cheap talk, honesty priming, consequentiality, and oaths have been extensively tested and appear to mitigate (to some extent) HB in SP surveys, IQ, and BTSs have not yet received much attention. We discuss these in more detail in the following sections.
Indirect Questioning and Novel Truth Serums
The IQ method goes back to Haire (1950) and involves asking respondents to predict the choice behavior of a third party, which is indicated in the indirect question (Fischer and Tellis 1998). In SP studies, IQ (sometimes referred to as inferred valuation; Lusk and Norwood 2009b) involves asking respondents to predict the choices of other respondents or the population of interest instead of reporting their own private choices. In the context of a DCE, this involves not choosing one’s own most preferred among several options but estimating the distribution of choices that the population (of interest) would make in each choice situation. This can reduce social desirability bias, which is identified as a driver of HB. IQ is particularly relevant for eliciting preferences associated with public goods or attributes. In theory, IQ could reduce HB in SP studies (Norwood and Lusk 2011), and empirical evidence suggests that IQ can partially reduce HB in DCE studies (Lusk and Norwood 2009b; Carlsson, Daruvala, and Jaldell 2010; Yadav, van Rensburg, and Kelley 2013; Menapace and Raffaelli 2020; Raffaelli et al. 2021). Implementing IQ is relatively straightforward, and an IQ DCE survey can be conducted the same way as a standard DCE survey. Nonetheless, while representing an improvement over classic DCEs, IQ is not without problems. First, IQ is still not incentive compatible, as respondents have no incentives to provide truthful beliefs about choices at the population level. Second, IQ methods are only able to elicit population-level choices (i.e., distributions of population choices), and it is not clear whether respondents’ beliefs about others’ choices correlate with their own preferences (Fisher 1993). Thus, while IQ may offer some advantages over standard DCEs, it is certainly not a panacea.
More recently, new methods to elicit truthful choices were applied to DCEs. The BTS (Prelec 2004) method asks respondents to make their personal choices and predict the choice behavior of other participants, just as in the IQ method. BTS uses a scoring mechanism that is associated with monetary payoffs and induces respondents to provide truthful personal choices. This is because truthful revelation is a Bayesian Nash equilibrium. The BTS scoring rule consists of two components: (1) a “prediction score,” which rewards respondents’ predictions based on their accuracy regarding others’ choice behavior; and (2) an “information score,” which rewards respondents’ personal choices based on whether these are surprisingly common (i.e., these choices are more frequent than predicted).3 Menapace and Raffaelli (2020) applied BTS in the context of a DCE by exploring consumers’ preferences for more sustainably produced pasta. Respondents were asked to make choices in a standard hypothetical DCE and guess the percentage of respondents choosing each option in the presented choice situations. The BTS score was associated with a payment rule that should incentivize respondents to make truthful personal choices: the top 30% of respondents in terms of this BTS score received a €30 gift voucher.
BTS has limitations. First, the method requires a large sample defined as a function of the (unknown) prior and is therefore impossible to calculate a priori (i.e., it is impossible to know how large a sample is required in advanced). Second, the implementation of BTS in DCE surveys requires paying monetary rewards to respondents, which makes this approach slightly more difficult to implement than a standard DCE survey. Third, BTS is more cognitively demanding than a standard DCE survey, and this may create fatigue effects that undermine the accuracy of elicited preferences. Further refinements of BTS either cannot be applied to DCEs, such as the robust BTS by Witkowski and Parkes (2012), or have other undesirable properties, such as the divergent BTS by Radanovic and Faltings (2014), which allows for dishonest equilibria where lying is a payoff-dominant strategy (Cvitanić et al. 2019). In this article, we focus on the ability of the CMa mechanism to overcome some of the limitations of IQ and BTS discussed here, especially when applying the methods in DCE applications.
3. The CMa Method Applied to the DCE Survey
This study is the first application of the CMa method to a DCE. The CMa method consists of two stages that can be operationalized into two tasks. Task 1 (preference task) is equivalent to a standard hypothetical DCE, where respondents are asked to select the most preferred alternative in several choice situations. In task 2 (belief task), each respondent i is asked to predict the choices that all other respondents in their session made in each choice situation k presented in task 1. These predictions are elicited using an incentivized proper scoring rule (i.e., quadratic scoring rule [QSR]), and respondents are rewarded depending on how accurate their predictions are at the end of the experiment. This induces respondents to reveal their true beliefs about the choices that other respondents made in task 2.
The key mechanic that makes the DCE presented in task 1 incentive compatible is that at the beginning of the experiment, respondents are informed that (1) one choice situation k will be selected at random (i.e., binding choice) at the end of the experiment, and (2) there is a probability p = [0,1] that the prediction that each respondent i made in task 2 regarding the choices that other respondents (j) made in the binding choice situation k in task 1 will be replaced with the average prediction of all other respondents who made the same choice as respondent i in task 1. Therefore, respondents may receive an experimental payoff according to the average predictions of the other respondents who made the same choice.
The key assumption that needs to be satisfied to make the DCE in task 1 incentive compatible is referred to as impersonal updating. This assumption implies that respondents who made the same choice in choice situation k of task 1 report the same beliefs regarding the choices that other respondents made in choice situation k of task 1. The logic is that respondents prefer that their beliefs expressed in task 2 are replaced with beliefs more similar to their own simply because this maximizes their expected experimental payoff. Therefore, to increase the probability of maximizing their experimental payoff from task 2, each respondent has an incentive to report truthful choices in task 1. This is true, assuming that respondents’ beliefs and choices are highly correlated (impersonal updating). If a respondent makes untruthful choices in task 1, a chance exists that their payoff from stage 2 will be determined by beliefs that differ from their true belief. In this case, the respondent will not maximize their experimental payoff. Thus, respondents have strict incentives to provide their truthful choices in task 1, despite the preference elicitation not being directly incentivized in task 1.
In theory, the CMa mechanism should fully remove HB and induce demand revelation and hence should represent a marked improvement over the IQ method discussed already. Compared with BTS, CMa has the advantages of (1) not being based on any kind of equilibrium concept or requiring cognitively difficult Bayesian updating; (2) using a payoff-generating rule, which is easier to explain; and (3) being implementable in small groups. Similar to BTS, implementing CMa in DCE surveys implies the payment of monetary payoff to respondents. This represents a complication compared with the design of a standard DCE survey.
4. Methods
Discrete Choice Experiment
In the DCE, respondents were presented with a series of 12 choice situations that each featured two alternative cottage pies (options A and B) and an “I prefer neither” opt-out alternative (option C). Each cottage pie was the same size (400 g), representing a typical individual portion size (i.e., a “ready meal for one”). A cottage pie is made of a layer of mashed potato on top of minced beef in gravy sauce with some vegetables (onions, carrots, etc.) included. Options A and B were described using four attributes.
The first attribute is related to the presence of traces of antibiotic in food. Each cottage pie either had a label indicating that the cattle-derived ingredients (i.e., beef and dairy) were “raised without antibiotics” or had no such label (implicitly indicating the possible presence of antibiotic residues in the meat and dairy ingredients). This attribute had two levels.
The cottage pies were also described according to the saturated fat and salt content using a traffic light system (TLS). The UK Food Standards Agency (FSA) has implemented a (voluntary) TLS that rates food products as either low, medium, or high (represented as green, amber, or red, respectively) for their quantities of calories, fat, saturated fat, sugar, and salt (FSA 2016). These attributes had three levels each. The level of saturated fat is low (or green: 1.2 g per 100 g), medium (or amber: 2.3 g per 100 g), or high (or red: 6.2 g per 100 g). These values are within the appropriate range for each TLS level per the FSA guidelines and are therefore consistent with existing food labeling with which consumers are likely to be familiar. The level of salt was similarly varied. Low (or green) salt content corresponds to 0.2 g per 100 g, medium (or amber) to 1.1 g per 100 g, and high (or red) to 2.3 g per 100 g. Finally, each cottage pie was associated with one of four prices (attribute levels): £1.50, £2.00, £3.00, or £4.50. This range of prices represents the extent of market prices from, at the low end, a supermarket-brand version to, at the high end, a gourmet version from an upmarket supermarket.
The 12 choice situations were generated using a D-efficient design, which was created using data from a pilot study (ChoiceMetrics 2018).4 An example choice situation shown to respondents during the instructions is presented in Figure 1. Respondents were asked to select their most preferred alternative in each choice situation. The order of the choice situations was randomized across respondents. Because this DCE was hypothetical, respondents do not have the chance to receive the cottage pie or any additional payment beyond the £15 participation fee.
An Example Choice Situation Shown to Respondents in the Experimental Instructions
Note: Light gray corresponds to yellow in the instructions; medium gray corresponds to green in the instructions; dark gray corresponds to red in the instructions.
Choice-Matching Discrete Choice Experiment
As noted, the CMa mechanism consisted of two tasks. In task 1, (preference task), respondents were exposed to the same procedure used for the classic hypothetical DCE, as described already. In task 2 (belief task), they were asked to predict the frequency of other participants (in their session) choosing each option (A, B, or the opt-out C) in task 1. This was asked for all 12 choice situations presented in task 1.
These frequencies were elicited using a QSR (Brier 1950; Murphy and Winkler 1970). Each possible prediction corresponded to a payoff vector of three possible payoffs (Figure 2). These payoffs were derived using a QSR for eliciting subjective probability distributions recently developed by Harrison et al. (2017). A QSR rewards respondents for the accuracy of their predictions and penalizes them depending on how frequencies are distributed across the available intervals (Harrison et al. 2017). In each choice situation k, the exact payoff associated with each option j (A, B, or C) is calculated as
, where pi,j,k is the frequency assigned by respondent i to a given option j in choice situation k. In our parameterization, a = b = 5, giving a minimum payoff of £0 and a maximum of £10. Respondents in the CMa treatment therefore received a payment ranging from £15 to £25.5 For example, assume that a respondent i in a session with 10 other respondents predicted that 5 people chose option A, 4 option B, and 1 option C in a given choice situation k (see Figure 2). The payoff obtained by respondent i for each of the options A, B, and C is then given by πi,j,k = 5 + 5[(2*0.5) − (0.52 + 0.42 + 0.12)] = £7.90 if option A realizes, πi,B,k = 5 + 5[(2*0.4) − (0.52 + 0.42 + 0.12)] = £6.90 if option B realizes, and πi,C,k = 5 + 5[(2*0.1) − (0.52 + 0.42 + 0.12)] = £3.90 if option C realizes.
An Example Choice Situation Shown to Respondents in the Experimental Instructions
Note: Light gray corresponds to yellow in the instructions; medium gray corresponds to green in the instructions; dark gray corresponds to red in the instructions.
Respondents were provided with the following information before taking part in the experiment: Respondents were told they will take part in two tasks in the following order: the preference task (task 1) and the belief task (task 2) (information 1). They were told that one choice situation was to be drawn at random to be payoff-relevant, and this was referred to as the binding choice situation. This was illustrated to respondents as a numbered ball (from 1 to 12) drawn from a bucket (information 2). They were informed that earnings depended on (1) the reported frequency of respondents choosing each option j (A, B, or C) in the binding choice situation k in task 1, and (2) the observed frequency of respondents choosing each option j (A, B, or C) in the binding choice situation k in task 1. In particular, respondents’ payoffs depended on a random draw from a bucket containing balls labeled as A, B, and C. The proportion of A-, B- and C-labeled balls was equivalent to the observed frequency of respondents choosing option A, B, or C in the binding choice situation k in task 1. The final earnings for each respondent were equal to the payoff calculated using the QSR for the randomly drawn letter. Consider the example in Figure 2. Respondent i would earn £7.90 if an A-labeled ball was randomly drawn from the bucket £6.90 if a B-labeled ball was picked from the bucket, and £3.90 if a C-labeled ball was randomly drawn (information 3). Respondents were also told that there was a chance (70% in our experiment6) that their payoffs were calculated based on the average predictions of the other respondents who preferred the same option as themselves in the binding situation in task 1, rather than calculated based on their predictions. This might affect their payoff. For example, suppose respondent i preferred option A in the binding choice situation in task 1. There is a 70% chance that respondent i’s predictions, reported in task 2, regarding the frequency of respondents choosing option A, B, or C in the binding choice situation will be replaced with the average predictions of all the other respondents who also preferred option A in task 1.7 Whether respondent i’s prediction was to be replaced was decided by an independent random draw (illustrated to respondents as the roll of a 10-sided dice). If the outcome of the roll was between 1 and 3, respondent i’s prediction was used to calculate the payoff from task 2 (QSR). If the outcome of the roll was between 4 and 10, the average predictions of the other respondents who preferred the same option as respondent i in the binding situation in task 1 were used to calculate respondent i’s payoff (information 4).
Sample and Experimental Design
Our sample consists of 130 consumers living in or around Belfast (Northern Ireland, United Kingdom).8 The study was advertised in several locations (including digital channels) and described simply as a food choice study. Ages ranged from 19 to 74 (the average age was 33), and 65% of respondents were female.
As noted, our sample was randomly split between two treatment conditions: 66 respondents took part in the treatment featuring the DCE supplemented by the CMa mechanism (“CMa treatment”), and the remaining 64 respondents took part in the DCE baseline control treatment (“DCE treatment”). Sessions in the CMa treatment ranged in size from 10 to 14 respondents. Sessions in the DCE treatment ranged in size from 6 to 14 respondents. Participants were randomly assigned to sessions. All sessions took place at the Institute for Global Food Security at Queen’s University Belfast in April 2019. All sessions were programmed and run using z-Tree (Fischbacher 2007). The study was granted full ethical approval by the Faculty of Medicine, Health and Life Sciences Ethical Review Board at Queen’s University Belfast.
In both treatments, respondents began by answering two questions relating to how hungry or full they felt at that moment, rated on a seven-point Likert scale. Subsequently, using a D-efficient design in both treatments, respondents made their choices in the 12 choice situations presented in a standard DCE survey. In the CMa treatment only, respondents then expressed their beliefs regarding other respondents’ choices in the standard DCE incentivized via QSR. To familiarize themselves with the CMa procedures, respondents were exposed to a practice involving both the preference task and the belief task. All respondents in both treatments completed a questionnaire that asked about demographics, behavior, and preferences related to cottage pies (including expected taste for every combination of saturated fat and salt content), broader shopping habits, and knowledge of antibiotics and antimicrobial resistance.
All participants received a £15 show-up fee before making any of their choices (or hearing any of the instructions) in the experiment. An overview of each of the treatments is shown in Table 2.9
Steps in Each Experimental Treatment
5. Testing Choice Matching’s Ability to Reduce Hypothetical Bias
Econometric Models and Testable Hypotheses
In this study, we test the validity of CMa using a criterion validity paradigm. Specifically, we compare marginal WTP (mWTP) for each attribute characterizing the cottage pie across treatments. We expect that mWTPs elicited via CMa will be lower than those elicited via the hypothetical DCE, which can potentially suffer from HB. Also, reliability of mWTPs elicited via CMa is compared with that of mWTPs elicited via the hypothetical DCE. Specifically, we compared the following:
■ Standard deviations of estimated mWTPs across treatments as a measure of reliability, as suggested by Kealy, Montgomery, and Dovidio (1990). We expect that standard deviations associated with CMa will be lower than those elicited via the DCE. We acknowledge that using standard deviation as an indicator of the reliability of estimated mWTPs is not the norm in the choice modeling literature. These are usually used as an indicator of preference heterogeneity (Hensher and Greene 2003).
■ Standard errors of estimated coefficients in our choice models, following Bishop and Boyle (2019). Because larger standard errors signal lower reliability (and vice versa), we expect that standard errors will be lower in the CMa treatment than in the hypothetical DCE.
■ Scale parameter of estimated choice models, as suggested by Liebe, Meyerhoff, and Hartje (2012). We expect that the scale parameter (error variance) will be higher (lower) in the CMa treatment than in the hypothetical DCE.
To this end, we used a two-step procedure. First, we estimate the mean and standard deviation of mWTPs by estimating two RPL models (or mixed logit) in WTP space (Train and Weeks 2005): model 1 is estimated using data from the CMa treatment, and model 2 uses data from the DCE treatment. Second, we compare the mean and standard deviations of mWTPs across treatments using Poe, Giraud, and Loomis’s (2005) convolution approach.10
In the first step, random utility models (RUMs) were used to model choice data (McFadden 1973). RUMs assume that the utility that participant i attaches to each alternative j in each choice situation k is split into two parts: Vi,j,k, the part of the utility observed by the researcher, and εi,j,k, which cannot be observed by the researcher, so that Ui,j,k = Vi,j,k + εi,j,k. RPL models were estimated in WTP space because this estimation procedure provides several advantages compared with standard estimation in preference space. First, it allows direct estimation of mWTP for nonprice attributes. In WTP space models, the utility is rearranged such that estimated coefficients related to nonprice attributes represent mWTP for such attributes. Second, estimation in WTP space mitigates the confounding of variation in scale (i.e., the standard deviation of the unobserved part of the utility) and WTP (Train and Weeks 2005), which is instead an issue in models estimated in preference space. Third, many studies have shown that models in WTP space fit data better than those in preference space (e.g., Thiene and Scarpa 2008; Hole and Kolstad 2012). This estimation approach was recently adopted in studies investigating consumers’ preferences for food products (e.g., Lin, Ortega, and Caputo 2019; Macdiarmid et al. 2021).
The general specification of the indirect utility function of the RPL models estimated in WTP space is specified as Vi,j,k = −γiPRICEi,j,k + (γiωi)xi,j,k. In the equation, γi = αi/μi, where αi indicates participants’ preferences for the price of the cottage pie PRICEi,j,k, and μi is the scale parameter (the standard deviation of the unobserved part of the utility). The coefficient vector ωi = βi/αi is the ratio of the vector of coefficients βi, which are associated with the vector of nonprice attributes xi,j,k and the coefficient αi. The vector of coefficients βi indicates preferences for the vector of nonprice attributes xi,j,k, while the vector of coefficients ωi indicates the vector of mWTPs associated with the vector of nonprice attributes xi,j,k.
The vector ωi is composed of the following coefficients: ωFAT_A,i and ωFAT_G,i indicate subjects’ mWTP for pies that are amber and green in saturated fat (FAT_A and FAT_G, respectively) compared with pies that are red in saturated fat (FAT_R). The coefficients ωSALT_A,i and ωSALT_G,i indicates subjects’ mWTP for pies that are amber and green in salt (SALT_A and SALT_G, respectively) compared with pies that are red in salt (SALT_R). The coefficient ωANT,i refers to pies that are made of beef and dairy products produced from animals that were raised without antibiotics. The coefficient ωOPT-OUT,i indicates subjects’ preferences for the opt-out alternative. The coefficients ωFAT_A,i, ωFAT_G,i, ωSALT_A,i, ωSALT_G,i, ωANT,i, and ωOPT-OUT,i are all assumed to be normally distributed, with means and standard deviations to be estimated. The coefficient αi indicates subjects’ preferences for the price of pies (PR) and is modeled as a random parameter following a log-normal distribution with mean and standard deviation to be estimated.
In the second step, we used Poe, Giraud, and Loomis’s (2005) convolution approach to test differences in the distribution of estimated coefficients between model 1 (CMa) and model 2 (DCE). Specifically, we used parametric bootstrapping techniques (Krinsky and Robb 1986) to generate 1,000 bootstrapped values for each pair of coefficient distributions and calculated 1,000,000 differences between the two bootstrapped distributions. The full set of hypotheses tested is shown in Table 3.
Description and Interpretation of Testable Hypotheses
Results
Results from the estimation of models 1 and 2 are shown in Table 4. Summary statistics of the choices made in the two treatment groups are provided in Appendix B. Potential differences among the two subsamples were investigated using a logit sample selection model, nonparametric Kolmogorov-Smirnov tests, and parametric t-tests. We did not find any substantial difference in a set of key variables (e.g., gender, age, hunger level, income, taste expectation for the pies).11
Random Parameter Logit Models Estimated in Willingness-to-Pay Space
Results reported in Table 5 suggest that the distributional means of our coefficients are not statistically different across groups, meaning there is no evidence that mWTPs elicited via CMa are lower than those elicited via the DCE. After all, it may be that mWTPs elicited via hypothetical DCEs already have an acceptable level of validity. It is common knowledge that SP studies valuing private goods are less affected by HB than are SP studies valuing public goods (List and Gallet 2001; Murphy et al. 2005; McFadden and Train 2017). In a more recent meta-analysis, Penn and Hu (2018) found that DCE and referendum formats generate significantly lower HB than open-ended, payment card, and dichotomous choice CVM studies. Therefore, we conclude that the mWTP estimated from the CMa method are as valid as those estimated from the DCE.
Comparison of Distributional Means and Standard Deviation of Estimated Coefficients across Treatments
Concerning reliability, we found that CMa provides less dispersed distribution for five (out of seven) attributes in our empirical application (i.e., OPT-OUT, FAT_A, FAT_G, SALT_G, and ANT), which suggests that mWTPs elicited via CMa may be more reliable than those elicited using the hypothetical DCE survey.12 This finding is confirmed by the fact that standard errors of estimated coefficients are consistently smaller in the CMa treatment than in the DCE treatment (see Table 4). However, we found that the scale parameter τ and error variance are not statistically significantly different in the two groups (Table 5). Two indicators of reliability (out of three) suggest that CMa provides more reliable results than the hypothetical DCE survey, indicating that CMa has the potential to improve the reliability of estimated welfare measures.
6. Testing Impersonal Updating and Result Robustness
The key assumption behind the CMa method is impersonal updating: respondents with similar choices should have similar beliefs about other respondents’ choices. If this assumption does not hold, the incentive compatibility of the CMa method may be weakened.
We test whether impersonal updating is satisfied empirically using Pearson’s χ2 tests. Specifically, in each session t, for each choice situation k, we tested whether subjects who chose the same alternative j (A, B, or C) in the preference task (task 1) had equivalent beliefs regarding the number of subjects choosing option A, B, and C in the preference task (task 1). Beliefs were reported by respondents in the belief task (task 2). This is our null hypothesis (H0) to test. Rejection of this null hypothesis suggests that impersonal updating was not satisfied. We conducted 18 tests per choice situation k, 3 for each session t (we had 6 sessions). Overall, we conducted 216 tests, as we had 12 choice situations. Results are reported in Appendix E. Approximately 54% of tests did not reject the null hypothesis of impersonal updating, suggesting that impersonal updating is satisfied in just over half of the cases.13 We discuss the possible ramifications of this result in the next section.14
7. Conclusions
DCEs are arguably the most popular SP method used by applied economists, and finding ways to elicit reliable preferences is very important. This article represents the first empirical application of the CMa method and the first empirical tests of its validity and reliability. The CMa method was recently developed by Cvitanić et al. (2019) to elicit honest responses using any type of discrete choice question and represents a refinement of BTSt.
We conducted a laboratory experiment involving two treatments. Part of the sample was exposed to a standard hypothetical DCE survey, the other part to the CMa method applied to a hypothetical DCE survey. The DCE was designed to elicit consumers’ preferences for antibiotic residue presence as well as salt and saturated fat content in cottage pies, a popular British dish, which is available as a ready meal on many supermarket shelves.
RPL models (mixed logit) were estimated in WTP space. Our results show that the means of estimated mWTPs for cottage pie attributes do not differ across treatments, indicating that CMa is as valid as the hypothetical DCE survey. Standard deviations of the estimated mWTPs and the standard errors of the estimated coefficients are significantly lower in the CMa treatment than in the DCE treatment. In contrast, the error variance is not statistically significantly different in the two groups. Although these results indicate that the DCE with CMa can be more reliable than the conventional DCE, further research is needed on the topic, considering that only two reliability criteria (out of three) support this argument, and the standard deviations of the mWTPs may simply signal unobserved preference heterogeneity. We acknowledge that results on validity may be driven by the private nature of the good under investigation. Previous studies have shown that HB is more of a problem when the good under valuation is public (e.g., McFadden and Train 2017; Penn and Hu 2018). Future research could explore the performances of the CMa approach in terms of validity and reliability when the nature of the good under valuation is public or quasi-public. This investigation would be particularly beneficial for disciplines such as environmental and health economics that often focus on valuing these types of goods.
A test of the impersonal updating assumption was conducted to explore whether this crucial assumption for the functioning of the CMa mechanism was satisfied. We found that this assumption is only partially satisfied in our sample. This may cast some doubts on the applicability of the CMa method to DCEs, and further research is needed to test whether the benefits of using CMa in SP research outweigh the higher cognitive burden to which respondents are exposed in the CMa mechanism. For example, it is possible that fatigue effects may undermine the empirical applicability of CMa for SP research. These trade-offs could also be explored by comparing the performance of CMa in controlled laboratory experiments with more standard subject pools (i.e., students) and field experiments involving the general public.
The potential implementation of CMa in studies that are not conducted in the laboratory, including those with larger and more representative samples of the population of interest, could be another important aspect to consider in future research. High levels of internet access (even in remote and rural areas), the availability of many software and platforms to conduct online experiments, and the existence of many companies offering consumer panels and sampling services at a large scale have increased the potential to conduct economic experiments online. The COVID-19 pandemic has given more impetus to the use of such online experiments. Recent studies have shown that data quality from online economic experiments is adequate and reliable (Arechar, Gächter, and Molleman 2018). These considerations allow us to be optimistic about the use of the CMa approach applied to choice-based SP methods outside the laboratory with larger and more representative samples. Future research should also explore the validity and reliability of the CMa method in such settings to test the robustness of our findings.
Finally, a limitation of the study is our test of validity: we do not have any real market data with which to compare our results. Unfortunately, there were no cottage pies with the same range of attributes as those used in the survey currently available on the U.K. market when the study was conducted. However, future research could easily perform such a test of validity by using the CMa method to value a good currently available on the market. Nevertheless, our empirical application shows that the CMa method can provide more reliable estimates than hypothetical DCE surveys; hence, we conclude that CMa could be a promising method that should be tested further, not just in DCEs but also in other SP elicitation formats, such as the dichotomous choice CVM, payment card formats, and multiple price list formats. We hope this article will encourage other researchers to further test the merits of the CMa method in SP studies.
Acknowledgments
We thank Chloe McCallum for her help in organizing sessions and running the experiment at Queen’s University Belfast. We thank two anonymous reviewers and the editor for comments on an earlier draft of this article.
Footnotes
Appendix materials are freely available at http://le.uwpress.org and via the links in the electronic version of this article.
↵1 In this article, we use Harrison and List’s (2004) categorization of field experiments.
↵2 Internal validity is the ability to draw robust causal conclusions (Loewenstein 1999).
↵3 The formula for the BTS score is
where respondent r chooses among k = 1,…,m options. The variable
is a dummy indicating whether respondent r chose option k
or not
. The variable
is the proportion of respondents choosing option k,
is the predicted frequency of respondents choosing option k by respondent r, and
is the average predicted frequency of choosing option k.↵4 The pilot was conducted with 25 respondents in March 2019. These were randomly recruited among academic and nonacademic staff members at Queen’s University Belfast.
↵5 Both treatments took less than an hour to complete on average. The minimum wage per hour in the United Kingdom in 2019 was £8.21 (U.K. Government 2019). Therefore, compensation of between £15 and £25 represents a reasonable monetary incentive to motivate participants in their decision-making.
↵6 The decision to fix this probability to 70% is because we wanted to avoid using a 50% chance, since participants may perceive everything as being random (i.e., a coin flip) and we did not want the chance of beliefs being replaced to be so high that participants viewed their own beliefs as inconsequential in practice. We acknowledge that this parameter may influence results. Future research could investigate whether varying this chance influences the properties of the choice matching approach.
↵7 In the case that a participant was the only one to choose a particular option—and therefore that there are no existing beliefs with which to replace the participants—the participant automatically gets an additional payment of £0, per Cvitanić et al. (2019).
↵8 Summary statistics of the S-error were calculated using the software Ngene (ChoiceMetrics 2018): mean = 18.325, standard deviation = 4.022, median = 17.720, minimum = 11.104, maximum = 40.949. These values suggested that our sample is large enough for the design used in the study.
↵9 The full set of experimental instructions are available in Appendix A.
↵10 We used parametric bootstrapping techniques (Krinsky and Robb 1986) to generate 1,000 bootstrapped values for each estimated coefficient and calculate 1,000,000 differences between the two bootstrapped distributions.
↵11 Results from these analyses are presented in Appendix C. Only age was statistically different at the 5% significance level between the two groups.
↵12 This is evident in Appendix Figure D.1. To test the robustness of our results, we estimated models 1 and 2 in preference space and used Poe, Giraud, and Loomis’s (2005) convolution approach to test differences in coefficients across treatment groups (CMa and the DCE). Results from these analyses show the robustness of our results and are provided in Appendix D.
↵13 A limitation of the strategy used to test the impersonal updating assumption is that it relies on the fact that CMa elicits truthful choices. A proper test of whether elicited choices are truthful (and therefore impersonal updating is satisfied) is only possible if revealed choices are available for the goods under investigation. We thank an anonymous referee for highlighting the issue.
↵14 As pointed out by an anonymous referee, the key to whether impersonal updating is sufficient for incentive compatibility is the “closeness” of the beliefs. Broadly, it would be sufficient for the beliefs of those who made similar choices to be closer than the beliefs of those who made different choices. Hence, an alternative approach to explore whether the impersonal updating assumption is satisfied would be testing whether the beliefs of respondents who made similar choices are more homogeneous than the beliefs of respondents who made different choices. This would represent a less strict approach to test the impersonal updating assumption.















