Abstract
A method for incorporating unobserved heterogeneity into aggregate count data frameworks is presented and used to control for endogenous spatial sorting in zonal recreation models. The method is based on latent class analysis, which has become a popular tool for analyzing heterogeneous preferences with individual data but has not yet been applied to aggregate count data. The method is tested using data on backcountry hikers for a southern California study site and performs well for relatively small numbers of classes. The latent class model produces substantially smaller welfare estimates compared to a constrained version that assumes homogeneity throughout the population. (JEL Q51)
I. Introduction
The basic premise of a standard travel cost model is that the cost to access a recreation site is a function of the distance an individual must travel from home to the site. All else being equal, individuals who have higher travel costs would be expected to make fewer trips than individuals with lower travel costs. With observations on individuals who live close to the site and further away from the site, a Marshallian demand curve for trips can be derived, as in Figure 1, and the value of a trip can be estimated from the curve (either as consumer surplus as shown in the figure, or as compensating or equivalent variation if the integrability conditions are satisfied).
Welfare Measurementina Travel Demand Model With Homogeneous Preferences
As noted by Parsons (1991), a crucial assumption in such a derivation is that the trip-taking behavior of any two observationally equivalent individuals can be estimated using the same demand curve as in Figure 1. That is, if the distance individual A must travel to reach the site were changed to match that of individual B, the demands for individuals A and B would be the same (except perhaps for an empirical disturbance term). This assumption is unlikely to be valid if there is substantial unobserved heterogeneity in the population. For example, Parsons (1991) argues that individuals with relatively stronger preferences for recreation might choose to live closer to the recreation sites they frequent specifically to reduce their travel costs. This endogenous spatial sorting means that if such an individual A were moved to individual B’s location, their demands would differ systematically due to unobserved heterogeneity in preferences: individual A would still demand more trips than B. Using an instrumental variables approach to control for endogeneity, Parsons (1991) finds support for this hypothesis in a dataset of visitors to Wisconsin lakes in 1978.
Figure 2 illustrates the implications for welfare estimation (using consumer surplus as the welfare measure). In Figure 2, individual A demands strictly more trips than individual B at all prices due to differing preferences for recreation; their observed behaviors do not lie on a common demand curve. Both individual and aggregate consumer surplus in this case are strictly greater than in the preceding case. If Figure 2 represents the “true” model, then Figure 1 represents a constrained version that overestimates the price-elasticity of demand and underestimates the welfare effects of recreation site closures. Parsons (1991) finds this to be the case in his dataset: the consumer surplus estimate from the instrumental variables model is more than twice the measure from the standard model.
Welfare Measurement in a Travel Demand Model With Heterogeneous Preferences
Another possible sorting outcome is that some individuals may choose to live closer to recreation sites not for the recreation opportunities, but rather for the other amenity (passive use) values that open space typically provides. In this case the outcome depicted in Figure 2 would be reversed: individual A would demand strictly fewer trips than individual B at all prices. The constrained model would underestimate the price-elasticity of demand and overestimate the welfare effects of recreation site closures. Which of these (or other) candidate explanations holds true is an empirical question that this analysis seeks to answer for a southern California application, using a latent class analysis of an aggregated dataset.
II. Unobserved Heterogeneity and Aggregation
Despite the current popularity of random utility modeling, there remain good reasons to model recreation demand using an aggregate count data specification, also known as a “zonal traVel cost model.” First, the discrete and nonnegative nature of the recreation decision lends itself to well-known count data frameworks such as the Poisson and negative binomial. Second, there is an abundance of readily available recreation permit data and aggregated census data that can substitute for individual surveys; the latter can be costly to collect and prone to selection bias (Weber and Berrens 2006). Third, compared to other approaches that rely on individual data, aggregate count data models are less prone to model specification bias and have performed well in Monte Carlo tests (Hellerstein 1995). And fourth, it is straightforward to obtain estimates of seasonal welfare changes including access value.
However. there are also potential problems associated with these models. First, it is not feasible to model the consumer’s complete utility maximization problem. Although the same is true of other modeling frameworks, this can lead to biased welfare estimates unless certain assumptions and restrictions are deemed valid. Second, reliance on aggregate rather than individual data can introduce aggregation bias into the analysis (Stoker 1993). And third, large sample populations likely exhibit substantial amounts of unobserved heterogeneity, which if neglected also can be a source of bias.
LaFrance (1990) notes three approaches to address the first problem: (1) aggregate across goods, (2) assume preferences are appropriately separable, or (3) estimate a theoretically consistent incomplete demand system. Shonkwiler (1999, 256) argues that aggregation is not desirable when the focus is on demand for specific goods, and LaFrance (1990) and Shonkwiler (1999, 256) outline several shortcomings of invoking arbitrary separability. Incomplete demand systems thus have emerged as an attractive option for addressing this problem, particularly when modeling demand for multiple goods that may be substitutes.
Stoker (1993) provides an excellent overview of aggregation issues. To be consistent with theory when working with aggregate data, the analyst should either aggregate individual demands exactly by integrating over the distribution of characteristics in the population (Hellerstein 1995; Moeltner 2003; Blundell and Stoker 2005, 352), or assume utility functions consistent with the Gorman form and estimate derived demand functions using the aggregate data (LaFrance, Beatty, and Pope 2008). Moeltner (2003) demonstrates how information about the distribution of characteristics in ZIP-code-level census data can be incorporated into a zonal recreation demand model, but finds only a 5% change in welfare estimates after doing so.
Individual heterogeneity can be addressed in a variety of ways, depending on its nature. As described by Moeltner (2003), it is straightforward to incorporate observable heterogeneity into the demand function by including the appropriate moments of regressors thought to influence demand (e.g., age, income, gender, etc.). Unobserved heterogeneity is often addressed with either fixed or random effects, or with latent class (Provencher, Baerenklau, and Bishop 2002; Boxall and Adamowicz 2002) or random parameters models (Train 1998). All of these approaches permit different parameter estimates to be associated with different individuals in the population, thus potentially improving measures of model fit and accuracy and avoiding certain types of bias.
Issues of heterogeneity and aggregation are thus related. When there is little heterogeneity or when the nature of it is well known in a population, aggregation is easier and the scope for aggregation bias is small. Moeltner’s (2003) analysis provides some evidence that neglecting to account for observable heterogeneity at the ZIPcode level does not substantially bias welfare estimates in an aggregate count data framework.1 However, to date, there has been no examination of the consequences of neglecting to account for unobserved heterogeneity. This article addresses this gap and focuses on the relationships between unobserved heterogeneity, endogenous spatial sorting, and welfare estimation in zonal recreation demand models.
III. Modeling Unobserved Heterogeneity in an Aggregate Count Data Framework
Following Moeltner (2003), Englin, Holmes, and Niell (2006), and others, this analysis posits a semilogarithmic incomplete demand system at the individual level as the basis for the model (von Haefen 2002):

Here, xij is individual i’s demand for trips to site j∈{1,...,J}; αj is a demand shifter that is a function of observable individual- and site-specific variables qij; pik is individual i's cost to access site k; each β is an estimable parameter; yi is individual i’s income; and γj is an estimable parameter. One set of parameter restrictions that guarantees integrability of the demand system described by equation [1] is (LaFrance 1990; von Haefen 2002):

Imposing these restrictions on equation [1] gives

where aj(qij)≡ln(αj(qij)), and each βj is restricted to be negative. As in previous studies, this analysis assumes aj(qij) is linear in q: aj(qj)≡δ′qij, thus a(qj)≡ exp(δ′qj). LaFrance (1990) and von Haefen (2002) derive compensated welfare measures for this demand system, which are used later in the analysis to estimate recreation access values.
A statistical framework is needed to estimate the parameters β, γ, and δ. For reasons discussed below, this analysis specifies that individual demand for each site follows an independent Poisson distribution (Cameron and Trivedi 1986):

with mean and variance both equal to λij, and λj ≡ E(xij) = exp(δ′qij + z βjpij + γyi), ∀j. Because zonal models use aggregate rather than individual data, this analysis and others before it assume qij=q1j, pij = pij, yi=yI, ∀i ∈ I for consistent aggregation, where I indicates a specific zone and subscript I indicates an aggregated (typically per capita) variable for that zone. Furthermore, because the sum of N independent Poisson distributions is also Poisson with parameter , it follows that the density for the aggregate zonal demand,
, is given by

where λij ≡ E(xIj) = exp (δ′qIj + βjpIj + γyI), ∀j and nI is the population of zone I.
As mentioned previously there are multiple ways to incorporate unobserved heterogeneity into equation [5]. The approach used here is to assume the population of agents within any zone can be characterized as a mixture of distinct but unobservable homogenous groups. This is known as a latent class or finite mixture model, and it has been used in several previous applications with individual data in the context of random utility (Swait 1994; Morduch and Stern 1997; Provencher, Baerenklau, and Bishop 2002; Scarpa and Thiene 2005). Wedel et al. (1993) and Bockenholt (1993) appear to provide the first instances of latent class count data models in the literature. These and subsequent applications—mainly in the marketing and health literatures—use individual data. The approach used here is similar, but the model structure and interpretation are somewhat different, and the aggregate nature of the data imposes some additional constraints that are discussed below. Although there is no theoretical reason to prefer a latent class model to a random parameters model, the latent class approach is intuitive and computationally easier than the random parameters approach. It also can be useful for identifying policy-relevant subpopulations of individuals.
In previous applications using individual data, the probability that an individual i belongs to group g ∈{1,...,G} often is assumed to be logistic:

where each ϕg is an estimable parameter vector (ϕ1≡0 for identification), and zi are individual characteristics thought to influence group membership.2 However, when working with aggregate data it is not possible to allocate individuals to groups, even probabilistically, because zi is not observed. Rather, the analogous concept is to interpret equation [6] as the expected proportion of individuals in zone I belonging to group g. To implement this it is again necessary to assume zi = zI, ∀i ∈ I; then the expected proportion is given by

Because the aggregate zonal demand remains a sum of independent Poisson distributions in the presence of multiple latent groups, the latent-class analog for equation [5] is parameterized by the expected aggregate demand across all classes of individuals in a given zone:

where subscript g refers to group-specific versions of previously defined variables. Incorporating [8] into [5] gives

with associated sample log-likelihood3

Note that because group-specific demands are not observed, the regression still relies on the observed aggregate zonal demands , which are generated by the mixture of groups in each zone. However each group is now characterized by an expected individual demand:

which can be aggregated up to give the expected demands by group and by zone: nI πI,g λIj,g.4
These group-specific expected demands provide the basis for group-specific welfare estimates that can inform the distributional effects of policies affecting recreation demand. In this application, equivalent variation (EV) is used as the welfare measure. Conditional on group membership, EV for an individual in a particular zone is given by (Englin, Boxall, and Watson 1998)

where is the prepolicy set of access costs and
the postpolicy set. It follows that aggregate EV for a group in a particular zone is nI,gvI,g. However, because groupspecific populations cannot be known with certainty, the expectation of this quantity must be used: E(nI,gvI,g) = nI πI,gvIg. These expectations may be aggregated across zones to provide estimates of group-specific welfare effects in the population.
Some further comments on the preceding model are warranted, particularly with regard to fixed and random effects and negative binomial (NB) specifications. A well-known criticism of the basic Poisson model is that the mean and variance are constrained to be equal. Because the Poisson is a member of the linear exponential family (LEF) of distributions, this constraint potentially affects the efficiency but not the consistency of the Poisson estimates under a distributional misspecification (i.e., if the true data-generating mechanism is not Poisson), provided the conditional mean function is specified correctly (Cameron and Trivedi 1986).5 However, typically it is possible to relax this constraint, obtain more efficient estimates, and incorporate a limited amount of unobserved heterogeneity by introducing random effects into the regression (Greene 1997, 939). When the random effects are assumed to follow a gamma distribution, this produces the NB model. Depending on the specific parameterization of the gamma function, different NB models are obtained including Cameron and Trivedi’s (1986) NB1 and NB2 specifications. A major advantage of using the NB2 model is that it is a LEF distribution; however, using the NB2 model with aggregate data is problematic because the sum of independent NB2 draws is not NB2. The sum of independent NB1 draws is NB1; however the NB1 is not a LEF distribution so it is not robust to misspecification. Furthermore, Cameron and Trivedfs (1986) quasi-generalized pseudo-maximum-likelihood estimator cannot be used to estimate either model with latent classes because group-specific demands are not observed, so consistent estimates of the nuisance parameters cannot be obtained. Other random effects estimation methods— quadrature, Monte Carlo, and quasi-Monte Carlo methods—all require evaluating the factorial of the total number of trips in equation [9]; but with aggregate data the value of the factorial can be quite large, resulting in numerical errors during estimation.
Another approach would be to incorporate fixed effects into the Poisson model, which normally is straightforward (Greene 1997, 940). However introducing fixed effects into a latent class Poisson model is problematic again because the group-specific demands are not observed. This prevents substituting sufficient statistics for the fixed effects to obtain a concentrated likelihood function that can be optimized over the slope parameters only. Greene (2001) proposes a “brute force” method that may be adaptable to a latent class Poisson model and avoids inverting a potentially large Hessian, but this approach has been left for future work.
The basic Poisson model thus provides a robust and theoretically sound framework for calculating consistent (though potentially inefficient) parameter estimates from aggregate data.6 Furthermore, in a related application using this dataset (Baerenklau et al. 2010), ML estimation of a NB1 model without latent classes predicted 3.4 times the actual number of trips and increased welfare estimates by 24% relative to a Poisson model, which correctly predicted the total number of trips. Other authors (e.g., von Haefen and Phaneuf 2003a; Englin, Holmes, and Niell 2006) have found similar results with NB models. Thus von Haefen and Phaneuf (2003b) have suggested the Poisson is preferable for policy purposes. For these reasons, this application relies on a Poisson framework with robust Eicker-White standard errors for hypothesis testing.
IV. Study Site and Data
The preceding framework is applied to backcountry recreation during 2005 in the San Jacinto Wilderness in the San Bernardino National Forest in southern California.7 The wilderness covers over 13,000 hectares and is located within a 2.5 hour drive of most of the Los Angeles, San Diego, and Palm Springs metropolitan areas (population: ~18 million) and attracts roughly 60,000 backcountry visitors each year. An additional 350,000 visitors ride the Palm Springs Aerial Tramway into the Mt. San Jacinto State Park but do not enter the backcountry. The Pacific Crest Trail traverses the wilderness form north to south, and elevations range from 1,800 to 3,300 meters.
Backcountry access is regulated by two U.S. Forest Service ranger stations and one state park office. Horses are allowed but bikes and motorized vehicles are prohibited. Day hiking is by far the most popular activity in the backcountry. Day hikers enter the backcountry via several vehicle-accessible trailheads located on the north, west, and south sides of the wilderness (regulated by a ranger station and the state park office, both located in Idyllwild), or by riding the tram and then hiking in from the east side (regulated by a ranger station located in Long Valley). Table 1 presents some statistics for the 10 trailheads that account for nearly all day-use visitors.
Summary Statistics for Trailheads
Backcountry visitors are required to obtain a permit in either Idyllwild or Long Valley, but the Forest Service estimates8 the compliance level is around 75%.9 Data needed to perform a standard count data travel cost analysis is available from permit receipts maintained by the Forest Service and state park offices. Each permit lists the date of the trip, the number of people in the group, the entry and exit points, and the home address of the group leader. Most wilderness areas maintain similar records, which helps explain the popularity and usefulness of travel cost models for estimating recreation value (Moeltner 2003). Because the focus of this article is the latent class approach, rather than estimating trail-specific access values or marginal values of trail characteristics, the estimation is simplified by combining all trails accessed through Idyllwild into one destination and all trails accessed through Long Valley into another. This is a reasonable simplification because all permitted hikers first visit one of these two destinations before proceeding to a trailhead, and only a relatively small additional cost is incurred to reach each trailhead after obtaining a permit.
Following conventional methods (e.g., Moeltner 2003; Weber and Berrens 2006), the hiking permit data is combined with the most recent census data (U.S. Department of Commerce 2000) to construct a dataset containing the number of backcountry trips taken from each of the 586 ZIP codes within a 2.5 hour drive of the wilderness and certain population characteristics of each ZIP code that are likely to help explain variation in recreation demand across ZIP codes (e.g., race, gender, age, education level, income).10 The price of a trip from each ZIP code is estimated to be the sum of driving costs and time costs. Driving costs are a function of distance (measured from the home address of the group leader using Google Maps11), the average per mile cost of operating a typical car ($0.561/mile; AAA 2005), and the average number of passengers per vehicle (1.5; author’s dataset). Time costs are a function of travel time (also derived from Google Maps) and the opportunity cost of time, which is evaluated at one-third of the average hourly per capita income for each ZIP code (Hagerty and Moeltner 2005). For tramway users, the ticket price and one hour of wait and ride time are added to these amounts. When necessary, costs are adjusted to 2005 dollars using the U.S. Consumer Price Index.12 The dataset also includes voting records on an environmental initiative from the 2000 election (California Secretary of State 2000) to help control for variation in environmental attitudes across ZIP codes.13 Table 2 summarizes the variables used in the statistical analysis; the following section discusses how these variables enter the group membership equation [7] and the demand equation [11].
Definitions and Summary Statistics for Variables used in the Statistical Analysis
V. Model Specification
In any latent class model, the analyst must decide which variables influence behavior within a group (the vector q in equation [11]) and which influence group membership (the vector z in equation [7]). In the context of random utility models with individual data, it seems intuitive to include characteristics of the good in q and characteristics of the individual in z; information on attitudinal variables seem particularly appropriate for inclusion in z, as shown by Boxall and Adamowicz (2002) and Morey, Thacher, and Breffle (2006). However, this is certainly not the only acceptable approach. Provencher, Baerenklau, and Bishop (2002) include some individual characteristics (employment status, senior citizen) as variables in the utility function, while others (age, experience) determine group membership. Morduch and Stern (1997) use the same set of variables in q and z. Gupta and Chintagunta (1994) and Bockenholt (1999) judge whether certain demographic characteristics are likely to influence group membership and whether they would be useful for identifying subpopulations, and include these in z. After finding no significant effects of any available socioeconomic variables on class membership, Scarpa and Thiene (2005) include only a constant in z and mostly site characteristics in q.
In the context of aggregate count data models of recreation demand, the situation is somewhat different. First, there is generally less available data. Characteristics of the good often can be reduced to a site-specific constant when, as is typical, demand is considered over a single season and access value is the primary focus. Demographic information tends to be limited to census data, and attitudes— except what can be gleaned from serendipitous voting data—are not observable. Second, parameters in the demand function, not the utility function, are being estimated, and theory tells us that both individual preferences and site characteristics affect demand. Price and income must be in each group-specific demand, but which other variables ought to shift demand within a group versus between groups is reasonably debatable.
Provencher and Moore (2006) observe that the delineation between q and z in previous applications has been driven by the specific goals of the analyst. For example, understanding how different segments of customers might be influenced by advertising. This article is concerned with endogenous spatial sorting of individuals relative to recreation sites. Therefore the previous discussion about candidate explanations for endogenous sorting is used to motivate the choice of q and z. Each demand equation includes eight variables: a site-specific constant, a site-specific travel cost variable, prop12, white, adultmale, elderly, college, and income (see Table 2).14 The group membership equation includes a constant and the same demographic variables as well as three additional variables thought to explain differing proportions of groups across ZIP codes assuming some type of sorting behavior is taking place in the population. The variable milesID measures the distance to Idyllwild, which is not only the west-side access point for the San Jacinto Wilderness, but also a mountain community that provides residents with nonrecreation amenity benefits.15 The variable milesSIM measures the distance to the nearest similar mountain community that provides comparable recreation opportunities and amenity benefits.16 Positive (negative) coefficients on these variables would indicate that the corresponding group tends to live further from (closer to) these communities. The variable Popdens is included to help further identify whether there exists an amenity value-seeking group in the population. A positive (negative) coefficient on this variable would indicate that the corresponding group tends to live in more (less) densely populated areas. These locational preferences of the latent classes can be matched with the corresponding recreational preferences (defined by the groupspecific demand and welfare estimates) to inform the question of spatial sorting.
VI. Results and Discussion
Results are summarized in Tables 3 and 4. Table 3 provides the estimation results for one-, two-, and three-group versions of the model; Table 4 provides some measures of fit and welfare calculations. Model 1 assumes there is only one group in the population and estimates a standard Poisson model. Table 3 shows that 9 of the 10 estimates for this model are significant at the 15% level and below. The travel cost coefficients are negative and highly significant. The signs of the significant demographic coefficients show there is greater per capita demand from ZIP codes with larger percentages of proenvironment voters, whites, adult males, people under age 60, and college graduates. Generally these results are consistent with intuition and with previous studies of recreation demand.
Estimation Results and Significance Levels for One-, Two-, and Three-Group Models
Goodness-of-Fit Measures and Welfare Calculations
Table 4 shows that the regression average deviation (expressed as a percentage of the average number of observed trips per ZIP code) for Model 1 is relatively high at 64%, even though the Poisson model necessarily predicts the correct aggregate number of trips. The estimated per trip equivalent variation (EV) is nearly $16 for the Idyllwild trails and nearly $18 for the Long Valley trails.17 The estimated annual aggregate EV for the entire wilderness is just above $573,000.
Model 2 introduces a second latent class into the analysis. In terms of fit, this model is a noticeable improvement upon Model 1, particularly with regard to the average deviation measure shown in Table 4. Twenty-two of the 30 parameter estimates in Table 3 are significant at the 15% level and below, including the coefficients on the travel cost variables, milesID, milesSIM, and Popdens. Considering the results displayed in both tables, Model 2 reveals some interesting insights about the latent subpopulations of wilderness users. Model 2 suggests that there exists a group of“hiking enthusiasts” (Group 1) and a distinctly different group of “casual users” (Group 2). Table 4 shows that the hiking enthusiasts make up a relatively small portion of the total population (24%) but tend to hike roughly 10 to 15 times as frequently as the casual users. The hiking enthusiasts value each trip more highly than do the casual users and generate 86% of the recreation value of the wilderness. The significant group membership coefficients in Table 3 show that the hiking enthusiasts tend to live in ZIP codes containing relatively large proportions of people who are nonwhite, relatively young, and do not have college degrees. To be clear, this does not mean that people with these characteristics tend to be hiking enthusiasts; rather it means that hiking enthusiasts tend to live in areas that exhibit these population characteristics. More germane to the purpose of this analysis, these ZIP codes also tend to be more densely populated (i.e., urban) and located further from Idyllwild and similar places, whereas the opposite implicitly holds for the casual users.18 This result is consistent with the second possible sorting outcome mentioned previously: that people may choose to live closer to outdoor recreation sites not for the recreation opportunities, but rather for the other amenity values that open space typically provides. However, labeling this entire group of “casual users” as “amenity seekers” would be inappropriate, given the large size of the group. Nonetheless, according to the previous analysis of Figures 1 and 2, this result means a one-group model will overestimate the recreation value of the wilderness. Table 4 confirms this: the aggregate value of the wilderness is 40% higher when estimated using Model 1 instead of Model 2.
Interestingly this result is contrary to that of Parsons (1991), who uses an instrumental variables approach to control for endogeneity in the travel cost variable due to sorting behavior. To investigate this discrepancy further, Parsons’ approach is applied to this dataset. The calculated travel cost to each site (travcostID and travcostLV) is regressed on variables similar to those used by Parsons, and the fitted values are substituted for the calculated values to estimate the demand equations for a one-group model.19 Consistent with the preceding results, the “corrected” value of the wilderness (estimated to be $426,712) again is substantially less than that estimated by Model 1 (and notably similar to that of Model 2). This lends further support to the latent class results and, in light of Parsons’ opposing results, also suggests that even the qualitative properties of sorting behavior may be highly case dependent.
Figure 3 illustrates the spatial sorting pattern implied by Model 2 and provides some additional insights into the interpretation of the results. It can be seen in the figure that ZIP codes with the highest proportions of hiking enthusiasts tend to be located in the northern and eastern inland regions of southern California (and western Arizona), and in the areas surrounding the cities of San Fernando, Los Angeles, Santa Ana, and San Diego. The lowest proportions of hiking enthusiasts tend to be located in the mountainous regions surrounding the wilderness areas, and in northern and coastal San Diego County, southern and coastal Orange County, and coastal Los Angeles County from the Palos Verdes peninsula north toward Santa Monica and west toward Malibu. These areas of San Diego, Orange, and Los Angeles counties are generally thought to be some of the most desirable places to live in southern California. A plausible, though admittedly speculative, explanation of these results is that proximity to natural resources for purposes of recreation is not an important determinant of residential location choice in southern California. Rather, those who live near such resources primarily benefit from them in more passive ways that typically are not captured by recreation surveys and, due to this relatively steady stream of amenity benefits, feel less compelled to participate in more formal recreation activities. Those who live further from such resources and who do not have access to the same stream of amenity benefits associated with living on the coast or in the mountains feel more compelled to actively participate in recreation activities in these areas to compensate for the lack of amenity benefit flows. By this logic, these same people would be more willing to spend an entire day at the beach, for example, and thus would be more likely to end up in a survey of recreational beach users compared to local residents who may frequently be near the beach but for shorter periods of time and in less structured recreational activities (e.g., dining outdoors or jogging near the beach).
Proportion of “Hiking Enthusiasts” By Zip Code
Returning to the latent class results, Table 3 also elucidates how demand varies within each group according to demographic characteristics. The significant parameter estimates in the group-specific demand equations show that, among the hiking enthusiasts, there is higher per capita demand from ZIP codes with greater proportions of college-educated white males who are relatively wealthy. The reader may note that these characteristics are exactly opposite from those characterizing membership in this group. However this is not problematic; as mentioned previously, the group membership equation elucidates the types of areas in which hiking enthusiasts tend to live, whereas the demand equation reveals the types of areas from which relatively more or fewer trips are taken by hiking enthusiasts (i.e., the latter is a conditional relationship). Among casual users, there is higher per capita demand from ZIP codes with greater proportions of college-educated white males (similar to the hiking enthusiasts) who are relatively less wealthy (in contrast to the hiking enthusiasts). Comparison of these and other demographic demand coefficients in Models 1 and 2 shows that Model 1 (which can be interpreted as a constrained version of Model 2) is obscuring important heterogeneity in the population. However, care must be exercised in interpreting the results in Table 3 further because there is no information on the demographic characteristics of the actual hikers in the sample, rather only on their ZIP codes of origin.
Model 3 introduces a third latent class into the analysis and provides a small additional improvement in terms of the log likelihood and regression standard deviation, but also exhibits some signs of the limits of this modeling approach. Only 18 of 50 estimates are significant at the 15% level and below. In terms of group-specific welfare and per capita demand, Table 4 shows that the characteristics of Groups 1 and 2 remain relatively unchanged from Model 2. The additional third group is quite small (around 7,400 people) and exhibits very low per capita demand for Idyllwild and very high per capita demand for Long Valley. Borrowing terminology from Morduch and Stern (1995), it appears Model 3 has isolated a random clustering in the data rather than true underlying structure.20 Both the Akaike and Bayesian information criteria in Table 4 provide support for Model 3 (lower values of these criteria are preferable), but as discussed by Provencher, Baerenklau, and Bishop (2002), these and other statistical tests for the number of groups are problematic for various reasons.21 However the Akaike and Bayesian values are reported here due to their frequent use in latent class modeling for assessing the number of groups in a population.22 With regard to aggregate welfare, Model 3 estimates similar but slightly higher values than Model 2. In contrast to the comparison of Model 2 with Model 1, this result is consistent with the first possible sorting outcome mentioned previously (i.e., living closer to recreation sites for recreation opportunities rather than amenity values). It appears that both sorting mechanisms may be occurring in the population and that the second mechanism is dominating the first in the aggregate.
VII. Conclusions
This article demonstrates how a latent class framework can be applied to aggregate count data models, in particular to address endogenous spatial sorting resulting from unobserved heterogeneity that may impact welfare estimates derived from zonal recreation demand models. The approach does not require ex ante specification of the type of sorting behavior that is occurring in the population, but rather enables the analyst to ascertain the nature of the sorting behavior by interpreting the coefficient estimates. The estimation algorithm works well for a relatively small number of groups, but because increasing the number of groups can greatly increase the number of estimable coefficients (20 additional coefficients per group in this application) and because the data do not contain an explicit identifier of group membership, diminishing returns arise relatively quickly as the number of groups is increased. For the present application, a model with only two or three groups appears optimal. Furthermore, characterization of the groups becomes increasingly difficult as the number of groups increases, which tends to negate the policy relevance of this modeling approach. For this application, both the two- and three-group models provide compelling evidence for the existence of two broad but distinctly different groups in the population. Although the group characterized as “hiking enthusiasts” comprises only 25% of the population, it accounts for 85% of the recreation value associated with the study site. Furthermore, and arguably most importantly, because this group apparently tends to live further from the study site, a zonal demand model that assumes a homogenous population overestimates the recreation value of the study site by 35% to 40% (assuming a two- or three-group model accurately characterizes the population). It is evident that neglecting to control for unobserved heterogeneity in aggregate count data models may be a more serious problem for welfare estimation than neglecting to control for observable heterogeneity (Moeltner 2003).
This estimation method and the conclusions it supports clearly invite further testing in a variety of contexts, particularly in light of the contrasting results of Parsons (1991). Applying the method to other existing aggregate datasets would be straightforward and would reveal both the robustness of the method and to what extent the sorting behavior described here appears to exist elsewhere. A broad pattern of substantial over- or underestimation of welfare measures by homogenous count data models would suggest that the standard modeling practice should be modified. This also would provide a substantial body ofknowledge for policy makers who may be interested not only in better recreation value estimates, but also in the distributional implications of their decisions. The modeling framework used here could be extended to incorporate elements of spatial econometrics, or Moeltner’s (2003) consistent aggregation method to determine whether the associated aggregation bias remains relatively small even after controlling for unobserved heterogeneity. Other estimation techniques such as the EM algorithm might be used to determine if larger numbers of broadly relevant groups can be identified, or if idiosyncratic clusters regularly appear in models with relatively few classes. Applying the method to individual rather than aggregate recreation data would help to identify the characteristics of the actual users across groups rather than just the characteristics of their zones of origin. This would seem very useful for better understanding sorting behavior and how it impacts welfare estimates in recreation demand models.
Footnotes
The author is associate professor, Department of Environmental Sciences, University of California, Riverside. Generous research funding was provided by the United States Forest Service Pacific Southwest Research Station through Cooperative Agreement No. 05-JV-11272165-094, and by the Giannini Foundation of Agricultural Economics through a 2007–2008 Minigrant. The views expressed in this article are solely those of the author and not of the Forest Service. Able research assistance was provided by Yeneochia Nsor, Sunil Patel, Victoria Voss, Edgar Chavez, Catrina Paez, and Jose Sanchez. The author also thanks Daniel Hellerstein of the USDA ERS, Armando González-Cabán of the U.S. Forest Service Pacific Southwest Research Station, Melinda Lyon, Jerry Frates, and Roman Rodriguez of the U.S. Forest Service Idyllwild and Long Valley Ranger Stations, Eddie Guaracha of the Mt. San Jacinto State Park Office in Idyllwild, participants at the 2009 AERE sessions in Milwaukee, Dan Bromley, and an anonymous reviewer who provided particularly helpful comments.
↵1 Moeltner (2003) shows that the bias can become more substantial as the amount of heterogeneity increases, but overall the bias remains small for his sample population.
↵2 Wedel et al. (1993) and Bokenholt (1993) effectively assume that zi includes only a constant term and thus estimate the unconditional probability that any individual belongs to each group. This approach generalizes that framework.
↵3 Constant terms that enter the log likelihood additively and thus do not affect optimization have been removed.
↵4 To ensure theoretically consistent aggregation in this model, it must be assumed that individuals within a ZIP code differ only according to their demand parameters d, b, and c. Therefore individuals in the same group and ZIP code are assumed to be homogenous. Moeltner’s (2003) approach could be used to relax this assumption; but because the welfare discrepancy appears to be small (~5%), this correction is forgone in the present analysis.
↵5 The data for this application were tested for meanvariance equivalence but found to exhibit a substantial amount of overdispersion. Therefore the true datagenerating mechanism does not appear to be Poisson. However, assuming the conditional mean function is correct, the Poisson estimates remain consistent.
↵6 Problems other than inefficiency can arise from distributional misspecification, even when using a LEF model, if the analyst intends to rely on properties of the modeled probability other than the conditional mean. However, in this application only the estimated means are utilized.
↵7 This description is adapted from Baerenklau et al. (2010).
↵8 M. Lyon, personal communication, 2007.
↵9 Noncompliance will not bias the estimation results if it is purely random.
↵10 To avoid sampling bias problems, ZIP codes from which no trips were taken are included in the analysis.
↵12 U.S. Department of Labor, Bureau of Labor Statistics (www.bls.gov/data/inflation_calculator.htm).
↵13 The initiative was the Safe Neighborhood Parks, Clean Water, Clean Air, and Coastal Protection Bond Act of 2000 (Proposition 12). It provided $2.1 billion “to protect land around lakes, rivers, and streams and the coast to improve water quality and ensure clean drinking water; to protect forests and plant trees to improve air quality; to preserve open space and farmland threatened by unplanned development; to protect wildlife habitats; and to repair and improve the safety of state and neighborhood parks” (California State Attorney General; League of Women Voters 2000).
↵14 Englin, Holmes, and Niell (2006) proceed similarly, allowing site-specific coefficients for the intercept and travel cost variables while estimating a single coefficient for their demographic variables.
↵15 Distance from the east-side access point, Long Valley, is not included because it is highly correlated with distance from Idyllwild and because the east side of the San Jacinto range is markedly different from Idyllwild and the identified similar sites due to significantly lower precipitation.
↵16 Two such communities were identified: Forest Falls, which provides access to the San Gorgonio Wilderness, and Lytle Creek, which provides access to the Cucamonga and Sheep Mountain Wilderness areas. Each of these areas is similar to the San Jacinto in terms of elevation, vegetation, and wildlife. Each is also located within the San Bernardino National Forest.
↵17 In all instances in Table 4, EV and consumer surplus (not reported) are very similar because income effects in this analysis are small.
↵18 For the mean ZIP code characteristics, increasing milesID and milesSIM by 10 miles each produces a 1% increase in the estimated proportion of hiking enthusiasts.
↵19 Parsons uses a semilog model with individual data to regress calculated travel cost on occupation dummies (used as proxies for the opportunity cost of time), an urban/rural dummy, years of education, and a marriage dummy. Here, a semilog model with ZIP code level data is used to regress calculated travel cost on per capita income (used as a proxy for the opportunity cost of time), percentage of population living in urbanized areas, percentage of population with a bachelors degree, and percentage of adult population that is married.
↵20 “One chief concern is that, lacking an explicit identifier of group identity, the mixture model may yield groups simply by chance in the data. For example, when the estimation procedure divides the sample into one large nebulous group and one precisely-estimated group with a small number of observations, it could reflect random clustering rather than true underlying structure” (Morduch and Stern 1995, 269).
↵21 Yang and Yang (2007) provide a thorough evaluation of these and other criteria using Monte Carlo techniques. The Akaike information criteria (AIC) and the Bayesian information criteria (BIC) perform well under certain conditions but are dominated by other criteria under different conditions. AIC and BIC are reported here due to their frequent use in the literature; other criteria may be calculated by the reader using the data in Table 4. For an alternative approach to evaluating the number of latent classes that utilizes the Pearson statistic, see Owen and Videras (2007).
↵22 Though not shown in the tables, a four-group version of the model with 70 parameters also was estimated. The log-likelihood improvement was 0.002%, and both the AIC and BIC were higher than for Model 3.