Abstract
Omitted, misspecified, or mismeasured spatially varying characteristics are a cause for concern in hedonic house price models. Spatial econometrics or spatial fixed effects have become popular ways of addressing these concerns. We discuss the limitations of standard spatial approaches to hedonic modeling and demonstrate the spatial generalized additive model as an alternative. Parameter estimates for several spatially varying regressors are shown to be sensitive to the scale of the fixed effects and bandwidth dimension used to control for omitted variables. This sensitivity reflects the uncertainty associated with the estimates when the appropriate spatial scale of the controls is unknown. (JEL Q51, Q52)
I. INTRODUCTION
The hedonic model as described by Rosen (1974) remains popular in the environmental valuation literature for valuation of amenities through housing markets. The major concern in hedonic estimation is the issue of omitted variable bias. A symptom that the model does not accurately capture the spatial variation in prices is spatial correlation in the error term. This residual correlation can be caused by misspecification of spatially delineated variables, systematic mismeasurement of the spatial regressors, or spatial covariates omitted from the model. Omitted spatial processes can result in biased parameter estimates and biased standard errors (Anselin 2010).
Recent years have seen many improvements in the way such omitted variables are addressed. Parametric models based on the spatial weight matrix and spatial fixed effects of urban subdivisions have emerged and are increasingly common in hedonic house price modeling. However, these methods have been criticized. In a special issue of the Journal of Regional Science, Gibbons and Overman (2012) and McMillen (2012) criticize the “automatic” use of spatial lag and spatial error components in econometric modeling. Brady and Irwin (2011) discuss developments in land use modeling in the light of this recent critique. We focus our attention on hedonic house price models and the challenge of handling the spatial dimension in these models. In addition we present an alternative approach to modeling the spatial dimension of the housing market in the form of a spatial generalized additive model (GAM).
To the best of our knowledge there are no surveys that explicitly contain a review of the use of spatial methods in hedonic house price modeling. Kuminoff, Parmeter, and Pope (2010) discuss the choice of functional form for hedonic estimation in the light of the increased use of fixed effects or spatial econometrics, which aims to reduce omitted variable bias. Looking at more than 60 published papers, they find that more than half of the hedonic studies apply either spatial fixed effects or spatial econometrics models to address omitted spatial variables. We looked at 21 hedonic studies published between 2010 and 2012 in the Journal of Environmental Economics and Management, Land Economics, Ecological Economics, and Environmental and Resource Economics. Approximately half of these studies used either a spatial error term or a spatial lag term to control for spatial correlation, while the other half used fixed effects and differences-in-differences. As such it is clear that the use of these methods is extensive in current hedonic research. The specification of spatial econometric models varies across studies both with regard to the chosen model (spatial lag, spatial error, or both) and the design of the spatial weight matrix (inverse distance weighting, contiguity, etc.). Similarly, fixed effects are generally based on available spatial entities such as provinces, census blocks, and so forth. Only two of the papers (Heintzelman and Tuttle 2012; Chamblee et al. 2011) contain in-depth discussion of the choice of spatial control and explicitly discuss sensitivity analysis of different spatial specifications.
The aim of this paper is to address the strengths and weaknesses of the standard econometric strategies to address spatial omitted variables in the literature on hedonic house price valuation. On the basis of this discussion, we suggest an alternative spatial model, a spatial GAM, which handles omitted spatial processes nonparametrically. To our knowledge, the use of a GAM for explicit spatial modeling is novel to the peer-reviewed hedonic house price literature. The only similar application of a GAM that we are aware of is in a book chapter by Geniaux and Napoleone (2008). However, their emphasis is different, as they focus on comparison with a geographically weighted regression. They do not consider parametric spatial models or fixed effects, both of which are more commonly used in the hedonic literature. The present paper is divided into two parts. In the first part we provide a general discussion of the standard econometric approaches and introduce the GAM model as an alternative. In the second part we illustrate the discussion with an empirical example. We estimate the hedonic price function using a simple linear model with no spatial corrections, a spatial fixed effects model, and a GAM model. We then vary the spatial unit used for fixed effects and the number of basis functions in the GAM to evaluate the sensitivity of our results to the extent of the spatial control.
Our main critique of the fixed effect and spatial econometric approaches is that the nature of the omitted spatial processes is assumed to be known a priori. In practice there are likely to be several omitted spatial processes at different spatial scales in the housing market. In the fixed effects model a geographical entity such as a school district or another spatial subdivision is assumed to capture the unknown omitted spatial processes. In the standard spatial error and spatial lag models, the spatial processes are assumed to be captured by a spatial weight matrix, which is often based on the 10 or 20 nearest neighbors. We argue that it is inappropriate to place such restrictions on the omitted spatial processes. In contrast, the GAM does not require any assumptions about the structure of the omitted spatial processes in the housing market. The spatial processes are handled by letting the geographical coordinates of each property enter into the model through a smoothing function based on thin plate splines. In practice, the spatial GAM can be understood as a smoothed fixed effects model as opposed to the standard discrete version of the fixed effects model.
The most commonly used parametric spatial econometric approaches have some additional drawbacks. The spatial error model is based on the assumption that the correlation in the error term is a result of processes that are not correlated with the variables in the model. In that setting, the correlation in residuals would lead only to inefficient estimates and biased standard errors, which the spatial error model corrects for. In contrast, the GAM and the fixed effects model treat the omitted spatial processes as an additional regressor. The model specification in a spatial lag model implies the existence of a spatial multiplier effect of the marginal values of the model (LeSage and Pace 2009). Such an effect seems unjustified in hedonic house price models when interpreting the marginal change in prices as a measure of an individual household’s willingness to pay.
When comparing the spatial GAM with alternative approaches and varying the dimension of spatial modeling, we find that estimated coefficients of spatially varying regressors in our data set can be quite sensitive to the way in which the spatial dimension is modeled. We hypothesize that this sensitivity of parameters to different approaches to handling omitted spatial processes confirms that omitted variable bias in the hedonic model remains an issue despite the use of spatial models. Sensitivity analysis with different spatial models and varying choice of fixed effects or basis functions can shed light on how sensitive the parameter estimates are to untestable assumptions about the underlying spatial structure. We conclude in the spirit of Leamer (1983) that spatial sensitivity analysis should be a part of every hedonic study with spatially varying regressors.
II. MODELING THE VALUE OF A RESIDENCE
The point of departure for the hedonic model is that the price of a house reflects its attributes as a composite good. A house in a more attractive location with many amenities tends to be more expensive than a house with fewer amenities, all else equal. For this reason, the transactions in the housing market can be used to place a value on many amenities not traded independently in the market (see Palmquist [2005] for an introduction to the literature). The hedonic method is based on analyzing the equilibrium price in the housing market under the assumption that a continuum of housing types exists. The equilibrium price schedule is determined by the structure of preferences and technologies in the market and can be expressed as a function of the attributes of the house:1 [1]
Housing attributes, X, usually include structural variables such as the number of rooms, the size of the living area, and the time of construction, as well as accessibility measures such as distance to the central business district or distance to train stations or highway access. The housing good further has different environmental attributes such as traffic noise exposure (Day, Bateman, and Lake 2007), air quality (Chay and Greenstone 2005), and green space (Abbott and Klaiber 2010a), as well as neighborhood characteristics such as school quality, crime levels, and so forth. Household utility is a function of consumption of housing, H (X) and other goods, C: [2]
The household chooses a quantity of housing attribute, X, to maximize utility subject to a budget constraint: M = P(X)+C, where nonhousing consumption, C, is the numeraire. Maximizing utility delivers the following first-order condition: [3]
Utility maximization implies that the change in utility resulting from a marginal increase in Xj exactly equals the change in the house price following the same increase in Xj, all else equal.2 Based on equation [1], the marginal price of a housing attribute is defined as the partial derivative of the house price with respect to the characteristic. The main object of most published hedonic analyses is the recovery of (average) marginal prices, sometimes also referred to as implicit prices. Only a few studies proceed to recover household preferences, as this is subject to several econometric challenges (for further discussion see Epple 1987; Kahn and Lang 1988; Ekeland, Heckman, and Nesheim 2004). The marginal prices can be used to valuate marginal changes in, for example, environmental amenity levels. For larger changes in amenities, the marginal price can be a poor estimate of the change in welfare.
Omitted Spatial Processes
The unit of analysis in most hedonic studies is the individual transacted dwelling. With the use of geographical information systems, researchers have access to ample data on the location and surroundings of dwellings. However, it remains close to impossible to measure every characteristic of a home and a neighborhood. Location is often described using proximity measures or similar proxies for accessibility to amenities. Such proxies may not always be an accurate reflection of the household’s perception of the amenity. Further, it is not clear how different attributes should enter the hedonic price function. Theory gives little guidance, as the shape of the hedonic price schedule is determined by the preference parameters and technology parameters together (for an excellent discussion see Ekeland, Heckman, and Nesheim 2004). Environmental amenity access, general accessibility, and neighborhood characteristics all vary with location. Misspecification or mismeasurement of such spatially delineated variables can result in spatial autocorrelation in the model residuals and biased parameter estimates (Anselin and Lozano-Gracia 2008). The same results if a spatial regressor is omitted from the analysis. Take the hedonic model written below, where—for simplicity—two spatially delineated attributes X1 and X2 determine the price of the house, and e is a random independent and identically distributed error term: [4]
Suppose that X1 and X2 are uncorrelated and linearly related to Pi, and X2 is wrongly specified with a log transformation. When estimating the hedonic equation, we will find , where , which will vary across space as X2 varies across space. If X2 is omitted from the analysis, the error will consist of ûi = Xiβ2+ ei. Finally, if X2 has been mismeasured, so the model is estimated with , then the error term consists of . The estimated coefficient will be biased toward zero, if the error is uncorrelated with true X2. In that case, there may be no resulting spatial autocorrelation. However, if the measurement error is also correlated with X2, the bias will depend on the sign and size of the correlation, and in addition, the error term û will be spatially correlated. Additionally, measurement error can be inherently spatial if it arises from interpolation of variables only measured at discrete points (e.g., air pollution data). If the data are interpolated without attention to, for example, wind direction or barriers in the landscape, the result would be a spatially correlated measurement error. We refer to all three causes of spatial correlation as “omitted spatial processes.”
The preceding discussion assumes that X1 and X2 are uncorrelated. However, spatially delineated regressors are often correlated (Panduro and Thorsen 2014). Homes near the central business district tend to be far from natural areas or agricultural fields; industry is often located near waterways or other infrastructure. When this is the case, it is not unlikely that mismeasurement, misspecification, or omission of a spatially delineated variable can bias parameter estimates of other spatially varying regressors in the model, as well as the parameter estimate of the affected variable. In practice, there are likely to be several omitted spatial processes in a hedonic data set that vary at different scales. As environmental amenities tend to be inherently spatial in nature, omitted spatial processes are particularly troubling in the hedonic valuation literature. Our main focus here lies on methods intended to aid in the recovery of robust marginal prices for spatially delineated attributes. In the following, we discuss the techniques most commonly used in the literature.
Parametric Spatial Econometrics
In spatial econometrics, spatial relationships are modeled parametrically through the use of weight matrices identifying the relevant neighboring observations. There are several different varieties, with the most common being the spatial error model, the spatial lag model, and the combined model containing both spatial error and spatial lag processes. These spatial econometric models impose strong assumptions about the structure of spatial correlation in the data. A general spatial autoregressive model with spatial autoregressive errors for the price of a house, Pi, is given by [5] where [6]
The matrices Wp and Wu are usually referred to as the spatial weight matrices and often coincide in applications. The diagonal elements are all zero, and off-diagonal elements can be ones (contiguity indicators) or a function of distance between observations. In practice, the weight matrices are row standardized, so the term ρw’piP corresponds to a weighted average price of neighboring observations. The parameters λ and ρ are commonly known as the autocorrelation coefficients. Intuitively, the autocorrelation coefficients will be positive in most house price analyses, reflecting clustering in high- and low-value (residential) areas. In the spatial error model ρ = 0, and in the spatial lag model, λ = 0, so that the errors, u, are independent and identically distributed. The researcher usually chooses the relevant dimension of the weight matrix, that is, how many neighbors to include or which distance boundary to set for neighborhood effects. This choice is made either based on tests, for example, Moran’s I-test for spatial correlation, or justified by referring to the existing literature. A notable exception is the spatial error model of Hoshino and Kuriyama (2010), where the relevant distance is estimated.
Autocorrelation in the error term can affect inference through incorrect standard errors. When errors are positively correlated across space, failure to account for this correlation can lead to overestimating significance levels. If spatial correlation derives from omitted spatial components correlated with regressors included in the model, the estimated parameters will additionally be biased. Spatially delineated amenities tend to be correlated with each other and, hence, also with omitted or misspecified spatial characteristics. Therefore, the conditions under which the spatial error model is valid are unlikely to be satisfied in most housing market applications. Essentially, the spatial error model is similar to the use of random effects in panel data estimation or feasible generalized least squares. McMillen (2012) emphasizes that the spatial error model is a form of spatial smoother, where the number of neighbors plays a role similar to the choice of bandwidth in terms of kernel smoothing or basis dimension in terms of splines.3
The spatial lag model in turn implies that there exist direct spillover effects between house prices of neighboring properties. LeSage and Pace (2009) give some interpretations of the spatial lag model, not all of which are consistent with the equilibrium assumption underlying the hedonic approach. In particular, such spillovers could describe the process of neighborhood gentrification, in which wealthier households move in and in doing so change the composition of the neighborhood, which leads to higher prices and so on.4 It seems unlikely that the hedonic price function should remain the same in a new equilibrium if the composition of a neighborhood changes. Alternatively, the spillover can be interpreted as an information effect. If sellers and buyers are unsure of the appropriate value of a property given its characteristics, they may infer the appropriate price from looking at nearby properties with similar characteristics that have been sold recently. The information contained in previous transactions in the same area may also allow the household to form expectations about the future evolution of the prices in the area. For each of these interpretations, however, it is clear, there should be a subscript t indicating that the spillover effect occurs from recently sold properties to future sales and not vice versa. In most applications of the spatial lag model, that distinction is not made. The lagged dependent variables are not fixed regressors and likely to fall under the definition of “bad controls” of Angrist and Pischke (2009). As a thought experiment, consider the expansion of a park in an area. An increase in access to the park will raise the price of not just one home but also the neighboring properties. The prices of surrounding properties are themselves outcome variables and as such affected by changes in the attractiveness of the location.5
Gibbons and Overman (2012) emphasize the need to think of the theoretical context before specifying the model rather than choosing a model based on statistical tests. This becomes especially important as the spatial lag model implies the existence of a spatial multiplier on marginal effects that leads directly to higher marginal prices. LeSage and Pace (2009) distinguish between average direct, indirect, and total impacts, depending on whether one looks solely at the estimated coefficient or accounts for neighboring observations. A similar interpretation is given by Won Kim, Phipps, and Anselin (2003), where the marginal price of a housing characteristic (total impact) becomes [7]
When spatial correlation is positive and large, this multiplier can be quite significant. Small and Steimetz (2006) argue that the multiplier should be applied only when the spillover is technological, but not when it is purely informational. It is very hard empirically to distinguish between these interpretations, as the model does not identify the source of the spillover. It seems unintuitive that the addition of, for example, an additional square meter of living area to a house should have value for all the neighbors. Nor does it seem intuitive that this value including spillovers should correspond to the individual household’s willingness to pay for the improvement, which is essentially what the model implies in a hedonic context. Several alternatives are available to address spatial correlation, which are more in tune with the theory underlying the hedonic model.
A reason for the common use of the spatial lag model is that model estimations are likely to find the autoregressive parameter to be significantly different from zero. This finding may just as well be due to omitted variable bias and in any case does not imply that the chosen model correctly captures the spatial processes (see also McMillen 2012). A neighboring home has similar spatial characteristics and the price of that home will proxy for any omitted spatial characteristics if it is included in the regression. From equation [5], it is clear that the elements in WpP and u will be correlated. As a result, an instrumental variable approach is used for consistent estimation (e.g., Kelejian and Prucha 2010). The instruments are constructed based on spatially lagged characteristics. For housing market applications, the instruments would be a weighted combination of the characteristics of nearby properties. If prices are high in an area, the households living there tend to be wealthy, and the average wealth of a neighborhood will be correlated with crime rates, school quality, and the neighborhood’s general appearance, as well as the size and style of a home. The lagged dependent variable is likely to be a proxy for these often unobserved characteristics. In some cases, the instruments used to identify the spatial lag model may not be redundant in the model in the first place. In other words, estimating a significant spatial correlation coefficient does not imply that the model is a true representation of reality.
We find the spatial error model and the spatial lag model to be unsuited for hedonic analysis and concentrate on the fixed effects model and the GAM in what follows.
Spatial Fixed Effects
Spatial fixed effects are quite popular as a control for omitted variables. Basically, fixed effect estimation corresponds to including a dummy variable for belonging to a geographical entity in the data. In that sense, fixed effects are similar to the use of spatial weight matrices with contiguity indicators, except the matrix is not centered on each observation but rather identifies observations belonging to the same spatial entities. With fixed effects, each observation belongs to only one neighborhood entity, whereas an observation can belong to several neighborhoods, as these are defined by the weight matrix in the spatial econometric models.
Fixed effects imply discrete shifts in the level of house prices as one moves across the border of the entity used to establish the panel structure. The effect of omitted variables is constrained to be constant within the entity and vary only between entities. Just as the specification of the weight matrix determines the relevant neighborhood size, the fixed effect should coincide with the level at which the omitted variables vary in order to be effective. In the framework of equation [4] if X2 varies discretely, as with school quality across school attendance zone boundaries, a fixed effect at the school attendance zone level is able to control for school quality. It does not matter how exactly school quality enters the hedonic price function as long as it varies only between school attendance zones and not within them. The fixed effect can handle mismeasurement, misspecification, and omission of a variable if the fixed effect is specified at the correct scale.
Unfortunately, the nature of most omitted processes is unobservable to the econometrician. In existing studies, fixed effects are usually created based on availability of, for example, administrative units such as provinces (Brounen and Kok 2011), counties (Deaton and Vyn 2010), or municipalities (Cavailhes et al. 2009), or they are created from the object of interest, for example, beaches (Gopalakrishnan et al. 2011) or lakes (Walsh, Milon, and Scrogin 2011). If the researcher has a clear idea about what the omitted variables are and at which scale they vary, the fixed effect unit can be chosen appropriately. If there is some uncertainty, the use of small entities is preferable to the use of larger ones with respect to controlling appropriately for omitted variable bias.
However, fixed effects based on small entities demand a lot of variation in the data. There must be sufficient spatial or temporal variation in the data within the spatial entity to distinguish the variable of interest from the fixed effect. In the extreme case of a repeat sales model, only the effect of variables that change over time can be identified. In cross sections, using census block or school district fixed effects can make it very difficult to recover impacts of amenities such as air pollution or airport noise, which vary little across space. Any effect they might have on housing prices is likely to be “sucked up” in the fixed effect.6 For proximity measures as well, the use of fixed effects can make estimation of parameters for such amenities as park access difficult. The variation in proximity to the nearest park within a spatial entity declines as the entity becomes smaller. Proximity measures of spatially delineated amenities often focus on the nearest amenity. This implies that access to amenities that are scarce in the landscape is more likely to be confounded with omitted neighborhood variables than proximity to spatially delineated amenities found at several locations. As such it is more difficult to identify any effect these scarce amenities might have on housing prices.
Remaining spatial correlation in the error term can be accounted for by clustering errors within entities to avoid overestimating significance levels. Fixed effects are easy to implement and can be more flexible than the parametric models in capturing the unknown urban structure, depending on the size of the unit used for the fixed effects. The flexibility comes at the cost of a loss of degrees of freedom when many fixed effects are included. Essentially, where the standard spatial econometric model estimates one or two spatial lag parameters, the fixed effects model estimates additional parameters corresponding to the number of entities in the data set. As data availability increases, sample sizes have also grown to make this constraint less binding.
Generalized Additive Modeling: A “Flexible” Fixed Effect
Several nonparametric and semiparametric approaches exist to account for spatially varying data. All of these methods are based on the recognition that the researchers have limited knowledge of the spatial structure and processes in the data. Rather than impose structure on the data, the data are allowed to speak. One such alternative is the locally weighted regression (see, e.g., McMillen 2012). The locally weighted regression is characterized by locally estimating parameters leading to variation in parameters across space. This variation will reflect any omitted spatial processes in so far as they correlate locally with the variables included in the model.7 The GAM discussed below allows variation in the overall level of prices across space through a flexible fixed effect, but keeps parameters constant.8 A general introduction to GAMs is given by Wood (2006). Essentially the model can be written as [8] where g −1(•) is the inverse of the link function, and f (xlon, ylat, k; α) is a smoothing function of the spatial coordinates capturing the exact location of the property. The smoothing function is made up of the sum of k thin plate regression spline bases b (xlon, ylat), each multiplied by its coefficient, .
The thin plate spline bases are a series of known polynomials of increasing complexity in the two variables. As such the bases enter the model in a way similar to the other regressors. The coefficient vector, α, in the smoothing function is estimated jointly with the other parameters in the hedonic price function. The researcher must choose the flexibility of the model by setting the number of basis functions, k. This is a balancing act between accurately capturing the locational attribute without overfitting the model (bias versus variance). The risk of overfitting is also addressed directly by including a penalty on “wiggliness” in the estimation procedure. This penalty, θ, is determined from the data using generalized cross-validation or similar techniques. The penalty enters the objective function directly through an additional term capturing wiggliness in the smoothing function. The coefficient vector B containing both the parameters from the smoothing function, α, and the remaining parameters, β, are estimated based on the expression [9]
Here, the first term in the expression is the deviance measuring the difference between the satiated model likelihood, lmax, and the likelihood of the reduced model that we are estimating, l(B). The second term captures the penalty on variance using the second derivatives of the smoothing function to describe its wiggliness. The objective function thereby explicitly contains the trade-off between bias and variance. In practice the model is estimated using penalized iterated reweighted least squares.9
The higher the choice of k, the less spatial variation remains in the data to be explained by other variables. In this way, there is a clear parallel between the choice of k and the scale of the spatial fixed effect. Essentially, it is difficult to separate the influence of included covariates from that of omitted processes when both vary on a spatial scale. We require the included spatial covariates to vary on a finer spatial scale than the omitted spatial processes in order to identify them in the model. In comparison with the fixed effects estimator discussed above, there will be no discrete changes in the level of house prices across space in the GAM. The location component instead acts as a sort of “flexible” fixed effect describing the landscape. Sensitivity to the scale of the fixed effect can then be carried out easily by varying the choice of k, as we demonstrate below.
In the empirical example in this paper, the focus is on the flexible fixed effect, but it should be noted that the GAM has several other properties desirable for hedonic analysis. In addition to smoothing the spatial coordinates, the GAM can also include smoothing terms for other regressors, for example, to determine an appropriate functional form.
III. EMPIRICAL APPLICATION
The purpose of the following empirical exercise is to illustrate the sensitivity of the results from hedonic modeling to different spatial specifications and modeling principles. To that end, we estimated the hedonic price function using three different models: a simple linear model with no attempt to correct for omitted spatial variables, a spatial fixed effects model, and a GAM. For the fixed effects model we then vary the unit (the spatial scale) of the fixed effect. For the GAM, we vary the number of basis functions to evaluate the sensitivity of our results to the level of the spatial correction. We refrain from applying spatial autoregressive models in the following empirical application of the hedonic house price model, given the considerable theoretical shortcomings outlined in the previous section.
In each of the models, we model spatial processes in the data in two ways: To capture the finer structure at a neighborhood level we include a vector of variables Z that describes the average visible characteristics of homes in the neighborhood of dwelling i at the level of the road for each house. These average characteristics are calculated based on all houses (including those not traded within our time frame) in the same street as house i and are intended to proxy for unobservable neighborhood characteristics in close proximity to the individual dwelling. On a large spatial scale, we include a linear measure of the distance to the central business district. The fixed effects model additionally has fixed effects at the level of school districts. For the GAM, we additionally model the location of the property through a smoothing function of the spatial coordinates to capture the smoothed fixed effect. These approaches account for the spatial structure of the housing market at an aggregate level in our models and may be thought of as capturing the land rent gradient together with the measure of distance to the town center.
To facilitate comparison across models, we have made a number of common assumptions: all models are estimated with maximum likelihood estimators and assume a Gaussian distribution. The GAM requires specification of an exponential family distribution for the dependent variable, and we did not want differences in results to rest on different distributional assumptions.10 For each of the models, the dependent variable was log-transformed before estimation.
We estimate the GAM: [10] where f(xlon, ylat, k) is a smoothing function of the spatial coordinates capturing the exact location of the property. Finally, the fixed effect specification is given by [11] where aj is the fixed effect for school district j. For the fixed effects model, errors were clustered at the school district level to account for residual spatial correlation.
All models are estimated in R (R Core Team 2012). The nonspatial linear model and the fixed effects model are estimated using software for generalized linear models (GLMs). The GAM is estimated with the mgcv package developed for R by Simon Woods (see, e.g., Wood and Augustin 2002).
Data
The data set covers the transactions of single-family houses in the city of Aalborg, Denmark, over the period 2000–2007. The study area is depicted in Figure 1, which shows the distribution of transacted properties on a map of the buildings in Aalborg. Aalborg is the fourth-largest town in Denmark, with approximately 125,000 inhabitants (2010). In terms of owner-occupied dwellings, approximately half of the available housing units consist of houses. In total, 6,313 transactions were included in the analysis.
The data set contains information about each transaction in terms of price, date, and type of sale. The data also contain information on the structural characteristics of the property, such as the number of rooms and size of the living area. A summary of the control variables in the data set is found in Table 1. The information was extracted from the Danish Registry of Buildings and Housing database, which contains information on all dwellings in Denmark (Ministry of Housing 2012). The data are a “snapshot” of the housing characteristics and are continuously updated. Our data therefore reflect the characteristics of a house in August 2011, when the data were collected. The register contains information on the date of the latest renovation, so it is possible to control for postsale renovations. The exact coordinates of the location of each dwelling are also available. Based on this information and maps from the Danish Geodata Agency (2010), a number of measures of proximity have been calculated using ArcGIS Desktop 10.1 (ESRI 2012), for example, proximity to large roads, industrial sites, and different types of green space.
The neighborhood variables contained in our vector z (see equations [10] and [11]) hold information about the appearance of surrounding properties in terms of the average age, average of dummies for renovation in the years preceding the sale, and the style of the building as captured by roof type and wall type. Finally, the average size of gardens for houses in the same street was included, as this gives an idea of the development’s density. A description and a set of descriptive statistics of the data are found in Appendix A.
Spatial fixed effects are usually intended to capture elements that do not vary within the neighborhood but are unobservable to the researcher. The definition of the fixed effect entity can be thought of as what constitutes a neighborhood in the data. The spatial entities used for spatial fixed effects in the literature are most often based on availability, for example, municipalities, school districts, or zip code areas. To analyze the effect of reducing the size of the fixed effect, we looked at five different groupings in the data, which are of varying size and frequency but can be argued to capture natural groupings in the urban landscape. The crudest definitions are postal code areas (7 units) and school districts (23 units). We also constructed groupings based on barriers in the landscape using geographical information systems software. Such barriers are typically large roads and railway tracks, which cut through the urban landscape and effectively break up neighborhoods. This resulted in 36 units. We then intersected the school districts with the barriers to divide school districts into 56 smaller units. Finally, looking at the clustering of homes on a map, we cut up these 56 units into a total of 87 units for the finest spatial aggregation unit. The divisions can be seen on the map in Figure 2, along with a count of the number of observations within each unit. The finer divisions are nested within the cruder units, which makes it easier to compare across models with different spatial fixed effects.
The number of units and the observations within the units can be found in Table 2. Clearly, as the number of units increases, the average number of observations within a unit declines. Units with fewer than 10 observations were joined with a neighboring unit.
As we have several control variables describing each house, we limit our discussion of results to a few variables. Access to green space is inherently a spatial variable and has long been a topic for hedonic analysis. Recent surveys include those by McConnell and Walls (2005) and Waltert and Schlaepfer (2010). As green space has been so extensively studied (and such variation in results has been found), it is a useful example for our purposes, namely, to demonstrate sensitivity to spatial modeling choices. We have identified several different types of green space that differ both in the services provided and in terms of their prevalence in the urban landscape (Panduro and Veie 2013). Our discussion of results will mainly focus on parks, natural areas, residential common areas, lakes, and scraplands for brevity. We also discuss the robustness of a few of the so-called structural characteristics of the houses for comparison. These variables are the size of the living area, lot size, and type of wall covering. Variation in these characteristics is not primarily spatial, and parameter estimates should therefore be less sensitive to the choice of spatial model for these variables.
Modeling Spatial Variables
We describe accessibility to green space using proximity to the nearest property in a straight line, with the exception of the common area category. Common area green space is attached to specific residential areas, which means that distance to the nearest common area is generally small. However, the size of common areas varies and is included as our regressor. We work with two different proximity cutoffs (ccutoff) for different types of green space to capture different scales of capitalization (Abbott and Klaiber 2010b). Some types of public green space are used for outings, and people would be willing to travel farther to enjoy a stay in such a green space, whereas other types of green space are de facto a club good, for example, because they are small and located out of the way in the middle of a residential area. This should be reflected by capitalization of the latter types at a more local scale. We set the high cutoff at 600 m, reflecting an 8- to 10-minute walking time to parks and natural areas. The lower cutoff for club goods was set at 300 m for the remaining types of green space. The scale of proximity is calculated by Xprox = ccutoff−Xdist, where Xdist is distance in a straight line from the house. Further, for homes beyond the cutoff distance the measure of amenity access is set to zero, {Xprox|Xprox<0} = 0. The coefficients on the proximity variables are easy to interpret, as amenities are expected to have positive coefficients and disamenities to have negative coefficient estimates. A quadratic specification has been applied to all green space proximity variables, as an earlier study using the same data finds nonconstant marginal effects (Panduro and Veie 2013). The common space variable has not been transformed.
IV. RESULTS
Model Estimates
Table 4 includes estimates for the GLM, the fixed effects model, and the GAM. The table contains parameter estimates of the selected regressors from Table 3. The performance of each model is described by R2, loglikelihood, and Akaike information criterion. Parameter estimates of the full models can be found in Appendix B.11
The estimates of the structural variables across all three models are highly significant and vary only marginally. Among the spatial variables, proximity to parks and scraplands, and the size of the nearest common area are associated with significantly higher and lower prices. Proximity to natural areas does not have a significant impact on the house price in the fixed effects model. For the GLM without fixed effects, an effect is significant at the 5%c level. The GAM provides highly significant estimates of the effect of proximity to natural areas, and proximity to lakes at the 10% level, but it should be noted, that the standard errors for the GAM and the GLM do not take account of the clustering of residuals in space. Hence, significance levels for these two models are likely to be overestimated. The estimated coefficients for the spatial variables have the expected signs for all models, except for the common area variable, which has a negative effect. There is little variation in the size of the estimated coefficients, with most confidence intervals overlapping across the models. Spatial externalities such as proximity to the nearest park are modeled as a quadratic function in order to allow for nonconstant effects of increased proximity. The specification implies that there is less value added from reducing the distance to a park by 100 m when at the outset the distance is 600 m, than there is when the distance is just 200 m; in other words, the sales price would increase only 0.5% if the distance to a park were reduced from 600 m to 500 m, while the sales price would increase 4.1% if the distance to a park were reduced from 200 m to 100 m.
The spatial smoothing term in the GAM is highly significant and can be understood as the land rent gradient of the housing market. The spatial smoothing term is mapped in Figure 3. While the two spatial models cannot be compared directly as they are not nested, the fixed effects model is directly comparable to the GLM model and performs better on all the displayed model criteria in Table 4. We present no statistic for residual spatial correlation. Residual spatial correlation is often tested through Moran’s I statistic. However, computation of Moran’s I requires specifying a weight matrix to pick out neighbors for which the correlation is calculated. As such it is subject to the same criticism as the spatial parametric models discussed above. An alternative test can be constructed based on the GAM by fitting a spatial smoothing function to the residuals of the model to see if significant systematic variation is present. Below we focus on varying the spatial structure of our model.
Spatial Smoothing and Spatial Fixed Effects
The estimated spatial structure with fixed effects and the smoothing spatial function are depicted in the contour plots in Figure 3. The plots show the spatial pattern of the log of the transactions price at the median of the covariates for properties built in the period between 1955 and 1975. There are some differences in the variation in prices across space, depending on the model. The fixed effects model based on school districts recovers much less variation in prices than that found in the GAM. In the GAM, the spatial price trend in Aalborg seems to conform to a wave of high prices rising in the southern part of Aalborg and falling near the industrial area near the harbor. A local depression in prices is found northeast of the city. Note that the rate of decline in prices depends on the direction of movement away from the high-price areas.
According to Wood (2006) the choice of basis dimensions is a part of model specification, and the researcher should aim to ensure that sufficient flexibility is available for the individual application. The results presented in Table 4 are for a basis dimension of k = 40 for the geographical coordinates, which was chosen based on a rule of thumb: k = min{n/4, 40} (see Ruppert 2002). The penalty term, θ, was determined through generalized cross-validation.
Sensitivity Analysis: Variable Robustness
Given that we do not know the appropriate scale at which omitted spatial processes vary, it is important to ask how sensitive the estimates are to the choices made about the fixed effect units or basis dimension. The coefficient estimates for the selected regressors from Table 3 are displayed in Tables 5 and 6 for varying spatial scales of the fixed effects. The estimated parameters for structural (non-spatial) characteristics are robust across the spatial dimensions, while the results for the spatially varying regressors are sensitive to the aggregation level in the fixed effects model and the number of spline basis functions. In the fixed effects model, parameter estimates for proximity to parks remain relatively stable and retain significance levels in most cases, but not with the finest spatial scale. For proximity to scraplands, the estimated parameter remains significant in all cases except the barrier fixed effect, but fluctuates around −0.006. In the GAM, proximities to parks, natural areas, and scraplands are significant up to 60 basis functions for the spatial smoothing. For larger k, the parameter estimates of proximity to parks become insignificant, while, those for proximity to nature and scraplands remain significant. The parameter estimates in the GAM for lakes and common areas are not robust. The lake variable is significant only at the 10% level for k = 40 and k = 60, and the size of the nearest common area is associated with significantly different house prices only when there are low values of k. While the coefficient estimates of the selected regressors seem to vary only a little, the level of variation is economically significant. In the fixed effects model the parks estimate varies from 0.004 to 0.001, which corresponds to a difference of 7.2% to 1.8% of the sales price if the distance to a park is reduced from 200 m to 100 m. In the GAM model the change in parameter estimates varies from 0.002 to 0.001, which indicates a capitalization of between 4.1% and 1.8% of the sales price with similar proximity increases. This example of model interpretation highlights that model estimates in the fixed effects model and the GAM model are sensitive to the level of spatial correction of omitted spatial processes.
Scrapland is the only spatial variable that remains robust over both the different choices of fixed effect units and varying levels of k basis functions. The sensitivity of the estimates does not imply that access to, for example, parks has no positive effect on house prices, it simply implies that these effects are hard to distinguish from omitted spatial processes using smaller fixed effect units or a high dimensionality of smoothing splines. Essentially, the fixed effects or smoothing splines compete with the spatial variables in explaining variations in the price levels across space. The spatial variation in the variable of interest must be greater than the variation in the modeled spatial processes, that is, the flexibility in the smoothing function or the size of the fixed effect units. In our case, the access to the (dis)amenity of interest is measured by proximity to the nearest object. For objects that are scarce in the urban landscape, there will be little “within” variation in this measure across space. There are only 13 parks in the Aalborg area, whereas there are more than 200 scraplands spread out across the urban landscape.
As the scale of the fixed effects becomes smaller and the basis dimensions increase, the landscape as shown in the maps in Figure 4 becomes more diverse. High-priced areas clearly emerge. While the overall tendencies are quite similar with the two approaches, they do not predict exactly the same levels of prices in the same locations. Also the changes in prices are discrete at the border of the fixed effects but continuous in the GAM. The biggest difference can be found in the fact that the GAM predicts increasing prices as we move south in Aalborg, whereas the fixed effects models predict high prices in the center and declining prices when moving away from the center in almost all directions. The trends in the GAM may be affected by border effects, as the smoothing function also smooths over areas where no transactions are observed in the data.
V. CONCLUDING DISCUSSION
This paper builds on a criticism of the existing spatial (parametric) models, where the spatial structure is specified and treated as “known” either in terms of spatial weight matrices or through the use of spatial fixed effects. We first discussed the hedonic model and how the different spatial econometric approaches relate to the model’s theoretical framework. In particular the parametric models such as the spatial lag or the spatial error model seem unsuited for hedonic analysis. The former because it implies spillovers between prices and therefore a spatial multiplier in the marginal prices that households pay for an attribute. Only the interpretation of the spillover as a purely informational effect is consistent with the interpretation of the hedonic function describing a market in equilibrium. The spatial error model assumes that omitted spatial processes causing correlated residuals are uncorrelated with the regressors included in the model. This is unlikely to hold if the regressors include spatially varying characteristics. Because location varies only in two dimensions, spatial variables tend to be correlated with each other.
Aside from these issues specific to the interpretation of the model, the parametric approaches also require the researcher to specify the spatial structure of omitted spatial processes a priori. It can be a rather restrictive assumption that spatial autocorrelation can be corrected by a single spatial entity, as in the fixed effects model. Omitted spatial processes are likely to be present on more than one spatial scale, which implies that the fixed effects model should use entities that are as small as possible to ensure that omitted spatial processes are accurately captured. Ideally a repeat sales model is capable of doing this, but it requires time-series variation in the amenity of interest. We propose an approach to account for omitted spatial processes, which to our knowledge is novel to the literature on hedonic regressions. We model location as a semiparametric function of the geographic coordinates, which allows us to capture a large part of the spatial variation in the data using a GAM. The GAM has less restrictive assumptions about the omitted spatial processes. Essentially, the GAM is data driven in the sense that the specification of the spatial omitted spatial processes is determined by model fit using generalized cross-validation. The choice of basis function dimension for the GAM remains a judgment call for the researcher, however, and sensitivity analysis should be carried out to check robustness of the model estimates.
In the empirical application of the hedonic house price model we estimated a GLM model, a spatial fixed effects model, and a GAM. We vary the choice of the spatial scale of the fixed effects and dimensionality of the spatial smoothing splines in the GAM to check sensitivity of the estimated parameters to these choices. The spatial variables, represented by different types of green space, are quite sensitive to the dimension of the spatial model. This is not surprising given that the controls for omitted spatial processes will to some extent compete with the spatial covariates in explaining the data. For identification there must be sufficient variation in the variable of interest independent of the variation in omitted spatial processes, which the models attempt to correct for. In practice it is often difficult to say if the variable of interest varies on a sufficiently fine scale. The estimates for other spatial variables in the fixed effects model and the GAM are equally sensitive to increases in the spatial dimension.12
The standard approaches rely on the researcher to specify how the omitted spatial processes vary in order to control for them in the model. Our concern is not with the use of spatial fixed effects in general, but rather in the cases where they are applied “automatically” without much thought for the specific case under study. In some cases, the researcher can choose appropriate fixed effects suited to the exact analysis. However, the most obvious property of most omitted spatial processes is that the researcher does not know at which scale misspecification, mismeasurement, and omitted variables operate. We suggest, therefore, that sensitivity analysis should be conducted using different levels of spatial corrections to determine which results are robust across models. In the existing literature, when spatial econometrics is used, it is rarely the case that results are shown for different choices of weight matrices or fixed effect units. Our findings suggest that omitted spatial processes are likely to play an important role in explaining the varying findings in hedonic models concerned with spatially delineated amenities.
Although the potential for omitted variables bias is reduced through the increased use of geographical information systems to generate data, the inclusion of geographical covariates does not solve the omitted variable problem. Rather, when spatially varying covariates are the main focal point of the analysis, extra care should be taken to ensure that results are robust to different spatial models as long as the true data-generating process is unknown. In some cases spatial variation in the environmental variable in question can be increased through careful modeling of the services or sources of annoyance. Careful attention to the nature of the environmental amenity and the way in which it is perceived by households improves the model’s ability to measure household willingness to pay through reduction of measurement errors. However, for some environmental amenities, cross-sectional hedonic analysis is unlikely to deliver reliable identification. Other methods are needed, such as instrumental variables with exogenous shifts in amenity levels (see Bayer, Keohane, and Timmins 2009) or quasi-experiments (see e.g., Pope 2008), though the latter are harder to interpret in terms of willingness to pay (see Kuminoff and Pope 2014). Alternatively, the use of sorting models as recently described by Kuminoff, Smith, and Timmins (2013) may aid in recovering preferences for neighborhood amenities with the use of instruments derived from the properties of the sorting equilibrium. These new methods provide many exciting opportunities for the study of revealed preferences in the housing market, including valuation of nonmarginal changes and quantifying general equilibrium effects.
Acknowledgments
This paper is based on a chapter from each author’s Ph.D. thesis at the Department of Food and Resource Economics, University of Copenhagen. The authors are grateful for suggestions and comments from Joshua K. Abbott, Elena Irwin, Nicolai V. Kuminoff, Richard H. Spady, Bo J. Thorsen, and Manuel Wiesenfarth, as well as two anonymous referees.
APPENDIX A: DESCRIPTION OF THE DATA SET
APPENDIX B: FULL MODEL ESTIMATIONS
Here we present the a table that provides the parameter estimates of all variables included in the three hedonic models discussed in the paper (Table 4), namely, as a naive GLM model, a spatial fixed effects model based on school districts, and a GAM model based on 40 spatial smoothing splines. In all cases a log-link function is used and the errors are assumed normally distributed.
Footnotes
The authors are, respectively, postdoctoral researcher, Zentrum für Europaïsche Wirtschaftsforschung (ZEW), Mannheim, Germany, and Department of Food and Resource Economics, University of Copenhagen, Frederiksberg, Denmark; and postdoctoral fellow, Department of Food and Resource Economics, University of Copenhagen, Frederiksberg, Denmark.
↵1 This is the first stage of the hedonic method, where an equilibrium house price schedule is estimated to reveal a household’s marginal willingness to pay for a characteristic. The second stage of the hedonic analysis, where household preferences are recovered, is not discussed in this paper.
↵2 The assumption of a continuum of houses is crucial to the interpretation of the first derivative of the hedonic price function, but not to its existence. Without a continuum of housing bundles, the equality may not hold, as households may have preferred to have more or less of an attribute at the given “marginal price” than was available.
↵3 A spline is a combination of a series of basis functions over covariate space. Basis functions can consist of, for example, polynomials of increasing order. A higher number of basis functions translates into more flexibility in the functional form.
↵4 “We need to keep in mind that the scalar summary measures of impact reflect how these changes would work through the simultaneous dependence system over time to culminate in a new steady state equilibrium” (LeSage and Pace 2009, 37).
↵5 We thank an anonymous reviewer for bringing Angrist and Pischke’s concept of “bad controls” to our attention.
↵6 Abbott and Klaiber (2010a) have shown in the context of green space, that a spatial Hausman-Taylor model can recover components that do not vary within fixed effect entities. However, that solution requires that good instruments be available for the variable of interest.
↵7 Geniaux and Napoleone (2008) compare the locally weighted regression with a GAM similar to the one discussed here.
↵8 It is possible to have spatially varying parameters in a GAM by specifying, for example, a trivariate function of space and the regressor of interest.
↵9 More information on thin-plate regression splines and the use of generalized cross-validation and alternative methods for fitting a GAM can be found in Wood (2006) and in the vignette for the mgcv package in R.
↵10 We also compared the model specifications to a spatial error model, which was estimated with assumed normal errors. These results are available from the authors upon request.
↵11 Given that the data set comprised 8 years of sales it was necessary to adjust for inflation in the house prices. We did this by fitting a fourth-degree polynomial in the date of sale. The models are estimated on the detrended data.
↵12 Additional results are available from the authors upon request.