December 7, 2014

There is a problem with the way houses are traded. Since houses are so heterogeneous, it is hard to estimate what they are worth according to any consistent methodology - and it is hard to place them into a trend. If I look at a stock, for example, it is frequently traded so I know that the price I would buy at reflects the market value. The history of its price growth is available and hence we have some clues as to whether this value seems to reflect an underconfident or alternatively an overinflated market (one overinflated might currently growing at 20% per year, for example). However, this is currently not the case for those looking to buy a house - the asking price is the single price point available, and whether it is market value can normally be only determined by the level of interest, by ad hoc comparisons, or by an expensive valuation process. We need a method to be able to value houses on a local basis, whilst taking into account their heterogeneous nature.


The Repeat Sales Regression method is one very productive approach. It is a widely used methodology for calculating house price indexes -- used for the Case-Schiller Index, the Federal Housing Finance Agency, the UK Official HPI, and many others. Essentially it is a technical tool for aggregating values for highly heterogeneous and rarely traded assets. Clearly - houses are one such set of assets. Over time, house prices in an area, a city or even in a country follow a general trend. But, it is evident that houses even on the most localised level are very different from each other. A decrepid, cramped studio flat may be next to a luxuriously refurbished 5-bed house, so therefore an average price for an area does not tell us very much about the actual price we should expect to pay for any particular property.

Likewise, over a given period, the change in the average price paid in each sale does not indicate an aggregate change in values in the market - since the change in values depends precisely on which houses have been sold. If lower value houses start to be sold more frequently, then this could be misinterpreted as a fall in aggregate market values (actually if we do not correct for this effect we will not know if this is really the trend or not).

The Repeat Sales Regression solves these problems. An index is constructed by means of comparison of the price of any sale that takes place with the price paid for the same property at an earlier point in time. Hence, if we assume that the nature of the housing stock is, on aggregate, unchanged, then we are able to construct an index that uses all the available transactional data in order to give an average factor increase at any period covered by the data. This index can furthermore be used for prediction of values for properties that have already been transacted in the dataset.

Since the Land Registry has released transactional data for all house sales since 1995, the data set that is available to inform such an endeavour is by now quite large and quite comprehensive. Publicly available indexes based on this data are available but are geographically too broad in scope to account for heterogeneous trends in the market arising from the neighbourhood level.


Lets now look at the construction of the regression:

  1. We format the initial dataset as a sparsely populated matrix with T time periods of equal durations and with prices corresponding to residences during N ownership periods at each time period (N.B. since each property may have more than one ownership in the dataset, to ensure all pricing information is captured we must capture each ownership with a separate matrix row - this will become clearer as we go on). This representation gives us a sensible and intuitive mathematical representation of the data set:

    N.B. since we specify the time periods are equally spaced we can set for simplicity.

  2. We introduce a quantity we want to calculate, this is the vector . This represents the house price index (a scalar value) produced for out sampled population at each point in time . This scalar quantity represents the geometric aggregate growth amount at any time t by the ratio
  3. With the terms already set up, hopefully is is clear that we can set up an equation representing our model:

    Here is an idiosyncratic error term representing the deviation of house n from the overall trend in the dataset. In words, therefore this model asserts: "The ratio of the price of a property at time of sale 1, to the price at time of sale 2, , is equal to the ratio of the calculated house price index plus an individual idiosyncratic error term, .
  4. By means of a mathematical derivation that we can omit here, the model equation is transformed into a matrix equation:

    Where the values in the matrices and vectors are defined as:

Once we have set up these variables,we will apply a standardized Ordinary Least Squares regression to find the index vector from the Transaction Matrix and the Log Price Vector. This is the output from our model, and shows the general trend for the houses in our model. See some preliminary examples of the results:

Hammersmith (W6)

Holland Park (W11)

Shepherd's Bush (W12)

Plotting them together, it is (a little) easier to understand how the trends relate to each other.
A few notes on these graphs:
  • There is a high degree of correlation between the three price trends. The areas are neighbouring each other (although with slightly different demographics) so we expect the same basic trend. But, the only trend difference that seems evident visually is that the growth in Holland Park becomes slightly faster than the growth in Hammersmith.
  • Although we plot overall prices, we might be interested in the rate of return. In which case, it is worth noting that Shepherd's Bush started from a significantly lower price point, so it has not necessarily underperformed here (although we can also see that it has not overperformed).
  • There is a lot of volatility in the indexes measured. This is due to two parts:
    1. There is large variance due to the short time intervals used (one month). We only have limited sales in each period (maybe an average of 6-10) so therefore, we do not have enough information in the individual periods to capture an accurate average price.
    2. There is volatility in the market itself. Volatility in a market is often associated with speculation, so this would be useful information. We need to eliminate the confounding variables in order to measure it though.

Upcoming Work

On the basis of this information we can also calculate some more concrete (perhaps more directly useful) metrics: most important would be the predicted price of a house that was already transacted in our dataset. To do this we can now simply input the calculated index values, along with the previous transaction price, into the model equation (step 3 in the construction). We also need to investigate the variance, by lengthening the index periods slightly to reduce model volatility and then making a calculation of the remaining variance.


  • Bailey, M., R. Muth, and H. Nourse. 1963. A Regression Method for Real Estate Price Index Construction. Journal of the American Statistical Association 58: 933-942.
  • Case, K., and R. Shiller. 1987. Prices of Single Family Homes Since 1970: New Indexes for Four Cities. New England Economics Review September/October: 45-56.
  • Goetzmann, W.. 1992. The Accuracy of Real Estate Indices: Repeat Sale Estimators. Journal of Real Estate Finance and Economics 5:5-53.
  • Goetzmann, W.. 1993. Accounting for Taste: Art and Financial Markets over Three Centuries. American Economic Review 83: 1370-1376.
  • Kuo, C.L.. 1997. A Bayesian Approach to the Construction and Comparison of Alternative House Price indices. Journal of Real Estate Finance and Economics, 14: 113-132.
  • K Graddy, J Hamilton, R Pownall. 2012. Repeat Sales Indexes: Estimation without Assuming that Errors in Asset Returns Are Independently Distributed. Real Estate Economics, 40: 131-166