Modeling Real Estate Sale Prices

When putting you’re home for sale on the real estate market, researching sale prices is always necessary. You can rely on realtors experience in a geographic area, use Zillow’s Zestimate or troll through listings on various real estate web sites to make somewhat educated guesses. Your best bet is probably to hire a professional. For me, it was of interest to build a model for predicting a price based on other homes for sale in the area.

The goal was to:

  • collect geographically local property data on other properties for sale
  • build a mathematical model with sample data representative of the property attributes
  • plug in the corresponding attributes of my house to predict a sale price

The number of houses on the market change on a daily basis as can the prices during the sale period. I wanted the ability to quickly pull current data, build a model and predict a price at anytime so I implemented the following:

  • a script to scrape data from realtor.com
  • code to build a model with property attributes using multi-variant linear regression
  • the ability to input the corresponding property attributes of my house into the model to predict a sale price

Scraping Data

The data is scraped from realtor.com. This web site allows looking at properties for sale in localized geographic region.  Data could be scraped for many geographic regions creating a large data sample or you can scope the data to an area where fewer property attributes become more relevant. Many people reference Zestimate for home price estimates. For my neighborhood, for example, the estimates are too low. Other neighborhoods have estimates much too high. I suspect the algorithm used is very general and doesn’t account for geographic uniqueness. Relevant attributes unique to my geographic area are gorgeous mountain views. Some properties have exceptional mountain views, others do not. I choose to scope the the area of data used for the training set to roughly a 5 mile radius. The actual boundaries are set by realtor.com. The number of properties for sale in the area currently averages about 140. A small set for high accuracy but turned out to be big enough to get an estimate comparable to professional advice.

Scraping data for the training set is done using Python and the BeautifulSoup package to help parse the page content. The flow of execution implemented is roughly:

Open a file for output
load the realtor.com local region page
while there are more properties to read:
    read property data block
    parse the desired attribute values
    write attributes to output file
    if at end of listing page:
        if there are more pages:
            load next page
close output file

Executing the script produces a file with content looking like:

5845 Wisteria Dr city st zip Homestar Properties, 234900, 4, 3, 1225, 88
165 Timber Dr city st zip, The Platinum Group, 1200000, 7, 5, 7694, 520
6380 Winter Haven Dr city st zip Pink Realty Inc., 544686, 4, 4, 5422, 25
...

The columns correspond to address, realtor name, price, # of bedrooms, # of bathrooms, square footage and acreage. Additional values not shown here are year built, whether or not the property has a view, kitchen is update or not, bathroom is updated or not. The acreage value is multiplied by 100 in the output file to produce an integer value for easier matrix arithmetic. The address and realtor name is not used in the building of the model, however it is useful when comparing page scrapes to determine which properties are new to the list and which ones have sold.

Building the Model

The output file from the web page scrape becomes the training set used to build the model. I used multi-variant linear regression to build the model. It is a reasonable approach for supervised learning where there are many attribute inputs and one calculated value output. In this case the predicted price is the output.

The hypothesis defining the function used to predict the price is

hθ(x) = θ0 + θ1x1 + θ2x2 + θ3x3 + θ4x4

where

  • h is the hypothesis
  • the right side of the equation is the algorithm to determine the hypothesis
  • x’s represent the attributes of the property and
  • θ’s are the parameters used to define the gradient in a gradient descent algorithm

The gradient descent algorithm minimizes over all θ‘s meaning an iteration over all θ parameters is done to find, on average, the minimal deviation of all x’s from y (the price) in the hypothesis function. The equation for calculating the minimum cost (or deviation) looks like:

gradientDescentForMultipleVariables

where J is the cost function. Fortunately this equation can be solved computationally with linear algebra in a straight forward manner. Creating a feature vector of all attributes for each training example and building a design matrix X from the vectors, θ value can be solve for using

theta

Currently I’m learning some about the Octave language. With much learning and experimentation (on my part), the code is relatively straight forward. The code first reads in the trainingg set data file and normalizes each feature dimension. Normalization is done by subtracting the mean of the feature values and subtracting that from each element and then dividing each element by 1 standard deviation of the feature values. The important parts of the rest of the gradient descent code looks like:

J_history = zeros(num_iters, 1);
for iter = 1:num_iters
    theta = theta .- alpha * (1/m) * (((X*theta) - y)' * X)';
    J_history(iter) = (1/(2*m)) * (X*theta-y)' * (X*theta-y);
end

where X is the matrix of normalized feature vectors and J_history is the cost function value. The cost function value is plotted to validate the function is converging to a minimized solution. The plot looks like the following over 400 iterations

costFunctionConvergenceGraph

Also for a simple visualization, a two dimensional plot of square footage vs price gives an idea of how the model performs using one feature. Basically a simple limear regresion fit using gradient descent.

lr_plot_nw_colospgs_prices

Retrospective

The cost function appears to converge nicely on the training set and a simple visual plot indicates a reasonable fit. To predict prices for a given feature set, the Octave code looks like:

v=[4793, 5, 4, 66, 1986];
price = [1, (v-mu)./sigma] * theta;

where

  • v is the vector of feature values representing the property and
  • price is calculated as the vector normalized using the same values to normalize the training data multiplied by the vector of theta values representing the hypothesis function.

Getting predictive price results for a set of property attributes has produced values comparable to professional opinion. So the model turns out to be fairly accurate. Many more attributes could be added to the model theoretically making the model more accurate. Although it can become time consuming to acquire all the attributes if they are not readily available. I believe scoping training data sets to bounded geographic areas increases accuracy, especially if you can pick unique attributes relevant to the area. Who knows, maybe a tool like this could be used to make accurate price predictions available for use by most everyone.

Advertisements

Usefulness of Confusion Matrices

A Confusion Matrix is a visual performance assessment of a classification algorithm in the form of a table layout or matrix. Each column of the matrix represents predicted classifications and each row represents actual defined classifications. This representation is a useful way to help evaluate a classifier model. A well behaved model should produce a balanced matrix and have consist percent correctness numbers for accuracy, recall, precision and an F measure. If it does not, there is cause to further evaluate the data used to build the model and the data used to test the model. If you are building your own classification models, this is a helpful way to evaluate them. If you are buying a product for something like Sentiment Analysis which uses a classification model, you should ask for data associated with the Confusion Matrix to help evaluate the tool.

If a classification algorithm distinguishes between positive, negative and neutral text statements, for example, a confusion matrix summarizing the results of a classification algorithm might look like the following.

Predicted
Actual Pos Neg Neutral
Pos 15 10 100
Neg 10 15 10
Neutral 10 100 1000
  • In this matrix, the total actual number of statements classified in each category is the sum of the row. 125 Positive, 35 Negative and 1110 Neutral.
  • Columns represent predictions made by the algorithm. In the first column, 15 Positive statements were classified correctly as Positive. 10 Negative statements were incorrectly classified as Positive and 10 Neutral statements were incorrectly classified as Positive. The 20 statements incorrectly classified are considered false positives.
  • Reading left to right across the top line, of the total Positive statements, 15 were classified as Positive, 10 were classified as Negative and 100 were classfied as Neutral. 110 Positive statements were missed and considered false negatives.
  • Values in the diagonal are correctly classified and are underlined. All other classifications are incorrect.

Accuracy

The simplest and most intuitve assessment is Accuracy. It is the correct classifications divided by all classifications. In this example, divide the sum of all the underlined values by the sum of values in all cells. 1030/1270 = 0.811

How to Cheat:

If the categories are highly imbalanced, high accuracy can be obtained by always guessing the category with the largest number of elements. Accuracy can be checked by randomly classifying the elements and comparing the percent guessed correctly with the accuracy of the algorithm. Accuracy should also be validated with Precision and Recall to detect how the data may be set up to cheat.

Precision
Precision is the correct classifications penalized by the number of incorrect classifications. true positives / ( true positives + false positives). In this example, the precision is:

(reading down each predicted column)

positive statements is 15 / (15 + 20) = .43

negative statements is 15 / (15 + 110) = .12

neutral statements is 1000 / (1000 + 110) = .90

How to Cheat:

You can get high precision by rarely classifying in a given category but this ruins Recall.

Recall

Recall is the number of correct classifications penalized by the number of missed items. true positives / (true positives + false negatives). in this example, the recall is:

(reading across each Actual row)

positive statements is 15 / (15 + 110) = .12

negative statements is 15 / (15 + 20) = .43

neutral statements is 1000 / (1000 + 110) = .90

F Measure (F1)

F1 measure is a derived effectiveness measurement. The resultant value is interpreted as a weighted average of the precision and recall. The best value is 1 and the worst is 0.

2((precision*recall) / (precision+recall))

positive statements:  2*((0.43 * 0.12) / (0.43 + 0.12)) = 0.18

negative statements: 2*((0.12 * 0.43) / (0.12 + 0.43)) = 0.18

neutral statements: 2*((0.9 * 0.9) / (0.9 + 0.9)) = 0.9

Observations

In this example, although the Accuracy at 81% seems good, the precision and recall indicate that the numbers are skewed by having an imbalance in the number of values tested across the categories. This is clearly visible in the F measure numbers. It could be that the model in this example was built with many more neutral statements than positive and negative and the test data contains mostly neutral statements. Accuracy alone can be misleading.

Ideally you want to see a matrix more like the following:

Predicted
Actual Pos Neg Neutral
Pos 105 10 10
Neg 6 90 8
Neutral 12 15 150

Accuracy = 0.85

Precision

Positive = 0.85

Negative = 0.78

Neutral = 0.89

Recall

Positive = 0.84

Negative = 0.87

Neutral = 0.85

F Measure

positive = 0.84

negative = 0.86

neutral = 0.87

All the numbers are in the same ballpark and are reasonably balanced across the matrix with the bulk of the numbers on the diagonal showing a majority of correct classifications. This also shows a greater likelihood of the model being constructed with a balanced data set and that the test data evaluated by the model is also balanced.