Ceyhun Emre Öztürk's Personal Blog: House Price Predictor Project (Machine Learning Implementation Using Python-3)

Bilkent University EEE 485 Term Project Report

Members of the Group:

Ceyhun Emre Öztürk/ Dep.: EE

Ömer Musa Battal/ Dep.: EE

Project Phase:

3 (Last Phase)

Introduction:

In this project, we are implementing a house price predictor using several machine learning approaches. We used Python programming environment to apply machine learning. Our dataset was constructed by extracting data from zingat.com and sahibinden.com. We used 3 different learning methods that were shown in class to find predictions of house prices. These methods are Linear Regression, K-means Clustering and SNN (Shallow Neural Network). We used K-means clustering to divide house notices into the groups according to their features. This way, we had a purpose of getting rid of nonlinearities of our dataset. Then, we applied Linear Regression separately in each cluster.

Changes Made After Phase 2 Report: We decided to delete some of the features. We changed the imputation code such that if percentage of empty cells in a feature column exceeds 50%, we dropped the feature. We realized that we made some mistakes while applying vectorization on the dataset (vectorization means assigning vectors instead of non-numerical valued features). We noticed that the creators of the notices left trailing white spaces at the end of some strings and numbers. Therefore, we added a few lines to solve the problem.

Detailed Description of the Methods Used:

1) Feature Extraction:

1.1) Imputation Methods: Imputation methods are the methods used in statistics to replace missing values with assigned values. We needed imputation methods because our datasets had missing cells. For example, the owner of the notice may not know if there is an airport nearby the house on sale. Therefore, they have the option to not give information about whether there is an airport nearby or not. One other possibility is that the owner of the notice may want to hide some features of the house on sale. We used different types of imputations and combined the datasets with imputed values. This method is known as multiple imputation. Multiple imputation is the most appropriate imputation method for our project because multiple imputation replaces missing values with the values that have variances larger than the variance of simple residual variance. We wanted large noise in the imputed values because it is more likely that values that are missing in the datasets are actually the values that are known by the owners of notices of the houses on sale – they hide information to make their houses look more valuable.

Researches who use this method can obtain approximately unbiased estimates of all the parameters. The steps of multiple imputation are given below in order:

Imputing the missing values by using an appropriate model –a model with random variation.
Repeating the first step several times –let us assume that we repeat it m times.
Then, we get m different datasets by applying the m models to replace missing values.
Averaging m results to get the value to get final values to replace missing values (Multiple Imputation for Missing Data, Web).

In our project we chose m = 2 (We used at most 2 different methods of imputation for each feature). There are 156 features in the original dataset.

Figure 1: Imputation Methods Used for Some of the Features

	Age of Building	District	Doorman	Dry Cleaning	Drywall	Dues (TL)	Elevators	Fire Alarm	Fitness Center
Imputation Method 1	Mod	Mod	Mod	Mod	Mod	Mod	Mod	Mod	False
Imputation Method 2	Average	Mod	Mod	Mod	Mod	Average	Mod	Mod	False

Mod: Assigning the mod of the values of the column (If there are non-numerical values) to the empty cells. Average: Assigning the mean of the values of the column (If all values are numerical) to the empty cells. False: Assigning “False” to the empty cells when it is known that having “True” for that feature is rare.

There were some features in our dataset such that most of the cells of the columns of those features were empty. As an imputation method, we deleted any of those features that were sounding irrelevant to the prices of houses. Also, we deleted any of the features such that the percentage of empty cells in the corresponding feature column exceeds 50%.We consulted a civil engineer (Ceyhun Emre Öztürk’s father) beforehand so that we could delete the right features. Consequently, we had 107 features and 2086 notices in total as our dataset.

1.2)Vectorization and Normalization of the Dataset: We looked at the most of the values that a notice can have for each feature. Then, we thought over the concepts of features. Values of some features’ were suitable for ordering. For example, we knew from before that the price of a house decreases as the age of the building increases. Therefore, we did not do any change for the values of features like “Age of Building”. However, it was clear that for some features, different values could not be compared. For example, it is not clear that the price of an apartment flat increases as the “Floor” of the flat increases. Therefore, we created n features to replace the “Floor” feature such that n is the number of different values observed in the “Floor” column of the dataset. For example, the value Floor = 0 got its own feature. The values of the houses for that feature was either 0 or 1; if the house was on the ground floor, then the value of the cell with the index of that house and the “Floor=0” feature was 1. Otherwise (if that house was on another floor), 0 was assigned to that cell. This process was applied for all similar features. Also, this process was applied for features with non-numeric values. For example, in the original dataset there was a feature called “District” which stores the name of the district of the corresponding house. In the resulting dataset after vectorization, there was one feature for each possible district such that values of the corresponding cells were either 1 or 0. This process has the name “vectorization” since every “district” has its information stored in an element of a vector instead of a text string or a scalar number. After applying vectorization, we had 444 features in total.

We applied normalization on the dataset so that applying PCA could give us the right results.

1.3)Principal Component Analysis (PCA):

PCA is a method that can be used to find most relevant elements of the feature vector and eliminate unnecessary features. The algorithm of PCA is given as follows: Firstly, the covariance matrix of the dataset will be found. Then, the eigenvalues of the matrix will be calculated. If there are eigenvalues that are much greater than other eigenvalues, then their corresponding eigenvectors will be used to construct the new dataset with reduced dimensions. These eigenvectors are called principal eigenvectors. The formula for this process is given below:

where

is the initial dataset with n features,

is the matrix consisting of principal eigenvectors (We ignored n – a eigenvectors since covariance matrix has n x n size.),

is the new dataset with reduced number of features. It should be noted that the features found by PCA become completely different from the initial features. Therefore, they do not reflect the categories given in the real estate trade websites directly – for example, there is no feature representing “the number of rooms feature” directly anymore. We picked the eigenvectors for the "E" matrix such that the PVE (proportion of variance explained) by the selected eigenvectors was more than 0.7. After applying PCA, we were left with 170 features (principle components) in total.

Figure 2: Importance of Each Principle Component

2) Learning

For this phase, we worked with three different learning algorithms to get two different predictions.

2.1) Clustering and Linear Regression:

2.1.1) Clustering:

Clustering is an unsupervised algorithm that is used to classify data points. Our project is a problem that requires supervised regression. However we found a way of using Clustering to complete our machine learning task. Our purpose is dividing the data into meaningful groups using Clustering. Then, we applied linear regression separately on these groups to predict the prices of the houses.

The loss function for Clustering was given as follows:

where n is number of data points, K is number of clusters,

is a general dissimilarity measure for

(location of ith data point) and

(location of the centroid of kth cluster) and

is the indicator of the relationship between ith data point and kth cluster. This indicator is 1 only when ith data point is assigned to the kth cluster. Otherwise, it is 0. The loss function is minimized by applying two steps iteratively. At first, initial values are assigned to the location of centroids of clusters. Then, the two steps, expectation and maximization are applied.

In the expectation step, the loss function is minimized with respect to the r_ik’s. In the maximization step, the loss function is minimized with respect to mu_k. This algorithm is guaranteed to reach a local minimum. To reach the local minimum with least loss, we implemented the algorithm 10 times such that at each implementation a different random location was selected for a centroid. After 10 implementations, the implementation with least loss was found. We wrote a loop such that if the loss J exceeds 400000, 10 new implementations will be applied. The threshold for the loss was derived experimentally (before applying the loop) by looking at the runtime results of the Clustering script.

2.1.1.1) k-means Clustering: This is the first type of Clustering that we implemented. For this type of Clustering,

For this equality, we have the following solution for the expectation step:

And the solution for the maximization step is given below:

2.1.2) Linear Regression:

We applied linear regression to the clusters that were found using K-means Clustering. Actually, the properties of houses may have nonlinear effect on their prices. However, linear regression can be applied on nonlinear data. Advantage of using linear regression instead of nonlinear regression is the fact that fitting a linear curve is simpler than fitting a nonlinear curve. Also, calculation of loss functions is easier with linear regression.

When using linear regression with nonlinear data, we need to use the following formula:

where y is the output, is Beta_0 linear regression bias, x_i is the ith element of the feature vector (which has n elements in total), Beta_i is the weight for ith element of the feature vector and m_i is power of ith element of the feature vector. After choosing m_i’s, we are left with linear regression. The drawback of this procedure is the fact that the terms that are dependent to x_i’s may be insufficient in terms of complexity. We applied Ridge regularization on our loss function to avoid overfitting.

2.2) Shallow Neural Network:

We used a shallow neural network as another learning algorithm for our purpose. Shallow neural networks are simply neural networks with one hidden layer, unlike deep neural networks which may have many such hidden layers. A feed forward neural network consists of layers of perceptrons, with each consequent layer perceptrons interconnected. All connections have their associated weights, and each perceptron has its own bias. A perceptron applies a nonlinear activation function to its input from the linear combination of the outputs from previous layer and the bias, and passes its output to the perceptrons next layer. This way, it somewhat simulates a neuron, and the overall network may be able to predict the output better than regular regression. The following figure illustrates the general structure of a feed forward neural network.

Figure 3: Feed Forward Neural Network (This figure was taken from Lecture Slides.)

The vector of activations of the a layer of neurons can be described by the following equation.

where a’ denotes the result of the current activation layer, a is the previous layer activations, σ is the activation function, w is the weights of the previous layer, and b is the biases of the previous layer. The choice of the activation function is important for reasons that will be apparent later.

The neural network is trained by repeated applications of feed forward and back propagation. The input is given to the network, and the predetermined loss function is computed based on the output. Then, all the weights of the connections between the layers are updated to minimize the loss function, through back propagation. The loss function is a function in the following form. It is also called as cost function.

where l(.,.) is a measure of error for the network output for input data x_i. The loss function is the sum of all such errors. The loss function can be chosen arbitrarily in this form, however some established cost functions exist, which possibly provide better learning. One such loss function is the quadratic loss function, which is suitable for regression. It is given below.

To minimize the loss, gradient descent or stochastic gradient descent methods are used. Repetitions of this process minimizes the error, and hopefully at a global minimum. To ensure that the network is not stuck at a local minimum, stochastic methods may be used or the initial weights may be chosen differently.

Stochastic gradient descent (SGD) applies the gradient descent algorithm to randomly selected mini batches of the input data. It thus reduces the likelihood of getting stuck at a local minima, and still represents the dataset well in a probabilistic sense. It has another favorable consequence. It enables the network to train faster, since the entire input is not used for a single mini batch back propagation. These factors make the stochastic gradient descent a useful method of training.

Moreover, neural networks may be prone to overfitting to the training data, in which case the results cannot be generalized well to real life situations. To avoid this, regularization methods such as weight decay may be used. The formula for the gradient descent with weight decay in SGD is like the following.

and back propagation is computed as follows.

In the above equations, η is the learning rate, and λ is the regularization parameter.

Although neural networks should theoretically be very capable, they also have some drawbacks. One such drawback is the vanishing gradient problem, which results from the gradient of the activation functions vanishing during back propagation computations. In that case, the network is stuck and cannot learn further. Moreover, the activation function should allow us to have small changes in the output when the weights are changed slightly, since this is the underlying principle of the gradient descent algorithm. The choice of the activation function is therefore essential. It should be a smooth function, and possibly should not have vanishing gradient. One such function is the sigmoid function, which is commonly used in neural networks, and we used it as well. Its derivative is also easy to compute, which is nice for back propagation computations. The sigmoid function and its derivative is given below.

There are also some terminology regarding neural networks, particularly the SGD algorithm. A mini batch is a randomly selected subset of the training data, which is used in one update of the SGD algorithm. When the entire training dataset is covered by these mini batches, one epoch of training is said to be completed.

3) Validation

3.1) k-fold Cross Validation:

This technique is robust for providing randomness in validation. The algorithm of k-fold cross validation is given as follows:

· Randomly split the dataset into k equal sized folds.

· Train ith model on all folds except fold i.

· Validate ith model using fold i.

Description of the Dataset:

Our dataset was constructed by extracting data from zingat.com and sahibinden.com. The features were created using the features listed in the user interface of those sites. There were so many empty cells in our dataset because of using two different sites with different standards for features of the houses. For imputation, we used mod and average values of the filled cells. Also, we deleted some of the features since most of the houses did not had any information about them and they sounded irrelevant to the civil engineer who we consulted. Then, we assigned numerical values to each cell that originally stored non-numerical values. After that, we normalized the data and applied PCA on the dataset. The resulting dataset was used in all our learning methods as the input.

Simulation and Performance of the Training Methods:

1) Clustering:

Actually, we did not use Clustering in our final implementation for training. We used Clustering without making use of prices of houses. We did not make price predictions with Clustering. We used Clustering to find similar house notices and create groups with these similar house notices. Then, we applied Linear Regression on these groups to predict the house prices.

Before the Phase 2 Demo, we validated Clustering methods (K-means Clustering and K-medoids Clustering with weighted Euclidean distance) with 5 different random initial locations for each centroid. We evaluated RSS/TSS for each implementation to compare the coefficients of determination of K-means Clustering model and K-medoids Clustering model. We reported in our Phase 2 Report that we used 10-fold cross validation to validate these models, however we made some changes in the code script before the Phase 2 demo to create a better testing conditions for the accuracy of the Clustering models. RSS/TSS ratios for the models of cross-validation are shown in the figure below:

Figure 4: RSS/TSS for Different Initial Locations of Centroids

Coefficient of determination is equal to 1-RSS/TSS. Therefore, we observed from the figure above that K-means Clustering model is more accurate. As a result, we used only the K-means Clustering method as our Clustering method in the final implementation. After deciding on the model to use, we implemented the algorithm 10 times (instead of 5 times) such that at each implementation a different random location was selected for a centroid. We displayed the training time for each implementation in the console output. Average training time is found as 2.851 seconds. Training time is not about training, actually it is the execution time of the algorithm. We made an abuse of notation by calling the execution time the training time, because we had used until the Phase 2 Demo the Clustering model as a learning method to directly find house price predictions. Before the demo, we were not making use of Linear Regression on clusters.

2) Linear Regression:

The value that we optimized was lambda (the coefficient of Ridge regularization). For this phase, we applied linear regression to our data without applying feature transformation. Therefore, the following version of the formula that is given in this section was used:

We found by trial and error that using 3 clusters creates the best accuracy for the predictions made by linear regression. Linear Regression with Ridge Regularization was applied on 3 clusters separately and therefore 3 different optimal λ values were found. We used 5-fold cross validation to find the optimal λ values for clusters. The results are given below:

Average training time for cluster 0 : 0.0506 s.

Optimal lambda value for cluster 0 : 190.

Average training time for cluster 1 : 0.079 s.

Optimal lambda value for cluster 1 : 320.

Average training time for cluster 2 : 0.0534 s.

Optimal lambda value for cluster 2 : 240.

Coefficient of determination was defined as the average of the coefficient of determination values of cross validation models. Coefficient of determination for Regularized Linear Regression was found as 0.715. After finding optimal values, we shuffled the house notices and then separated each cluster into two groups with same number of members. One of the groups was used for training with the corresponding optimal value, and the other group was used for testing. We used the test set to calculate the probability of predicting the price in the given accuracy range. Our results are given below:

Probability of making a prediction in the range [0.9 (real price), 1.1 (real price)] = 0.217.

Probability of making a prediction in the range [0.8 (real price), 1.2 (real price)] = 0.432.

Probability of making a prediction in the range [0.7 (real price), 1.3 (real price)] = 0.592.

We thought that using the Ada-Boost method we could increase the accuracy of the algorithm. To implement the Ada-Boost algorithm, we needed to create subsets of our clusters that will be used in training instead of the clusters. However, we did not have sufficient number of house notices to work with. Our algorithm allowed clusters with at least 300 house notices. We had 170 feature, which means that number of data points divided by number of features is a small ratio which may lead to inaccuracies. Therefore, we did not have any tolerance to work with a smaller dataset. That was the reason why we did not use Ada-Boost algorithm in our first predictor model (Clustering + Regularized Linear Regression Model).

3) Shallow Neural Network:

In our implementation, we first defined various useful activation functions. We also defined their derivatives for back propagation calculations. We also defined the quadratic and cross entropy cost functions. Then we defined a neural network class, which held the information of perceptron weights and biases. The implementation of this class was inspired by an online source. It has a prominent SGD function, which utilizes a mini batch update function, which in turn uses the back propagation function. The SGD function can be configured to monitor the accuracy and the total cost of the neural network on a given test dataset, which is not used in training. The algorithm can be terminated prematurely if the accuracy on the test dataset is unchanging for a set number of epochs of training. In the end of the training, the performance of the network can be evaluated on another validation dataset, which is separate from the test dataset. For performance measure, we used a method resembling a confidence interval. We compared the results of the neural network with the actual values, and computed the statistic of the prediction being a within certain percentage interval of the true value. The driver program is configured to save the state of the neural network object, and load this object in the next run, enabling incremental training from one run to the next.

The determination of the activation function, number of layers and the perceptron counts at each layer, number of epochs, the mini batch size, learning rate, regularization method and regularization coefficient were the hyperparameters of our learning algorithm. We adjusted these hyperparameters based on the obtained accuracy, that is, we tried to find the hyperparameters which maximized our accuracy of prediction. We applied a heuristic approach for this purpose, and tried various combinations, until we settled with the current model. It may be argued that our model may not be the optimal after this procedure, but it was the result of our best effort. The driver program could be modified to exhaustively search through all hyperparameter combinations, training the network through each combination from start. However, the amount of computation required would be unfeasible, so we did not use such an approach.

One implementation detail is that our neural network does regression, but the output layer does not have linear activation function. Instead, the real results were scaled down when fed to the network, and the network predictions were scaled up in the real prediction case. We resorted to such a method because the linear layer was causing numerical instability in the back propagation computations. However, this was not a major issue, since house prices are always positive, and they have a reasonable upper bound. Even if we were to have negative values in our regression problem, we could circumvent by using a bias at the output, eliminating the problem.

Total execution time of the SNN algorithm was 1023.20 seconds. Coefficient of determination for the test set was found as 0.647. Coefficient of determination for the training set was found as 0.956. We used the test set to calculate the probability of predicting the price in the given accuracy range. Our results are given below:

Probability of making a prediction in the range [0.9 (real price), 1.1 (real price)] = 0.313.

Probability of making a prediction in the range [0.8 (real price), 1.2 (real price)] = 0.582.

Probability of making a prediction in the range [0.7 (real price), 1.3 (real price)] = 0.779.

As a result, we can say that for finding the house prices in Ankara, SNN is a more effective learning method than the method we created by combining Clustering and Ridge Regularized Linear Regression methods

Conclusion:

In the Expected Challenges section of our Phase 1 report, most of the problems which we stated that we would have to counter were about data extraction and data cleaning. Our prediction was correct. We faced the following problems in the first phase of the project:

1) Our web crawler program crashed because the websites’ protection tools denied our requests to get html code after the program extracts html for some certain time from those websites.

2) We had to make corrections to the names of the features because house notices from zingat.com had Turkish labels whereas house notices from sahibinden.com had English labels. We converted all label names to English. Also, there were some irrelevant features for houses of interest. For example, one of the extracted features was “closeness to Bosphorus”. However, Bosphorus is a strait in İstanbul whereas we were interested in only houses from Ankara. The sad fact about this situation is that there were some notices where the house was located in Ankara but the owner of the notice claimed that the house was close to Bosphorus by filling the corresponding cell with the “True/Yes” value. This made us question the reliability of the features of house notices.

Because of these problems, creating the dataset that we desired to have took several weeks. Therefore, we were able to create a dataset with only 2086 house notices. If we had more house notices, we could implement Ada-Boost learning on our both learning models to improve the accuracy of them. We were not able to implement Ada-Boosting since we did not have sufficient number of house notices to work with.

Ada-Boost is the abbreviation of Adaptive Boosting. In Adaptive Boosting, some data points are selected randomly for training from the training set. Then, the whole training set is used for testing. Then, some data points are selected randomly again for training from the training set. This time, each of the data point that has been predicted poorly has a more probability of being selected than the other data points’ probabilities of being selected. The number of iterations for the training-testing sequence is fixed and chosen by the programmer.

One other problem is that we found an optimal number of clusters to implement “Clustering + Ridge Regularized Linear Regression” on our fixed dataset as 3 by trial and error. However, for a different dataset consisting of other house notices, the optimal number of clusters may differ. We need a way to find the optimal number of clusters automatically so that our Machine Learning approach can be used in commercial application.

Statics for the accuracy of our learning models are a given below for convenience:

For Clustering + Ridge Regularized Linear Regression:

Coefficient of Determination = 0.715.

Probability of making a prediction in the range [0.9 (real price), 1.1 (real price)] = 0.217.

Probability of making a prediction in the range [0.8 (real price), 1.2 (real price)] = 0.432.

Probability of making a prediction in the range [0.7 (real price), 1.3 (real price)] = 0.592.

For Shallow Neural Network:

Coefficient of Determination = 0.647.

Probability of making a prediction in the range [0.9 (real price), 1.1 (real price)] = 0.313.

Probability of making a prediction in the range [0.8 (real price), 1.2 (real price)] = 0.582.

Probability of making a prediction in the range [0.7 (real price), 1.3 (real price)] = 0.779.

We thought about ways to increase the accuracy of our Machine Learning approach and to make it more suitable for commercial applications (in the form of a Website or a mobile app). We are proposing the ways which will be useful according to our opinion:

1) We can take permission from the Real Estate Trade Websites to get their html content faster. They can fund us to get profit from the commercial application.

2) If the owners of a Real Estate Trade Website make a deal with us as we proposed in the first clause, they will adjust their user interface so that the labels we extract from that Website will be more relevant to the 107 features that were left after imputation. We believe that those 107 features are sufficient to predict the house prices in Ankara since we found those features using two different Websites. We think that we can generalize our model by making use of these fixed features in the commercial application. One exception might be adding the listing date and inflation rate as features. By adding these features we can add to our learning models the effect of change of prices as time passes.

3) There were conflicts between labels coming from different Websites. Also, sometimes duplicate labels occurred. To solve this problem, our commercial application will make several predictions for the price of a house such that each prediction will make use of a dataset that was constructed using only one Website as a source. This way, we can avoid conflicts of labels. The user of the application will have several predictions in hand to make use of.

4) Natural language processing approaches can be utilized to adjust the name of features of house notices to the desired name (to one of the names of the fixed 107 features). A feature will be discarded if it is not accurately represented as one of the fixed 107 features.

5) To find the number of optimal clusters for our Clustering algorithm, cross-validation will be used and several numbers will be tested as the number of clusters. These way, the optimal number of clusters will be found automatically.

Works Cited:

“Multiple Imputation for Missing Data.” Statistics Solutions, www.statisticssolutions.com/multiple-imputation-for-missing-data/.

Appendix:

Python Scripts: crawler.py is responsible for extracting data from a targeted website. dataset_imputation.py is responsible for imputing the empty values of the dataset. dataset_vectorization.py is responsible for assigning numbers instead of non-numeric data. normalization_of_dataset.py centralizes and scales the data. pca_on_normalized_dataset.py applies PCA on the normalized data. clustering.py applies K-means Clustering on our dataset to divide the dataset into separate groups based on similarity. This way, we had a purpose of getting rid of nonlinearities of our dataset. linear_regression.py is responsible for training and testing a linear regression model with Ridge regularization. neural_network.py is responsible for training and testing a Shallow Neural Network model. neural_network_driver.py is the class that contains the Neural Network model. miscellaneous.py is responsible for measuring how much time the algorithms inside the neural_network.py script consume.

crawler.py:

# installed libraries

from bs4 import BeautifulSoup

import pandas as pd

import requests

# built in libraries

import json

import logging

import random

import platform

import re

import subprocess

import time

# self written code

import data_extractors as data_ex

# give beep sound

def os_beep():

if 'Linux' in platform.system():

os_sound_name = 'alarm-clock-elapsed'

subprocess.call(['/usr/bin/canberra-gtk-play', '--id', os_sound_name])

elif 'Windows' in platform.system():

print('to beep, or not to beep... That is the question.')

time.sleep(2)

# get some free proxies from given url

def get_proxies(proxy_url='https://free-proxy-list.net/') -> list:

# list to keep all proxies

proxies = []

response = persistent_request(proxy_url, timeout=10)

soup = BeautifulSoup(response.text, 'html.parser')

# extract proxies table

proxies_table_tag = soup.find('div', {'class': 'table-responsive'})

proxy_tag_list = proxies_table_tag.find_all('tr')

# determine proxy info key names

keynames = []

for keyname_tag in proxy_tag_list[0].find_all('th'):

keynames.append(keyname_tag.text.strip())

# add proxy information to proxies list

for proxy_tag in proxy_tag_list:

# get only elite proxies

if proxy_tag.find(string='elite proxy'):

proxy_info_list = proxy_tag.find_all('td')

proxy = {}

key_index = 0

for proxy_info in proxy_info_list:

proxy[keynames[key_index]] = proxy_info.text.strip()

key_index = key_index + 1

proxies.append(proxy)

# return num_of_proxies_to_return proxies at maximum

num_of_proxies_to_return = 20

if len(proxies) >= num_of_proxies_to_return:

proxy_list = random.sample(proxies, num_of_proxies_to_return)

#proxy_list = proxies[:num_of_proxies_to_return]

return proxy_list

else:

return proxies

# try a persistent http connection

def persistent_request(url: str, headers={}, proxies={}, timeout=0):

while True:

try:

http_response = requests.get(url, headers=headers, proxies=proxies, timeout=timeout)

# check if http response status code is not 200

if http_response.status_code != 200:

print('http connection not successful.')

print('url: ' + http_response.url)

print('http status code: ' + str(http_response.status_code))

# act based on the http status code itself

if http_response.status_code == 404:

print('Page not found errors are to be skipped.')

break

else:

raise Exception('http connection error.')

except Exception as ex:

print('An exception occurred: ')

print(ex)

print('Retrying...')

continue

else:

break

return http_response

# try a persistent http connection with proxy changes

def persistent_proxy_request(url: str, headers={}, proxy_list=[], proxy_index=0, timeout=0):

while True:

try:

print('Using proxy from the proxy set: ' + proxy_list[proxy_index])

requests_proxies = {

'http': proxy_list[proxy_index],

'https': proxy_list[proxy_index],

}

http_response = requests.get(url, \

headers=headers, proxies=requests_proxies, timeout=timeout)

# check if http response status code is not 200

if http_response.status_code != 200:

print('http connection not successful.')

print('url: ' + http_response.url)

print('http status code: ' + str(http_response.status_code))

# act based on the http status code itself

if http_response.status_code == 404:

print('Page not found errors are to be skipped.')

break

else:

raise Exception('http connection error.')

except Exception as ex:

print('An exception occurred: ')

print(ex)

proxy_index += 1

if proxy_index < 0 or proxy_index >= len(proxy_list):

print('Proxy index is out of range.')

proxy_index = 0

print('Retrieving another set of proxies...')

proxy_list = []

for proxy_dict in get_proxies():

proxy = proxy_dict.get('IP Address') + ':' + proxy_dict.get('Port')

proxy_list.append(proxy)

print('New proxy list: ')

print(proxy_list)

print('Retrying http connection...')

continue

else:

break

return {'http_response': http_response, 'proxy_list': proxy_list, 'proxy_index': proxy_index}

# main crawler body for all websites

def crawler_main(webpage_name: str, main_notices_page_no_interval: list):

# time delay intervals in seconds to avoid being banned by website

# TODO how long to wait???

delay_between_notices_interval = [0, 0]

delay_between_notice_sets_interval = [0, 0]

delay_between_main_pages_interval = [0, 0]

# http timeout

requests_timeout = 5

# user agents used in http requests

requests_user_agents_list = \

['Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \

Chrome/73.0.3683.86 Safari/537.36']

# headers for http requests, needed to get response from some sites

requests_headers = {'user-agent': requests_user_agents_list[0]}

# websites to check visible ip address

#ip_test_url = ['https://httpbin.org/ip', 'http://ipecho.net/plain']

# initialize a list of usable free proxies

proxies = []

for proxy_dict in get_proxies():

proxy = proxy_dict.get('IP Address') + ':' + proxy_dict.get('Port')

proxies.append(proxy)

print('proxy list: ')

print(proxies)

proxy_index = 0

# check page interval bounds

if main_notices_page_no_interval[0] < 1:

print('Invalid start page number.')

return False

if main_notices_page_no_interval[1] < 1:

print('Invalid end page number.')

return False

# list to contain all house dictionaries

# uses too much RAM???

houses_info_list = []

# initialize the obtained data range

# this information is used for storage file naming

# decoded as follows: (start_page_no)-(start_notice_no)_(end_page_no)-(end_notice_no)

retrieved_data_start_index = str(main_notices_page_no_interval[0]) + '-1'

retrieved_data_end_index = str(main_notices_page_no_interval[0]) + '-0'

retrieved_data_index_range = '[' + retrieved_data_start_index + '_' + \

retrieved_data_end_index + ']'

# start main try block

try:

# list to contain all house notice links over all pages

house_notice_url_list = []

# loop over notices main pages

for main_notices_page_no in range(main_notices_page_no_interval[0], \

main_notices_page_no_interval[1]+1):

# get website info list

website_info_list = data_ex.get_webpage_info(webpage_name, main_notices_page_no)

# get the webpage main url

webpage_main = website_info_list[0]

# get website customized main notices page url for given page number

main_notices_page_url = website_info_list[1]

# get the search term to use for finding notice links

notice_links_search_term = website_info_list[2]

# get the data extractor to be used

data_extractor = website_info_list[3]

# the http response object from main notices page

main_notices_page_response_list = persistent_proxy_request(main_notices_page_url, \

headers=requests_headers, proxy_list=proxies, proxy_index=proxy_index, timeout=requests_timeout)

main_notices_page_response = main_notices_page_response_list['http_response']

proxies = main_notices_page_response_list['proxy_list']

proxy_index = main_notices_page_response_list['proxy_index']

# get beautifulsoup object

main_notices_page_soup = BeautifulSoup(main_notices_page_response.text, 'html.parser')

# wait a random amount of time so that the website does not ban us

main_pages_time_delay = random.uniform(delay_between_main_pages_interval[0], \

delay_between_main_pages_interval[1])

print('Waiting time before new notice main page request: ' + f"{main_pages_time_delay:.3f}" + ' s.')

time.sleep(main_pages_time_delay)

# loop over each house notice link

for notice_page_link in main_notices_page_soup.find_all('a', notice_links_search_term):

# get the notice page

notice_page_url = webpage_main + notice_page_link['href']

# add it to the complete list

house_notice_url_list.append(notice_page_url)

# inform the user about current status

print('Adding retrieved notice urls to the list...')

print('Current number of total notices: ' + str(len(house_notice_url_list)))

# index to keep number of notices processed in one page

page_notice_no = 0

# index to keep main notices page count

main_notices_page_no = 1

# process each house notice url

for notice_page_url in house_notice_url_list:

# the http response object for one notice page

notice_page_response_list = persistent_proxy_request(notice_page_url, \

headers=requests_headers, proxy_list=proxies, proxy_index=proxy_index, timeout=requests_timeout)

notice_page_response = notice_page_response_list['http_response']

proxies = notice_page_response_list['proxy_list']

proxy_index = notice_page_response_list['proxy_index']

# wait a random amount of time so that the website does not ban us

if page_notice_no % 10 == 0:

notice_sets_time_delay = random.uniform(delay_between_notice_sets_interval[0], \

delay_between_notice_sets_interval[1])

print('Extra waiting time before new notice page request: ' \

+ f"{notice_sets_time_delay:.3f}" + ' s.')

time.sleep(notice_sets_time_delay)

# get beautifulsoup object

notice_page_soup = BeautifulSoup(notice_page_response.text, 'html.parser')

#### Code to extract info for one house from notice page

# dictionary to hold information of one house

house_info_dict = {}

# add notice url to the house info dictionary

keyname = 'URL'

valname = notice_page_url

house_info_dict[keyname] = valname

## Custom extraction code for one notice webpage

data_extractor(notice_page_soup, house_info_dict)

# append the house info dictionary to the complete list

houses_info_list.append(house_info_dict)

# notify progress on terminal

page_notice_no += 1

print('Retrieved data: Page ' + str(main_notices_page_no) + \

', Notice ' + str(page_notice_no) + ', url=' + notice_page_url)

# wait a random amount of time so that the website does not ban us

notices_time_delay = random.uniform(delay_between_notices_interval[0], \

delay_between_notices_interval[1])

print('Waiting time before new notice page request: ' + f"{notices_time_delay:.3f}" + ' s.')

time.sleep(notices_time_delay)

# determine the obtained data range

# this information is used for storage file naming

# decoded as follows: (start_page_no)-(start_notice_no)_(end_page_no)-(end_notice_no)

retrieved_data_end_index = str(main_notices_page_no) + '-' + str(page_notice_no)

retrieved_data_index_range = '[' + retrieved_data_start_index + '_' + \

retrieved_data_end_index + ']'

# keep the index of main notices page

notices_per_main_page = 50

if page_notice_no % notices_per_main_page == 0:

main_notices_page_no += 1

page_notice_no = 0

except:

print('Exception caught, writing rescued data to file...')

# rescue the dataset file

# create the dataset

house_dataset = pd.DataFrame(houses_info_list)

# save rescued dataset to file

house_dataset.to_csv(webpage_name + '_dataset_' + retrieved_data_index_range + '_rescue.csv', sep=';')

# rescue the obtained house links

# create the links dataset

house_links_dataset = pd.DataFrame(house_notice_url_list)

# save rescued dataset to file

house_links_dataset.to_csv(webpage_name + '_links_dataset_' \

+ retrieved_data_index_range + '_rescue.csv', sep=';')

# display exception logs

logging.getLogger().error('Some exception occurred during crawling.', exc_info=True)

else:

print('No exception occurred during crawling, writing collected data to file...')

# create the dataset

house_dataset = pd.DataFrame(houses_info_list)

# save dataset to file

house_dataset.to_csv(webpage_name + '_dataset_' + retrieved_data_index_range + '.csv', sep=';')

# create the links dataset

house_links_dataset = pd.DataFrame(house_notice_url_list)

# save rescued dataset to file

house_links_dataset.to_csv(webpage_name + '_links_dataset_' \

+ retrieved_data_index_range + '.csv', sep=';')

finally:

# make sound when code finishes execution, due to any reason

try:

print('Use CTRL+C to stop alert sound.')

for _1 in range(2):

os_beep()

pass

except KeyboardInterrupt:

print('Keyboard interrupt.')

finally:

print('Finished execution.')

# main program start

crawler_main('sahibinden', [1, 101])

#crawler_main('zingat', [1, 3])

dataset_imputation.py:

import numpy as np

import pandas as pd

print('Program start.')

dataset_file = 'whole_dataset.csv'

imputation_method_file = 'applied_imputation_methods.csv'

dataset_frame = pd.read_csv(dataset_file, delimiter=';')

imputation_method_frame = pd.read_csv(imputation_method_file, delimiter=';')

dataset_frame = dataset_frame.iloc[:, 1:]

dataset_frame = dataset_frame.replace('Yes', True)

dataset_frame = dataset_frame.replace('No', False)

dataset_frame = dataset_frame.replace('-', None)

dataset_frame = dataset_frame.replace('Not Specified', None)

num_of_nulls = dataset_frame.isnull().sum(axis=0)

for imputation_method_colname in imputation_method_frame.columns[1:]:

# Drop column if data in column is too empty

if num_of_nulls[imputation_method_colname] > dataset_frame.shape[0]/2:

print('Dropping column due to lack of data...')

dataset_frame = dataset_frame.drop(columns=[imputation_method_colname])

imputation_method_frame = imputation_method_frame.drop(

columns=[imputation_method_colname])

mods_table = dataset_frame.mode()

mods = mods_table.iloc[0]

medians = dataset_frame.median()

# print(medians)

for imputation_method_colname in imputation_method_frame.columns[1:]:

# Strip trailing whitespaces from the data

print(dataset_frame[imputation_method_colname].dtype)

if not np.isreal(dataset_frame[imputation_method_colname][0]):

print('Stripping trailing whitespace...')

dataset_frame[imputation_method_colname] = dataset_frame[imputation_method_colname].str.strip()

imputation_method_col = imputation_method_frame[imputation_method_colname]

dataset_col_mod_index = dataset_frame.columns.get_loc(

imputation_method_colname)

imputation_method = imputation_method_col[0]

# Debugging

print(imputation_method_colname)

print(dataset_frame.columns[dataset_col_mod_index])

print(mods[dataset_col_mod_index])

print(dataset_col_mod_index)

fillna_value = None

if imputation_method == 'Mod':

fillna_value = mods[dataset_col_mod_index]

elif imputation_method == 'Average':

if imputation_method_colname in medians.columns:

fillna_value = medians[imputation_method_colname]

else:

fillna_value = mods[dataset_col_mod_index]

elif imputation_method == False:

fillna_value = False

if imputation_method_col[1] == 'Average':

if imputation_method_colname in medians.keys():

# Debugging

print(medians[imputation_method_colname])

print(type(medians[imputation_method_colname]))

print('########')

print(mods[dataset_col_mod_index])

print(type(mods[dataset_col_mod_index]))

for mod_index in range(len(mods_table.iloc[0])):

try:

mod_to_use = mods_table.iloc[mod_index][dataset_col_mod_index]

fillna_value = (

int(medians[imputation_method_colname]) + int(mod_to_use))/2

break

except ValueError as valueErr:

continue

else:

pass

finally:

pass

dataset_frame[imputation_method_colname] = dataset_frame[imputation_method_colname].fillna(

fillna_value)

for dataset_colname in dataset_frame.columns:

# print(dataset_frame[dataset_colname])

if dataset_frame[dataset_colname].dtype == 'float64':

dataset_frame[dataset_colname] = dataset_frame[dataset_colname].astype(

int)

dataset_frame = dataset_frame.fillna(dataset_frame.mode().iloc[0])

dataset_frame.to_csv('whole_dataset_imputation.csv', sep=';')

print('Program end.')

dataset_vectorization.py:

import numpy as np

import pandas as pd

print('Program start.')

dataset_file = 'whole_dataset_imputation.csv'

dataset_frame = pd.read_csv(dataset_file, delimiter=';')

dataset_frame = dataset_frame.iloc[:, 1:]

#print(dataset_frame)

#print(dataset_frame.applymap(np.isreal))

# the following method is better for determining cols to vectorize

# vectorize the false entries

cols_isreal = dataset_frame.iloc[0].map(np.isreal)

#print(cols_isreal)

cols_to_vectorize = list()

for col_is_real_colname in cols_isreal.keys():

col_is_real = cols_isreal[col_is_real_colname]

if not col_is_real:

cols_to_vectorize.append(col_is_real_colname)

print(cols_to_vectorize)

cols_to_vectorize.remove('URL')

for col_to_vectorize in cols_to_vectorize:

#print(dataset_frame[col_to_vectorize])

for i in range(len(dataset_frame[col_to_vectorize])):

element = dataset_frame.loc[i, col_to_vectorize]

if element is None or element == '':

continue

col_to_add = str(col_to_vectorize) + '-' + str(element)

if col_to_add not in dataset_frame.columns:

dataset_frame[col_to_add] = False

dataset_frame.loc[i, col_to_vectorize] = ''

dataset_frame.loc[i, col_to_add] = True

dataset_frame = dataset_frame.drop(col_to_vectorize, axis=1)

# old version below

# removed one level indentation

'''

try:

element = dataset_frame.loc[i, col_to_vectorize]

if element is None or element == '':

continue

element = int(element)

except ValueError as valueErr:

col_to_add = str(col_to_vectorize) + '-' + str(element)

if col_to_add not in dataset_frame.columns:

dataset_frame[col_to_add] = False

dataset_frame.loc[i, col_to_vectorize] = ''

dataset_frame.loc[i, col_to_add] = True

continue

else:

pass

finally:

pass

'''

dataset_frame = dataset_frame.reindex(sorted(dataset_frame.columns), axis=1)

dataset_frame = dataset_frame.drop('URL', axis=1)

print(dataset_frame.iloc[0].map(np.isreal))

'''

for element in dataset_frame[cols_to_vectorize[col_index]]:

print('####')

print(element)

print(type(element))

'''

# new_col_name = str(cols_to_vectorize[col_index]) + '-' + str(element)

dataset_frame.to_csv('whole_dataset_vectorized.csv', sep=';')

print('Program end.')

exit

normalization_of_dataset.py:

import pandas as pd

import numpy as np

file = "whole_dataset_vectorized_use_this.csv"

original_dataset = pd.read_csv(file, delimiter=';')

mean_vector = np.zeros((1, len(original_dataset.columns[1:])))

for i in range(0, len(original_dataset.index)):

for j in range(0,len(original_dataset.columns[1:])):

mean_vector[0, j] = mean_vector[0, j] + original_dataset.iat[i, j+1] / len(original_dataset.index)

centered_dataset = np.zeros((len(original_dataset.index), len(original_dataset.columns[1:])))

df_centered_dataset = pd.DataFrame(centered_dataset)

df_scaled_dataset = pd.DataFrame(centered_dataset)

df_scaled_dataset.columns = original_dataset.columns[1:]

for i in range(0, len(original_dataset.index)):

for j in range(0,len(original_dataset.columns[1:])):

df_centered_dataset.iat[i, j] = original_dataset.iat[i, j+1] - mean_vector[0, j]

variance_vector = np.zeros((1, len(original_dataset.columns[1:])))

for i in range(0, len(original_dataset.index)):

for j in range(0,len(original_dataset.columns[1:])):

variance_vector[0, j] = variance_vector[0, j] + (df_centered_dataset.iat[i, j])**2 / len(original_dataset.index)

for i in range(0, len(original_dataset.index)):

for j in range(0,len(original_dataset.columns[1:])):

df_scaled_dataset.iat[i, j] = df_scaled_dataset.iat[i, j] / (variance_vector[0,j])**(1/2)

df_scaled_dataset.to_csv('normalized_dataset.csv', sep = ';',index=True)

pca_on_normalized_dataset.py:

import pandas as pd

import numpy as np

from numpy import linalg

file = "normalized_dataset.csv"

df_normalized_dataset = pd.read_csv(file, delimiter=';')

normalized_dataset = df_normalized_dataset.iloc[:, 1:].values

# Compute the covariance matrix of the normalized data.

covariance_matrix = (1/len(df_normalized_dataset.index)) * \

np.dot(np.transpose(normalized_dataset), normalized_dataset)

eigenvalues, eigenvectors = linalg.eig(covariance_matrix)

# Organize eigenvalues and eigenvectors as a dictionary.

eigenvectors_dict = {}

for i in range(0, len(eigenvalues)):

keyname = eigenvalues[i]

valname = eigenvectors[:, i]

eigenvectors_dict[keyname] = valname

# Sort eigenvectors based on eigenvalues.

eigenvectors_list = []

for i in sorted(eigenvectors_dict, reverse=True):

eigenvectors_list.append(eigenvectors_dict[i])

# Normalize eigenvectors.

sorted_eigenvectors = np.transpose(np.asarray(eigenvectors_list))

for i in range(len(eigenvalues)):

sorted_eigenvectors[:, i] = sorted_eigenvectors[:, i] / \

(np.sum(sorted_eigenvectors[:, i] ** 2))**(1/2)

# PREALLOCATION

# Here k is number of principle components that we plan to use

PVE_by_first_k_eigvectors = 0

total_variance_times_n = 0 # Here n means number of observations

# Compute total variance times n, for PVE computation later.

for i in range(len(normalized_dataset[:, 0])):

for j in range(len(normalized_dataset[0, :])):

total_variance_times_n += normalized_dataset[i, j] ** 2

projected_dataset = np.dot(normalized_dataset, sorted_eigenvectors)

projected_dataset_squared = np.square(projected_dataset)

for j in range(projected_dataset.shape[1]):

variance_explained_by_jth_eigvector_times_n = 0

for i in range(projected_dataset.shape[0]):

variance_explained_by_jth_eigvector_times_n += projected_dataset_squared[i, j]

PVE_by_jth_eigvector = variance_explained_by_jth_eigvector_times_n / \

total_variance_times_n

PVE_by_first_k_eigvectors = PVE_by_first_k_eigvectors + PVE_by_jth_eigvector

if PVE_by_first_k_eigvectors > 0.6:

break

number_of_prime_components_to_use = j + 1

reduced_projected_dataset = np.dot(

normalized_dataset, sorted_eigenvectors[:, 0: number_of_prime_components_to_use])

data_frame_to_write = pd.DataFrame(np.real(reduced_projected_dataset))

data_frame_to_write.to_csv('result_of_PCA.csv', sep=';', index=True)

clustering.py:

import pandas as pd

import numpy as np

from numpy import linalg

import matplotlib.pyplot as plt

import time

file = 'result_of_PCA.csv'

file2 = 'price.csv'

df_dataset = pd.read_csv(file, delimiter=';')

df_price = pd.read_csv(file2, delimiter=';')

number_of_clusters = 5

#Evaluation of Total Sum of Squares

price_vector = np.squeeze(df_price.values)

dataset = df_dataset.iloc[:, 1:].values

shuffled_index = np.arange(price_vector.shape[0])

np.random.shuffle(shuffled_index)

dataset = dataset[shuffled_index]

price_vector = price_vector[shuffled_index]

number_of_trials_for_centroids = 10

training_time_for_k_means_at_each_trial = np.zeros(( number_of_trials_for_centroids))

initial_centroid_locations_array = np.zeros((number_of_clusters,df_dataset.shape[1]-1, number_of_trials_for_centroids))

centroid_locations_array = np.zeros((number_of_clusters,df_dataset.shape[1]-1, number_of_trials_for_centroids))

total_loss_array = np.full(( number_of_trials_for_centroids), 501000)

while np.amin(total_loss_array) > 500000:

print('New iteration of 10 trials for initial locations of centroids')

#Loss according to the loss functions of Clustering

total_loss_array = np.zeros(( number_of_trials_for_centroids))

for trial_number in range(number_of_trials_for_centroids):

# Initialization of locations of centroids

initial_centroid_locations_array[ 0:int(number_of_clusters/2), :, trial_number] = 10 * np.random.rand(int(number_of_clusters/2),df_dataset.shape[1]-1)

initial_centroid_locations_array[int(number_of_clusters / 2):number_of_clusters, :, trial_number] \

= -10 * np.random.rand(number_of_clusters - int(number_of_clusters/2), df_dataset.shape[1] - 1)

centroid_locations_array[ :, :, trial_number] = initial_centroid_locations_array[ :, :, trial_number]

check_to_break = False

#K-means Clustering Method (uses Euclidean Distance)

# Initialization

centroid_locations_array[ :, :, trial_number] = initial_centroid_locations_array[ :, :, trial_number]

#Training Phase

start_time = time.time()

threshold_for_change_of_loss = 0.00001

#Finding Initial Loss

r_matrix = np.zeros((dataset.shape[0], number_of_clusters))

'''

#The vector below stores squares of smallest Euclidean norms

'''

smallest_euclidean_norm_squ_for_each_notice = np.zeros(dataset.shape[0])

for i in range(0, dataset.shape[0]):

candidates_for_smallest_euclidean_norm_squ = np.zeros(number_of_clusters)

for j in range(0, number_of_clusters):

candidates_for_smallest_euclidean_norm_squ[j] = (np.linalg.norm(dataset[i, :] - centroid_locations_array[j, :,trial_number])) ** (2)

smallest_euclidean_norm_squ_for_each_notice[i] = np.amin(candidates_for_smallest_euclidean_norm_squ)

r_matrix[i, np.argmin(candidates_for_smallest_euclidean_norm_squ)] = 1

initial_loss = np.sum(smallest_euclidean_norm_squ_for_each_notice)

previous_loss = 0

old_r_matrix = np.zeros((dataset.shape[0], number_of_clusters))

while(True):

#Expectation Step

r_matrix = np.zeros((dataset.shape[0], number_of_clusters))

'''

#The vector below stores squares of smallest Euclidean norms

'''

smallest_euclidean_norm_squ_for_each_notice = np.zeros(dataset.shape[0])

for i in range(0, dataset.shape[0]):

candidates_for_smallest_euclidean_norm_squ = np.zeros(number_of_clusters)

for j in range(0,number_of_clusters):

candidates_for_smallest_euclidean_norm_squ[j] = (np.linalg.norm(dataset[i, :] - centroid_locations_array[j, :,trial_number]))**(2)

smallest_euclidean_norm_squ_for_each_notice[i] = np.amin(candidates_for_smallest_euclidean_norm_squ)

r_matrix[i, :] = np.zeros(r_matrix[i,:].shape)

r_matrix[i, np.argmin(candidates_for_smallest_euclidean_norm_squ)] = 1

total_loss = np.sum(smallest_euclidean_norm_squ_for_each_notice)

total_loss_array[trial_number] = total_loss

previous_loss = total_loss

#Check for Convergence

total_loss_array[trial_number] = total_loss

if np.array_equal(old_r_matrix, r_matrix):

break

old_r_matrix = r_matrix

#Maximization Step

for j in range(0, number_of_clusters):

if np.sum(r_matrix[:, j]) > 0: # If is used to avoid division by zero

weighted_sum_vector = np.zeros((1, dataset.shape[1]))

sum_of_r = 0

for i in range(0, dataset.shape[0]):

weighted_sum_vector += r_matrix[i, j] * dataset[i, :]

sum_of_r += r_matrix[i, j]

centroid_locations_array[j, :, trial_number] = weighted_sum_vector / sum_of_r

total_loss = 0

for i in range(0,dataset.shape[0]):

for j in range(0,number_of_clusters):

total_loss += r_matrix[i, j] * (np.linalg.norm(dataset[i, :] - centroid_locations_array[j, :, trial_number]))**(2)

total_loss_array[trial_number] = total_loss

previous_loss = total_loss

total_loss_array[trial_number] = total_loss

training_time_for_k_means_at_each_trial[ trial_number] = time.time() - start_time

#Comparisons According to Total Loss

k_means_best_locations_for_centroids_2 = centroid_locations_array[:, :, np.argmin(total_loss_array)]

df_k_means_best_locations_for_centroids_2 = pd.DataFrame(k_means_best_locations_for_centroids_2)

df_k_means_best_locations_for_centroids_2.to_csv('best_cluster_centroids_for_k_means_clusters.csv', sep = ';',index=True)

cluster_indices_of_notices = np.zeros((dataset.shape[0]),dtype=int)

for i in range(0, dataset.shape[0]):

norm_vector = np.zeros((number_of_clusters))

for j in range(0, number_of_clusters):

norm_vector[j] = np.linalg.norm(dataset[i, :] - centroid_locations_array[j, :, np.argmin(total_loss_array)])

cluster_indices_of_notices[i] = np.argmin(norm_vector)

df_cluster_indices_of_notices = pd.DataFrame( cluster_indices_of_notices)

df_cluster_indices_of_notices.columns = ['Cluster Number']

df_cluster_indices_of_notices.to_csv('cluster_indices_of_notices.csv', sep=';', index=True)

print('Total loss for each trial:', total_loss_array)

print('Training time at each trial:', training_time_for_k_means_at_each_trial)

df_dataset = pd.DataFrame(dataset)

df_price = pd.DataFrame(price_vector)

df_dataset.to_csv('dataset_after_clustering.csv', sep=';', index=True)

df_price.to_csv('prices_after_clustering.csv', sep=';', index=True)

#Counting number of notices in each cluster

number_of_notices_in_each_cluster = np.zeros((number_of_clusters))

for cluster in range(number_of_clusters):

for notice in range(cluster_indices_of_notices.shape[0]):

if cluster_indices_of_notices[notice] == cluster:

number_of_notices_in_each_cluster[cluster] += 1

#Finding clusters with more than 199 notices:

sufficient_clusters = np.empty((0),dtype=int)

clusters_to_erase = np.empty((0),dtype=int)

for cluster in range(number_of_clusters):

if number_of_notices_in_each_cluster[cluster] >= 300:

sufficient_clusters = np.append(sufficient_clusters, [cluster])

else:

clusters_to_erase = np.append(clusters_to_erase, [cluster])

#Changing the cluster that a notice is assigned to if number of notices in the cluster is less than 200

distance_to_each_centroid = np.zeros(sufficient_clusters.shape)

cluster_indices_of_notices_v2 = np.zeros((dataset.shape[0]))

for notice in range(cluster_indices_of_notices.shape[0]):

for cluster in clusters_to_erase:

if cluster_indices_of_notices[notice] == cluster:

for cluster_candidate_index in range(sufficient_clusters.shape[0]):

distance_to_each_centroid[cluster_candidate_index] = (np.linalg.norm(dataset[notice, :] -

centroid_locations_array[sufficient_clusters[cluster_candidate_index], :,

np.argmin(total_loss_array)]))**(2)

cluster_indices_of_notices_v2[notice] = sufficient_clusters[np.argmin(distance_to_each_centroid)]

break

else:

cluster_indices_of_notices_v2[notice] = cluster_indices_of_notices[notice]

df_cluster_indices_of_notices_v2 = pd.DataFrame( cluster_indices_of_notices_v2)

df_cluster_indices_of_notices.columns = ['Cluster Number']

df_cluster_indices_of_notices_v2.to_csv('cluster_indices_of_notices.csv', sep=';', index=True)

linear_regression.py:

import pandas as pd

import numpy as np

from numpy import linalg

import matplotlib.pyplot as plt

import time

# Ridge regression algorithm.

# If no regularization coefficient lambda_reg is given, does simple linear regression.

def ridge_regression(X, y, lambda_reg=0):

X_T = np.transpose(X)

X_T_X = np.dot(X_T, X)

reg_matrix = lambda_reg * np.identity(len(X_T_X))

inverted_matrix = linalg.inv(X_T_X + reg_matrix)

transformation_matrix = np.dot(inverted_matrix,X_T)

weights = np.dot(transformation_matrix ,y)

return weights

def apply_linear_regression(X, weights):

y_predict = np.dot(X, weights)

return y_predict

def loss_rss(weights, X, y):

y_predict = apply_linear_regression(X, weights)

# clip prices from below

y_predict = np.clip(y_predict, 10 * 10 ** 3, None)

rss = sum((y - y_predict) ** 2)

return rss

def k_fold_x_validation(X, y, learning_alg, hyperparameters, loss_fcn, k):

coeff_of_det_list = []

for i in range(k):

subset_size = int(len(X) / k)

validation_set_index = [i * subset_size, (i + 1) * subset_size]

X_test = X[validation_set_index[0]:validation_set_index[1], :]

X_train = np.append(X[:validation_set_index[0]],

X[validation_set_index[1]:, :], axis=0)

y_test = y[validation_set_index[0]:validation_set_index[1]]

y_train = np.append(y[:validation_set_index[0]],

y[validation_set_index[1]:], axis=0)

learned_function = learning_alg(X_train, y_train, hyperparameters)

####

loss = loss_fcn(learned_function, X_test, y_test)

tss = sum((y_test - np.mean(y_test)) ** 2)

coeff_of_det = 1 - loss / tss

coeff_of_det_list.append(coeff_of_det)

####

return np.mean(coeff_of_det_list)

if __name__ == "__main__":

print('Program start.')

# real data files

dataset_file = 'dataset_after_clustering.csv'

results_file = 'prices_after_clustering.csv'

# read from data files

dataset_frame = pd.read_csv(dataset_file, delimiter=';')

dataset_frame = dataset_frame.iloc[:, 1:]

results_frame = pd.read_csv(results_file, delimiter=';')

results_frame = results_frame.iloc[:, 1:]

cluster_indices_of_notices_frame = pd.read_csv('cluster_indices_of_notices.csv', delimiter=';')

cluster_indices_of_notices_frame = cluster_indices_of_notices_frame.iloc[:, 1:]

# convert to numpy arrays

dataset = dataset_frame.values

prices = results_frame.values

cluster_indices_of_notices = cluster_indices_of_notices_frame.values

# Feature transformation.

# Add the squares of the most important features.

num_of_features_to_square = 10

dataset = np.append(dataset, dataset[:, :num_of_features_to_square]**2, axis=1)

cluster_names = np.unique(cluster_indices_of_notices)

number_of_clusters= len(cluster_names)

y_combined_total = np.empty((0, 2))

for cluster_index in range(number_of_clusters):

X = np.empty((0, dataset.shape[1]))

y = np.empty((0))

for notice in range(dataset.shape[0]):

if cluster_names[cluster_index] == cluster_indices_of_notices[notice]:

X = np.append(X, np.reshape(dataset[notice, :], (1, dataset[notice,:].shape[0])), axis=0)

y = np.append(y, prices[notice], axis=0)

# append bias entries to data matrix

X = np.append(np.ones((len(X), 1)), X, axis=1)

print(y.shape)

# lists to hold errors and lambdas

coeff_of_det_list = []

tss_mean_list = []

lambda_reg_list = []

elapsed_time_list = []

# the interval to check lambda values

lambda_reg_interval = np.arange(0, 1500, 10)

# number of folds for cross validation

cross_validation_k = 5

# cross validate for all lambdas

for lambda_reg in lambda_reg_interval:

start_time = time.time()

coeff_of_det = k_fold_x_validation(

X, y, ridge_regression, lambda_reg, loss_rss, cross_validation_k)

#print(loss_mean)

end_time = time.time()

elapsed_time = end_time - start_time

elapsed_time_list.append(elapsed_time)

coeff_of_det_list.append(coeff_of_det)

# print average elapsed time

average_elapsed_time = sum(elapsed_time_list) / len(elapsed_time_list)

print('Average training time for cluster', cluster_index,':',average_elapsed_time)

optimal_lambda_reg = lambda_reg_interval[np.where(

coeff_of_det_list == np.amax(coeff_of_det_list))[0][0]]

print('Optimal lambda value for cluster', cluster_index,':', optimal_lambda_reg)

shuffled_index = np.arange(y.shape[0])

np.random.shuffle(shuffled_index)

X = X[shuffled_index]

y = y[shuffled_index]

# apply learned function to half of the input and write results to external file

weights = ridge_regression(X[:round(X.shape[0]/2),:], y[:round(X.shape[0]/2)], optimal_lambda_reg)

y_predict = apply_linear_regression(X[round(X.shape[0]/2):,:], weights)

y_predict = np.clip(y_predict, 10*10**3, None)

y_combined = np.column_stack((y[round(X.shape[0]/2):], y_predict))

y_combined_total = np.concatenate((y_combined_total, y_combined), axis=0)

data_frame_to_write = pd.DataFrame(np.real(y_combined_total), columns=['Real Prices (TL)','Predicted Prices (TL)'])

data_frame_to_write.to_csv(

'result_of_ridge_regression_after_clustering.csv', sep=';', index=True)

coeff_of_det = np.amax(coeff_of_det_list)

print('Average coeff of det:')

print(coeff_of_det)

probability_of_accurate_prediction = 0

for notice_index in range(y_combined_total.shape[0]):

if y_combined_total[notice_index, 1] >= 0.9 * y_combined_total[notice_index, 0] and\

y_combined_total[notice_index, 1] <= 1.1 * y_combined_total[notice_index, 0]:

probability_of_accurate_prediction += 1.0/ float(y_combined_total.shape[0])

print('Probability of making a prediction in the range [0.9*(real price), 1.1*(real price)]:', probability_of_accurate_prediction)

second_probability_of_accurate_prediction = 0

for notice_index in range(y_combined_total.shape[0]):

if y_combined_total[notice_index, 1] >= 0.8 * y_combined_total[notice_index, 0] and\

y_combined_total[notice_index, 1] <= 1.2 * y_combined_total[notice_index, 0]:

second_probability_of_accurate_prediction += 1.0/ float(y_combined_total.shape[0])

print('Probability of making a prediction in the range [0.8*(real price), 1.2*(real price)]:', second_probability_of_accurate_prediction)

third_probability_of_accurate_prediction = 0

for notice_index in range(y_combined_total.shape[0]):

if y_combined_total[notice_index, 1] >= 0.7 * y_combined_total[notice_index, 0] and\

y_combined_total[notice_index, 1] <= 1.3 * y_combined_total[notice_index, 0]:

third_probability_of_accurate_prediction += 1.0/ float(y_combined_total.shape[0])

print('Probability of making a prediction in the range [0.7*(real price), 1.3*(real price)]:', third_probability_of_accurate_prediction)

print('Program end.')

neural_network_driver.py:

"""Driver for network2 for EEE485 project. """

# Custom libraries

from neural_network import *

# Main function

if __name__ == "__main__":

print('Program start.')

# Filename to contain the neural network object

#network_filename = 'neural_network2_object_eee485.json'

network_filename = 'neural_network2_object_eee485_v0.1.json'

# An execution timer to track execution durations

exe_timer = ExecutionTimer()

# Load the datasets

# EEE485 datasets

# Trial data files

#X_filename = 'normalized_deneme.csv'

#y_filename = 'normalized_deneme_results.csv'

# Real data files

X_filename = 'result_of_PCA.csv'

y_filename = 'price_indexed.csv'

X_frame = pd.read_csv(X_filename, delimiter=';')

y_frame = pd.read_csv(y_filename, delimiter=';')

# Remove index columns

X_frame = X_frame.iloc[:, 1:]

y_frame = y_frame.iloc[:, 1:]

# Convert data frames to numpy arrays

X = X_frame.values

y = y_frame.values

# Limit the data size (optional)

data_size = len(X)

X = X[:data_size]

y = y[:data_size]

# Scale y values to avoid overflow

y_scale = (6 * 10**6)

y = y / y_scale

# Shuffle original datasets (optional)

shuffle_index = np.arange(X.shape[0])

np.random.shuffle(shuffle_index)

X_shuffled = X[shuffle_index]

y_shuffled = y[shuffle_index]

#X = X_shuffled

#y = y_shuffled

# Split dataset into subsets based on the given ratios

dataset_split_ratios = {'train': 0.8, 'test': 0.1, 'validation': 0.1}

dataset_split_counts = [int(len(X)*dataset_split_ratios[i])

for i in dataset_split_ratios]

dataset_split_indexes = np.cumsum(dataset_split_counts)

dataset_split_indexes = np.insert(dataset_split_indexes, 0, 0, axis=0)

tuple_dataset = create_tuple_dataset(X, y)

dataset_batches = [tuple_dataset[start:end]

for start, end in zip(dataset_split_indexes[:-1], dataset_split_indexes[1:])]

dataset = {key: dataset_batch

for (key, dataset_batch) in zip(dataset_split_ratios.keys(), dataset_batches)}

training_data = dataset['train']

test_data = dataset['test']

validation_data = dataset['validation']

print('Loaded the datasets.')

# Set the layer sizes

input_layer_size = len(tuple_dataset[0][0])

output_layer_size = len(tuple_dataset[0][1])

hidden_layer_sizes = [60]

sizes = [input_layer_size] + hidden_layer_sizes + [output_layer_size]

print('Layer sizes for the neural network:')

print(sizes)

# Construct the neural network

net = NeuralNetwork(sizes, activation_fn=Sigmoid, cost=QuadraticCost)

#net = NeuralNetwork(sizes, activation_fn=Sigmoid, cost=CrossEntropyCost)

net.large_weight_initializer()

# Define the hyperparameters for training

hyperparameters = {

'epochs': 10,

'mini_batch_size': 10,

'eta': 0.05,

'lmbda': 0.2,

'acc_percent': 10,

'early_stopping_n': 500,

}

epochs = hyperparameters['epochs']

mini_batch_size = hyperparameters['mini_batch_size']

eta = hyperparameters['eta']

lmbda = hyperparameters['lmbda']

acc_percent = hyperparameters['acc_percent']

early_stopping_n = hyperparameters['early_stopping_n']

# Load the neural network object before training (optional)

print('Loading the neural network object...')

net = load(network_filename)

# Train the network using SGD

try:

exe_timer.start()

print('Training the neural network...')

net.SGD(training_data, epochs, mini_batch_size, eta, lmbda=lmbda,

acc_percent=acc_percent, test_data=test_data,

check_test_accuracy=True, check_training_accuracy=True,

check_test_cost=False, check_training_cost=False,

early_stopping_n=early_stopping_n)

except KeyboardInterrupt as ex:

print('Keyboard Interrupt caught during SGD.')

option = input('Do you want to save the neural network? (y/n):')

if option == 'y':

# Save the trained neural network object

print('Saving the neural network object...')

net.save(network_filename)

else:

print('Not saving the neural network object.')

pass

else:

print('Finished training the neural network.')

execution_time = exe_timer.stop()

print('Execution time: ' + str(execution_time) + ' sec.')

# Save the trained neural network object

print('Saving the neural network object...')

net.save(network_filename)

finally:

pass

# Load the neural network object after training (optional)

print('Loading the neural network object...')

net = load(network_filename)

# Make predictions and evaluate them

# Extract training dataset

print('Training dataset confidence interval computations:')

X_valid, y_valid = extract_data(training_data)

# Scale outputs

y_valid = y_valid * y_scale

y_predict = net.predict(X_valid) * y_scale

y_combined = (np.hstack((y_valid, y_predict)))

for percentage in range(10, 40, 10):

probability_of_accurate_prediction = net.accuracy(

training_data, percent=percentage) / len(training_data)

print('Probability of making a prediction with error percentage margin {}: {}'.format(

percentage, probability_of_accurate_prediction))

# Compute coefficient of determination.

tss = np.sum((y_valid - np.mean(y_valid)) ** 2)

rss = np.sum((y_valid - y_predict)**2)

coeff_of_det = 1 - rss / tss

print('Training dataset coeff. of det.: ')

print(coeff_of_det)

# Print first 10 predictions on terminal to give an idea

print('first ten predictions:')

print(y_combined[:10])

# Write results to external file

data_frame_to_write = pd.DataFrame(np.real(y_combined))

data_frame_to_write.to_csv(

'result_of_neural_network2_train.csv', sep=';', index=True)

# Extract validation dataset

print('Validation dataset confidence interval computations:')

X_valid, y_valid = extract_data(validation_data)

y_valid = y_valid * y_scale

y_predict = net.predict(X_valid) * y_scale

y_combined = (np.hstack((y_valid, y_predict)))

for percentage in range(10, 40, 10):

probability_of_accurate_prediction = net.accuracy(

validation_data, percent=percentage) / len(validation_data)

print('Probability of making a prediction with error percentage margin {}: {}'.format(

percentage, probability_of_accurate_prediction))

# Compute coefficient of determination.

tss = np.sum((y_valid - np.mean(y_valid)) ** 2)

rss = np.sum((y_valid - y_predict)**2)

coeff_of_det = 1 - rss / tss

print('Validation dataset coeff. of det.: ')

print(coeff_of_det)

# Print first 10 predictions on terminal to give an idea

print('first ten predictions:')

print(y_combined[:10])

# Write results to external file

data_frame_to_write = pd.DataFrame(np.real(y_combined))

data_frame_to_write.to_csv(

'result_of_neural_network2_valid.csv', sep=';', index=True)

# Print accuracies of the neural network

exe_timer.start()

percentage = 10

print('Accuracy of the neural network on training data:')

accuracy = net.accuracy(training_data, percent=percentage)

datasize = len(training_data)

print(str(accuracy) + ' / ' + str(datasize))

print('Accuracy of the neural network on validation data:')

accuracy = net.accuracy(validation_data, percent=percentage)

datasize = len(validation_data)

print(str(accuracy) + ' / ' + str(datasize))

print('Finished evaluating accuracies of the neural network.')

execution_time = exe_timer.stop()

print('Execution time: ' + str(execution_time) + ' sec.')

# Print total costs of the neural network

exe_timer.start()

print('Total cost of the neural network on training data:')

total_cost = net.total_cost(training_data, lmbda=lmbda)

print(total_cost)

print('Total cost of the neural network on validation data:')

total_cost = net.total_cost(validation_data, lmbda=lmbda)

print(total_cost)

print('Finished evaluating total costs of the neural network.')

execution_time = exe_timer.stop()

print('Execution time: ' + str(execution_time) + ' sec.')

# Print the total execution time

execution_time = exe_timer.get_total_time()

print('Total execution time: ' + str(execution_time) + ' sec.')

print('Program end.')

neural_network.py:

"""Implementation of feedforward neural network, using SGD.

We were inspired by the following website: http://neuralnetworksanddeeplearning.com"""

# Libraries

# Standard library

import json

import sys

# Third-party libraries

import numpy as np

import pandas as pd

# Custom libraries

from miscellaneous import *

# Define the activation functions

class Sigmoid(object):

@staticmethod

def fn(z):

"""The sigmoid function."""

return 1.0/(1.0+np.exp(-z))

@staticmethod

def prime(z):

"""Derivative of the sigmoid function."""

return Sigmoid.fn(z)*(1-Sigmoid.fn(z))

class ReLU(object):

@staticmethod

def fn(z):

"""The ReLU function."""

# return np.maximum(z, 0, z) # works in place, hence faster?

return np.maximum(z, 0)

@staticmethod

def prime(z):

"""Derivative of the ReLU function."""

return np.where(z > 0, 1, 0)

class Linear(object):

@staticmethod

def fn(z):

"""The linear function."""

return z

@staticmethod

def prime(z):

"""Derivative of the linear function."""

return np.ones(z.shape)

# Define the cost functions

class QuadraticCost(object):

@staticmethod

def fn(a, y):

"""Return the cost associated with an output."""

return 0.5*np.linalg.norm(a-y)**2

@staticmethod

def delta(z, a, y, activation):

"""Return the error delta from the output layer for backpropagation."""

return (a-y) * activation.prime(z)

class CrossEntropyCost(object):

@staticmethod

def fn(a, y):

"""Return the cost associated with an output."""

return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))

@staticmethod

def delta(z, a, y, activation):

"""Return the error delta from the output layer for backpropagation."""

return (a-y)

# Main Network class

class NeuralNetwork(object):

def __init__(self, sizes, activation_fn=Sigmoid, cost=CrossEntropyCost):

"""Construct the neural network object."""

# List to hold the neuron counts at each layer.

self.sizes = sizes

self.num_layers = len(self.sizes)

self.input_layer_size = self.sizes[0]

self.output_layer_size = self.sizes[self.num_layers-1]

# The activation function to use.

self.activation_fn = activation_fn

# The cost function to use.

self.cost = cost

# Initialize the weights of the network.

self.default_weight_initializer()

def default_weight_initializer(self):

"""Initialize weights and biases using Gaussian distributions."""

self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]

self.weights = [np.random.randn(y, x)/np.sqrt(x)

for x, y in zip(self.sizes[:-1], self.sizes[1:])]

def large_weight_initializer(self):

"""Initialize the weights randomly again, but with larger means."""

self.biases = [np.random.randn(y, 1) for y in self.sizes[1:]]

self.weights = [np.random.randn(y, x)

for x, y in zip(self.sizes[:-1], self.sizes[1:])]

def feedforward(self, a):

"""Return the output of the network if a is input."""

for b, w in zip(self.biases, self.weights):

a = self.activation_fn.fn(np.dot(w, a)+b)

return a

def SGD(self, training_data, epochs, mini_batch_size, eta,

lmbda=0.0,

acc_percent=0,

test_data=None,

check_test_cost=False,

check_test_accuracy=False,

check_training_cost=False,

check_training_accuracy=False,

early_stopping_n=0):

"""Apply SGD to train the neural network."""

# Get the lengths of the datasets.

training_data = list(training_data)

n = len(training_data)

if test_data:

test_data = list(test_data)

n_data = len(test_data)

# Variables for early stopping functionality.

best_accuracy = 0

no_accuracy_change = 0

# Lists to hold the costs and accuracies.

test_cost, test_accuracy = [], []

training_cost, training_accuracy = [], []

# Loop for the epochs.

for j in range(epochs):

# Shuffle the training data.

np.random.shuffle(training_data)

# Construct mini batches from the training data.

mini_batches = [

training_data[k:k+mini_batch_size]

for k in range(0, n, mini_batch_size)]

# Update the weights and biases based on mini batch SGDs.

for mini_batch in mini_batches:

self.update_mini_batch(

mini_batch, eta, lmbda, len(training_data))

# Print progress of the learning for information.

if (j % 50 == 0):

print("Epoch %s of training complete" % j)

if check_training_cost:

tot_cost = self.total_cost(training_data, lmbda)

training_cost.append(tot_cost)

print("Cost on training data: {}".format(tot_cost))

if check_training_accuracy:

accuracy = self.accuracy(

training_data, percent=acc_percent)

training_accuracy.append(accuracy)

print("Accuracy on training data: {} / {}".format(accuracy, n))

if check_test_cost:

tot_cost = self.total_cost(

test_data, lmbda)

test_cost.append(tot_cost)

print("Cost on evaluation data: {}".format(tot_cost))

if check_test_accuracy:

accuracy = self.accuracy(

test_data, percent=acc_percent)

test_accuracy.append(accuracy)

print(

"Accuracy on evaluation data: {} / {}".format(accuracy, n_data))

# Early stopping checks.

if early_stopping_n > 0:

if accuracy > best_accuracy:

best_accuracy = accuracy

no_accuracy_change = 0

#print("Early-stopping: Best so far {}".format(best_accuracy))

else:

no_accuracy_change += 1

if (no_accuracy_change == early_stopping_n):

print(

"Stopping early: No accuracy change in the last {} epochs.".format(early_stopping_n))

return test_cost, test_accuracy, training_cost, training_accuracy

return test_cost, test_accuracy, \

training_cost, training_accuracy

def update_mini_batch(self, mini_batch, eta, lmbda, n):

"""Update the weights and biases of the network using SGD.

eta is the learning rate, lmbda is the regularization coefficient.

n is the total size of the training dataset."""

# Initialize the gradients.

grad_J_b = [np.zeros(b.shape) for b in self.biases]

grad_J_w = [np.zeros(w.shape) for w in self.weights]

# For each data entry in mini batch, apply weight updates

for x, y in mini_batch:

# Apply backpropagation for the current data entry.

delta_grad_J_b, delta_grad_J_w = self.backprop(x, y)

# Update the gradients.

grad_J_b = [gJb + dgJb for gJb,

dgJb in zip(grad_J_b, delta_grad_J_b)]

grad_J_w = [gJw + dgJw for gJw,

dgJw in zip(grad_J_w, delta_grad_J_w)]

# Update the weights and biases using the gradients and regularization.

self.weights = [(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*gJw

for w, gJw in zip(self.weights, grad_J_w)]

self.biases = [b-(eta/len(mini_batch))*gJb

for b, gJb in zip(self.biases, grad_J_b)]

def backprop(self, x, y):

"""Return a tuple representing the gradients for the cost function."""

grad_J_b = [np.zeros(b.shape) for b in self.biases]

grad_J_w = [np.zeros(w.shape) for w in self.weights]

a = x

# Lists to store all the activations and z vectors, layer by layer.

a_list = [a]

z_list = []

# Feedforward the input, and update the lists.

for b, w in zip(self.biases, self.weights):

z = np.dot(w, a)+b

z_list.append(z)

a = self.activation_fn.fn(z)

a_list.append(a)

# Backpropagation for output layer.

delta = (self.cost).delta(

z_list[-1], a_list[-1], y, self.activation_fn)

grad_J_b[-1] = delta

grad_J_w[-1] = np.dot(delta, a_list[-2].T)

# Backpropagation for hidden layers.

for l in range(2, self.num_layers):

z = z_list[-l]

a_prime = self.activation_fn.prime(z)

delta = np.dot(self.weights[-l+1].T, delta) * a_prime

grad_J_b[-l] = delta

grad_J_w[-l] = np.dot(delta, a_list[-l-1].T)

return (grad_J_b, grad_J_w)

def accuracy(self, data, percent=0):

"""Return the number of inputs in data for which the neural

network outputs the correct result within given accuracy."""

if percent > 0:

# Accuracy computation for regression.

X_valid, y_valid = extract_data(data)

y_predict = self.predict(X_valid)

percent /= 100

y_accurate = (y_predict < (y_valid * (1 + percent))) \

& (y_predict > (y_valid * (1 - percent)))

result_accuracy = np.sum(y_accurate)

return result_accuracy

def total_cost(self, data, lmbda):

"""Return the total cost for the dataset."""

cost = 0.0

for x, y in data:

a = self.feedforward(x)

cost += self.cost.fn(a, y)/len(data)

cost += 0.5*(lmbda/len(data)) * \

sum(np.linalg.norm(w)**2 for w in self.weights)

return cost

def save(self, filename):

"""Save the neural network to a file with given filename."""

data = {"sizes": self.sizes,

"weights": [w.tolist() for w in self.weights],

"biases": [b.tolist() for b in self.biases],

"cost": str(self.cost.__name__)}

f = open(filename, "w")

json.dump(data, f)

f.close()

def predict(self, X):

"""Return the output of the network if X is input.

A wrapper of feedforward for outside use."""

return self.feedforward(X.T).T

def load(filename):

"""Load a neural network from the file filename."""

f = open(filename, "r")

data = json.load(f)

f.close()

cost = getattr(sys.modules[__name__], data["cost"])

net = NeuralNetwork(data["sizes"], cost=cost)

net.weights = [np.array(w) for w in data["weights"]]

net.biases = [np.array(b) for b in data["biases"]]

return net

def create_tuple_dataset(X, y):

"""Create list of tuples containing pairs of data,

from data and result matrices X and y."""

dataset = [(X_row.reshape((X_row.shape[0], 1)),

y_row.reshape((y_row.shape[0], 1))) for X_row, y_row in zip(X, y)]

return dataset

def extract_data(dataset):

"""Extract data and results from tuple list dataset as an array."""

X, y = zip(*dataset)

X = np.asarray(X)

y = np.asarray(y)

X = np.reshape(X, X.shape[:2])

y = np.reshape(y, y.shape[:2])

return X, y

miscellaneous.py:

#### Libraries

# Standard library

import time

# Third-party libraries

import numpy as np

class ExecutionTimer(object):

def __init__(self):

"""A code execution timer."""

self.time_interval = {'start': None, 'end': None}

self.total_time = 0

self.running = False

def start(self):

"""Start the timer."""

if not self.running:

self.time_interval['start'] = time.time()

self.running = True

else:

print('Cannot start: timer is currently running.')

def stop(self):

"""Stop the timer."""

if self.running:

self.time_interval['stop'] = time.time()

self.running = False

duration = self.time_interval['stop'] - self.time_interval['start']

self.total_time += duration

return duration

else:

print('Cannot stop: timer is currently not running.')

return None

def get_total_time(self):

"""Get the total time from the timer."""

if not self.running:

return self.total_time

else:

print('Cannot get total time: timer is currently running.')

return None

def reset(self):

"""Reset the timer."""

self.time_interval = {'start': None, 'end': None}

self.total_time = 0

self.running = False

def k_fold_x_validation(X, y, learning_alg, loss_fcn, k):

if (k > X.shape[0]):

k = X.shape[0]

subset_size = X.shape[0] // k

#loss_vector = np.empty(subset_size * k)

loss_vector = np.empty(k) # for coeff of det

for i in range(k):

validation_set_index = [i * subset_size, (i + 1) * subset_size]

X_test = X[validation_set_index[0]:validation_set_index[1]]

X_train = np.concatenate((X[:validation_set_index[0]],

X[validation_set_index[1]:]), axis=0)

y_test = y[validation_set_index[0]:validation_set_index[1]]

y_train = np.concatenate((y[:validation_set_index[0]],

y[validation_set_index[1]:]), axis=0)

learning_alg.train(X_train, y_train)

loss_entry = loss_fcn(learning_alg, X_test, y_test)

#loss_vector[validation_set_index[0]

# :validation_set_index[1]] = loss_entry

loss_vector[i] = loss_entry # for coeff of det

return loss_vector

Ceyhun Emre Öztürk's Personal Blog

Blog Listem

14 Haziran 2019 Cuma

House Price Predictor Project (Machine Learning Implementation Using Python-3)

Hiç yorum yok:

Yorum Gönder

About Me/ Hakkımda

Blog Listem/ My Blog List

Blog Arşivi/ Archive