Bilkent University EEE 485 Term Project Report
Members of the Group:
Ceyhun
Emre Öztürk/ Dep.: EE
Ömer
Musa Battal/ Dep.: EE
Project Phase:
3 (Last Phase)
Introduction:
In this project, we are implementing a house price predictor using several
machine learning approaches. We used Python programming environment to apply
machine learning. Our
dataset was constructed by extracting data from zingat.com and sahibinden.com. We used 3 different learning methods
that were shown in class to find predictions of house prices. These methods are
Linear Regression, K-means Clustering and SNN (Shallow Neural Network). We used
K-means clustering to divide house notices into the groups according to their
features. This
way, we had a purpose of getting rid of nonlinearities of our dataset. Then, we applied Linear Regression
separately in each cluster.
Changes Made After Phase 2 Report: We decided to delete some of the
features. We changed the imputation code such that if percentage of empty cells
in a feature column exceeds 50%, we dropped the feature. We realized that we
made some mistakes while applying vectorization on the dataset (vectorization
means assigning vectors instead of non-numerical valued features). We noticed
that the creators of the notices left trailing white spaces at the end of some
strings and numbers. Therefore, we added a few lines to solve the problem.
Detailed Description of the
Methods Used:
1)
Feature Extraction:
1.1)
Imputation Methods: Imputation
methods are the methods used in statistics to replace missing values with
assigned values. We needed imputation methods because our datasets had missing
cells. For example, the owner of the notice may not know if there is an airport
nearby the house on sale. Therefore, they have the option to not give
information about whether there is an airport nearby or not. One other
possibility is that the owner of the notice may want to hide some features of
the house on sale. We used different types of imputations and combined the
datasets with imputed values. This method is known as multiple imputation.
Multiple imputation is the most appropriate imputation method for our project
because multiple imputation replaces missing values with the values that have
variances larger than the variance of simple residual variance. We wanted large
noise in the imputed values because it is more likely that values that are
missing in the datasets are actually the values that are known by the owners of
notices of the houses on sale – they hide information to make their houses look
more valuable.
Researches
who use this method can obtain approximately unbiased estimates of all the
parameters. The steps of multiple imputation are given below in order:
- Imputing
the missing values by using an appropriate model –a model with random variation.
- Repeating
the first step several times –let us assume that we repeat it m times.
- Then,
we get m different datasets by applying the m models to replace missing
values.
- Averaging
m results to get the value to get final values to replace missing values
(Multiple Imputation for Missing Data, Web).
In
our project we chose m = 2 (We used at most 2 different methods of imputation
for each feature). There are 156 features in the original dataset.
Figure 1: Imputation Methods
Used for Some of the Features
Age
of Building
|
District
|
Doorman
|
Dry
Cleaning
|
Drywall
|
Dues
(TL)
|
Elevators
|
Fire
Alarm
|
Fitness
Center
|
|
Imputation
Method 1
|
Mod
|
Mod
|
Mod
|
Mod
|
Mod
|
Mod
|
Mod
|
Mod
|
False
|
Imputation
Method 2
|
Average
|
Mod
|
Mod
|
Mod
|
Mod
|
Average
|
Mod
|
Mod
|
False
|
Mod: Assigning the mod of the values of the column (If there are non-numerical values) to the empty cells. Average: Assigning the mean of the values of the column (If all values are numerical) to the empty cells. False: Assigning “False” to the empty cells when it is known that having “True” for that feature is rare.
There were some features in our
dataset such that most of the cells of the columns of those features were
empty. As an imputation method, we deleted any of those features that were
sounding irrelevant to the prices of houses. Also, we deleted any of the
features such that the
percentage of empty cells in the corresponding feature column exceeds 50%.We consulted a civil
engineer (Ceyhun Emre Öztürk’s father) beforehand so that we could delete the
right features. Consequently, we had 107 features and 2086 notices in total as
our dataset.
1.2)Vectorization
and Normalization of the Dataset: We looked at the most of the values that a notice
can have for each feature. Then, we thought over the concepts of features.
Values of some features’ were suitable for ordering. For example, we knew from
before that the price of a house decreases as the age of the building
increases. Therefore, we did not do any change for the values of features like
“Age of Building”. However, it was clear that for some features, different
values could not be compared. For example, it is not clear that the price of an
apartment flat increases as the “Floor” of the flat increases. Therefore, we
created n features to replace the “Floor” feature such that n is the number of
different values observed in the “Floor” column of the dataset. For example,
the value Floor = 0 got its own feature. The values of the houses for that
feature was either 0 or 1; if the house was on the ground floor, then the value
of the cell with the index of that house and the “Floor=0” feature was 1. Otherwise (if that house was on
another floor), 0 was assigned to that cell. This process was applied for all
similar features. Also, this process was applied for features with non-numeric
values. For example, in the original dataset there was a feature called
“District” which stores the name of the district of the corresponding house. In
the resulting dataset after vectorization, there was one feature for each
possible district such that values of the corresponding cells were either 1 or
0. This process has the name “vectorization” since every “district” has its
information stored in an element of a vector instead of a text string or a
scalar number. After applying vectorization, we had 444 features in total.
We applied normalization on the
dataset so that applying PCA could give us the right results.
1.3)Principal
Component Analysis (PCA):
PCA is a method that can be used
to find most relevant elements of the feature vector and eliminate unnecessary
features. The algorithm of PCA is given as follows: Firstly, the covariance
matrix of the dataset will be found. Then, the eigenvalues of the matrix will
be calculated. If there are eigenvalues that are much greater than other
eigenvalues, then their corresponding eigenvectors will be used to construct
the new dataset with reduced dimensions. These eigenvectors are called
principal eigenvectors. The formula for this process is given below:
is the matrix consisting of principal eigenvectors (We ignored n – a eigenvectors since covariance matrix has n x n size.),
is the new dataset with reduced number of
features. It should be noted that the features found by PCA become completely
different from the initial features. Therefore, they do not reflect the
categories given in the real estate trade websites directly – for example,
there is no feature representing “the number of rooms feature” directly
anymore. We picked the eigenvectors for the "E" matrix such that the PVE (proportion of
variance explained) by the selected eigenvectors was more than 0.7. After
applying PCA, we were left with 170 features (principle components) in total.
2)
Learning
For
this phase, we worked with three different learning algorithms to get two
different predictions.
2.1)
Clustering and Linear Regression:
2.1.1)
Clustering:
Clustering
is an unsupervised algorithm that is used to classify data points. Our project
is a problem that requires supervised regression. However we found a way of
using Clustering to complete our machine learning task. Our purpose is dividing
the data into meaningful groups using Clustering. Then, we applied linear
regression separately on these groups to predict the prices of the houses.
is a general dissimilarity measure for
(location of ith data point) and
(location of the centroid of kth cluster) and
is the indicator of the relationship between
ith data point and kth cluster.
This indicator is 1 only when ith data point is assigned to
the kth cluster. Otherwise, it is 0. The loss function is minimized by
applying two steps iteratively. At first, initial values are assigned to the
location of centroids of clusters. Then, the two steps, expectation and
maximization are applied.
In the expectation step, the loss function is minimized
with respect to the r_ik ’s.
In the maximization step, the loss function is minimized with respect to mu_k .
This algorithm is guaranteed to reach a local minimum. To reach the local
minimum with least loss, we implemented the algorithm 10 times such that at
each implementation a different random location was selected for a centroid. After
10 implementations, the implementation with least loss was found. We wrote a
loop such that if the loss J exceeds 400000, 10 new implementations will be
applied. The threshold for the loss was derived experimentally (before applying
the loop) by looking at the runtime results of the Clustering script.
Before
the Phase 2 Demo, we validated Clustering methods (K-means Clustering and K-medoids
Clustering with weighted Euclidean distance) with 5 different random initial
locations for each centroid. We evaluated RSS/TSS for each implementation to
compare the coefficients of determination of K-means Clustering model and
K-medoids Clustering model.
2.1.1.1)
k-means Clustering: This
is the first type of Clustering that we implemented. For this type of Clustering,
For
this equality, we have the following solution for the expectation step:
2.1.2)
Linear Regression:
We
applied linear regression to the clusters that were found using K-means
Clustering. Actually, the properties of houses may have nonlinear effect on
their prices. However, linear regression can be applied on nonlinear data.
Advantage of using linear regression instead of nonlinear regression is the
fact that fitting a linear curve is simpler than fitting a nonlinear curve.
Also, calculation of loss functions is easier with linear regression.
When using linear regression with
nonlinear data, we need to use the following formula:
where y is the output,
is Beta_0 linear regression bias, x_i is the ith element of the feature vector
(which has n elements in total),
Beta_i is the weight for ith element of the feature
vector and m_i is power of ith element of the feature vector.
After choosing m_i ’s,
we are left with linear regression. The drawback of this procedure is the fact
that the terms that are dependent to x_i’s
may be insufficient in terms of complexity. We applied Ridge regularization on
our loss function to avoid overfitting.
2.2)
Shallow Neural Network:
We used a shallow neural network as another learning
algorithm for our purpose. Shallow neural networks are simply neural networks
with one hidden layer, unlike deep neural networks which may have many such
hidden layers. A feed forward neural network consists of layers of perceptrons,
with each consequent layer perceptrons interconnected. All connections have their
associated weights, and each perceptron has its own bias. A perceptron applies
a nonlinear activation function to its input from the linear combination of the
outputs from previous layer and the bias, and passes its output to the
perceptrons next layer. This way, it somewhat simulates a neuron, and the
overall network may be able to predict the output better than regular
regression. The following figure illustrates the general structure of a feed
forward neural network.
where a’ denotes the result of the current activation layer, a is the previous layer activations, σ is the activation function, w is the weights of the previous layer, and b is the biases of the previous layer. The choice of the activation function is important for reasons that will be apparent later.
The neural network is trained by repeated
applications of feed forward and back propagation. The input is given to the
network, and the predetermined loss function is computed based on the output.
Then, all the weights of the connections between the layers are updated to
minimize the loss function, through back propagation. The loss function is a
function in the following form. It is also called as cost function.
where l(.,.) is a measure of error for the network output for input data xi. The loss function is the sum of all such errors. The loss function can be chosen arbitrarily in this form, however some established cost functions exist, which possibly provide better learning. One such loss function is the quadratic loss function, which is suitable for regression. It is given below.
To minimize the loss, gradient descent or stochastic gradient descent methods are used. Repetitions of this process minimizes the error, and hopefully at a global minimum. To ensure that the network is not stuck at a local minimum, stochastic methods may be used or the initial weights may be chosen differently.
Stochastic gradient descent (SGD) applies the
gradient descent algorithm to randomly selected mini batches of the input data.
It thus reduces the likelihood of getting stuck at a local minima, and still
represents the dataset well in a probabilistic sense. It has another favorable
consequence. It enables the network to train faster, since the entire input is
not used for a single mini batch back propagation. These factors make the
stochastic gradient descent a useful method of training.
Moreover, neural networks may be prone to
overfitting to the training data, in which case the results cannot be
generalized well to real life situations. To avoid this, regularization methods
such as weight decay may be used. The formula for the gradient descent with
weight decay in SGD is like the following.
In the above equations, η is the learning rate, and λ is the regularization parameter.
Although neural networks should theoretically be
very capable, they also have some drawbacks. One such drawback is the vanishing
gradient problem, which results from the gradient of the activation functions
vanishing during back propagation computations. In that case, the network is stuck
and cannot learn further. Moreover, the activation function should allow us to
have small changes in the output when the weights are changed slightly, since
this is the underlying principle of the gradient descent algorithm. The choice
of the activation function is therefore essential. It should be a smooth
function, and possibly should not have vanishing gradient. One such function is
the sigmoid function, which is commonly used in neural networks, and we used it
as well. Its derivative is also easy to compute, which is nice for back
propagation computations. The sigmoid function and its derivative is given
below.
There are also some terminology regarding neural networks, particularly the SGD algorithm. A mini batch is a randomly selected subset of the training data, which is used in one update of the SGD algorithm. When the entire training dataset is covered by these mini batches, one epoch of training is said to be completed.
3)
Validation
3.1)
k-fold Cross Validation:
This technique is robust for
providing randomness in validation. The algorithm of k-fold cross validation is
given as follows:
·
Randomly
split the dataset into k equal sized folds.
·
Train
ith model on all folds except fold i.
·
Validate
ith model using fold i.
Description of the Dataset:
Our dataset was constructed by extracting data from
zingat.com and sahibinden.com. The features were created using the features
listed in the user interface of those sites. There were so many empty cells in
our dataset because of using two different sites with different standards for
features of the houses. For imputation, we used mod and average values of the
filled cells. Also, we deleted some of the features since most of the houses
did not had any information about them and they sounded irrelevant to the civil
engineer who we consulted. Then, we assigned numerical values to each cell that
originally stored non-numerical values. After that, we normalized the data and
applied PCA on the dataset. The resulting dataset was used in all our learning
methods as the input.
Simulation and Performance of the
Training Methods:
1)
Clustering:
Actually, we did not
use Clustering in our final implementation for training. We used Clustering
without making use of prices of houses. We did not make price predictions with
Clustering. We used Clustering to find similar house notices and create groups
with these similar house notices. Then, we applied Linear Regression on these
groups to predict the house prices.
Before the Phase 2 Demo, we validated
Clustering methods (K-means Clustering and K-medoids Clustering with weighted
Euclidean distance) with 5 different random initial locations for each
centroid. We evaluated RSS/TSS for each implementation to compare the
coefficients of determination of K-means Clustering model and K-medoids
Clustering model. We reported in our Phase 2 Report that we used 10-fold cross
validation to validate these models, however we made some changes in the code
script before the Phase 2 demo to create a better testing conditions for the
accuracy of the Clustering models. RSS/TSS ratios for the models of
cross-validation are shown in the figure below:
Coefficient of
determination is equal to 1-RSS/TSS. Therefore, we observed from the figure
above that K-means Clustering model is more accurate. As a result, we used only
the K-means Clustering method as our Clustering method in the final
implementation. After deciding on the model to use, we implemented the algorithm 10 times
(instead of 5 times) such that at each implementation a different random
location was selected for a centroid. We displayed the training time for each
implementation in the console output. Average training time is found as 2.851
seconds. Training time is not about training, actually it is the execution time
of the algorithm. We made an abuse of notation by calling the execution time
the training time, because we had used until the Phase 2 Demo the Clustering
model as a learning method to directly find house price predictions. Before the
demo, we were not making use of Linear Regression on clusters.
2) Linear Regression:
The value that we optimized was lambda (the
coefficient of Ridge regularization). For this phase, we applied linear regression to our data without
applying feature transformation. Therefore,
the following version of the formula that is given in this section was used:
We found by trial and error that using 3 clusters creates the best accuracy for the predictions made by linear regression. Linear Regression with Ridge Regularization was applied on 3 clusters separately and therefore 3 different optimal λ values were found. We used 5-fold cross validation to find the optimal λ values for clusters. The results are given below:
Average
training time for cluster 0 : 0.0506 s.
Optimal
lambda value for cluster 0 : 190.
Average
training time for cluster 1 : 0.079 s.
Optimal
lambda value for cluster 1 : 320.
Average
training time for cluster 2 : 0.0534 s.
Optimal
lambda value for cluster 2 : 240.
Coefficient of determination was
defined as the average of the coefficient of determination values of cross
validation models. Coefficient of determination for Regularized Linear
Regression was found as 0.715. After finding optimal
values, we shuffled the house notices and then
separated each cluster into two groups with same number of members. One of the
groups was used for training with the corresponding optimal
value, and the other group was used for
testing. We used the test set to calculate the probability of predicting the
price in the given accuracy range. Our results are given below:
Probability
of making a prediction in the range [0.9
(real
price), 1.1
(real
price)] = 0.217.
Probability
of making a prediction in the range [0.8
(real
price), 1.2
(real
price)] = 0.432.
Probability
of making a prediction in the range [0.7
(real
price), 1.3
(real
price)] = 0.592.
We thought that using the Ada-Boost
method we could increase the accuracy of the algorithm. To implement the
Ada-Boost algorithm, we needed to create subsets of our clusters that will be
used in training instead of the clusters. However, we did not have sufficient
number of house notices to work with. Our algorithm allowed clusters with at
least 300 house notices. We had 170 feature, which means that number of data
points divided by number of features is a small ratio which may lead to
inaccuracies. Therefore, we did not have any tolerance to work with a smaller
dataset. That was the reason why we did not use Ada-Boost algorithm in our
first predictor model (Clustering + Regularized Linear Regression Model).
3) Shallow Neural Network:
In our
implementation, we first defined various useful activation functions. We also
defined their derivatives for back propagation calculations. We also defined
the quadratic and cross entropy cost functions. Then we defined a neural
network class, which held the information of perceptron weights and biases. The
implementation of this class was inspired by an online source. It has a
prominent SGD function, which utilizes a mini batch update function, which in
turn uses the back propagation function. The SGD function can be configured to
monitor the accuracy and the total cost of the neural network on a given test
dataset, which is not used in training. The algorithm can be terminated
prematurely if the accuracy on the test dataset is unchanging for a set number
of epochs of training. In the end of the training, the performance of the
network can be evaluated on another validation dataset, which is separate from
the test dataset. For performance measure, we used a method resembling a
confidence interval. We compared the results of the neural network with the
actual values, and computed the statistic of the prediction being a within
certain percentage interval of the true value. The driver program is configured
to save the state of the neural network object, and load this object in the
next run, enabling incremental training from one run to the next.
The determination
of the activation function, number of layers and the perceptron counts at each
layer, number of epochs, the mini batch size, learning rate, regularization
method and regularization coefficient were the hyperparameters of our learning
algorithm. We adjusted these hyperparameters based on the obtained accuracy,
that is, we tried to find the hyperparameters which maximized our accuracy of
prediction. We applied a heuristic approach for this purpose, and tried various
combinations, until we settled with the current model. It may be argued that
our model may not be the optimal after this procedure, but it was the result of
our best effort. The driver program could be modified to exhaustively search
through all hyperparameter combinations, training the network through each
combination from start. However, the amount of computation required would be
unfeasible, so we did not use such an approach.
One implementation
detail is that our neural network does regression, but the output layer does
not have linear activation function. Instead, the real results were scaled down
when fed to the network, and the network predictions were scaled up in the real
prediction case. We resorted to such a method because the linear layer was
causing numerical instability in the back propagation computations. However,
this was not a major issue, since house prices are always positive, and they
have a reasonable upper bound. Even if we were to have negative values in our
regression problem, we could circumvent by using a bias at the output,
eliminating the problem.
Total execution time of the SNN algorithm
was 1023.20 seconds. Coefficient of determination for the test set was found as
0.647. Coefficient of determination for the training set was found as 0.956. We used the test set to
calculate the probability of predicting the price in the given accuracy range.
Our results are given below:
Probability
of making a prediction in the range [0.9
(real
price), 1.1
(real
price)] = 0.313.
Probability
of making a prediction in the range [0.8
(real
price), 1.2
(real
price)] = 0.582.
Probability
of making a prediction in the range [0.7
(real
price), 1.3
(real
price)] = 0.779.
As a result, we
can say that for finding the house prices in Ankara, SNN is a more effective
learning method than the method we created by combining Clustering and Ridge
Regularized Linear Regression methods
Conclusion:
In the Expected
Challenges section of our Phase 1 report, most of the problems which we stated
that we would have to counter were about data extraction and data cleaning. Our
prediction was correct. We faced the following problems in the first phase of
the project:
1)
Our web crawler program crashed because the websites’ protection tools denied
our requests to get html code after the program extracts html for some certain
time from those websites.
2)
We had to make corrections to the names of the features because house notices
from zingat.com had Turkish labels whereas house notices from sahibinden.com
had English labels. We converted all label names to English. Also, there were
some irrelevant features for houses of interest. For example, one of the
extracted features was “closeness to Bosphorus”. However, Bosphorus is a strait
in İstanbul whereas we were interested in only houses from Ankara. The sad fact
about this situation is that there were some notices where the house was
located in Ankara but the owner of the notice claimed that the house was close
to Bosphorus by filling the corresponding cell with the “True/Yes” value. This
made us question the reliability of the features of house notices.
Because of these problems, creating
the dataset that we desired to have took several weeks. Therefore, we were able
to create a dataset with only 2086 house notices. If we had more house notices,
we could implement Ada-Boost learning on our both learning models to improve
the accuracy of them. We were not able to implement Ada-Boosting since we did not have
sufficient number of house notices to work with.
Ada-Boost is the abbreviation of
Adaptive Boosting. In Adaptive Boosting, some data points are selected randomly
for training from the training set. Then, the whole training set is used for
testing. Then, some data points are selected randomly again for training from
the training set. This time, each of the data point that has been predicted
poorly has a more probability of being selected than the other data points’
probabilities of being selected. The number of iterations for the
training-testing sequence is fixed and chosen by the programmer.
One other problem is that we found
an optimal number of clusters to implement “Clustering + Ridge Regularized
Linear Regression” on our fixed dataset as 3 by trial and error. However, for a
different dataset consisting of other house notices, the optimal number of
clusters may differ. We need a way to find the optimal number of clusters automatically
so that our Machine Learning approach can be used in commercial application.
Statics for the accuracy of our learning models are
a given below for convenience:
For Clustering + Ridge Regularized Linear
Regression:
Coefficient
of Determination = 0.715.
Probability
of making a prediction in the range [0.9
(real
price), 1.1
(real
price)] = 0.217.
Probability
of making a prediction in the range [0.8
(real
price), 1.2
(real
price)] = 0.432.
Probability
of making a prediction in the range [0.7
(real
price), 1.3
(real
price)] = 0.592.
For
Shallow Neural Network:
Coefficient
of Determination = 0.647.
Probability
of making a prediction in the range [0.9
(real
price), 1.1
(real
price)] = 0.313.
Probability
of making a prediction in the range [0.8
(real
price), 1.2
(real
price)] = 0.582.
Probability
of making a prediction in the range [0.7
(real
price), 1.3
(real
price)] = 0.779.
We
thought about ways to increase the accuracy of our Machine Learning approach
and to make it more suitable for commercial applications (in the form of a
Website or a mobile app). We are proposing the ways which will be useful
according to our opinion:
1) We can take permission from the Real
Estate Trade Websites to get their html content faster. They can fund us to get
profit from the commercial application.
2) If the owners of a Real Estate Trade
Website make a deal with us as we proposed in the first clause, they will
adjust their user interface so that the labels we extract from that Website
will be more relevant to the 107 features that were left after imputation. We
believe that those 107 features are sufficient to predict the house prices in
Ankara since we found those features using two different Websites. We think
that we can generalize our model by making use of these fixed features in the
commercial application. One exception might be adding the listing date and
inflation rate as features. By adding these features we can add to our learning
models the effect of change of prices as time passes.
3) There were conflicts between labels
coming from different Websites. Also, sometimes duplicate labels occurred. To
solve this problem, our commercial application will make several predictions
for the price of a house such that each prediction will make use of a dataset
that was constructed using only one Website as a source. This way, we can avoid
conflicts of labels. The user of the application will have several predictions
in hand to make use of.
4) Natural language processing approaches
can be utilized to adjust the name of features of house notices to the desired
name (to one of the names of the fixed 107 features). A feature will be
discarded if it is not accurately represented as one of the fixed 107 features.
5) To find the number of optimal clusters
for our Clustering algorithm, cross-validation will be used and several numbers
will be tested as the number of clusters. These way, the optimal number of
clusters will be found automatically.
Works Cited:
“Multiple Imputation for Missing Data.” Statistics
Solutions, www.statisticssolutions.com/multiple-imputation-for-missing-data/.
Appendix:
Python Scripts: crawler.py is
responsible for extracting data from a targeted website. dataset_imputation.py
is responsible for imputing the empty values of the dataset.
dataset_vectorization.py is responsible for assigning numbers instead of
non-numeric data. normalization_of_dataset.py centralizes and scales the data.
pca_on_normalized_dataset.py applies PCA on the normalized data. clustering.py
applies K-means Clustering on our dataset to divide the dataset into separate
groups based on similarity. This way, we had a purpose of getting rid of
nonlinearities of our dataset. linear_regression.py is responsible for training
and testing a linear regression model with Ridge regularization. neural_network.py
is responsible for training and testing a Shallow Neural Network model. neural_network_driver.py
is the class that contains the Neural Network model. miscellaneous.py is
responsible for measuring how much time the algorithms inside the
neural_network.py script consume.
crawler.py:
#
installed libraries
from
bs4 import BeautifulSoup
import
pandas as pd
import
requests
#
built in libraries
import
json
import
logging
import
random
import
platform
import
re
import
subprocess
import
time
#
self written code
import
data_extractors as data_ex
#
give beep sound
def
os_beep():
if 'Linux' in platform.system():
os_sound_name =
'alarm-clock-elapsed'
subprocess.call(['/usr/bin/canberra-gtk-play',
'--id', os_sound_name])
elif 'Windows' in
platform.system():
print('to beep,
or not to beep... That is the question.')
time.sleep(2)
#
get some free proxies from given url
def
get_proxies(proxy_url='https://free-proxy-list.net/') -> list:
# list to keep all proxies
proxies = []
response =
persistent_request(proxy_url, timeout=10)
soup =
BeautifulSoup(response.text, 'html.parser')
# extract proxies table
proxies_table_tag =
soup.find('div', {'class': 'table-responsive'})
proxy_tag_list =
proxies_table_tag.find_all('tr')
# determine proxy info key names
keynames = []
for keyname_tag in
proxy_tag_list[0].find_all('th'):
keynames.append(keyname_tag.text.strip())
# add proxy information to
proxies list
for proxy_tag in proxy_tag_list:
# get only elite
proxies
if
proxy_tag.find(string='elite proxy'):
proxy_info_list
= proxy_tag.find_all('td')
proxy
= {}
key_index
= 0
for
proxy_info in proxy_info_list:
proxy[keynames[key_index]]
= proxy_info.text.strip()
key_index
= key_index + 1
proxies.append(proxy)
# return
num_of_proxies_to_return proxies at maximum
num_of_proxies_to_return = 20
if len(proxies) >=
num_of_proxies_to_return:
proxy_list =
random.sample(proxies, num_of_proxies_to_return)
#proxy_list =
proxies[:num_of_proxies_to_return]
return proxy_list
else:
return proxies
#
try a persistent http connection
def
persistent_request(url: str, headers={}, proxies={}, timeout=0):
while True:
try:
http_response
= requests.get(url, headers=headers, proxies=proxies, timeout=timeout)
#
check if http response status code is not 200
if
http_response.status_code != 200:
print('http
connection not successful.')
print('url:
' + http_response.url)
print('http
status code: ' + str(http_response.status_code))
#
act based on the http status code itself
if
http_response.status_code == 404:
print('Page
not found errors are to be skipped.')
break
else:
raise
Exception('http connection error.')
except Exception
as ex:
print('An
exception occurred: ')
print(ex)
print('Retrying...')
continue
else:
break
return http_response
#
try a persistent http connection with proxy changes
def
persistent_proxy_request(url: str, headers={}, proxy_list=[], proxy_index=0,
timeout=0):
while True:
try:
print('Using
proxy from the proxy set: ' + proxy_list[proxy_index])
requests_proxies
= {
'http':
proxy_list[proxy_index],
'https':
proxy_list[proxy_index],
}
http_response
= requests.get(url, \
headers=headers,
proxies=requests_proxies, timeout=timeout)
#
check if http response status code is not 200
if
http_response.status_code != 200:
print('http
connection not successful.')
print('url:
' + http_response.url)
print('http
status code: ' + str(http_response.status_code))
#
act based on the http status code itself
if
http_response.status_code == 404:
print('Page
not found errors are to be skipped.')
break
else:
raise
Exception('http connection error.')
except Exception
as ex:
print('An
exception occurred: ')
print(ex)
proxy_index
+= 1
if
proxy_index < 0 or proxy_index >= len(proxy_list):
print('Proxy
index is out of range.')
proxy_index
= 0
print('Retrieving another set of
proxies...')
proxy_list
= []
for
proxy_dict in get_proxies():
proxy
= proxy_dict.get('IP Address') + ':' + proxy_dict.get('Port')
proxy_list.append(proxy)
print('New
proxy list: ')
print(proxy_list)
print('Retrying
http connection...')
continue
else:
break
return {'http_response':
http_response, 'proxy_list': proxy_list, 'proxy_index': proxy_index}
#
main crawler body for all websites
def
crawler_main(webpage_name: str, main_notices_page_no_interval: list):
# time delay intervals in
seconds to avoid being banned by website
# TODO how long to wait???
delay_between_notices_interval =
[0, 0]
delay_between_notice_sets_interval
= [0, 0]
delay_between_main_pages_interval
= [0, 0]
# http timeout
requests_timeout = 5
# user agents used in http
requests
requests_user_agents_list = \
['Mozilla/5.0
(X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/73.0.3683.86
Safari/537.36']
# headers for http requests,
needed to get response from some sites
requests_headers =
{'user-agent': requests_user_agents_list[0]}
# websites to check visible ip
address
#ip_test_url =
['https://httpbin.org/ip', 'http://ipecho.net/plain']
# initialize a list of usable
free proxies
proxies = []
for proxy_dict in get_proxies():
proxy =
proxy_dict.get('IP Address') + ':' + proxy_dict.get('Port')
proxies.append(proxy)
print('proxy list: ')
print(proxies)
proxy_index = 0
# check page interval bounds
if main_notices_page_no_interval[0]
< 1:
print('Invalid
start page number.')
return False
if
main_notices_page_no_interval[1] < 1:
print('Invalid
end page number.')
return False
# list to contain all house
dictionaries
# uses too much RAM???
houses_info_list = []
# initialize the obtained data
range
# this information is used for
storage file naming
# decoded as follows:
(start_page_no)-(start_notice_no)_(end_page_no)-(end_notice_no)
retrieved_data_start_index =
str(main_notices_page_no_interval[0]) + '-1'
retrieved_data_end_index =
str(main_notices_page_no_interval[0]) + '-0'
retrieved_data_index_range = '['
+ retrieved_data_start_index + '_' + \
retrieved_data_end_index
+ ']'
# start main try block
try:
# list to contain
all house notice links over all pages
house_notice_url_list
= []
# loop over
notices main pages
for
main_notices_page_no in range(main_notices_page_no_interval[0], \
main_notices_page_no_interval[1]+1):
#
get website info list
website_info_list
= data_ex.get_webpage_info(webpage_name, main_notices_page_no)
#
get the webpage main url
webpage_main
= website_info_list[0]
#
get website customized main notices page url for given page number
main_notices_page_url
= website_info_list[1]
#
get the search term to use for finding notice links
notice_links_search_term
= website_info_list[2]
#
get the data extractor to be used
data_extractor
= website_info_list[3]
#
the http response object from main notices page
main_notices_page_response_list
= persistent_proxy_request(main_notices_page_url, \
headers=requests_headers,
proxy_list=proxies, proxy_index=proxy_index, timeout=requests_timeout)
main_notices_page_response
= main_notices_page_response_list['http_response']
proxies
= main_notices_page_response_list['proxy_list']
proxy_index
= main_notices_page_response_list['proxy_index']
#
get beautifulsoup object
main_notices_page_soup
= BeautifulSoup(main_notices_page_response.text, 'html.parser')
#
wait a random amount of time so that the website does not ban us
main_pages_time_delay
= random.uniform(delay_between_main_pages_interval[0], \
delay_between_main_pages_interval[1])
print('Waiting
time before new notice main page request: ' + f"{main_pages_time_delay:.3f}"
+ ' s.')
time.sleep(main_pages_time_delay)
#
loop over each house notice link
for
notice_page_link in main_notices_page_soup.find_all('a',
notice_links_search_term):
#
get the notice page
notice_page_url
= webpage_main + notice_page_link['href']
#
add it to the complete list
house_notice_url_list.append(notice_page_url)
#
inform the user about current status
print('Adding
retrieved notice urls to the list...')
print('Current
number of total notices: ' + str(len(house_notice_url_list)))
# index to keep
number of notices processed in one page
page_notice_no =
0
# index to keep
main notices page count
main_notices_page_no
= 1
# process each
house notice url
for
notice_page_url in house_notice_url_list:
#
the http response object for one notice page
notice_page_response_list
= persistent_proxy_request(notice_page_url, \
headers=requests_headers,
proxy_list=proxies, proxy_index=proxy_index, timeout=requests_timeout)
notice_page_response
= notice_page_response_list['http_response']
proxies
= notice_page_response_list['proxy_list']
proxy_index
= notice_page_response_list['proxy_index']
#
wait a random amount of time so that the website does not ban us
if
page_notice_no % 10 == 0:
notice_sets_time_delay
= random.uniform(delay_between_notice_sets_interval[0], \
delay_between_notice_sets_interval[1])
print('Extra
waiting time before new notice page request: ' \
+
f"{notice_sets_time_delay:.3f}" + ' s.')
time.sleep(notice_sets_time_delay)
#
get beautifulsoup object
notice_page_soup
= BeautifulSoup(notice_page_response.text, 'html.parser')
####
Code to extract info for one house from notice page
#
dictionary to hold information of one house
house_info_dict = {}
#
add notice url to the house info dictionary
keyname
= 'URL'
valname
= notice_page_url
house_info_dict[keyname]
= valname
##
Custom extraction code for one notice webpage
data_extractor(notice_page_soup,
house_info_dict)
#
append the house info dictionary to the complete list
houses_info_list.append(house_info_dict)
#
notify progress on terminal
page_notice_no
+= 1
print('Retrieved
data: Page ' + str(main_notices_page_no) + \
',
Notice ' + str(page_notice_no) + ', url=' + notice_page_url)
#
wait a random amount of time so that the website does not ban us
notices_time_delay
= random.uniform(delay_between_notices_interval[0], \
delay_between_notices_interval[1])
print('Waiting
time before new notice page request: ' + f"{notices_time_delay:.3f}"
+ ' s.')
time.sleep(notices_time_delay)
#
determine the obtained data range
#
this information is used for storage file naming
#
decoded as follows: (start_page_no)-(start_notice_no)_(end_page_no)-(end_notice_no)
retrieved_data_end_index
= str(main_notices_page_no) + '-' + str(page_notice_no)
retrieved_data_index_range
= '[' + retrieved_data_start_index + '_' + \
retrieved_data_end_index
+ ']'
#
keep the index of main notices page
notices_per_main_page
= 50
if
page_notice_no % notices_per_main_page == 0:
main_notices_page_no
+= 1
page_notice_no
= 0
except:
print('Exception
caught, writing rescued data to file...')
# rescue the
dataset file
# create the
dataset
house_dataset =
pd.DataFrame(houses_info_list)
# save rescued
dataset to file
house_dataset.to_csv(webpage_name
+ '_dataset_' + retrieved_data_index_range + '_rescue.csv', sep=';')
# rescue the
obtained house links
# create the
links dataset
house_links_dataset
= pd.DataFrame(house_notice_url_list)
# save rescued
dataset to file
house_links_dataset.to_csv(webpage_name
+ '_links_dataset_' \
+
retrieved_data_index_range + '_rescue.csv', sep=';')
# display
exception logs
logging.getLogger().error('Some
exception occurred during crawling.', exc_info=True)
else:
print('No
exception occurred during crawling, writing collected data to file...')
# create the
dataset
house_dataset = pd.DataFrame(houses_info_list)
# save dataset to
file
house_dataset.to_csv(webpage_name
+ '_dataset_' + retrieved_data_index_range + '.csv', sep=';')
# create the
links dataset
house_links_dataset
= pd.DataFrame(house_notice_url_list)
# save rescued
dataset to file
house_links_dataset.to_csv(webpage_name
+ '_links_dataset_' \
+
retrieved_data_index_range + '.csv', sep=';')
finally:
# make sound when
code finishes execution, due to any reason
try:
print('Use
CTRL+C to stop alert sound.')
for
_1 in range(2):
os_beep()
pass
except
KeyboardInterrupt:
print('Keyboard
interrupt.')
finally:
print('Finished
execution.')
#
main program start
crawler_main('sahibinden',
[1, 101])
#crawler_main('zingat',
[1, 3])
dataset_imputation.py:
import
numpy as np
import
pandas as pd
print('Program
start.')
dataset_file
= 'whole_dataset.csv'
imputation_method_file
= 'applied_imputation_methods.csv'
dataset_frame
= pd.read_csv(dataset_file, delimiter=';')
imputation_method_frame
= pd.read_csv(imputation_method_file, delimiter=';')
dataset_frame
= dataset_frame.iloc[:, 1:]
dataset_frame
= dataset_frame.replace('Yes', True)
dataset_frame
= dataset_frame.replace('No', False)
dataset_frame
= dataset_frame.replace('-', None)
dataset_frame
= dataset_frame.replace('Not Specified', None)
num_of_nulls
= dataset_frame.isnull().sum(axis=0)
for
imputation_method_colname in imputation_method_frame.columns[1:]:
# Drop column if data in column
is too empty
if num_of_nulls[imputation_method_colname]
> dataset_frame.shape[0]/2:
print('Dropping
column due to lack of data...')
dataset_frame =
dataset_frame.drop(columns=[imputation_method_colname])
imputation_method_frame
= imputation_method_frame.drop(
columns=[imputation_method_colname])
mods_table
= dataset_frame.mode()
mods
= mods_table.iloc[0]
medians
= dataset_frame.median()
#
print(medians)
for
imputation_method_colname in imputation_method_frame.columns[1:]:
# Strip trailing whitespaces
from the data
print(dataset_frame[imputation_method_colname].dtype)
if not
np.isreal(dataset_frame[imputation_method_colname][0]):
print('Stripping
trailing whitespace...')
dataset_frame[imputation_method_colname]
= dataset_frame[imputation_method_colname].str.strip()
imputation_method_col =
imputation_method_frame[imputation_method_colname]
dataset_col_mod_index =
dataset_frame.columns.get_loc(
imputation_method_colname)
imputation_method =
imputation_method_col[0]
# Debugging
print(imputation_method_colname)
print(dataset_frame.columns[dataset_col_mod_index])
print(mods[dataset_col_mod_index])
print(dataset_col_mod_index)
fillna_value = None
if imputation_method == 'Mod':
fillna_value =
mods[dataset_col_mod_index]
elif imputation_method == 'Average':
if
imputation_method_colname in medians.columns:
fillna_value
= medians[imputation_method_colname]
else:
fillna_value
= mods[dataset_col_mod_index]
elif imputation_method == False:
fillna_value =
False
if imputation_method_col[1] == 'Average':
if
imputation_method_colname in medians.keys():
#
Debugging
print(medians[imputation_method_colname])
print(type(medians[imputation_method_colname]))
print('########')
print(mods[dataset_col_mod_index])
print(type(mods[dataset_col_mod_index]))
for
mod_index in range(len(mods_table.iloc[0])):
try:
mod_to_use
= mods_table.iloc[mod_index][dataset_col_mod_index]
fillna_value
= (
int(medians[imputation_method_colname])
+ int(mod_to_use))/2
break
except
ValueError as valueErr:
continue
else:
pass
finally:
pass
dataset_frame[imputation_method_colname]
= dataset_frame[imputation_method_colname].fillna(
fillna_value)
for
dataset_colname in dataset_frame.columns:
# print(dataset_frame[dataset_colname])
if
dataset_frame[dataset_colname].dtype == 'float64':
dataset_frame[dataset_colname]
= dataset_frame[dataset_colname].astype(
int)
dataset_frame
= dataset_frame.fillna(dataset_frame.mode().iloc[0])
dataset_frame.to_csv('whole_dataset_imputation.csv',
sep=';')
print('Program
end.')
dataset_vectorization.py:
import
numpy as np
import
pandas as pd
print('Program
start.')
dataset_file
= 'whole_dataset_imputation.csv'
dataset_frame
= pd.read_csv(dataset_file, delimiter=';')
dataset_frame
= dataset_frame.iloc[:, 1:]
#print(dataset_frame)
#print(dataset_frame.applymap(np.isreal))
#
the following method is better for determining cols to vectorize
#
vectorize the false entries
cols_isreal
= dataset_frame.iloc[0].map(np.isreal)
#print(cols_isreal)
cols_to_vectorize
= list()
for
col_is_real_colname in cols_isreal.keys():
col_is_real =
cols_isreal[col_is_real_colname]
if not col_is_real:
cols_to_vectorize.append(col_is_real_colname)
print(cols_to_vectorize)
cols_to_vectorize.remove('URL')
for
col_to_vectorize in cols_to_vectorize:
#print(dataset_frame[col_to_vectorize])
for i in
range(len(dataset_frame[col_to_vectorize])):
element =
dataset_frame.loc[i, col_to_vectorize]
if element is
None or element == '':
continue
col_to_add =
str(col_to_vectorize) + '-' + str(element)
if col_to_add not
in dataset_frame.columns:
dataset_frame[col_to_add]
= False
dataset_frame.loc[i,
col_to_vectorize] = ''
dataset_frame.loc[i,
col_to_add] = True
dataset_frame =
dataset_frame.drop(col_to_vectorize, axis=1)
# old version below
# removed one level indentation
'''
try:
element =
dataset_frame.loc[i, col_to_vectorize]
if element is
None or element == '':
continue
element =
int(element)
except ValueError as valueErr:
col_to_add =
str(col_to_vectorize) + '-' + str(element)
if col_to_add not
in dataset_frame.columns:
dataset_frame[col_to_add]
= False
dataset_frame.loc[i,
col_to_vectorize] = ''
dataset_frame.loc[i,
col_to_add] = True
continue
else:
pass
finally:
pass
'''
dataset_frame
= dataset_frame.reindex(sorted(dataset_frame.columns), axis=1)
dataset_frame
= dataset_frame.drop('URL', axis=1)
print(dataset_frame.iloc[0].map(np.isreal))
'''
for
element in dataset_frame[cols_to_vectorize[col_index]]:
print('####')
print(element)
print(type(element))
'''
# new_col_name =
str(cols_to_vectorize[col_index]) + '-' + str(element)
dataset_frame.to_csv('whole_dataset_vectorized.csv',
sep=';')
print('Program
end.')
exit
normalization_of_dataset.py:
import
pandas as pd
import
numpy as np
file
= "whole_dataset_vectorized_use_this.csv"
original_dataset
= pd.read_csv(file, delimiter=';')
mean_vector
= np.zeros((1, len(original_dataset.columns[1:])))
for
i in range(0, len(original_dataset.index)):
for j in
range(0,len(original_dataset.columns[1:])):
mean_vector[0, j] = mean_vector[0, j] +
original_dataset.iat[i, j+1] / len(original_dataset.index)
centered_dataset
= np.zeros((len(original_dataset.index), len(original_dataset.columns[1:])))
df_centered_dataset
= pd.DataFrame(centered_dataset)
df_scaled_dataset
= pd.DataFrame(centered_dataset)
df_scaled_dataset.columns
= original_dataset.columns[1:]
for
i in range(0, len(original_dataset.index)):
for j in range(0,len(original_dataset.columns[1:])):
df_centered_dataset.iat[i, j] =
original_dataset.iat[i, j+1] - mean_vector[0, j]
variance_vector
= np.zeros((1, len(original_dataset.columns[1:])))
for
i in range(0, len(original_dataset.index)):
for j in
range(0,len(original_dataset.columns[1:])):
variance_vector[0, j] =
variance_vector[0, j] + (df_centered_dataset.iat[i, j])**2 /
len(original_dataset.index)
for
i in range(0, len(original_dataset.index)):
for j in range(0,len(original_dataset.columns[1:])):
df_scaled_dataset.iat[i, j] =
df_scaled_dataset.iat[i, j] / (variance_vector[0,j])**(1/2)
df_scaled_dataset.to_csv('normalized_dataset.csv',
sep = ';',index=True)
pca_on_normalized_dataset.py:
import
pandas as pd
import
numpy as np
from
numpy import linalg
file
= "normalized_dataset.csv"
df_normalized_dataset
= pd.read_csv(file, delimiter=';')
normalized_dataset
= df_normalized_dataset.iloc[:, 1:].values
#
Compute the covariance matrix of the normalized data.
covariance_matrix
= (1/len(df_normalized_dataset.index)) * \
np.dot(np.transpose(normalized_dataset),
normalized_dataset)
eigenvalues,
eigenvectors = linalg.eig(covariance_matrix)
#
Organize eigenvalues and eigenvectors as a dictionary.
eigenvectors_dict
= {}
for
i in range(0, len(eigenvalues)):
keyname = eigenvalues[i]
valname = eigenvectors[:, i]
eigenvectors_dict[keyname] = valname
#
Sort eigenvectors based on eigenvalues.
eigenvectors_list
= []
for
i in sorted(eigenvectors_dict, reverse=True):
eigenvectors_list.append(eigenvectors_dict[i])
#
Normalize eigenvectors.
sorted_eigenvectors
= np.transpose(np.asarray(eigenvectors_list))
for
i in range(len(eigenvalues)):
sorted_eigenvectors[:, i] =
sorted_eigenvectors[:, i] / \
(np.sum(sorted_eigenvectors[:, i] **
2))**(1/2)
#
PREALLOCATION
#
Here k is number of principle components that we plan to use
PVE_by_first_k_eigvectors
= 0
total_variance_times_n
= 0 # Here n means number of
observations
#
Compute total variance times n, for PVE computation later.
for
i in range(len(normalized_dataset[:, 0])):
for j in range(len(normalized_dataset[0,
:])):
total_variance_times_n +=
normalized_dataset[i, j] ** 2
projected_dataset
= np.dot(normalized_dataset, sorted_eigenvectors)
projected_dataset_squared
= np.square(projected_dataset)
for
j in range(projected_dataset.shape[1]):
variance_explained_by_jth_eigvector_times_n
= 0
for i in range(projected_dataset.shape[0]):
variance_explained_by_jth_eigvector_times_n +=
projected_dataset_squared[i, j]
PVE_by_jth_eigvector =
variance_explained_by_jth_eigvector_times_n / \
total_variance_times_n
PVE_by_first_k_eigvectors =
PVE_by_first_k_eigvectors + PVE_by_jth_eigvector
if PVE_by_first_k_eigvectors > 0.6:
break
number_of_prime_components_to_use
= j + 1
reduced_projected_dataset
= np.dot(
normalized_dataset, sorted_eigenvectors[:,
0: number_of_prime_components_to_use])
data_frame_to_write
= pd.DataFrame(np.real(reduced_projected_dataset))
data_frame_to_write.to_csv('result_of_PCA.csv',
sep=';', index=True)
clustering.py:
import
pandas as pd
import
numpy as np
from
numpy import linalg
import
matplotlib.pyplot as plt
import
time
file
= 'result_of_PCA.csv'
file2
= 'price.csv'
df_dataset
= pd.read_csv(file, delimiter=';')
df_price
= pd.read_csv(file2, delimiter=';')
number_of_clusters
= 5
#Evaluation
of Total Sum of Squares
price_vector
= np.squeeze(df_price.values)
dataset
= df_dataset.iloc[:, 1:].values
shuffled_index
= np.arange(price_vector.shape[0])
np.random.shuffle(shuffled_index)
dataset
= dataset[shuffled_index]
price_vector
= price_vector[shuffled_index]
number_of_trials_for_centroids
= 10
training_time_for_k_means_at_each_trial
= np.zeros(( number_of_trials_for_centroids))
initial_centroid_locations_array
= np.zeros((number_of_clusters,df_dataset.shape[1]-1,
number_of_trials_for_centroids))
centroid_locations_array
= np.zeros((number_of_clusters,df_dataset.shape[1]-1,
number_of_trials_for_centroids))
total_loss_array
= np.full(( number_of_trials_for_centroids), 501000)
while
np.amin(total_loss_array) > 500000:
print('New iteration of 10 trials for
initial locations of centroids')
#Loss according to the loss functions of
Clustering
total_loss_array = np.zeros((
number_of_trials_for_centroids))
for trial_number in
range(number_of_trials_for_centroids):
# Initialization of locations of
centroids
initial_centroid_locations_array[
0:int(number_of_clusters/2), :, trial_number] = 10 *
np.random.rand(int(number_of_clusters/2),df_dataset.shape[1]-1)
initial_centroid_locations_array[int(number_of_clusters /
2):number_of_clusters, :, trial_number] \
= -10 *
np.random.rand(number_of_clusters - int(number_of_clusters/2),
df_dataset.shape[1] - 1)
centroid_locations_array[ :, :,
trial_number] = initial_centroid_locations_array[ :, :, trial_number]
check_to_break = False
#K-means Clustering Method (uses
Euclidean Distance)
# Initialization
centroid_locations_array[ :, :,
trial_number] = initial_centroid_locations_array[ :, :, trial_number]
#Training Phase
start_time = time.time()
threshold_for_change_of_loss = 0.00001
#Finding Initial Loss
r_matrix = np.zeros((dataset.shape[0],
number_of_clusters))
'''
#The vector below stores squares of
smallest Euclidean norms
'''
smallest_euclidean_norm_squ_for_each_notice
= np.zeros(dataset.shape[0])
for i in range(0, dataset.shape[0]):
candidates_for_smallest_euclidean_norm_squ =
np.zeros(number_of_clusters)
for j in range(0,
number_of_clusters):
candidates_for_smallest_euclidean_norm_squ[j]
= (np.linalg.norm(dataset[i, :] - centroid_locations_array[j, :,trial_number]))
** (2)
smallest_euclidean_norm_squ_for_each_notice[i] =
np.amin(candidates_for_smallest_euclidean_norm_squ)
r_matrix[i,
np.argmin(candidates_for_smallest_euclidean_norm_squ)] = 1
initial_loss =
np.sum(smallest_euclidean_norm_squ_for_each_notice)
previous_loss = 0
old_r_matrix =
np.zeros((dataset.shape[0], number_of_clusters))
while(True):
#Expectation Step
r_matrix =
np.zeros((dataset.shape[0], number_of_clusters))
'''
#The vector below stores squares of
smallest Euclidean norms
'''
smallest_euclidean_norm_squ_for_each_notice
= np.zeros(dataset.shape[0])
for i in range(0,
dataset.shape[0]):
candidates_for_smallest_euclidean_norm_squ =
np.zeros(number_of_clusters)
for j in
range(0,number_of_clusters):
candidates_for_smallest_euclidean_norm_squ[j]
= (np.linalg.norm(dataset[i, :] - centroid_locations_array[j,
:,trial_number]))**(2)
smallest_euclidean_norm_squ_for_each_notice[i] =
np.amin(candidates_for_smallest_euclidean_norm_squ)
r_matrix[i, :] =
np.zeros(r_matrix[i,:].shape)
r_matrix[i,
np.argmin(candidates_for_smallest_euclidean_norm_squ)] = 1
total_loss =
np.sum(smallest_euclidean_norm_squ_for_each_notice)
total_loss_array[trial_number] =
total_loss
previous_loss = total_loss
#Check for Convergence
total_loss_array[trial_number] =
total_loss
if np.array_equal(old_r_matrix,
r_matrix):
break
old_r_matrix = r_matrix
#Maximization Step
for j in range(0,
number_of_clusters):
if np.sum(r_matrix[:, j]) >
0: # If is used to avoid division by
zero
weighted_sum_vector =
np.zeros((1, dataset.shape[1]))
sum_of_r = 0
for i in range(0,
dataset.shape[0]):
weighted_sum_vector +=
r_matrix[i, j] * dataset[i, :]
sum_of_r += r_matrix[i, j]
centroid_locations_array[j,
:, trial_number] = weighted_sum_vector / sum_of_r
total_loss = 0
for i in range(0,dataset.shape[0]):
for j in range(0,number_of_clusters):
total_loss += r_matrix[i,
j] * (np.linalg.norm(dataset[i, :] - centroid_locations_array[j, :,
trial_number]))**(2)
total_loss_array[trial_number] =
total_loss
previous_loss = total_loss
total_loss_array[trial_number] =
total_loss
training_time_for_k_means_at_each_trial[ trial_number] = time.time() -
start_time
#Comparisons
According to Total Loss
k_means_best_locations_for_centroids_2
= centroid_locations_array[:, :, np.argmin(total_loss_array)]
df_k_means_best_locations_for_centroids_2
= pd.DataFrame(k_means_best_locations_for_centroids_2)
df_k_means_best_locations_for_centroids_2.to_csv('best_cluster_centroids_for_k_means_clusters.csv',
sep = ';',index=True)
cluster_indices_of_notices
= np.zeros((dataset.shape[0]),dtype=int)
for
i in range(0, dataset.shape[0]):
norm_vector =
np.zeros((number_of_clusters))
for j in range(0, number_of_clusters):
norm_vector[j] =
np.linalg.norm(dataset[i, :] - centroid_locations_array[j, :,
np.argmin(total_loss_array)])
cluster_indices_of_notices[i] =
np.argmin(norm_vector)
df_cluster_indices_of_notices
= pd.DataFrame( cluster_indices_of_notices)
df_cluster_indices_of_notices.columns
= ['Cluster Number']
df_cluster_indices_of_notices.to_csv('cluster_indices_of_notices.csv',
sep=';', index=True)
print('Total
loss for each trial:', total_loss_array)
print('Training
time at each trial:', training_time_for_k_means_at_each_trial)
df_dataset
= pd.DataFrame(dataset)
df_price
= pd.DataFrame(price_vector)
df_dataset.to_csv('dataset_after_clustering.csv',
sep=';', index=True)
df_price.to_csv('prices_after_clustering.csv',
sep=';', index=True)
#Counting
number of notices in each cluster
number_of_notices_in_each_cluster
= np.zeros((number_of_clusters))
for
cluster in range(number_of_clusters):
for notice in
range(cluster_indices_of_notices.shape[0]):
if cluster_indices_of_notices[notice]
== cluster:
number_of_notices_in_each_cluster[cluster] += 1
#Finding
clusters with more than 199 notices:
sufficient_clusters
= np.empty((0),dtype=int)
clusters_to_erase
= np.empty((0),dtype=int)
for
cluster in range(number_of_clusters):
if
number_of_notices_in_each_cluster[cluster] >= 300:
sufficient_clusters =
np.append(sufficient_clusters, [cluster])
else:
clusters_to_erase =
np.append(clusters_to_erase, [cluster])
#Changing
the cluster that a notice is assigned to if number of notices in the cluster is
less than 200
distance_to_each_centroid
= np.zeros(sufficient_clusters.shape)
cluster_indices_of_notices_v2
= np.zeros((dataset.shape[0]))
for
notice in range(cluster_indices_of_notices.shape[0]):
for cluster in clusters_to_erase:
if cluster_indices_of_notices[notice]
== cluster:
for cluster_candidate_index in
range(sufficient_clusters.shape[0]):
distance_to_each_centroid[cluster_candidate_index] =
(np.linalg.norm(dataset[notice, :] -
centroid_locations_array[sufficient_clusters[cluster_candidate_index],
:,
np.argmin(total_loss_array)]))**(2)
cluster_indices_of_notices_v2[notice]
= sufficient_clusters[np.argmin(distance_to_each_centroid)]
break
else:
cluster_indices_of_notices_v2[notice] =
cluster_indices_of_notices[notice]
df_cluster_indices_of_notices_v2
= pd.DataFrame( cluster_indices_of_notices_v2)
df_cluster_indices_of_notices.columns
= ['Cluster Number']
df_cluster_indices_of_notices_v2.to_csv('cluster_indices_of_notices.csv',
sep=';', index=True)
linear_regression.py:
import
pandas as pd
import
numpy as np
from
numpy import linalg
import
matplotlib.pyplot as plt
import
time
#
Ridge regression algorithm.
#
If no regularization coefficient lambda_reg is given, does simple linear
regression.
def
ridge_regression(X, y, lambda_reg=0):
X_T = np.transpose(X)
X_T_X
= np.dot(X_T, X)
reg_matrix = lambda_reg *
np.identity(len(X_T_X))
inverted_matrix = linalg.inv(X_T_X +
reg_matrix)
transformation_matrix =
np.dot(inverted_matrix,X_T)
weights = np.dot(transformation_matrix ,y)
return weights
def
apply_linear_regression(X, weights):
y_predict = np.dot(X, weights)
return y_predict
def
loss_rss(weights, X, y):
y_predict = apply_linear_regression(X,
weights)
# clip prices from below
y_predict = np.clip(y_predict, 10 * 10 **
3, None)
rss = sum((y - y_predict) ** 2)
return rss
def
k_fold_x_validation(X, y, learning_alg, hyperparameters, loss_fcn, k):
coeff_of_det_list = []
for i in range(k):
subset_size = int(len(X) / k)
validation_set_index = [i *
subset_size, (i + 1) * subset_size]
X_test =
X[validation_set_index[0]:validation_set_index[1], :]
X_train =
np.append(X[:validation_set_index[0]],
X[validation_set_index[1]:,
:], axis=0)
y_test =
y[validation_set_index[0]:validation_set_index[1]]
y_train =
np.append(y[:validation_set_index[0]],
y[validation_set_index[1]:], axis=0)
learned_function = learning_alg(X_train,
y_train, hyperparameters)
####
loss = loss_fcn(learned_function,
X_test, y_test)
tss = sum((y_test - np.mean(y_test)) **
2)
coeff_of_det = 1 - loss / tss
coeff_of_det_list.append(coeff_of_det)
####
return np.mean(coeff_of_det_list)
if
__name__ == "__main__":
print('Program start.')
# real data files
dataset_file =
'dataset_after_clustering.csv'
results_file =
'prices_after_clustering.csv'
# read from data files
dataset_frame = pd.read_csv(dataset_file,
delimiter=';')
dataset_frame = dataset_frame.iloc[:, 1:]
results_frame = pd.read_csv(results_file,
delimiter=';')
results_frame = results_frame.iloc[:, 1:]
cluster_indices_of_notices_frame = pd.read_csv('cluster_indices_of_notices.csv',
delimiter=';')
cluster_indices_of_notices_frame =
cluster_indices_of_notices_frame.iloc[:, 1:]
# convert to numpy arrays
dataset = dataset_frame.values
prices = results_frame.values
cluster_indices_of_notices =
cluster_indices_of_notices_frame.values
# Feature transformation.
# Add the squares of the most important
features.
num_of_features_to_square = 10
dataset = np.append(dataset, dataset[:,
:num_of_features_to_square]**2, axis=1)
cluster_names =
np.unique(cluster_indices_of_notices)
number_of_clusters= len(cluster_names)
y_combined_total = np.empty((0, 2))
for cluster_index in
range(number_of_clusters):
X = np.empty((0, dataset.shape[1]))
y = np.empty((0))
for notice in range(dataset.shape[0]):
if cluster_names[cluster_index] ==
cluster_indices_of_notices[notice]:
X = np.append(X,
np.reshape(dataset[notice, :], (1, dataset[notice,:].shape[0])), axis=0)
y = np.append(y,
prices[notice], axis=0)
# append bias entries to data matrix
X = np.append(np.ones((len(X), 1)), X,
axis=1)
print(y.shape)
# lists to hold errors and lambdas
coeff_of_det_list = []
tss_mean_list = []
lambda_reg_list = []
elapsed_time_list = []
# the interval to check lambda values
lambda_reg_interval = np.arange(0,
1500, 10)
# number of folds for cross validation
cross_validation_k = 5
# cross validate for all lambdas
for lambda_reg in lambda_reg_interval:
start_time = time.time()
coeff_of_det = k_fold_x_validation(
X, y, ridge_regression,
lambda_reg, loss_rss, cross_validation_k)
#print(loss_mean)
end_time = time.time()
elapsed_time = end_time -
start_time
elapsed_time_list.append(elapsed_time)
coeff_of_det_list.append(coeff_of_det)
# print average elapsed time
average_elapsed_time =
sum(elapsed_time_list) / len(elapsed_time_list)
print('Average training time for
cluster', cluster_index,':',average_elapsed_time)
optimal_lambda_reg =
lambda_reg_interval[np.where(
coeff_of_det_list ==
np.amax(coeff_of_det_list))[0][0]]
print('Optimal lambda value for
cluster', cluster_index,':', optimal_lambda_reg)
shuffled_index = np.arange(y.shape[0])
np.random.shuffle(shuffled_index)
X = X[shuffled_index]
y = y[shuffled_index]
# apply learned function to half of the
input and write results to external file
weights =
ridge_regression(X[:round(X.shape[0]/2),:], y[:round(X.shape[0]/2)],
optimal_lambda_reg)
y_predict =
apply_linear_regression(X[round(X.shape[0]/2):,:], weights)
y_predict = np.clip(y_predict,
10*10**3, None)
y_combined =
np.column_stack((y[round(X.shape[0]/2):], y_predict))
y_combined_total =
np.concatenate((y_combined_total, y_combined), axis=0)
data_frame_to_write =
pd.DataFrame(np.real(y_combined_total), columns=['Real Prices (TL)','Predicted
Prices (TL)'])
data_frame_to_write.to_csv(
'result_of_ridge_regression_after_clustering.csv', sep=';', index=True)
coeff_of_det = np.amax(coeff_of_det_list)
print('Average coeff of det:')
print(coeff_of_det)
probability_of_accurate_prediction = 0
for notice_index in range(y_combined_total.shape[0]):
if y_combined_total[notice_index, 1]
>= 0.9 * y_combined_total[notice_index, 0] and\
y_combined_total[notice_index, 1]
<= 1.1 * y_combined_total[notice_index, 0]:
probability_of_accurate_prediction
+= 1.0/ float(y_combined_total.shape[0])
print('Probability of making a prediction
in the range [0.9*(real price), 1.1*(real price)]:',
probability_of_accurate_prediction)
second_probability_of_accurate_prediction =
0
for notice_index in range(y_combined_total.shape[0]):
if y_combined_total[notice_index, 1]
>= 0.8 * y_combined_total[notice_index, 0] and\
y_combined_total[notice_index, 1]
<= 1.2 * y_combined_total[notice_index, 0]:
second_probability_of_accurate_prediction
+= 1.0/ float(y_combined_total.shape[0])
print('Probability of making a prediction
in the range [0.8*(real price), 1.2*(real price)]:',
second_probability_of_accurate_prediction)
third_probability_of_accurate_prediction =
0
for notice_index in
range(y_combined_total.shape[0]):
if y_combined_total[notice_index, 1]
>= 0.7 * y_combined_total[notice_index, 0] and\
y_combined_total[notice_index, 1]
<= 1.3 * y_combined_total[notice_index, 0]:
third_probability_of_accurate_prediction
+= 1.0/ float(y_combined_total.shape[0])
print('Probability of making a prediction
in the range [0.7*(real price), 1.3*(real price)]:',
third_probability_of_accurate_prediction)
print('Program end.')
neural_network_driver.py:
"""Driver
for network2 for EEE485 project. """
#
Custom libraries
from
neural_network import *
#
Main function
if
__name__ == "__main__":
print('Program start.')
# Filename to contain the neural
network object
#network_filename =
'neural_network2_object_eee485.json'
network_filename =
'neural_network2_object_eee485_v0.1.json'
# An execution timer to track
execution durations
exe_timer = ExecutionTimer()
# Load the datasets
# EEE485 datasets
# Trial data files
#X_filename =
'normalized_deneme.csv'
#y_filename =
'normalized_deneme_results.csv'
# Real data files
X_filename = 'result_of_PCA.csv'
y_filename = 'price_indexed.csv'
X_frame =
pd.read_csv(X_filename, delimiter=';')
y_frame = pd.read_csv(y_filename,
delimiter=';')
# Remove index columns
X_frame = X_frame.iloc[:, 1:]
y_frame = y_frame.iloc[:, 1:]
# Convert data frames to numpy
arrays
X = X_frame.values
y = y_frame.values
# Limit the data size (optional)
data_size = len(X)
X = X[:data_size]
y = y[:data_size]
# Scale y values to avoid
overflow
y_scale = (6 * 10**6)
y = y / y_scale
# Shuffle original datasets
(optional)
shuffle_index =
np.arange(X.shape[0])
np.random.shuffle(shuffle_index)
X_shuffled = X[shuffle_index]
y_shuffled = y[shuffle_index]
#X = X_shuffled
#y = y_shuffled
# Split dataset into subsets
based on the given ratios
dataset_split_ratios = {'train':
0.8, 'test': 0.1, 'validation': 0.1}
dataset_split_counts = [int(len(X)*dataset_split_ratios[i])
for
i in dataset_split_ratios]
dataset_split_indexes =
np.cumsum(dataset_split_counts)
dataset_split_indexes =
np.insert(dataset_split_indexes, 0, 0, axis=0)
tuple_dataset =
create_tuple_dataset(X, y)
dataset_batches =
[tuple_dataset[start:end]
for start, end in
zip(dataset_split_indexes[:-1], dataset_split_indexes[1:])]
dataset = {key: dataset_batch
for (key,
dataset_batch) in zip(dataset_split_ratios.keys(), dataset_batches)}
training_data = dataset['train']
test_data = dataset['test']
validation_data =
dataset['validation']
print('Loaded the datasets.')
# Set the layer sizes
input_layer_size =
len(tuple_dataset[0][0])
output_layer_size =
len(tuple_dataset[0][1])
hidden_layer_sizes = [60]
sizes = [input_layer_size] +
hidden_layer_sizes + [output_layer_size]
print('Layer sizes for the
neural network:')
print(sizes)
# Construct the neural network
net = NeuralNetwork(sizes,
activation_fn=Sigmoid, cost=QuadraticCost)
#net = NeuralNetwork(sizes,
activation_fn=Sigmoid, cost=CrossEntropyCost)
net.large_weight_initializer()
# Define the hyperparameters for
training
hyperparameters = {
'epochs': 10,
'mini_batch_size':
10,
'eta': 0.05,
'lmbda': 0.2,
'acc_percent':
10,
'early_stopping_n':
500,
}
epochs =
hyperparameters['epochs']
mini_batch_size =
hyperparameters['mini_batch_size']
eta = hyperparameters['eta']
lmbda = hyperparameters['lmbda']
acc_percent =
hyperparameters['acc_percent']
early_stopping_n = hyperparameters['early_stopping_n']
# Load the neural network object
before training (optional)
print('Loading the neural
network object...')
net = load(network_filename)
# Train the network using SGD
try:
exe_timer.start()
print('Training
the neural network...')
net.SGD(training_data,
epochs, mini_batch_size, eta, lmbda=lmbda,
acc_percent=acc_percent,
test_data=test_data,
check_test_accuracy=True,
check_training_accuracy=True,
check_test_cost=False,
check_training_cost=False,
early_stopping_n=early_stopping_n)
except KeyboardInterrupt as ex:
print('Keyboard
Interrupt caught during SGD.')
option =
input('Do you want to save the neural network? (y/n):')
if option == 'y':
#
Save the trained neural network object
print('Saving
the neural network object...')
net.save(network_filename)
else:
print('Not
saving the neural network object.')
pass
else:
print('Finished
training the neural network.')
execution_time =
exe_timer.stop()
print('Execution
time: ' + str(execution_time) + ' sec.')
# Save the
trained neural network object
print('Saving the
neural network object...')
net.save(network_filename)
finally:
pass
# Load the neural network object
after training (optional)
print('Loading the neural
network object...')
net = load(network_filename)
# Make predictions and evaluate
them
# Extract training dataset
print('Training dataset
confidence interval computations:')
X_valid, y_valid =
extract_data(training_data)
# Scale outputs
y_valid = y_valid * y_scale
y_predict = net.predict(X_valid)
* y_scale
y_combined =
(np.hstack((y_valid, y_predict)))
for percentage in range(10, 40,
10):
probability_of_accurate_prediction
= net.accuracy(
training_data,
percent=percentage) / len(training_data)
print('Probability
of making a prediction with error percentage margin {}: {}'.format(
percentage,
probability_of_accurate_prediction))
# Compute coefficient of
determination.
tss = np.sum((y_valid -
np.mean(y_valid)) ** 2)
rss = np.sum((y_valid -
y_predict)**2)
coeff_of_det = 1 - rss / tss
print('Training dataset coeff.
of det.: ')
print(coeff_of_det)
# Print first 10 predictions on
terminal to give an idea
print('first ten predictions:')
print(y_combined[:10])
# Write results to external file
data_frame_to_write =
pd.DataFrame(np.real(y_combined))
data_frame_to_write.to_csv(
'result_of_neural_network2_train.csv',
sep=';', index=True)
# Extract validation dataset
print('Validation dataset
confidence interval computations:')
X_valid, y_valid =
extract_data(validation_data)
y_valid = y_valid * y_scale
y_predict = net.predict(X_valid)
* y_scale
y_combined =
(np.hstack((y_valid, y_predict)))
for percentage in range(10, 40,
10):
probability_of_accurate_prediction
= net.accuracy(
validation_data,
percent=percentage) / len(validation_data)
print('Probability
of making a prediction with error percentage margin {}: {}'.format(
percentage,
probability_of_accurate_prediction))
# Compute coefficient of
determination.
tss = np.sum((y_valid -
np.mean(y_valid)) ** 2)
rss = np.sum((y_valid -
y_predict)**2)
coeff_of_det = 1 - rss / tss
print('Validation dataset coeff.
of det.: ')
print(coeff_of_det)
# Print first 10 predictions on
terminal to give an idea
print('first ten predictions:')
print(y_combined[:10])
# Write results to external file
data_frame_to_write =
pd.DataFrame(np.real(y_combined))
data_frame_to_write.to_csv(
'result_of_neural_network2_valid.csv',
sep=';', index=True)
# Print accuracies of the neural
network
exe_timer.start()
percentage = 10
print('Accuracy of the neural
network on training data:')
accuracy =
net.accuracy(training_data, percent=percentage)
datasize = len(training_data)
print(str(accuracy) + ' / ' +
str(datasize))
print('Accuracy of the neural
network on validation data:')
accuracy =
net.accuracy(validation_data, percent=percentage)
datasize = len(validation_data)
print(str(accuracy) + ' / ' +
str(datasize))
print('Finished evaluating
accuracies of the neural network.')
execution_time =
exe_timer.stop()
print('Execution time: ' +
str(execution_time) + ' sec.')
# Print total costs of the
neural network
exe_timer.start()
print('Total cost of the neural
network on training data:')
total_cost = net.total_cost(training_data,
lmbda=lmbda)
print(total_cost)
print('Total cost of the neural
network on validation data:')
total_cost =
net.total_cost(validation_data, lmbda=lmbda)
print(total_cost)
print('Finished evaluating total
costs of the neural network.')
execution_time =
exe_timer.stop()
print('Execution time: ' +
str(execution_time) + ' sec.')
# Print the total execution time
execution_time =
exe_timer.get_total_time()
print('Total execution time: ' +
str(execution_time) + ' sec.')
print('Program end.')
neural_network.py:
"""Implementation
of feedforward neural network, using SGD.
We
were inspired by the following website: http://neuralnetworksanddeeplearning.com"""
#
Libraries
#
Standard library
import
json
import
sys
#
Third-party libraries
import
numpy as np
import
pandas as pd
#
Custom libraries
from
miscellaneous import *
#
Define the activation functions
class
Sigmoid(object):
@staticmethod
def fn(z):
"""The
sigmoid function."""
return
1.0/(1.0+np.exp(-z))
@staticmethod
def prime(z):
"""Derivative
of the sigmoid function."""
return
Sigmoid.fn(z)*(1-Sigmoid.fn(z))
class
ReLU(object):
@staticmethod
def fn(z):
"""The
ReLU function."""
# return
np.maximum(z, 0, z) # works in place, hence faster?
return
np.maximum(z, 0)
@staticmethod
def prime(z):
"""Derivative
of the ReLU function."""
return np.where(z
> 0, 1, 0)
class
Linear(object):
@staticmethod
def fn(z):
"""The
linear function."""
return z
@staticmethod
def prime(z):
"""Derivative
of the linear function."""
return
np.ones(z.shape)
#
Define the cost functions
class
QuadraticCost(object):
@staticmethod
def fn(a, y):
"""Return
the cost associated with an output."""
return
0.5*np.linalg.norm(a-y)**2
@staticmethod
def delta(z, a, y, activation):
"""Return
the error delta from the output layer for backpropagation."""
return (a-y) *
activation.prime(z)
class
CrossEntropyCost(object):
@staticmethod
def fn(a, y):
"""Return
the cost associated with an output."""
return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))
@staticmethod
def delta(z, a, y, activation):
"""Return
the error delta from the output layer for backpropagation."""
return (a-y)
#
Main Network class
class
NeuralNetwork(object):
def __init__(self, sizes,
activation_fn=Sigmoid, cost=CrossEntropyCost):
"""Construct
the neural network object."""
# List to hold
the neuron counts at each layer.
self.sizes =
sizes
self.num_layers =
len(self.sizes)
self.input_layer_size
= self.sizes[0]
self.output_layer_size
= self.sizes[self.num_layers-1]
# The activation
function to use.
self.activation_fn
= activation_fn
# The cost
function to use.
self.cost = cost
# Initialize the
weights of the network.
self.default_weight_initializer()
def
default_weight_initializer(self):
"""Initialize
weights and biases using Gaussian distributions."""
self.biases =
[np.random.randn(y, 1) for y in self.sizes[1:]]
self.weights =
[np.random.randn(y, x)/np.sqrt(x)
for
x, y in zip(self.sizes[:-1], self.sizes[1:])]
def
large_weight_initializer(self):
"""Initialize
the weights randomly again, but with larger means."""
self.biases =
[np.random.randn(y, 1) for y in self.sizes[1:]]
self.weights =
[np.random.randn(y, x)
for
x, y in zip(self.sizes[:-1], self.sizes[1:])]
def feedforward(self, a):
"""Return
the output of the network if a is input."""
for b, w in
zip(self.biases, self.weights):
a
= self.activation_fn.fn(np.dot(w, a)+b)
return a
def SGD(self, training_data,
epochs, mini_batch_size, eta,
lmbda=0.0,
acc_percent=0,
test_data=None,
check_test_cost=False,
check_test_accuracy=False,
check_training_cost=False,
check_training_accuracy=False,
early_stopping_n=0):
"""Apply
SGD to train the neural network."""
# Get the lengths
of the datasets.
training_data =
list(training_data)
n =
len(training_data)
if test_data:
test_data
= list(test_data)
n_data
= len(test_data)
# Variables for
early stopping functionality.
best_accuracy = 0
no_accuracy_change
= 0
# Lists to hold
the costs and accuracies.
test_cost,
test_accuracy = [], []
training_cost,
training_accuracy = [], []
# Loop for the
epochs.
for j in
range(epochs):
#
Shuffle the training data.
np.random.shuffle(training_data)
#
Construct mini batches from the training data.
mini_batches
= [
training_data[k:k+mini_batch_size]
for
k in range(0, n, mini_batch_size)]
#
Update the weights and biases based on mini batch SGDs.
for
mini_batch in mini_batches:
self.update_mini_batch(
mini_batch,
eta, lmbda, len(training_data))
#
Print progress of the learning for information.
if
(j % 50 == 0):
print("Epoch
%s of training complete" % j)
if
check_training_cost:
tot_cost =
self.total_cost(training_data, lmbda)
training_cost.append(tot_cost)
print("Cost
on training data: {}".format(tot_cost))
if
check_training_accuracy:
accuracy
= self.accuracy(
training_data,
percent=acc_percent)
training_accuracy.append(accuracy)
print("Accuracy
on training data: {} / {}".format(accuracy, n))
if
check_test_cost:
tot_cost
= self.total_cost(
test_data,
lmbda)
test_cost.append(tot_cost)
print("Cost
on evaluation data: {}".format(tot_cost))
if
check_test_accuracy:
accuracy
= self.accuracy(
test_data,
percent=acc_percent)
test_accuracy.append(accuracy)
print(
"Accuracy
on evaluation data: {} / {}".format(accuracy, n_data))
#
Early stopping checks.
if
early_stopping_n > 0:
if
accuracy > best_accuracy:
best_accuracy
= accuracy
no_accuracy_change
= 0
#print("Early-stopping:
Best so far {}".format(best_accuracy))
else:
no_accuracy_change
+= 1
if
(no_accuracy_change == early_stopping_n):
print(
"Stopping
early: No accuracy change in the last {}
epochs.".format(early_stopping_n))
return
test_cost, test_accuracy, training_cost, training_accuracy
return test_cost,
test_accuracy, \
training_cost,
training_accuracy
def update_mini_batch(self,
mini_batch, eta, lmbda, n):
"""Update
the weights and biases of the network using SGD.
eta is the
learning rate, lmbda is the regularization coefficient.
n is the total
size of the training dataset."""
# Initialize the
gradients.
grad_J_b =
[np.zeros(b.shape) for b in self.biases]
grad_J_w =
[np.zeros(w.shape) for w in self.weights]
# For each data
entry in mini batch, apply weight updates
for x, y in
mini_batch:
#
Apply backpropagation for the current data entry.
delta_grad_J_b,
delta_grad_J_w = self.backprop(x, y)
#
Update the gradients.
grad_J_b
= [gJb + dgJb for gJb,
dgJb
in zip(grad_J_b, delta_grad_J_b)]
grad_J_w
= [gJw + dgJw for gJw,
dgJw
in zip(grad_J_w, delta_grad_J_w)]
# Update the
weights and biases using the gradients and regularization.
self.weights =
[(1-eta*(lmbda/n))*w-(eta/len(mini_batch))*gJw
for
w, gJw in zip(self.weights, grad_J_w)]
self.biases =
[b-(eta/len(mini_batch))*gJb
for b, gJb in zip(self.biases, grad_J_b)]
def backprop(self, x, y):
"""Return
a tuple representing the gradients for the cost function."""
grad_J_b =
[np.zeros(b.shape) for b in self.biases]
grad_J_w =
[np.zeros(w.shape) for w in self.weights]
a = x
# Lists to store
all the activations and z vectors, layer by layer.
a_list = [a]
z_list = []
# Feedforward the
input, and update the lists.
for b, w in
zip(self.biases, self.weights):
z
= np.dot(w, a)+b
z_list.append(z)
a
= self.activation_fn.fn(z)
a_list.append(a)
# Backpropagation
for output layer.
delta =
(self.cost).delta(
z_list[-1],
a_list[-1], y, self.activation_fn)
grad_J_b[-1] =
delta
grad_J_w[-1] =
np.dot(delta, a_list[-2].T)
# Backpropagation
for hidden layers.
for l in range(2,
self.num_layers):
z
= z_list[-l]
a_prime
= self.activation_fn.prime(z)
delta
= np.dot(self.weights[-l+1].T, delta) * a_prime
grad_J_b[-l]
= delta
grad_J_w[-l]
= np.dot(delta, a_list[-l-1].T)
return (grad_J_b,
grad_J_w)
def accuracy(self, data,
percent=0):
"""Return
the number of inputs in data for which the neural
network outputs
the correct result within given accuracy."""
if percent >
0:
#
Accuracy computation for regression.
X_valid,
y_valid = extract_data(data)
y_predict
= self.predict(X_valid)
percent
/= 100
y_accurate
= (y_predict < (y_valid * (1 + percent))) \
&
(y_predict > (y_valid * (1 - percent)))
result_accuracy
= np.sum(y_accurate)
return
result_accuracy
def total_cost(self, data,
lmbda):
"""Return
the total cost for the dataset."""
cost = 0.0
for x, y in data:
a
= self.feedforward(x)
cost
+= self.cost.fn(a, y)/len(data)
cost
+= 0.5*(lmbda/len(data)) * \
sum(np.linalg.norm(w)**2
for w in self.weights)
return cost
def save(self, filename):
"""Save
the neural network to a file with given filename."""
data =
{"sizes": self.sizes,
"weights":
[w.tolist() for w in self.weights],
"biases":
[b.tolist() for b in self.biases],
"cost":
str(self.cost.__name__)}
f =
open(filename, "w")
json.dump(data,
f)
f.close()
def predict(self, X):
"""Return
the output of the network if X is input.
A wrapper of
feedforward for outside use."""
return
self.feedforward(X.T).T
def
load(filename):
"""Load a neural
network from the file filename."""
f = open(filename,
"r")
data = json.load(f)
f.close()
cost =
getattr(sys.modules[__name__], data["cost"])
net =
NeuralNetwork(data["sizes"], cost=cost)
net.weights = [np.array(w) for w
in data["weights"]]
net.biases = [np.array(b) for b
in data["biases"]]
return net
def
create_tuple_dataset(X, y):
"""Create list of
tuples containing pairs of data,
from data and result matrices X
and y."""
dataset = [(X_row.reshape((X_row.shape[0],
1)),
y_row.reshape((y_row.shape[0],
1))) for X_row, y_row in zip(X, y)]
return dataset
def
extract_data(dataset):
"""Extract data
and results from tuple list dataset as an array."""
X, y = zip(*dataset)
X = np.asarray(X)
y = np.asarray(y)
X = np.reshape(X, X.shape[:2])
y = np.reshape(y, y.shape[:2])
return X, y
miscellaneous.py:
####
Libraries
#
Standard library
import
time
#
Third-party libraries
import
numpy as np
class
ExecutionTimer(object):
def __init__(self):
"""A
code execution timer."""
self.time_interval
= {'start': None, 'end': None}
self.total_time =
0
self.running =
False
def start(self):
"""Start
the timer."""
if not
self.running:
self.time_interval['start']
= time.time()
self.running
= True
else:
print('Cannot
start: timer is currently running.')
def stop(self):
"""Stop
the timer."""
if self.running:
self.time_interval['stop']
= time.time()
self.running
= False
duration
= self.time_interval['stop'] - self.time_interval['start']
self.total_time
+= duration
return
duration
else:
print('Cannot
stop: timer is currently not running.')
return
None
def get_total_time(self):
"""Get
the total time from the timer."""
if not
self.running:
return
self.total_time
else:
print('Cannot
get total time: timer is currently running.')
return
None
def reset(self):
"""Reset
the timer."""
self.time_interval
= {'start': None, 'end': None}
self.total_time =
0
self.running =
False
def
k_fold_x_validation(X, y, learning_alg, loss_fcn, k):
if (k > X.shape[0]):
k = X.shape[0]
subset_size = X.shape[0] // k
#loss_vector =
np.empty(subset_size * k)
loss_vector = np.empty(k) # for
coeff of det
for i in range(k):
validation_set_index
= [i * subset_size, (i + 1) * subset_size]
X_test =
X[validation_set_index[0]:validation_set_index[1]]
X_train =
np.concatenate((X[:validation_set_index[0]],
X[validation_set_index[1]:]), axis=0)
y_test =
y[validation_set_index[0]:validation_set_index[1]]
y_train =
np.concatenate((y[:validation_set_index[0]],
y[validation_set_index[1]:]), axis=0)
learning_alg.train(X_train,
y_train)
loss_entry =
loss_fcn(learning_alg, X_test, y_test)
#loss_vector[validation_set_index[0]
# :validation_set_index[1]] =
loss_entry
loss_vector[i] =
loss_entry # for coeff of det
return loss_vector
Hiç yorum yok:
Yorum Gönder