PCA
Principle Components Analysis
Principle Components Analysis (PCA) is a dimensionality-reduction technique that is often used to remove attributes from a dataset that will then be analyzed using some other machine learning technique. By identifying the least-important features it allows you to simplify your data, possibly increasing the accuracy of your other classifiers.
In the example below, a Linear Regression classifier is used to predict the price of a car based on it's other attributes. Initially each car is described by 12 attributes; PCA reduces this to only two, lowering the sum-squared error (SSE) of the Linear Regression classifier by around 24%.
The code and output is shown below, and is also available as an iPython notebook and on BitBucket. The data file used in the notebook is included here.
In the example below, a Linear Regression classifier is used to predict the price of a car based on it's other attributes. Initially each car is described by 12 attributes; PCA reduces this to only two, lowering the sum-squared error (SSE) of the Linear Regression classifier by around 24%.
The code and output is shown below, and is also available as an iPython notebook and on BitBucket. The data file used in the notebook is included here.
In [1]:
import numpy as np
from matplotlib import pyplot as plt
from sklearn import linear_model
Dataset¶
This dataset contains data about different cars, contained in 'cars2.csv'. This experiment uses the numerical variables (listed after the next cell) to predict the price of the car. PCA will identify which features about the car affect the results the most, allowing less-useful features from being filtered out to improve the results.
In [2]:
# represents a data set
class Data:
def __init__(self, data, target, columns):
self.data = np.array(data)
self.target = np.array(target)
self.columns = columns
self.transformed = None
# converts string data to floats when possible
def fl(v):
try:
return float(v)
except:
return v
# build the data set
with open('cars2.csv', 'r') as f:
vals = []
tgt = []
columns = [x.strip() for x in f.readline().split(',')]
tgt_col = columns[-1]
columns = columns[9:14] + columns[18:-1]
for line in f.readlines():
if '?' not in line[25:]:
d = line.split(',')
vals.append([fl(x) for x in d[9:14] + d[18:-1]])
tgt.append(fl(d[-1]))
cars = Data(vals, tgt, columns)
print('Variables:')
print('\t', cars.columns)
print('\n\tUsed to predict:', tgt_col)
Implementation¶
In [3]:
class myPCA():
def __init__(self, k):
self.k = k
# centers the data
def __center(self, data):
avg = np.mean(data)
return data - avg
# performs PCA to transform the given data
def transform(self, orig):
data = orig.transpose()
# center data
data = self.__center(data)
# get covariance matrix
covariance_matrix = np.cov([data[n,:] for n in range(len(data))])
# get eigen vectors & values
values, vectors = np.linalg.eig(covariance_matrix)
# list of value-vector pairs
pairs = [(np.abs(values[n]), vectors[:,n]) for n in range(len(values))]
# sort by vector (descending)
pairs.sort(key=lambda x: x[0], reverse=True)
# reduce to 'k' dimensions
change_matrix = np.hstack(tuple([pairs[i][1].reshape(len(data),1) for i in range(self.k)]))
return change_matrix.transpose().dot(data).transpose()
In [4]:
# sum-squared error
def SSE(y_true, y_pred):
return np.mean((y_pred - y_true) ** 2)
# find the best value of k for PCA with the given data set
def find_k(data):
# train on first 3/4, test with last 1/4
sp = int(-(len(data.data) / 4))
# RECORD SSEs
scores = {}
regr = linear_model.LinearRegression()
# try each k value
for k in range(1, data.data.shape[1] + 1):
# find SSE for this value of k
my_pca = myPCA(k)
data.transformed = my_pca.transform(data.data)
regr.fit(data.transformed[:sp], data.target[:sp])
scores[k] = SSE(data.target[-30:], regr.predict(data.transformed[-30:]))
# return best k value
return min(scores, key=scores.get)
Calculate Results¶
In [5]:
sp = -45
regr = linear_model.LinearRegression()
In [6]:
# Find SSE for original data
regr.fit(cars.data[:sp], cars.target[:sp])
sse_1 = SSE(cars.target[sp:], regr.predict(cars.data[sp:]))
In [7]:
# Use PCA
k = find_k(cars)
print('Number of attributes lowered to', k)
transformed = myPCA(k).transform(cars.data)
# Find SSE for transformed dataset
regr.fit(transformed[:sp], cars.target[:sp])
sse_2 = SSE(cars.target[sp:], regr.predict(transformed[sp:]))
Results¶
PCA successfully lowers the SSE by around 24%
In [8]:
print('Original SSE: ', sse_1)
print('SSE after PCA: ', sse_2)
print('SSE Lowered by: {0:.2f}%'.format((sse_1 - sse_2) / sse_1 * 100))