Gavin Scott
  • Contact Me
  • Contact Me
Gavin Scott

PCA
Principle Components Analysis

Principle Components Analysis (PCA) is a dimensionality-reduction technique that is often used to remove attributes from a dataset that will then be analyzed using some other machine learning technique. By identifying the least-important features it allows you to simplify your data, possibly increasing the accuracy of your other classifiers.

In the example below, a Linear Regression classifier is used to predict the price of a car based on it's other attributes. Initially each car is described by 12 attributes; PCA reduces this to only two, lowering the sum-squared error (SSE) of the Linear Regression classifier by around 24%.

The code and output is shown below, and is also available as an iPython notebook and on BitBucket. The data file used in the notebook is included here.
PCA
In [1]:
import numpy as np
from matplotlib import pyplot as plt
from sklearn import linear_model

Dataset¶

This dataset contains data about different cars, contained in 'cars2.csv'. This experiment uses the numerical variables (listed after the next cell) to predict the price of the car. PCA will identify which features about the car affect the results the most, allowing less-useful features from being filtered out to improve the results.

In [2]:
# represents a data set
class Data:
    def __init__(self, data, target, columns):
        self.data = np.array(data)
        self.target = np.array(target)
        self.columns = columns
        self.transformed = None

# converts string data to floats when possible
def fl(v):
    try:
        return float(v)
    except:
        return v

# build the data set 
with open('cars2.csv', 'r') as f:
    vals = []
    tgt = []
    columns = [x.strip() for x in f.readline().split(',')]
    tgt_col = columns[-1]
    columns = columns[9:14] + columns[18:-1]
    for line in f.readlines():
        if '?' not in line[25:]:
            d = line.split(',')
            vals.append([fl(x) for x in d[9:14] + d[18:-1]])
            tgt.append(fl(d[-1]))

cars = Data(vals, tgt, columns)

print('Variables:')
print('\t', cars.columns)
print('\n\tUsed to predict:', tgt_col)
Variables:
	 ['wheel-base', 'length', 'width', 'height', 'curb-weight', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg']

	Used to predict: price

Implementation¶

In [3]:
class myPCA():
    def __init__(self, k):
        self.k = k
    
    # centers the data
    def __center(self, data):
        avg = np.mean(data)
        return data - avg
    
    # performs PCA to transform the given data
    def transform(self, orig):
        data = orig.transpose()
        # center data
        data = self.__center(data)
        # get covariance matrix
        covariance_matrix = np.cov([data[n,:] for n in range(len(data))])
        # get eigen vectors & values
        values, vectors = np.linalg.eig(covariance_matrix)
        # list of value-vector pairs
        pairs = [(np.abs(values[n]), vectors[:,n]) for n in range(len(values))]
        # sort by vector (descending)
        pairs.sort(key=lambda x: x[0], reverse=True)
        # reduce to 'k' dimensions
        change_matrix = np.hstack(tuple([pairs[i][1].reshape(len(data),1) for i in range(self.k)]))
        return change_matrix.transpose().dot(data).transpose()
In [4]:
# sum-squared error
def SSE(y_true, y_pred):
    return np.mean((y_pred - y_true) ** 2)

# find the best value of k for PCA with the given data set
def find_k(data):
    # train on first 3/4, test with last 1/4
    sp = int(-(len(data.data) / 4))
    # RECORD SSEs
    scores = {}
    regr = linear_model.LinearRegression()
    # try each k value
    for k in range(1, data.data.shape[1] + 1):
        # find SSE for this value of k
        my_pca = myPCA(k)
        data.transformed = my_pca.transform(data.data)
        regr.fit(data.transformed[:sp], data.target[:sp])
        scores[k] = SSE(data.target[-30:], regr.predict(data.transformed[-30:]))
    # return best k value
    return min(scores, key=scores.get)

Calculate Results¶

In [5]:
sp = -45
regr = linear_model.LinearRegression()
In [6]:
# Find SSE for original data
regr.fit(cars.data[:sp], cars.target[:sp])
sse_1 = SSE(cars.target[sp:], regr.predict(cars.data[sp:]))
In [7]:
# Use PCA
k = find_k(cars)
print('Number of attributes lowered to', k)
transformed = myPCA(k).transform(cars.data)
# Find SSE for transformed dataset
regr.fit(transformed[:sp], cars.target[:sp])
sse_2 = SSE(cars.target[sp:], regr.predict(transformed[sp:]))
Number of attributes lowered to 2

Results¶

PCA successfully lowers the SSE by around 24%

In [8]:
print('Original SSE:   ', sse_1)
print('SSE after PCA:  ', sse_2)
print('SSE Lowered by:  {0:.2f}%'.format((sse_1 - sse_2) / sse_1 * 100))
Original SSE:    14025991.431
SSE after PCA:   10636449.4596
SSE Lowered by:  24.17%