Principal Component Analysis | K Nearest Neighbour

Posted by Pritish Roy on November 23, 2020

An eigenface is the name given to a set of eigenvectors when used in the computer vision problem of human face recognition. The approach of using eigenfaces for recognition was developed by Sirovich and Kirby (1987) and used by Matthew Turk and Alex Pentland in face classification. The eigenvectors are derived from the covariance matrix of the probability distribution over the high-dimensional vector space of face images. The eigenfaces themselves form a basis set of all images used to construct the covariance matrix. This produces dimension reduction by allowing the smaller set of basis images to represent the original training images. Classification can be achieved by comparing how faces are represented by the basis set.

set of eigenfaces can be generated by performing a mathematical process called principal component analysis (PCA) on a large set of images depicting different human faces. Informally, eigenfaces can be considered a set of "standardized face ingredients", derived from statistical analysis of many pictures of faces. Any human face can be considered to be a combination of these standard faces. For example, one's face might be composed of the average face plus 10 per cent from eigenface 1, 55 per cent from eigenface 2, and even minus 3 per cent from eigenface 3. Remarkably, it does not take many eigenfaces combined together to achieve a fair approximation of most faces. Also, because a person's face is not recorded by a digital photograph, but instead as just a list of values (one value for each eigenface in the database used), much less space is taken for each person's face.

To understand why we perform Principal Component Analysis we first need to understand the curse of dimensionality. Complexity increases with dimension d, where d is considered the attributes. If d is large, n, the number of samples may be too small for accurate parameter estimation. The sample data becomes sparse and the distance calculated between two samples might be more dissimilar than similar. Hence we should try to avoid creating lots of features. High dimensionality is challenging and redundant. The main idea of PCA is to seek the most accurate data representation in a lower-dimensional space. For instance, is data is in 2D space we try to project to 1D subspace which minimises projection error. We try to find a good line to use for the projection lies in the direction of the largest variance. After the data is projected on the best line, we need to transform the coordinate system to get 1D representation for vector y. The new data y has the same variance as old data x. PCA preserves the largest variance in the data.

From either objective, it can be shown that the principal components are eigenvectors of the data's covariance matrix. Thus, the principal components are often computed by eigendecomposition of the data covariance matrix or singular value decomposition of the data matrix. PCA is the simplest of the true eigenvector-based multivariate analyses and is closely related to factor analysis. Factor analysis typically incorporates more domain-specific assumptions about the underlying structure and solves eigenvectors of a slightly different matrix.


  • Prepare a training set of face images. They must also be all resampled to a common pixel resolution (r x c). Each image is treated as one vector, simply by concatenating the rows of pixels in the original image, resulting in a single column with r x c elements. For this implementation, it is assumed that all images of the training set are stored in a single matrix T, where each column of the matrix is an image.
  • Subtract the mean. The average image a, has to be calculated and then subtracted from each original image in T.
  • Calculate the eigenvectors and eigenvalues of the covariance matrix S. Each eigenvector has the same dimensionality (number of components) as the original images, and thus can itself be seen as an image. The eigenvectors of this covariance matrix are therefore called eigenfaces. They are the directions in which the images differ from the mean image. Usually, this will be a computationally expensive step (if at all possible), but the practical applicability of eigenfaces stems from the possibility to compute the eigenvectors of S efficiently, without ever computing S explicitly.
  • Choose the principal components. Sort the eigenvalues in descending order and arrange eigenvectors accordingly. The number of principal components k is determined arbitrarily by setting a threshold on the total variance.
  • k is the smallest number that satisfies the threshold.
  • project the face image in the original subspace (image_vector - mean_face . top_eigen_vectors)
  • find the euclidean distance between the projected image and the test image, calculates the 1NN of the eigenface

These eigenfaces can now be used to represent both existing and new faces: we can project a new (mean-subtracted) image on the eigenfaces and thereby record how that new face differs from the mean face. The eigenvalues associated with each eigenface represent how much the images in the training set vary from the mean image in that direction. Information is lost by projecting the image on a subset of the eigenvectors, but losses are minimized by keeping those eigenfaces with the largest eigenvalues. For instance, working with a 100 × 100 image will produce 10,000 eigenvectors. In practical applications, most faces can typically be identified using a projection on between 100 and 150 eigenfaces, so that most of the 10,000 eigenvectors can be discarded.

Python Implementation:

@author: roycek

import math
import sys

import matplotlib.pyplot as plt
import numpy as np

pixel = (32, 32)  # pixels
plt.figure(figsize=(5, 5))  # figure size
color_scale = 'gray'  # gray scale
image_rotation = 3  # rotate 270 degrees

KNN = 1  # k nearest neighbour
arg_req, top_faces, min_K = 5, 5, 5  # arguments required, top 5 eigenvectors, minimum value of K

def mean_face(array, plot_img=False) -> np.ndarray:
    """mean face of the training set and plot image if flagged"""
    m_f = array.mean(axis=1)
    if plot_img:
        plot_image(rotate_image(np.resize(m_f, pixel)), title='Mean Face')
    return m_f

def centre_mean(array) -> np.ndarray:
    """subtract mean from the training set"""
    return array - array.mean(axis=0)

def compute_covariance(matrix) -> np.ndarray:
    """covariance matrix of dataset"""
    return np.dot(matrix.T, matrix)

def compute_eig_vector_values(cov_matrix) -> np.array:
    """returns sorted eigenvalue and eigenvectors in ascending order"""
    return np.linalg.eigh(cov_matrix)

def compute_dot_product(x, y) -> np.ndarray:
    """returns dot product between two entity"""
    return np.dot(x, y)

def rotate_image(vector) -> np.ndarray:
    """rotate image three times"""
    return np.rot90(vector, image_rotation)

def k_eig_vectors(eig_val, eig_vec, k_th) -> np.ndarray:
    """returns sorted and the K largest normalised eigen vectors"""
    eig_vectors = eig_vec[:, eig_val.argsort()[::-1]]
    for i in range(len(eig_val)):
        eig_vec[i] /= np.linalg.norm(eig_vec[i])
    return eig_vectors[:, :k_th]

def sort_tuple(tup, y) -> set:
    """returns sorted tuple"""
    tup.sort(key=lambda x: x[y])
    return tup

def euclidean_distance(test_data_vector, training_data_vector) -> float:
    """calculates the euclidean distance between two vectors"""
    return math.sqrt(sum([(a - b) ** 2 for a, b in zip(test_data_vector, training_data_vector)]))

def reconstruct_image(vector, m_f, top_ev, k, weight=None, weights_calculated=False) -> np.ndarray:
    """reconstruct a vector with a compact formula if not flagged else with weights"""
    if weights_calculated:
        return compute_dot_product(weight, top_ev) + m_f
    reconstructed_image = m_f + compute_dot_product(compute_dot_product(vector - m_f, top_ev), top_ev.T)
    plot_image(rotate_image(np.resize(reconstructed_image, pixel)), title=f'Reconstructed Test Image for K: {k}')

def accuracy_metric(actual, predicted, correct=0) -> float:
    """calculate accuracy of the dataset comparing the actual test label vs the predicted label"""
    for i in range(len(actual)):
        if actual[i] == predicted[i]:
            correct += 1
    return correct / len(actual) * 100.0

def plot_image(image, image_vector=None, subplot=False, horizontal=False, title=''):
    """plot image individual or subplots based on flags"""
    if not subplot:
        plt.imshow(image, cmap=color_scale)
        for i in range(len(image) if not horizontal else top_faces):
            if horizontal:
                image_vector = rotate_image(np.resize(image[:, i], pixel))
            plt.subplot(1 if horizontal else len(image), 1 if not horizontal else top_faces, i + 1)
            plt.imshow(image[i] if not horizontal else image_vector, cmap=color_scale)

def plot_requirements(training_data, test_data, k, eig_value, eig_vector, train_label,
                      test_label, top_eig_vectors, incorrect_images, threshold):
    """plots the mean face of the training set, top 5 eigen-face, 4 images of test_data for different K and
    the plot of 1NN classification rate upto k"""
    # # requirement 1
    mean_face(training_data.T, plot_img=True), plot_image(top_eig_vectors, subplot=True, horizontal=True,
                                                          title='Top 5 EigenFaces')

    # # requirement 2
    # reconstruction function with different k_th value, test_data[0] : first image from test dataset
    [reconstruct_image(test_data[0], mean_face(training_data.T),
                       k_eig_vectors(eig_value, eig_vector, k_th=i), i) for i in range(10, 120, 30)]

    # # requirement 3
    print(f'Computing plot for the nearest-neighbour (1NN) classification rate of K: {k}')
    plt.plot(range(k - 1), [knn_classifier(training_data, test_data, mean_face(training_data.T),
                                           k_eig_vectors(eig_value, eig_vector, i),
                                           train_label, test_label, threshold, project=True)[0]
                            for i in range(1, k)])
    plt.title(f'1NN classification rate of K: {k}')

    # # requirement 4
    plot_image(incorrect_images, subplot=True, title=f'Incorrect Classified Faces for K: {k}')

def knn_classifier(train_images, test_images, mean_f, top_eig_vectors,
                   train_label, test_label, threshold, print_statement=False, project=False) -> tuple:
    """finds the euclidean distance between reconstructed image and the test image, calculates the 1NN of the
    eigen-face and returns the accuracy, incorrect faces vectors and non-face rate"""
    no_face_detected = 0
    predicted, incorrect_images = [], []
    projected = [compute_dot_product(img_vec - mean_f, top_eig_vectors) for img_vec in train_images]
    for i in range(len(test_images)):
        distance = []
        for j in range(len(projected)):
            eu_distance = euclidean_distance(reconstruct_image(0, mean_f, top_eig_vectors.T, 0, projected[j],
                                             test_images[i]) if not project else \
                euclidean_distance(projected[j], compute_dot_product(test_images[i] - mean_f, top_eig_vectors))
            distance.append((eu_distance, train_label[j]))
        minimum_distant_neighbours = sort_tuple(distance, 0)[:KNN]

        nearest_neighbour, reconstruct_error = [], []
        for j in minimum_distant_neighbours:
            reconstruct_error.append(j[0]), nearest_neighbour.append(j[-1])

        predicted.append(max(nearest_neighbour, key=nearest_neighbour.count))

        if print_statement:
            print(f'({i + 1}) ~ TEST IMAGE LABEL: {int(test_label[i])} | '
                  f'RECONSTRUCTION ERROR: {round(reconstruct_error[0], 2)} | '
                  f'CLASSIFIED TRAINING LABEL: {int(predicted[i])}')

        if reconstruct_error[0] > threshold:
            no_face_detected += 1
            if print_statement:
                print(f'| TEST IMAGE: {int(test_label[i])} CLASSIFIED @ NON-FACE |')

        if test_label[i] != predicted[i]:
                    (rotate_image(np.resize(test_images[i], pixel)),
                     rotate_image(np.resize(train_images[int(predicted[i])], pixel))),

    non_face_rate = 100 * no_face_detected / len(test_images)
    return round(accuracy_metric(test_label, predicted), 2), incorrect_images, round(non_face_rate, 2)

def eig_faces(argv):
    """eigen face function to load training data, training label, number of principal components, test data, test label
    data, calls function to compute the top eigen vectors and classify images"""
        if len(argv) < arg_req:
            raise Exception
            training_data, training_data_label, K, test_data, test_data_label = np.genfromtxt(argv[0]), \
                                                                                np.genfromtxt(argv[1]), int(
                argv[2]), np.genfromtxt(argv[3]), np.genfromtxt(argv[4])

            if K < min_K:
                print(f'Minimum value of K should be greater than or equal to {min_K}'.upper())
                raise Exception

            eig_value, eig_vector = compute_eig_vector_values(compute_covariance(centre_mean(training_data)))
            top_eig_vectors = k_eig_vectors(eig_value, eig_vector, K)

            threshold = 1115
            accuracy, incorrect_images, non_face_rate = knn_classifier(training_data, test_data,
                                                                       top_eig_vectors, training_data_label,
                                                                       test_data_label, threshold,

            print(f'Accuracy: {accuracy}%\nNon-Face Rate: {non_face_rate}%')

            plot_requirements(training_data, test_data, K, eig_value, eig_vector, training_data_label,
                              test_data_label, top_eig_vectors, incorrect_images, threshold)
    except Exception as e:
        print(f'{str(e).upper()}\nFollow Command:\npython eigenfacesassignment.py faces_train.txt '
              f'faces_train_labels.txt 10 faces_test.txt faces_test_labels.txt')

if __name__ == '__main__':