PCA from First Principles

hard (>1 hr) linear-algebra PCA dim-reduction eigenvalues eigenvectors

this year by E

Principal Component Analysis (PCA) is a fundamental dimensionality reduction technique. It works by transforming the data into a new coordinate system such that the greatest variance by any projection of the data lies on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. This is achieved by finding the eigenvectors of the covariance matrix of the data.

Your task is to implement PCA from scratch.

Implementation Details: 1. Mean Centering: Given a dataset $X$ (a $N \times D$ matrix where $N$ is the number of samples and $D$ is the number of features), subtract the mean of each feature from the corresponding feature column. 2. Covariance Matrix: Compute the covariance matrix $Σ$ of the mean-centered data. For a mean-centered data matrix $X_{c e n t e r e d}$ , the covariance matrix is typically calculated as $\frac{1}{N - 1} X_{c e n t e r e d}^{T} X_{c e n t e r e d}$ (or $\frac{1}{N} X_{c e n t e r e d}^{T} X_{c e n t e r e d}$ if $N$ is large, but $(N - 1)$ is more common for sample covariance). 3. Eigen-decomposition: Compute the eigenvalues and eigenvectors of the covariance matrix. 4. Sorting: Sort the eigenvectors by their corresponding eigenvalues in descending order. These sorted eigenvectors are the principal components. 5. Projection: Given a number of components k, select the top k eigenvectors (principal components) to form a projection matrix $W$ . Project the original mean-centered data $X_{c e n t e r e d}$ onto these principal components to obtain the reduced-dimension data $X_{r e d u c e d} = X_{c e n t e r e d} W$ .

Implement a Python class MyPCA with the following methods: * __init__(self, n_components): Initializes the PCA with the desired number of components. * fit(self, X): Computes the principal components from the input data X. Stores the mean, eigenvalues, and eigenvectors. * transform(self, X): Transforms new data X into the reduced-dimension space using the learned components. * fit_transform(self, X): Combines fit and transform.

Verification: 1. Generate a synthetic 2D or 3D dataset (e.g., using np.random.multivariate_normal with some correlation between features). 2. Apply your MyPCA implementation to reduce its dimensionality to 1 or 2 components. 3. Compare the results (transformed data and explained variance ratios) with sklearn.decomposition.PCA on the same dataset. Pay attention to the signs of the components, as they might be flipped but still represent the same direction. The explained variance ratios should match closely. 4. Visualize the original and transformed data if possible (e.g., a 2D plot).