Tool for dimensionality reduction, lossy data compression, feature extraction, and data visualization Definition > The orthogonal projection of high dimensional data onto a lower dimensional linear space, called the *principal subspace*, such that the variance of the projected data is maximized Orders components with highest energy (amount of information) Two definitions, same algorithm: 1. Orthogonal projection of data onto lower dimensional space ***principal subspace*** - maximizes variance of projected data 2. Linear projection - minimize average projection cost - mean square distance between data and projections - find low dimensional space that represents observed data the best # Maximum Variance Formulation - Start with data set $\{\mathbf{x}_n\}_{n=1}^N$ with dimension $D$ - Project data into subspace with dimension $M<D$ while maximizing variance - direction of subspace given as $\mathbf{u}_1$ - assume $\mathbf{u}_1$ is unit --> $\mathbf{u}_1^T\mathbf{u}_1 = 1$ - Project data by taking inner product $\mathbf{u}_1^T\mathbf{x}_n$ - Sample mean given by: $\bar{\mathbf{x}}_n = \frac{1}{N}\sum_{n=1}^N\mathbf{x}_n$ - Sample mean of projected data is $\mathbf{u}_1^T\bar{\mathbf{x}}_n$ - Variance is then: $ \frac{1}{N}\sum_{n=1}^N(\mathbf{u}_1^T\mathbf{x}_n - \mathbf{u}_1^T\bar{\mathbf{x}})^2 = \mathbf{u}_1^T\mathbf{Su}_1 $ $ \mathbf{S}=\frac{1}{N}\sum_{n=1}^N(\mathbf{x}_n-\bar{\mathbf{x}})(\mathbf{x}_n-\bar{\mathbf{x}})^T $ - **S** = covariance matrix - Use Lagrangian multiplier, maximize: $ \frac{\partial}{\partial u_1}\bigg[\mathbf{u}_1^T\mathbf{Su}_1+\lambda_1(1-\mathbf{u}_1^T\mathbf{u_1})\bigg] = 0 $ $ 2\mathbf{Su}_1-2\lambda\mathbf{u}_1 = 0 $ $ \mathbf{Su}_1 = \lambda_1\mathbf{u}_1 $ - $\mathbf{u_1}$ must be an eigenvector of $\mathbf{S}$ - multiply both sides with $\mathbf{u}_1^T$: $\mathbf{u}_1^T\mathbf{Su}_1 = \lambda_1\mathbf{u}_1^T\mathbf{u}_1 = \lambda_1 $ - want to maximize variance, choose $\lambda_1$ as highest eigenvalue - $\mathbf{u}_1$ is eigenvector with highest eigenvalue of $\mathbf{S}$, called ***first principal component*** of the data set Algorithm: 1. Evaluate sample mean $\bar{\mathbf{x}}$ and covariance $\mathbf{S}$ 2. Find $M$ eigenvectors of $\mathbf{S}$ containing $M$ largest eigenvalues $\{\lambda_1, \lambda_2,\dots,\lambda_M\}$ 3. Project data onto subspace using:$\mathbf{y} = \mathbf{U}^T\mathbf{x}$$\mathbf{U} = [\mathbf{u}_1,\mathbf{u}_2,\dots,\mathbf{u}_M]$ - $\mathbf{U}$ is a matrix with eigenvectors as columns # Minimum Error Formulation Start with basis whose vectors satisfy: $ \mathbf{u}_i^T\mathbf{u}_j=\delta_{ij} = \begin{cases} 1 \quad\text{if } i=j \\ 0 \quad \text{otherwise} \end{cases} $ Rewrite each data point as: $ \mathbf{x}_n = \sum_{i=1}^D\alpha_{ni}\mathbf{u}_i $ - $\alpha_{ni}$ is measure of how strong each $\mathbf{u}_i$ is in data - represents coordinates of new data points - represent data point $[x_1,x_2,\dots,x_D]^T_n$ using new coordinates $[\alpha_1,\alpha_2,\dots,\alpha_D]^T_n$ - Orthonormality gives $\alpha_{nj} = \mathbf{x}_n^T\mathbf{u}_j$ : $ \mathbf{x}_n = \sum_{i=1}^D(\mathbf{x}_n^T\mathbf{u}_i)\mathbf{u_i} $ Approximate $\mathbf{x}_n$: $ \tilde{\mathbf{x}}_n = \sum_{i=1}^Mz_{ni}\mathbf{u}_i +\sum_{i=M+1}^Db_i\mathbf{u}_i $ Choose $\mathbf{u}_i, z_{ni}$, and $b_i$ to minimize difference between $\tilde{\mathbf{x}}_n$ and $\mathbf{x}_n$: $ J = \frac{1}{N}\sum_{n=1}^N||\mathbf{x}_n-\tilde{\mathbf{x}}_n||^2 $ - Differentiating wrt $z_{ni}$ and $b_i$, setting to zero yields: $ z_{nj} = \mathbf{x}_n^T\mathbf{u}_j\qquad j = 1, 2, \dots, M $ $ b_j = \tilde{\mathbf{x}}^T\mathbf{u}_j\qquad j = M+1,\dots, D $ Displacement vector lies in space orthogonal to principal subspace: $ \mathbf{x}_n-\tilde{\mathbf{x}}_n = \sum_{i=M+1}^D[(\mathbf{x_n}-\tilde{\mathbf{x}})^T\mathbf{u}_i]\mathbf{u_i} $ Sub into distortion measure $J$: $ \begin{align}J &= \frac{1}{N}\sum_{n=1}^N\bigg|\bigg|\sum_{i=M+1}^D[(\mathbf{x_n}-\tilde{\mathbf{x}})^T\mathbf{u}_i]\mathbf{u_i}\bigg|\bigg|^2\\&=\frac{1}{N}\sum_{n=1}^N\sum_{i=M+1}^D(\mathbf{x}_n^T\mathbf{u}_i-\tilde{\mathbf{x}}^T\mathbf{u}_i)^2 \\&=\sum_{i=M+1}^D\mathbf{u}_i^T\mathbf{Su}_i\end{align} $ To minimize: $ \mathbf{Su}_i = \lambda_1\mathbf{u}_i $ $ J = \sum_{i=M+1}^D\lambda_i $ Distortion is represented in eigenvalues of dimensions > $M+1$ [[Probabilistic PCA]] [[Factor Analysis]]