Factor Models in Statistics

27 Mar 2024 21:45

\[ \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \]

Factor models are a specific kind of latent-variable model in multivariate statistics, where the latent variables and the observable variables are both continuous, and the relationship between the two is linear. The basic form is that the latent variable \( F \) is a \( q \)-dimensional vector, the observables form a \( p \)-dimensional vector \( X\), and the relationship is \[ X = F\mathbf{w} + \epsilon \] for a \( q\times p \) matrix of "loadings" \( \mathbf{w}\), and a \( p \)-dimensional noise vector \( \epsilon\), which is assumed to be independent of, or at least uncorrelated with, the factor vector \( F\), and with no correlations between the different coordinates of \( \epsilon \). The covariance matrix of \( \epsilon \) is thus a diagonal matrix \( \mathbf{\psi}\), and the covariance of \( X \) is \[ \Var{X} = \mathbf{w}^T \Var{F} \mathbf{w} + \mathbf{\psi} \]

Now, if \( \Var{F} \) is anything other than the identity matrix \( \mathbf{I}_q\), we can always use principal components analysis / singular value decomposition to define a new \( q \)-dimensional vector \( G \) where \( \Var{G} = \mathbf{I}_q \) and \( F = G\mathbf{r}\), for a \( q\times q \) matrix \( \mathbf{r} \). Then, since the relationship of \( F \) and \( X \) is linear, we also have \[ X=G \mathbf{r} \mathbf{w} +\epsilon \equiv G \mathbf{w}^{\prime} + \epsilon \]

From a purely statistical point of view, therefore, we can always take the factor variables to have unit variance and no correlations. The covariance matrix of \( X \) is thus \[ \Var{X} = \mathbf{w}^T \mathbf{w} + \mathbf{\psi} \] Since \( \mathbf{w} \) has only \( q \) rows, this means that \( \mathbf{w}^T \mathbf{w} \) has rank \( q\), the covariance matrix is of the form often called "low rank plus noise".

Change of Coordinates and the "Rotation Problem"

Geometrically, as \( F \) ranges through \( \mathbb{R}^q\), the "structured" part of \( X\), \( F\mathbf{w}\), traces out a \( q \)-dimensional linear subspace in \( \mathbb{R}^p \). Taking each row of \( \mathbf{w} \) as a vector, that subspace is the span of those vectors, and I'll abuse notation to write it \( \mathrm{span}(\mathbf{w}) \). The distribution of \( F\mathbf{w} \) is the distribution of values on \( \mathrm{span}(\mathbf{w}) \); \( \epsilon \) is the distribution of perturbations off of \( \mathrm{span}(\mathbf{w}) \). The use of a particular \( F \) is the choice of a particular coordinate system on \( \mathrm{span}(\mathbf{w})\), but statistically any coordinate system is equally good, because it won't change the distribution of \( F\mathbf{w} \). In fact, for any linear transformation of coordinates, where \( F = G \mathbf{r}\), we also have \( F\mathbf{w} = G \mathbf{r}\mathbf{w}\), so if we re-define the loading matrix to \( \mathbf{r}\mathbf{w}\), we have another factor model which predicts exactly the same distribution of observables, and \( \mathrm{span}(\mathbf{w}) = \mathrm{span}(\mathbf{r}\mathbf{w}) \). (Nonlinear changes of coordinates, e.g., from Cartesian to polar, would break the linear relationship between the coordinates of \( F \) and those of \( X\), even though the linear relationship between vectors would remain.) The maximal identifiable parameters of the factor model are thus the distribution of \( F\mathbf{w} \) and \( \mathbf{\psi}\), not \( \mathbf{w} \) or the distribution of \( F \). (The subspace \( \mathrm{span}(\mathbf{w}) \) is identified because it's implied by a knowledge of the distribution of \( F\mathbf{w} \).) Of course, we might have reason based on some other source of knowledge to prefer one coordinate system over another, but this cannot come from the distribution of \( X\), the observable variables. (See my discussion of Rohe and Zeng 2023 below.) Factor models can thus be seen as an example of manifold learning, with the special assumption that the manifold is a linear subspace.

("Confirmatory" factor analysis does not really change that last conclusion. It essentially tests goodness of fit of a unrestricted estimate of \( \mathbf{w} \) against the fit of a restricted model where one or more entries of \( \mathbf{w} \) are fixed a priori, usually to 0. But any loading matrix with those restrictions is observationally equivalent to a matrix \( \mathbf{u} = \mathbf{r}\mathbf{w} \) where \( \mathbf{r} \) is any linear coordinate change, and a continuous infinity of those \( \mathbf{u} \) will not have the desired 0s. Each zero in some coordinate system really imposes a one-equation algebraic constraint on \( \mathbf{w}^T \mathbf{w}\), and we're really testing those restrictions.)

All-Positive Covariances and "General Factors"

I remarked above that in a factor model, the covariance matrix of the observables \( \Var{X} \) always takes the form \( \mathbf{w}^T \mathbf{w} + \mathbf{\psi} \). Suppose that all of the entries in \( \Var{X} \) are non-negative, i.e., all of the observable variables are positively correlated with each other (or at worst un-correlated). Now, it's evident that the off-diagonal entries of \( \Var{X} \) and \( \mathbf{w}^T\mathbf{w} \) are equal, because \( \mathbf{\psi} \) is a diagonal matrix. But the diagonal entries in \( \mathbf{w}^T\mathbf{w} \) are the variances of \( F\mathbf{w} \) along the observable coordinates, so they must be non-negative as well. Hence if \( \Var{X} \) has only non-negative entries, so does \( \mathbf{w}^T\mathbf{w} \). At this point, the Frobenius-Perron theorem of linear algebra tells us that (i) all of the eigenvalues of \( \mathbf{w}^T\mathbf{w} \) are non-negative, and (ii) the eigenvector corresponding to the largest eigenvalue has all positive entries. We write \( \mathbf{w}^T \mathbf{w} = \mathbf{v}^T \mathbf{\lambda} \mathbf{v}\), with \( \mathbf{\lambda} \) the diagonal, non-negative matrix of eigenvalues and \( \mathbf{v} \) the corresponding matrix of eigenvectors. (In general, Frobenius-Perron allows for different left and right eigenvectors, but \( \mathbf{w}^T\mathbf{w} \) is manifestly symmetric.) We can thus choose to set \( \mathbf{w} = \mathbf{\lambda}^{1/2} \mathbf{v} \). In context, this means we can choose a coordinate system for \( F \) where each latent coordinate (i) "explains" some (positive) share of the variance of the observables, and (ii) the latent coordinate which "explains" the largest share of variance is positively correlated with all observables. If you want to call this latent coordinate a "general factor", feel free, but notice this has to exist, mathematically, as soon as you have both all-positive correlations and decide to use a factor model.

I should explain how factor analysis is related to but different from the principal components, but having just been writing about this in my lecture notes, I'll just refer to that (link below).

Questions: In large dimensions, how different do random covariance matrices look from low-rank-plus-noise matrices? For every \( q \)-factor model with given means and covariance matrices, there is a mixture model with \( q+1 \) discrete clusters and the same means and covariances; is the distinction between them identifiable if we go to higher moments?

Notoriously, factor analysis comes out of psychology, and the attempt to infer general mental attributes from test scores. But I have said more than enough about that elsewhere.

Constructively: I am very interested in the possibilities of factor models for data reduction in high-dimensional time-series data, especially spatio-temporal data.

Update history