## Factor Models in Statistics

*07 Jan 2024 14:25*

\[ \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \]

Factor models are a specific kind of latent-variable model in multivariate statistics, where the latent variables and the observable variables are both continuous, and the relationship between the two is linear. The basic form is that the latent variable \( F \) is a \( q \)-dimensional vector, the observables form a \( p \)-dimensional vector \( X\), and the relationship is \[ X = F\mathbf{w} + \epsilon \] for a \( q\times p \) matrix of "loadings" \( \mathbf{w}\), and a \( p \)-dimensional noise vector \( \epsilon\), which is assumed to be independent of, or at least uncorrelated with, the factor vector \( F\), and with no correlations between the different coordinates of \( \epsilon \). The covariance matrix of \( \epsilon \) is thus a diagonal matrix \( \mathbf{\psi}\), and the covariance of \( X \) is \[ \Var{X} = \mathbf{w}^T \Var{F} \mathbf{w} + \mathbf{\psi} \]

Now, if \( \Var{F} \) is anything other than the identity matrix \( \mathbf{I}_q\), we can always use principal components analysis / singular value decomposition to define a new \( q \)-dimensional vector \( G \) where \( \Var{G} = \mathbf{I}_q \) and \( F = G\mathbf{r}\), for a \( q\times q \) matrix \( \mathbf{r} \). Then, since the relationship of \( F \) and \( X \) is linear, we also have \[ X=G \mathbf{r} \mathbf{w} +\epsilon \equiv G \mathbf{w}^{\prime} + \epsilon \]

From a purely statistical point of view, therefore, we can always take the factor variables to have unit variance and no correlations. The covariance matrix of \( X \) is thus \[ \Var{X} = \mathbf{w}^T \mathbf{w} + \mathbf{\psi} \] Since \( \mathbf{w} \) has only \( q \) rows, this means that \( \mathbf{w}^T \mathbf{w} \) has rank \( q\), the covariance matrix is of the form often called "low rank plus noise".

Geometrically, as \( F \) ranges through \( \mathbb{R}^q\), the "structured" part
of \( X\), \( F\mathbf{w}\), traces out a \( q \)-dimensional linear subspace in
\( \mathbb{R}^p \). Taking each row of \( \mathbf{w} \) as a vector, that subspace is
the span of those vectors, and I'll abuse notation to write it
\( \mathrm{span}(\mathbf{w}) \). The distribution of \( F\mathbf{w} \) is the
distribution of values on \( \mathrm{span}(\mathbf{w}) \); \( \epsilon \) is the
distribution of perturbations off of \( \mathrm{span}(\mathbf{w}) \). The use of a
particular \( F \) is the choice of a particular coordinate system on
\( \mathrm{span}(\mathbf{w})\), but statistically any coordinate system is equally
good, because it won't change the distribution of \( F\mathbf{w} \). In fact, for
any linear transformation of coordinates, where \( F = G \mathbf{r}\), we also
have \( F\mathbf{w} = G \mathbf{r}\mathbf{w}\), so if we re-define the loading
matrix to \( \mathbf{r}\mathbf{w}\), we have another factor model which predicts
exactly the same distribution of observables, and \( \mathrm{span}(\mathbf{w})
= \mathrm{span}(\mathbf{r}\mathbf{w}) \). (Nonlinear changes of coordinates,
e.g., from Cartesian to polar, would break the linear relationship between
the *coordinates* of \( F \) and those of \( X\), even though the linear
relationship between *vectors* would remain.) The
maximal identifiable parameters of
the factor model are thus the distribution of \( F\mathbf{w} \) and
\( \mathbf{\psi}\), *not* \( \mathbf{w} \) or the distribution of \( F \). (The
subspace \( \mathrm{span}(\mathbf{w}) \) is identified because it's implied by a
knowledge of the distribution of \( F\mathbf{w} \).) Of course, we might have
reason based on some other source of knowledge to prefer one coordinate system
over another, but this cannot come from the distribution of \( X\), the observable
variables. (See my discussion of Rohe and Zeng 2023 below.) Factor models can thus be seen as an example
of manifold learning, with the special
assumption that the manifold is a linear subspace.

("Confirmatory" factor analysis does not really change that last conclusion.
It essentially tests goodness of fit of a unrestricted estimate of \( \mathbf{w} \)
against the fit of a restricted model where one or more entries of \( \mathbf{w} \)
are fixed a priori, usually to 0. But any loading matrix with those
restrictions is observationally equivalent to a matrix \( \mathbf{u} =
\mathbf{r}\mathbf{w} \) where \( \mathbf{r} \) is any linear coordinate change, and a
continuous infinity of those \( \mathbf{u} \) will *not* have the desired
0s. Each zero in some coordinate system really imposes a one-equation
algebraic constraint on \( \mathbf{w}^T \mathbf{w}\), and we're really testing
those restrictions.)

I remarked above that in a factor model, the covariance matrix of the
observables \( \Var{X} \) always takes the form \( \mathbf{w}^T \mathbf{w} +
\mathbf{\psi} \). Suppose that all of the entries in \( \Var{X} \) are non-negative,
i.e., all of the observable variables are positively correlated with each other
(or at worst un-correlated). Now, it's evident that the off-diagonal entries
of \( \Var{X} \) and \( \mathbf{w}^T\mathbf{w} \) are equal, because \( \mathbf{\psi} \) is
a diagonal matrix. But the diagonal entries in \( \mathbf{w}^T\mathbf{w} \) are
the variances of \( F\mathbf{w} \) along the observable coordinates, so they must
be non-negative as well. Hence if \( \Var{X} \) has only non-negative entries, so
does \( \mathbf{w}^T\mathbf{w} \). At this point,
the Frobenius-Perron
theorem of linear algebra tells us that (i) all of the eigenvalues of
\( \mathbf{w}^T\mathbf{w} \) are non-negative, and (ii) the eigenvector
corresponding to the largest eigenvalue has all positive entries. We write
\( \mathbf{w}^T \mathbf{w} = \mathbf{v}^T \mathbf{\lambda} \mathbf{v}\), with
\( \mathbf{\lambda} \) the diagonal, non-negative matrix of eigenvalues and
\( \mathbf{v} \) the corresponding matrix of eigenvectors. (In general,
Frobenius-Perron allows for different left and right eigenvectors, but
\( \mathbf{w}^T\mathbf{w} \) is manifestly symmetric.) We can thus *choose*
to set \( \mathbf{w} = \mathbf{\lambda}^{1/2} \mathbf{v} \). In context, this
means we can choose a coordinate system for \( F \) where each latent coordinate
(i) "explains" some (positive) share of the variance of the observables, and
(ii) the latent coordinate which "explains" the largest share of variance is
positively correlated with all observables. If you want to call this latent
coordinate a "general factor", feel free, but notice this has to exist,
mathematically, as soon as you have both all-positive correlations *and*
decide to use a factor model.

I should explain how factor analysis is related to but different from the principal components, but having just been writing about this in my lecture notes, I'll just refer to that (link below).

*Questions:* In large dimensions, how different do random covariance
matrices look from low-rank-plus-noise matrices? For every \( q \)-factor model
with given means and covariance matrices, there is a mixture model with \( q+1 \)
discrete clusters and the same means and covariances; is the distinction
between them identifiable?

Notoriously, factor analysis comes out of psychology, and the attempt to infer general mental attributes from test scores. But I have said more than enough about that elsewhere.

Constructively: I am very interested in the possibilities of factor models for data reduction in high-dimensional time-series data, especially spatio-temporal data.

- See also:
- Graphical Models
- Manifold Learning
- Mixture Models
- Partial Identification
- The Thomson Ability-Sampling Model
- Recommendation Systems and Collaborative Filtering
- Symmetries of Neural Networks

- Recommended, big picture:
- David J. Bartholomew, Latent Variable Models and Factor Analysis

- Recommended, close-ups:
- J. Scott Armstrong, "Derivation of theory by means of factor analysis or Tom Swift and his electric factor analysis machine", The American Statistician
**21:5**(1967): 17--21 [Reprint] - Philipp Fleig and Ilya Nemenman, "Statistical properties of large data sets with linear latent features",
Physical Review E
**106**(2022): 014102, arxiv:2111.04641 - Yi-hao Kao and Benjamin Van Roy, "Learning a Factor Model via Regularized PCA", Machine Learning
**91**(2013): 279--303, arxiv:1111.6201 - Wim Krijnen, "Positive Loadings and Factor Correlations from Positive Covariances", Psychometrika
**69**(2004): 655--660 - John C. Loehlin, Latent Variable Models: An Introduction to Factor, Path, and Structural Analysis
- Paul E. Meehl, "Four Queries About Factor Reality", History and Philosophy of Psychology Bulletin
**5**(1993): 4--5 - Robert A. Peterson, "A Meta-Analysis of Variance Accounted for and Factor Loadings in Exploratory Factor Analysis", Marketing Letters
**11**(2000): 261--275 - Kristopher J. Preacher and Robert C. MacCallum, "Repairing Tom Swift's Electric Factor Analysis Machine", Understanding Statistics
**2**(2003): 13--43 [The objection about nonlinearity seems deeply unfair to me: if one is actually using factor analysis to*discover*things, when would one be willing to assert the linearity? Similarly, how is one to know which rotation of the factors is better? By definition, the different rotations are all empirically equivalent, so in a genuinely new domain we don't know which ones group related quantities together. Reprint] - Karl Rohe, Muzhe Zeng, "Vintage Factor Analysis with Varimax
Performs Statistical
Inference",
Journal of the Royal
Statistical Society B
**85**(2023): 1037--1060, arxiv:2004.05387 [Let me preface my remarks by saying that (1) this is a technically adept and well-written paper which makes nice contributions to a number of applied areas; (2) rhetorically, I figure in the paper as the stodgy voice of received error, to be overthrown by new knowledge; and (3) Karl was nonetheless very gracious when we corresponded about the paper. All of which said: the rotation problem leads to lack of identifiability of the factors (equivalently, of the loadings) because we're free to use any coordinate system we like for the factor space. We don't even have to use Cartesian coordinates, we could use (e.g.) spherical and then the observables would be nonlinear functions of the coordinates in the latent space. The assumption Drs. Rohe and Zeng are making here is (roughly) "there are preferred axes in the latent space, where most samples depart from the origin on just one (or a few) coordinates".*If*that is true then that is indeed reasonable, all else being equal, to prefer using those axes as the basis for factors coordinates. (Why they should be orthogonal to each other I couldn't begin to say.) Someone who arbitrarily rotated to a different basis would still not be*wrong*, they'd just find describing the distribution in the latent space more complicated than Rohe and Zeng would.] - Mohamed Saidane, Xavier Bry and Christian Lavergne,
"Generalized Linear Factor Models: A New Local EM Estimation Algorithm",
Communications in Statistics: Theory and
Methods
**42**(2013): 2944--2958

- Recommended, historical:
- Charles Spearman, "``General Intelligence,'' Objectively Determined
and Measured", American Journal of Psychology
**15**(1904): 201--293 [Online] - Peter H. Schönemann and Ming-Mei Wang, "Some new results on factor indeterminacy", Psychometrika
**37**(1972): 61--91 - James H. Steiger
- "Factor indeterminacy in the 1930's and the 1970's some interesting parallels", Psychometrka
**44**(1979): 157--167 - "The relationship between external variables and common factors", Psychometrika
**44**(1979): 93--97

- "Factor indeterminacy in the 1930's and the 1970's some interesting parallels", Psychometrka
- Godfrey H. Thomson, The Factorial Analysis of Human Ability
- L. L. Thurstone, "The Vectors of Mind", Psychological Review
**41**(1934): 1--32 [Online]

- Modesty forbids me to recommend:
- The chapters on principal components and factor analysis in Advanced Data Analysis from an Elementary Point of View

- To read:
- Umberto Amato, Anestis Antoniadis, Alexander Samarov, Alexander Tsybakov, "Noisy Independent Factor Analysis Model for Density Estimation and Classification", arxiv:0906.2885
- Ery Arias-Castro, Sébastien Bubeck, and Gábor Lugosi,
"Detection of
correlations", Annals of
Statistics
**40**(2012): 412--435 - Jushan Bai and Kunpeng Li, "Statistical analysis of factor models
of high dimension", Annals
of Statistics
**40**(2012): 436--465 - Tony Cai, Zongming Ma, Yihong Wu, "Optimal Estimation and Rank Detection for Sparse Spiked Covariance Matrices", arxiv:1305.3235
- Venkat Chandrasekaran, Pablo A. Parrilo, and Alan S. Willsky,
"Latent variable graphical model selection via convex optimization",
Annals of Statistics
**40**(2012): 1935--1967, arxiv:1008.1290 - Yunxiao Chen, Xiaoou Li & Siliang Zhang, "Structured Latent Factor Analysis for Large-scale Data: Identifiability, Estimability, and Their Implications", Journal of the American Statistical Association forthcoming (2019), arxiv:1712.08966
- John P. Cunningham, Zoubin Ghahramani, "Unifying linear dimensionality reduction", arxiv:1406.0873
- Jianqing Fan, Jianhua Guo and Shurong Zheng, "Estimating Number of Factors by Adjusted Eigenvalues Thresholding", Journal of the American Statistical Association
**117**(2022): 852--861 - Jianqing Fan, Yuan Liao, "Learning Latent Factors from Diversified Projections and its Applications to Over-Estimated and Weak Factors",
Journal of the American Statistical Association
**117**(2022): 909--924, arxiv:1908.01252 - Jianqing Fan, Yuan Liao, and Martina Mincheva,
"High-dimensional covariance matrix estimation in approximate factor models",
Annals of Statistics
**39**(2011): 3320--3356 - Philipp Fleig, Ilya Nemenman, "Generative probabilistic matrix model of data with different low-dimensional linear latent structures", arxiv:2212.02987
- Réami Gribonval, Rodolphe Jenatton, Francis Bach, Martin Kleinsteuber, Matthias Seibert, "Sample Complexity of Dictionary Learning and other Matrix Factorizations", arxiv:1312.3790
- Giles Hooker, Steven Roberts, "Maximal Autocorrelation Functions in Functional Data Analysis", arxiv:1407.4578
- V. I. Koltchinskii, "Empirical geometry of multivariate data: a deconvolution approach", Annals of Statistics
**28**(2000): 591--629 - Dennis Leung, Mathias Drton, Hisayuki Hara, "Identifiability of directed Gaussian graphical models with one latent source" arxiv:1505.01583
- Vinicius Diniz Mayrink and Joseph Edward Lucas, "Sparse latent factor models with interactions: Analysis of gene expression data", Annals of Applied Statistics
**7**(2013): 799--822 - Roderick P. McDonald, "A simple comprehensive model for the analysis of covariance structures: Some remarks on applications",
British
Journal of Mathematical and Statistical Psychology
**33**(1980): 161-–183 - Art B. Owen and Jingshu Wang, "Bi-Cross-Validation for Factor Analysis", Statistical Science
**31**(2016): 119--139, arxiv:1503.03515 < - Patrick O. Perry, Art B. Owen, "A Rotation Test to Verify Latent Structure", Journal of Machine Learning Research
**11**(2010): 603--624 - Armeen Taeb, Venkat Chandrasekaran, "Interpreting Latent Variables in Factor Models via Convex Optimization", arxiv:1601.00389
- Yichuan Tang, Ruslan Salakhutdinov, Geoffrey Hinton, "Deep Mixtures of Factor Analysers", arxiv:1206.4635
- Tyler J. VanderWeele, Stijn Vansteelandt, "A statistical test to reject the structural interpretation of a latent factor model", arxiv:2006.15899
- Xianli Zeng, Yingcun Xia, Linjun Zhang, "Double Cross Validation for the Number of Factors in Approximate Factor Models", arxiv:1907.01670
- Kai Zhang, "Rank-Extreme Association of Gaussian Vectors and Low-Rank Detection", arxiv:1306.0623

- To write:
- CRS, "General Factors in Correlational Psychology: Artifacts and Myths"
- CRS, "Semi-Parametric Generalized-Linear Factor Models" [Or, a brilliant scheme for replacing the linear dependence on the factors with a generalized additive model, if I can just get it to work...]

#### Update history

(Incomplete)- Major revision 9 November 2017
- Fixed minor algebraic typos and a broken link, added new paragraph about positively-loaded "general factors" on 6 August 2019
- Added discussion of Rohe and Meng (2023) after correspondence with Prof. Roh on 7 January 2024