Random Feature Regularization

Last update: 07 Jul 2025 12:13
First version: 1 May 2025

Attention conservation notice: A slightly mad idea, written down to get it out of my head, without having worked on it at all.

\[ \DeclareMathOperator*{\argmin}{argmin} \]

Start with ordinary, unpenalized statistical estimation: \[ \hat{\theta} = \argmin_{\theta \in \Theta}{L_n(X_{1:n}, \theta)} \] where $ X_{1:n} $ are our $n$ data points, and $ L_n $ is a loss function, e.g., mean squared error, or normalized negative log-likelihood. Now add a penalty term: \[ \tilde{\theta} = \argmin_{\theta \in \Theta}{\left( L_n(X_{1:n}, \theta) + \lambda \pi(\theta) \right)} \] with $\pi$ being the penalty function and $\lambda$ a factor which says how strong the penalty is (perhaps varying with $n$).

Adding a penalty term to a statistical model can help it predict better because it stabilizes the estimates: it adds bias (unless you're very lucky) but reduces variance. (Equivalently, a penalty enforces a constraint, only the subset of $\Theta$ where $\pi(\theta) \leq c$ for some $c$ varying with $\lambda$ is feasible ["a fine is a price"], and searching over that constrained set is more stable than searching over the full, unconstrained parameter space.) Ordinarily, the advice is to think carefully about the penalty function $\pi$, because different penalties will encourage different sorts of behavior in your model. I have written papers with such morals myself.

Suppose instead that we draw $\pi$ at random from some distribution over (well-behaved) functions on the parameter space $\Theta$. How much stabilization will we get? How much bias and of what sorts? Say, to be really concrete, that $\Theta = \mathbb{R}^d$ and we sample sine waves on $\Theta$ from a Gaussian distribution of frequencies. (These are perfectly good random feature bases when we do this on the $X$ space.) What would a linear regression with such random-Fourier-feature regularization look like, generically?