# Regression Estimation Under Resource Constraints: Fisher, Bartlett & Kalman

## Vivak Patel

Department of Statistics, University of Chicago
January 15, 2016

# Acknowledge

## Mihai Anitescu

• Senior Computational Mathematician in LANS
• Part-time Professor at University of Chicago

# Stationary Parameters

Problem.

• Observed (Independent) Pairs $(Y_t,X_t) \in \mathbb{R} \times \mathbb{R}^p$ where $t=1,2,\ldots,N$
• Criterion Function: $c:\mathbb{R}\times\mathbb{R} \to \mathbb{R}$
• Mean Function: $\mu: \mathbb{R}^p \times \mathbb{R}^p \to \mathbb{R}$
• Find $\hat{\beta}_N \in \mathbb{R}^p$ such that $$\hat{\beta}_N \in \arg\min \left\lbrace \frac{1}{N} \sum_{t=1}^N c(Y_t,\mu(\beta,X_t)) : \beta \in \mathbb{R}^p \right\rbrace$$

Example Problem. Linear Regression $$\frac{1}{N} \sum_{t=1}^N c(Y_t, \mu(\beta,X_t)) = \frac{1}{N} \sum_{t=1}^N(Y_t - \mu(\beta,X_t))^2 = \frac{1}{N} \sum_{t=1}^N(Y_t - \beta^T X_t)^2$$

# Stationary Parameters

Examples of Statistical Criterion Functions

• Negative log-likelihood (Fisher 1925, Bartlett 1953, Le Cam 1970, Hájek 1972)
• $$c(Y_t, \mu) = - \log \mathbb{P}(Y_t \vert \mu)$$
• Quasi Likelihood (Wedderburn 1974)
• $$\nabla_{\beta} c(Y_t, \mu) = \frac{Y_t - \mu(\beta,X_t)}{V(\mu(\beta,X_t))} \nabla_\beta \mu(\beta,X_t)$$
• Logistic Regression with Linear Predictor, Generalized Linear Model (Nelder & Wedderburn 1972)
• $$c(Y_t, \mu(\beta, X_t)) = -Y_t X_t^T \beta + \log( \exp(X_t^T \beta) + 1)$$
• "Nonparametric" Sieve Likelihood (Wong & Severini 1991)
• Wavelet Regression
• Bézier (Polynomial) Spline Regression

# Problem Class

First Bartlett Identity $$\exists \beta^* ~\text{s.t.}~ \mathbb{E}\left[ \left. \nabla c^* \right\vert \sigma(X) \right] = 0$$ Second Bartlett Identity $$\exists \beta^* ~\text{s.t.}~ \mathbb{E}\left[ \left. \nabla^2 c^* \right\vert \sigma(X)\right] = \mathbb{E}\left[\left. \nabla c^* \nabla^T c^* \right\vert \sigma(X)\right]$$

Fisher Criteria $$\exists V: \mathbb{R} \to \mathbb{R}_{> 0} ~\text{s.t.}~ \mathbb{E}\left[ \left. \nabla c^* \nabla^T c^* \right\vert \sigma(X)\right] = \frac{1}{V(\mu(\beta^*,X))}\nabla \mu(\beta^*,X) \nabla^T \mu(\beta^*,X)$$

# Problem Class: Example

Non-Linear Logistic Regression
Suppose $Y_t \in \lbrace 0,1\rbrace$ and $\mu: \mathbb{R}^p \times \mathbb{R}^p \to [0,1]$. $$\mathbb{P}\left[ Y_t \vert X_t\right] = \left(\mu(\beta,X_t)\right)^{Y_t}\left( 1 - \mu(\beta,X_t)\right)^{1-Y_t}$$ Taking the negative logarithm: $$c_t = -Y_t \log\left[\frac{\mu(\beta,X_t)}{1 - \mu(\beta,X_t)} \right] - \log\left[1-\mu(\beta,X_t)\right]$$ Taking the derivative with respect to $\beta$: $$\nabla c_t = \frac{\partial c_t}{\partial \mu } \nabla \mu = - \frac{Y_t - \mu}{\mu(1-\mu)} \nabla \mu$$ Notice that the first Bartlett Identity is satisfied since $\mathbb{E}\left[ \left. \frac{\partial c_t}{\partial \mu} \right\vert \sigma(X_t)\right] = 0$.

# Problem Class: Example

Non-Linear Logistic Regression
Now taking the second derivative with respect to $\beta$: \begin{aligned} \nabla^2 c_t &= \left(\nabla \frac{\partial c_t}{\partial \mu}\right) \nabla^T \mu + \frac{\partial c_t}{\partial \mu} \nabla^2 \mu \\ &= \frac{\partial^2 c_t}{\partial \mu^2} \nabla \mu \nabla^T \mu + \frac{\partial c_t}{\partial \mu} \nabla^2 \mu \\ &= \left(\frac{1}{\mu(1-\mu)} + \frac{(Y_t - \mu)(1-2\mu)}{\mu^2(1-\mu)^2} \right)\nabla \mu \nabla^T \mu + \frac{\partial c_t}{\partial \mu} \nabla^2 \mu \end{aligned} Taking conditional expectation, we have that Fisher's criteria is satisfied. $$\mathbb{E}\left[ \nabla^2 c_t \vert \sigma(X_t) \right] = \frac{1}{\mu(1-\mu)}\nabla \mu \nabla^T \mu$$ Moreover, since $Var(Y_t) = \mu (1-\mu)$, the second Bartlett identity is satisfied: $\mathbb{E}\left[ \nabla c_t \nabla^T c_t \right] = \mathbb{E}\left[ \nabla^2 c_t \vert \sigma(X_t) \right]$.

# Problem Class: More Examples

Problem Class Criterion Functions

• Linear and Non-linear Least Squares
• Generalized Linear Models and Quasi-Likelihood Models
• Sieve Regression
• Wavelet Regression
• H-likelihood Models
• Partial Likelihood Models (does not satisfy Fisher Criteria)

# Optimization

Classical Optimization Techniques

• Gauss-Newton Method (see Nocedal & Wright 2006, Ch. 10)
• Fisher's Scoring Method (Fisher 1925)
• Levenberg-Marquardt (Levenberg 1944, Marquardt 1963)
• Iteratively Reweighted Least Squares (Nelder & Wedderburn 1972, Wedderburn 1974)

Classical Performance Metrics

• Rate of Convergence (Sublinear, Linear, Superlinear, Quadratic)
• Objective, Gradient and Hessian Evaluations
• Floating Point Operations, Memory Requirements
• Well-defined Stop Conditions
• Sensitivity to Conditioning

# Optimization

Classical Metrics for Classical Methods

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(p^2 N)$ $\mathcal{O}(p^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(p^2 + nN)$ $\mathcal{O}(p^2)$ Super-Linear Deterministic Mild
IRLS $N$ $0$ $\mathcal{O}(p^2 N)$ $\mathcal{O}(p^2)$ Super-Linear (?) Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(pN)$ $\mathcal{O}(p)$ Linear Deterministic Moderate

Question 1. For $p$ and $N$ large, do these metrics have enough information?

Question 2. If not, what metrics should we add or reconsider to make better algorithmic choices?

# Optimization

Suggested Additional Metrics under Resource Constraints

• p-Feasibility. As $p$ increases, what are communication costs?
• N-Feasibility. As $N$ increases, what are communication costs?
• Update Efficiency. How many data points must be assimilated before the parameter can be incremented?

p-Feasibility N-Feasibility Efficiency
Newton Prohibitive Prohibitive $N$
Quasi-Newton Prohibitive Moderate $N$
L-Quasi-Newton Moderate Moderate $N$
IRLS Prohibitive Prohibitive $N$
Gradient Descent Mild Mild $N$

Basic Idea: Minimize $\frac{1}{N}\sum_{t=1}^N c(Y_t,\mu(\beta,X_t))$ using: $$\beta_{k+1}^{SGD} = \beta_k^{SGD} - \alpha_k \nabla_\beta c(Y_t, \mu(\beta_k^{SGD}, X_t))$$

History

• Least Mean Squares (Cotes 1700s, Legendre 1805, Gauss 1806, Macchi & Eweda 1983, Gerencsér 1995)
• Stochastic Gradient Descent (Robbins & Monro 1951, Neveu 1975, Murata 1998)

Simplifying Notation $$c_{t} = c(Y_t, \mu(\beta, X_t))$$ $$c_{t,k}^{SGD} = c(Y_t, \mu(\beta_k^{SGD},X_t))$$ $$\nabla c_{t,k}^{SGD} = \nabla_\beta c(Y_t, \mu(\beta_{k}^{SGD},X_t))$$

Classical Metrics

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(p^2 N)$ $\mathcal{O}(p^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(p^2 + nN)$ $\mathcal{O}(p^2)$ Super-Linear Deterministic Mild
L-Quasi-Newton $N$ $0$ $\mathcal{O}(pm + nN)$ $\mathcal{O}(pm)$ Super-Linear(?) Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(pN)$ $\mathcal{O}(p)$ Linear Deterministic Moderate
SGD $1$ $0$ $\mathcal{O}(p)$ $\mathcal{O}(p)$ Sub-Linear None Severe

p-Feasibility N-Feasibility Efficiency
Newton Prohibitive Prohibitive $N$
Quasi-Newton Prohibitive Moderate $N$
L-Quasi-Newton Moderate Moderate $N$
IRLS Prohibitive Prohibitive $N$
Gradient Descent Mild Mild $N$
SGD Mild Very Mild $1$

Incremental Optimization Methods

• Goal is to calculate $\hat{\beta}_N$ within some error.
• Treat current available data as the population
• SGD, Amari 2000, Schraudolph 2007, Roux 2012, Richtárik 2013, Sohl-Dickstein 2013, Zhang 2014

Incremental Estimator Methods

• Goal is to find a computationally inexpensive estimator of the minimizer of $\mathbb{E}\left[c(Y,\mu(\beta,X))\right]$
• Treat currently available data as a sample
• SGD, Patel 2015

The remainder of this talk will discuss going from SGD to a generalization of Patel 2015.

# Proximal Operator

Recall $$\beta_{k+1}^{SGD} = \beta_{k}^{SGD} - \alpha_k \nabla c_{t,k}^{SGD}$$

Proximal Form of SGD $$\beta_{k+1}^{SGD} = \arg\min \left\lbrace c_{t,k}^{SGD} + (\beta-\beta_k^{SGD})^T \nabla c_{t,k}^{SGD} + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k^{SGD} \right\Vert^2: \beta \in \mathbb{R}^p \right\rbrace$$

Non-approximative Form $$\beta_{k+1} \in \arg\min \left\lbrace c_t + \frac{1}{2\alpha_k}\left\Vert \beta - \beta_{k} \right\Vert^2: \beta \in \mathbb{R}^p \right\rbrace$$

# Proximal Operator

Reweighted Form

Let $\mathcal{M}_k = \mathbb{E}\left[ \left. (\beta_k - \beta^*)(\beta_k - \beta^*)^T \right\vert \sigma(X_1,\ldots,X_t) \right]$. Consider the incrementation given by $$\beta_{k+1} \in \arg\min \left\lbrace c_t + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k \right\Vert_{\mathcal{M}_k^{-1}}^2: \beta \in \mathbb{R}^p \right\rbrace$$ Here, as $\beta_k$ becomes more accurate, the observation $(Y_t,X_t)$ will become less relevant.

Example: Simple Linear Regression

We have that $2c_t = \left\Vert Y_t - X_t^T \beta \right\Vert^2$ and $V(\mu) = \sigma^2 > 0$. $$\beta_{k+1} \in \arg\min \left\lbrace \frac{1}{2\sigma^2} \left\Vert Y_t - X_t^T \beta \right\Vert^2 + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k \right\Vert_{\mathcal{M}_k^{-1}}^2: \beta \in \mathbb{R}^p \right\rbrace$$

# Proximal Operators

Problems with Reweighted Form

1. $\mathcal{M}_k$ is certainly unknown.
2. The form of $c_t$ may make calculating $\beta_{k+1}$ expensive compared to communication costs.

Estimating $\mathcal{M}_k$ (Kalman 1960)

• Considers case when $Y_t$ is modeled as non-linear least squares.
• Estimates $\mathcal{M}_k$ with $M_k$: $$M_{k+1} = \left(I - \frac{M_{k}\nabla \mu_{k} \nabla^T \mu_{k}}{\sigma_{k+1}^2 + \nabla^T \mu_{k} M_{k}\nabla \mu_k} \right) M_k$$
• Kalman was originally motivated by missile guidance/Weiner-Hopf Problem.

# Kalman Filter & Theory

Kalman Filter $$\beta_{k+1} = \arg\min \left\lbrace \frac{1}{2\sigma_{k+1}^2} \left\Vert Y_t - \mu_{k} - \nabla^T \mu_{k}(\beta-\beta_k) \right\Vert^2 + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k \right\Vert_{M_k^{-1}}^2: \beta \in \mathbb{R}^p \right\rbrace$$ $$M_{k+1} = \left(I - \frac{M_{k}\nabla \mu_{k} \nabla^T \mu_{k}}{\sigma_{k+1}^2 + \nabla^T \mu_{k} M_{k}\nabla \mu_k} \right) M_k$$

Bayesian "Convergence" Result. If $M_0$ and $\sigma_k^2$ are correct, then $M_k$ will be correct and $\beta_k \to \beta^*$.

Problem. $M_0$ and $\sigma_k^2$ will never be known in real problems.

# Kalman Filter & Theory

Control Theoretic Assumptions.

• $\exists \beta^* \in \mathbb{R}^p$ such that $Y_t = X_t^T \beta^*$.
• There are $0 < \alpha \leq \beta$ such that $\alpha I \prec M_k \prec \beta I$.
• $\sigma_k^2$ are replaced with a tuning parameter $\lambda \in (0,1)$

Control Theory Results.

• For arbitrary $\beta_0$, $\beta_k \to \beta^*$ exponentially.
• Johnstone et. al. 1982, Bittanti et. al. 1990, Parkum et. al. 1992, Cao et. al. 2003.

Problems.

• Focus was on dynamic parameter estimation.
• Real applications have noise.

# Kalman Filter & Theory

(Our) Probabilistic Assumptions.

• $\exists \beta^* \in \mathbb{R}^p$ such that $Y_t = X_t^T \beta^* + \epsilon_t$.
• $\epsilon_t$ are independent (of all variable), mean zero and (ident.) finite variance.
• $X_t$ are independent and identically distributed.
• $X_1$ has a finite second moment, and there is no colinearity.
• Replace $\sigma_{k+1}^2$ with bounded tuning parameters $\gamma_k^2$.
• Let $M_0$ be any positive definite symmetric matrix.

Probabilistic Results.

• $\mathbb{E} \left[ \left. \left\Vert \beta_{k} - \beta^*\right\Vert \right\vert \sigma(X_1,\ldots,X_k) \right] \to 0$ almost surely.
• Asymptotically almost surely, $\frac{1-\epsilon}{\max_k \gamma_k^2}M_k \prec \frac{1}{\sigma^2}\mathcal{M}_k \prec \frac{1+\epsilon}{\min_k \gamma_{k}^2} M_k$.
• Suppose $X_1$ is bounded. Then $\exists K$ such that for $k \geq K$ $$\mathbb{E}[c(\beta_k) \vert \sigma(X_1,\ldots,X_k)] - \mathbb{E}[c(\beta^*)] = \mathcal{O}\left( \frac{\sigma^2 p}{k} \right) \quad \mathbb{P} = 1 - o\left( k^{-2}\right)$$

# Kalman Filter & Experiment

Data Set

• Source: Public Use File from Center of Medicare and Medicaid Services.
• Described health care expenses, type of visit, patient demographics.
• Contained $N=2.8$ million records (too big to fit in my computer's 8GB memory).

Linear Model

• Response: Cost of visit.
• Predictors: Patient's gender, Type of Facility, Patient's age.
• Intercept term was included, giving $p=31$ unknown variables.

Experiment

• Compared to SGD. Three different types of learning rates were selected which performed best in class.
• KF's tuning parameters selected of three different types arbitrarily.

# Kalman Filter & Experiment

Convergence of Parameter Estimate.

• Mean Residuals Squared is our measure of convergence.
• Gentleman-Miller Algorithm (i.e. incremental QR) was used to get smallest mean residuals squared.

# Kalman Filter & Experiment

Convergence of Estimated Covariance.

• Recall: $\frac{1-\epsilon}{\max_k \gamma_k^2}M_k \prec \frac{1}{\sigma^2}\mathcal{M}_k \prec \frac{1+\epsilon}{\min_k \gamma_{k}^2} M_k$.
• Estimated covariance tracks well with mean residuals squared.

# Kalman Filter & Metrics

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Newton $N$ $N$ $\mathcal{O}(p^2 N)$ $\mathcal{O}(p^2)$ Quadratic Deterministic Mild
Quasi-Newton $N$ $0$ $\mathcal{O}(p^2 + nN)$ $\mathcal{O}(p^2)$ Super-Linear Deterministic Mild
L-Quasi-Newton $N$ $0$ $\mathcal{O}(pm + nN)$ $\mathcal{O}(pm)$ Super-Linear(?) Deterministic Mild
Gradient Descent $N$ $0$ $\mathcal{O}(pN)$ $\mathcal{O}(p)$ Linear Deterministic Moderate
Incremental Optimization $N+m$ $0$ $\mathcal{O}(p(N+m))$ $\mathcal{O}(p)$ Linear Probabilistic Moderate

Gradient Evals Hessian Evals Floating Point Memory Cost RoC Stop Condition Conditioning
Kalman Filter $1$ $0$ $\mathcal{O}(p^2)$ $\mathcal{O}(p^2)$ Sub-Linear Almost Sure Mild
SGD $1$ $0$ $\mathcal{O}(p)$ $\mathcal{O}(p)$ Sub-Linear None Severe

# Kalman Filter & Metrics

p-Feasibility N-Feasibility Efficiency
Newton Prohibitive Prohibitive $N$
Quasi-Newton Prohibitive Moderate $N$
L-Quasi-Newton Moderate Moderate $N$
IRLS Prohibitive Prohibitive $N$
Gradient Descent Mild Mild $N$
Incremental Optimization Varies Varies $(N+m)/2$

p-Feasibility N-Feasibility Efficiency
Kalman Filter Prohibitive Very Mild $1$
SGD Mild Very Mild $1$

# Kalman-based SGD

Recall: Problems with Reweighted Form

1. True covariance is unknown. Estimate with $M_k$.
2. The form of $c_t$ may make calculating $\beta_{k+1}$ expensive compared to communication costs.

Example:El Niño
We want to model atmospheric pressure $Y$ using the month of the year $X$. \begin{aligned} Y &= \beta_1 + \beta_2\cos\left(\frac{\pi X}{6}\right) + \beta_3\sin\left(\frac{\pi X}{6}\right) \\ &+ \beta_5\cos( 2\pi X\beta_4) + \beta_6\sin( 2\pi X\beta_4) + \beta_8\cos( 2\pi X\beta_7 ) + \beta_9\sin( 2\pi X\beta_7 )+\epsilon \end{aligned} Each update to find $$\beta_{k+1} \in \arg\min \left\lbrace c_t + \frac{1}{2\alpha_k} \left\Vert \beta - \beta_k \right\Vert_{M_k^{-1}}^2: \beta \in \mathbb{R}^9 \right\rbrace$$ would be too expensive to carry out for, say, large data sets.

# Kalman-based SGD

Approximation

Replacing $c_t$ with a linear approximation $$\arg\min \left\lbrace c_{t,k} + \nabla^T c_{t,k}(\beta - \beta_k) + \frac{1}{2\alpha_k}\left\Vert \beta - \beta_{k} \right\Vert_{M_k^{-1}}^2 \right\rbrace$$ Replacing $c_t$ with a quadratic approximation $$\arg\min \left\lbrace c_{t,k} + \nabla^T c_{t,k}(\beta - \beta_k) + \frac{1}{2}(\beta - \beta_k)^T \nabla^2 c_{t,k} (\beta - \beta_{k}) + \frac{1}{2\alpha_k}\left\Vert \beta - \beta_{k} \right\Vert_{M_k^{-1}}^2 \right\rbrace$$ Replacing $c_t$ with a conditional expectation $$\arg\min \left\lbrace c_{t,k} + \nabla^T c_{t,k}(\beta - \beta_k) + \frac{1}{2V(\mu)}\left[(\beta-\beta_k)^T \nabla \mu_k\right]^2 + \frac{1}{2\alpha_k}\left\Vert \beta - \beta_{k} \right\Vert_{M_k^{-1}}^2 \right\rbrace$$

# Thank You

Ongoing Projects

• Current: theory and examples for general problem class.
• Straight forward: parallelization over data set.
• Theoretically difficult: a low-memory version.

Slides
This slide deck and its iterations can be found at galton.uchicago.edu/~vpatel.

Reference

Patel, Vivak. "Kalman-based Stochastic Gradient Method with Stop Condition and Insensitivity to Conditioning." arXiv preprint arXiv:1512.01139 (2015).

/