Department of Statistics, University of Chicago

April 21, 2017

II. An HMM Formulation (not novel)

III. Some Results for Linear Regression (novel)

IV. Ongoing Work (novel)

Let $X,X_1,X_2,\ldots \in \mathbb{R}^{d}$ be independent, identically distributed, $l^2$ random variables such that $$ 0 \prec \mathbb{E}\left[XX'\right] $$

Let $\epsilon,\epsilon_1,\epsilon_2,\ldots$ be independent, identically distributed, mean zero, finite variance random variables.

Let $\beta^* \in \mathbb{R}^d$, and define $$ Y = X'\beta^* + \epsilon \text{ and } Y_k = X_k'\beta^* + \epsilon_k$$

**Problem:** Given $\lbrace (Y_k,X_k):k=1,\ldots,N \rbrace$ find an estimator for $\beta^*$ which scales as $N \to \infty$.

**Problem:** Given $\lbrace (Y_k,X_k):k=1,\ldots,N \rbrace$ find an estimator for $\beta^*$ which scales as $N \to \infty$.

**Solution 1:** At each $N$ compute
$$\beta_N = \arg \min \Vert Y - X\beta \Vert_2^2.$$

**Solution 2:** At each $N$ compute
$$\beta_N = \beta_{N-1} + \alpha_{N-1}X_{N}(Y_N - X_{N}'\beta_{N-1}).$$

**Problem:** Given $\lbrace (Y_k,X_k):k=1,\ldots,N \rbrace$ find an estimator for $\beta^*$ which scales as $N \to \infty$.

**Solution 1:** At each $N$ compute
$$\beta_N = \arg \min \sum_{k=1}^N \log(1 + \exp(X_k'\beta)) - Y_k X_k'\beta.$$

**Solution 2:** At each $N$ compute
$$\beta_N = \beta_{N-1} + \alpha_{N-1}X_N \left( Y_N - \frac{1}{1 + \exp(-X_N'\beta_{N-1})} \right).$$

If we view the estimation problem as a filtering problem, then we can use statistical filter to estimate (Kalman, Extended Kalman, Unscented, Particle, etc.)

**Kalman Filter for Linear Regression**
$$ \beta_N = \beta_{N-1} + M_{N-1}X_{N}\frac{ Y_N - X_{N}'\beta_{N-1}}{\gamma + X_{N}'M_{N-1}X_{N}}$$
where
$$ M_{N} = (I - \frac{M_{N-1}X_{N}X_{N}'}{\gamma + X_{N}'M_{N-1}X_N})M_{N-1}.$$

- The parameter estimator is consistent
- The covariance estimator is consistent up to a multiplicative factor.
- There is a robust set of choices for the parameter $\gamma$
- Achieves the (information theoretic) "lower bound" convergence rate complexity.

- Asymptotic distribution results
- A framework for extending the Kalman Filter to other estimation problems

## Patel, Vivak. "Kalman-Based Stochastic Gradient Method with Stop Condition and Insensitivity to Conditioning." SIAM Journal on Optimization 26, no. 4 (2016): 2620-2648.

II. Consistency of the Kalman Filter for GLMs

III. A Kalman Filter for Models with Convex Constraints

IV. On Why SGD Fails in Practice: Divergence, Stalling, Conditioning

V. A Survey of Computationally Tractable Incremental Estimation

II. An HMM Formulation (novel)

III. Some Numerical Experiments

**Problem:** Consider the following optimization problem
$$ \min \sum_{i=1}^N f_i(\theta),$$
where $N$ is very large. For such problems, it is often infeasible to parse all $N$ component functions per iteration owing to memory and communication limitations. This prevents us from using standard optimization methods such as gradient descent, quasi-Newton, etc.

**Solution 1:** Subsample the objective function and use an incremental estimator.

**Solution 2:** Subsample the objective function and use a stochastic method designed a discrete, finite probability space.

**Problem:** Consider the problem of optimizing an objective function $f(\theta)$, but any attempt to evaluate $f(\theta)$ only returns
$$ f(\theta) + \epsilon,$$
where $\epsilon$ is a random variable which may be strongly correlated to $f(\theta)$. Moreover, the gradient of $f(\theta)$ cannot be evaluated, noisy or otherwise.

**Solution:** ...

**Problem:** Let $T$ and $h$ be matrix-valued and vector-valued random variables. Consider
$$\begin{aligned}
\min_{\theta} ~& c'\theta + \mathbb{E}\left[\min_{y} q'y \right] \\
\text{subject to} ~& A\theta = b \\
& T\theta + Wy = h \\
& \theta \in \Theta \\
& y \in Y
\end{aligned}$$

**Solution:** Generate samples of $T,h$ and replace the expectation with an empirical estimate.

Consider the following generic problem $$\begin{aligned} \min_{\theta}~& \mathbb{E}\left[ f(\theta,\xi) \right] \end{aligned}$$ where $\xi$ is a random variable which we can independently sample from at each iteration to produce $\xi_1,\xi_2,\ldots$.

Suppose also that $$ \nabla_\theta \mathbb{E}\left[ f(\theta,\xi) \right] = \mathbb{E}\left[ \nabla_\theta f(\theta,\xi) \right]$$

Finally, suppose that for given values of $\theta$ and $\xi$, $f(\theta,\xi)$ and $\nabla_\theta f(\theta,\xi)$ can be evaluated.

## Note: We can use a model based approach if $\nabla_\theta f(\theta,\xi)$ cannot be evaluated. Other modifications can also be made if $f$ can be decomposed into a non-random component and random components.

**Hidden & Observed States:** $\mathbb{E}[f]$, $\mathbb{E}[\nabla f]$, $f(\cdot,\xi_k)$ and $\nabla f(\cdot, \xi_k)$

**Transition Dynamics:**
$$ \mathbb{E}[f(\theta,\xi)] = \mathbb{E}[f(\theta_k,\xi_k)] + \nabla \mathbb{E}[f(\theta_k,\xi_k)]'(\theta - \theta_k) + \epsilon \Vert \theta - \theta_k \Vert_2^2,$$
where $\epsilon$ is mean zero and has some finite variance.

**Likelihood Model:**
$$ f(\theta,\xi_k) = \mathbb{E}[f(\theta,\xi)] + \eta,$$
where $\eta$ is mean zero and has some finite variance.

## Note: It is unlikely that $\eta$ is mean zero, but we can take the average of multiple independent samples and use the normal approximation to justify the relationship.

Thank You

/