5.Marginalization in optimization problems

Optimization problems

A general optimization problem can be described as

\[\min_{x}{||y- f(x)||^2_W}\]

or a general linear form

\[\min_{x}{||y- Ax||^2_W}\]

where $x$ is the optimization variable to be estimated, $y$ is the observation, and $W$ is the weighted matrix. Here we define $

^2_W=x^T W x$.

Consider a special case with $W=I$. One of the most popular algorithms to solve this problem is to construct a residual for each observation or constraint, and it can be described as

\[\min_{x} \frac{1}{2} \sum_{i=1}^N ||y_i-f_i(x_i)||\]

where $N$ is the number of constraints. The number of constraints becomes very large when a large number of measurements are observed. As a result, the optimization problem becomes a large-scale problem that is difficult to solve in real-time. A strategy to solve this problem is to apply a sliding window to restrict the number of residual blocks. A general method to implement a sliding window with preserves all previous information is marginalization.

Marginalization using Schur's complement

The marginalization can be described with Schur's complement. Consider the variable $x= \left[x_m, x_r \right]^T $, where $x_m$ is the variable part to be marginalized out, and $x_r$ is the remaining part. The problem can be described as

\[\min_{x}{||r(y,x)||^2}=\min_{x}{||y- Ax||^2}= \min \left[\left(y- Ax \right) ^T \left(y- Ax \right) \right],\] \[\left.\left[ \begin{array}{cc} A_m & A_c \\\ A_c^T & A_r \end{array} \right. \right] \left[ \begin{array}{c} x_m \\\ x_r\end{array}\right] = \left[\begin{array}{c} y_m \\\ y_r \end{array} \right],\]

where

\[A =\left.\left[ \begin{array}{cc} A_m & A_c \\\ A_c^T & A_r \end{array} \right. \right], y=\left[ \begin{array}{c} y_m \\\ y_r\end{array}\right],\]

and then

\[\begin{bmatrix}I&0 \\\ -A_c^T A_m^{-1}&I\end{bmatrix} \left.\left[ \begin{array}{cc} A_m & A_c \\\ A_c^T & A_r \end{array} \right. \right] \left[ \begin{array}{c} x_m \\\ x_r\end{array}\right] = \begin{bmatrix}I&0 \\\ -A_c^T A_m^{-1}&I\end{bmatrix} \left[\begin{array}{c} y_m \\\ y_r \end{array} \right],\] \[\begin{bmatrix}A_m&A_c \\\ 0&A_r-A_c^TA_m^{-1}A_c\end{bmatrix}\begin{bmatrix} x_m \\\ x_r \end{bmatrix}= \begin{bmatrix} y_m \\\ y_r-A_c^TA_m^{-1}y_m \end{bmatrix},\]

and finally, we have

\[(A_r- A_c^T A_m^{-1} A_c ) x_r= y_r-A_c^TA_m^{-1} y_m,\]

which can be expressed as

\[(A_r- A_{marg} ) x_r= y_r-y_{marg},\]

where $A_{marg}$ and $y_{marg}$ are two marginalization partitials. Note that two marginalization parts are obtained at the current linearization point $\hat x$ and can be regarded as constant in the following iteration and induce linearization error. The fill-in produced by $A_{marg}$ renders the overall system matrix $A$ dense in the following computation, requiring more computation time.

Using the equation involving only $x_r$ above, we can marginalize out $x_m$ from the optimization problem. With $A_c^T A_m^{-1} A_c$ and $A_c^TA_m^{-1} y_m$ in the equation above, we can keep the information of $x_m$ as prior information when calculating $x_r$. From the above equation, we can calculate the prior information related to $x_m$. Note that $A_r$ may be a sparse matrix, but after adding $A_c^T A_m^{-1} A_c$, the correspoding matrix maynot be sparse, hence we should find a way to sparse the correspoding matrix.

Solving nonlinear problems

参考《视觉SLAM十四讲》和崔华坤对VINS的推导,使用高斯牛顿法计算目标函数的最小值，可以理解为，当优化变量产生一个小增量后，使得目标函数最小。可将上述目标函数写成增量形式

\[\begin{aligned} \min_{x}{\frac{1}{2}||r(y,x+\Delta x)||^2_W} &=\min_{x}{\frac{1}{2}||r(y,x)+J\Delta x||^2_W} \\\ &= \min_{x}{\frac{1}{2}\left(||r(y,x)||^2_W +r(y,x)^T W J\Delta x + \Delta x^T J^T W r(y,x) +||J\Delta x||^2_W\right)}, \end{aligned}\]

其中 $J(x)$，简写为 $J$ （以下符号类似处理）是残差项 $r(y,x)$ 关于所有优化变量 $x$ 的雅可比矩阵，对上式展开关于 $\Delta x$求导，并令其关于 $\Delta x$ 的导数等于零，得到增量 $\Delta x$ 的计算公式

\[J^TWJ\Delta x=-J^TWr,\]

其中的加权矩阵W是一个信息（对称）矩阵，设P为关于测量噪声的协方差矩阵，则有 $W=P^{-1}=L^TL$，可以考虑对信息矩阵作LLT分解，便于表达以Mahalanobis距离描述的优化项 $d=r^T L^T Lr$ 。我们可以将上式重写为

\[H \Delta x = b\]

其中 $H=J^TWJ, b=-J^TWr$ ， $H$ 可看作Hessian矩阵; 注意一般认为 $W=I$ 时 $H$ 为Hessian矩阵。此外求解上述方程要求 $H$ 可逆（这里即正定），但从数据计算得到的 $J^TWJ$ 仅能保证半正定；即使 $H$ 正定，但若其病态或 $\Delta x$ 较大导致前面使用雅可比矩阵对残差方程的一阶近似不够准确，则无法保证迭代算法收敛。因此，需要约束 $

D \Delta x

^2 \leq \mu $，其中$D=I$ 时将限制 $\Delta x$ 在一个半径为 $\mu$ 的球内。Levenberg–Marquardt方法，提出将 $D$ 取为非负对角阵，常取为 $J^T J$ 对角元的平方根使得在梯度小的维度上约束范围稍大。可通过拉格朗日乘子将该问题转化为一个无约束优化问题

\[\min_{x}{\frac{1}{2}||r(y,x)+J\Delta x||^2_W+ \frac{\lambda}{2}||D \Delta x||^2},\]

对该问题的求解可计算增量方程

\[\left(H+\lambda D^TD \right) \Delta x = b,\]

或当 $D=I$ 有

\[\left(H+\lambda I \right) \Delta x = b.\]

在每一次迭代更新中有 $x_k=x_{k-1}+\Delta x$ 。可通过对上式应用前面的Schur分解，实现基于边缘化的滑动窗口法。对于给定窗口大小$M$，滑动窗法通过边缘化限制加入目标函数的残差以及对应的待估计变量数量，实现对目标函数的快速优化

\[\min_{x} \frac{1}{2} \sum_{i=k}^{k+M-1} ||y_i-f_i(x_i)||.\]

其中 $x= \left[x_1,\dots, x_{k+M-1} \right]$ 为全部优化变量， $x_m = \left[x_1,\dots, x_{k} \right]^T$ 为被边缘化的变量（不一定按照时间顺序，可根据需要对变量进行边缘化） $x_r = \left[x_k,\dots, x_{k+M-1} \right]^T$ 为后续仍需估计更新的变量。具体做法，创建一个边缘化残差，记录边缘化后的信息，其中包括边缘化后的变量以及后续更新需要的先验信息。在每次边缘化后都要更新边缘化先验信息。