Hey there reader! It has been quite a while since I wrote a blog post.. but I have had a ton of things on my mind I wanted to write about! I am stoked to be able to write about some of them now!

Since leaving my job in defense to go to graduate school — yes, I am obviously into being a poor student — I have been busy with classes and research and have been introduced to some pretty cool new ideas. One awesome algorithm I was introduced to was related to the estimation of the Range (also called Column Space) of some matrix operator using a randomized algorithm.

To be more rigorous, we can define a matrix operator as $M: \mathbb{R}^n \rightarrow \mathbb{R}^m$ for $m \leq n$, meaning it maps an $n$-dimensional vector into an $m$-dimensional vector. The goal of the algorithm that will be introduced is to essentially find an approximate basis for the range of some matrix $M$ that spans $M$’s column space. Additionally, we would like to ideally estimate this basis with minimal computation. Time to see where we can go with this!

So given some matrix $M \in \mathbb{R}^{m \times n}$, it actually is not very tough to find a basis for the range of $M$. The easiest way one could do this would be to do a QR decomposition, i.e. $M = QR$, where the columns of $Q$ would represent orthogonal vectors that span the range of $M$. The sad thing about using this approach directly is the computational complexity is $O(m^2 n)$. Can we do better? The answer: we can.

So let us assume we know the approximate rank, $k$, of the matrix $M$ where $k \leq m \leq n$. Let us then assume we can generate some random set of input samples $\Omega \in \mathbb{R}^{n \times k}$ where each column of $\Omega$ is a random vector. If we assume we can compute each element of $\Omega$ in constant complexity, a randomized algorithm can proceed in the following steps with each associated computational complexity:

\begin{align}

&\text{Construct random matrix $\Omega \in \mathbb{R}^{n \times k}$ } &\rightarrow &O(nk) \tag{1}\\

&\text{Get measurements from $A$ by doing $Y = A\Omega$ } &\rightarrow &O(mnk) \tag{2}\\

&\text{Perform QR of $Y \ni Y = QR$ } &\rightarrow &O(mk^2) \tag{3}\\

&\text{Return range estimate, $Q$} &\rightarrow &O(mk) \tag{4}

\end{align}

The described algorithm has a dominating complexity of $O(mnk)$, potentially a huge speed up if $k \ll m$ and at least an improvement over the baseline approach which has an $O(m^2 n)$ complexity. The whole idea of this method is to use random inputs from $\Omega$ to extract information from the range of $M$ and then use a QR decomposition to actually use that information to estimate the basis that spans $M$’s range. If one wants to find the most dominant $k$ basis vectors more reliably, you can also throw in a power iteration styled loop into the algorithm. This can modify the algorithm into the following:

\begin{align}

&\text{Construct random matrix $\Omega \in \mathbb{R}^{n \times k}$ } &\rightarrow &O(nk) \tag{1}\\

&\text{Get measurements from $A$ by doing $Y = A\Omega$ } &\rightarrow &O(mnk) \tag{2}\\

&\text{For some small constant $N$ iterations, do: } \tag*{}\\

&\quad\text{Perform QR of $Y \ni Y = QR$ } &\rightarrow &O(mk^2)\tag{i}\\

&\quad\text{Compute $U = A^T Q$ } &\rightarrow &O(mnk)\tag{ii}\\

&\quad\text{Perform QR of $U \ni U = QR$ } &\rightarrow &O(mk^2)\tag{iii}\\

&\quad\text{Compute $Y = AQ$ } &\rightarrow &O(mnk)\tag{iv}\\

&\text{Perform QR of $Y \ni Y = QR$ } &\rightarrow &O(mk^2) \tag{3}\\

&\text{Return range estimate, $Q$} &\rightarrow &O(mk) \tag{4}

\end{align}

Note that, again, the overall computational complexity ends up being $O(mnk)$ but in this case the basis found that spans the range of $M$ will be much more precise thanks to the power iteration. For those that like to have some code to look at, the following function can be used to perform the above computation.

import numpy as np def randproj(A, k, random_seed = 17, num_power_method = 5): # Author: Christian Howard # This function is designed to take some input matrix A # and approximate it by the low-rank form A = Q*(Q^T*A) = Q*B. # This form is achieved using randomized algorithms. # # Inputs: # A: Matrix to be approximated by low-rank form # k: The target rank the algorithm will strive for. # random_seed: The random seed used in code so things are repeatable. # num_power_method: Number of power method iterations # set the random seed np.random.seed(seed=random_seed) # get dimensions of A (r, c) = A.shape # get the random input and measurements from column space omega = np.random.randn(c, k) Y = np.matmul(A, omega) # form estimate for Q using power method for i in range(1, num_power_method): Q1, R1 = np.linalg.qr(Y) Q2, R2 = np.linalg.qr(np.matmul(A.T, Q1[:, :k])) Y = np.matmul(A, Q2[:, :k]) Q3, R3 = np.linalg.qr(Y) # get final k orthogonal vector estimates from column space Q = Q3[:, :k] # return the two matrices return Q

Sweet, we found a better way to go about finding the range of some matrix $M$! So what? Why is this even valuable to know?

Well, one huge opportunity for this algorithm is when one wishes to find a low rank approximation for the matrix $M$. The development of the above algorithm, based on the power iteration, actually depended on an implicit assumption that we were approximating $M$ with the form $M \approx Q Q^T M$.

Since the above algorithm finds $Q$, we can in turn approximate $M$ with a low rank factorization of $M \approx Q B$ where $B = Q^T M$. That in and of itself can be very useful to shrink down datasets represented by $M$, given $k \ll m \leq n$.

Another use for this is when we want to perform Principle Component Analysis (PCA) on some matrix $M$. It turns out that the columns of $Q$ are actually the set of the $k$ dominant features vectors we can use to drop a dataset represented by $M$ to $k$ dimensions and $B$ is the dataset in the lower dimensional form. In practice, one would then use $B$ as the independent variable data that a model would be build from.

The badass thing about using the above algorithm for PCA is you get the same orthogonal features you would performing PCA via an SVD decomposition, but you avoid the computational cost of an SVD decomposition!

Just as an example, a dataset of $10,000$ images of handwritten digits from the MNIST dataset were used to test out dimensionality reduction using the above algorithm. Note that each image is made up of $784$ pixels, meaning each point in the dataset is $784$ dimensions. Now to give one an idea of how the dataset looks, the first figure below shows what a random set of digits might look like.

The second figure shows the $36$ features extracted using the randomized range finder algorithm, effectively making the dataset reducible to having $36$ dimensional data points instead of the original $784$ dimensional points!

The above example using this Randomized Range Finder is pretty cool and it certainly can be used in plenty more applications! For example, I personally have used the above algorithm as a stepping stone to building an efficient randomized SVD code.

Additionally, one can modify the above algorithm to be adaptive, meaning we do not need to specify some rank $k$, and can instead allow the algorithm to find the optimal value for $k$ given some tolerance. This can be very useful in the above examples so we can trade off accuracy with run-time/data compression.

As we saw in this post, there are methods we can use to estimate the range of some matrix. Using some pretty simple randomized approaches, we can make this estimation much more efficient at the expense of some minor approximation error. As shown, the resulting Randomized Range Finder can find use in many things, ranging from data compression to feature extraction and more. Basically, this algorithm is pretty cool!

]]>In a recent post, principles of Dynamic Programming were used to derive a recursive control algorithm for Deterministic Linear Control systems. The challenges with the approach used in that blog post is that it is only readily useful for Linear Control Systems with linear cost functions. What if, instead, we had a Nonlinear System to control or a cost function with some nonlinear terms? Such a problem would be challenging to solve using the approach described in the former blog post.

In this blog post, we are going to cover a more general approximate Dynamic Programming approach that approximates the optimal controller by essentially discretizing the state space and control space. This approach will be shown to generalize to any nonlinear problems, no matter if the nonlinearity comes from the dynamics or cost function. While this approximate solution scheme is conveniently general in a mathematical sense, the limitations with respect to the Curse of Dimensionality will show why this approach cannot be used for every problem.

To approach approximating these Dynamic Programming problems, we must first start out with an applicable formulation. One of the first steps will be defining various items that will help make the work later more precise and understandable. The first two quantities are that of the complete State Space and Control Space. We can define those two spaces in the following manner:

\begin{align}

\mathcal{X} &= \bigtimes_{i=1}^{n} \lbrack x_{l}^{(i)}, x_{u}^{(i)}\rbrack \\

\mathcal{U} &= \bigtimes_{i=1}^{m} \lbrack u_{l}^{(i)}, u_{u}^{(i)}\rbrack

\end{align}

where $\mathcal{X} \subset \mathbb{R}^{n}$ is the State Space, $x_{l}^{(i)}, x_{u}^{(i)}$ are the $i^{th}$ low and upper bounds of the State Space, $\mathcal{U} \subset \mathbb{R}^{m}$ is the Control Space, and $u_{l}^{(i)}, u_{u}^{(i)}$ are the $i^{th}$ low and upper bounds of the Control Space. Now these spaces represent the complete State Space and Control Space. To approximate the Dynamic Programming problem, though, we will instead discretize the State Space and Control Space into subspaces $\mathcal{X}_{D} \subset \mathcal{X}$ and $\mathcal{U}_{D} \subset \mathcal{U}$. We can thus define $\mathcal{X}_{D}$ and $\mathcal{U}_{D}$ in the following manner:

\begin{align}

\mathcal{X}_{D} &= \bigtimes_{i=1}^{n} L( x_{l}^{(i)}, x_{u}^{(i)}, N_i ) \label{xd} \\

\mathcal{U}_{D} &= \bigtimes_{i=1}^{m} L( u_{l}^{(i)}, u_{u}^{(i)}, M_i ) \label{ud} \\

L(a,b,N) &= \left \lbrace a + j \Delta : \Delta = \frac{b-a}{N-1}, j \in \lbrace 0, 1, 2, \cdots, N-1 \rbrace \right \rbrace

\end{align}

What the formulation above shows is we generate a subset of both $\mathcal{X}$ and $\mathcal{U}$ by breaking up the bounds of the $i^{th}$ dimensions into pieces. With these definitions, we can proceed with the mathematical and algorithmic formulation of the problem!

To make a general (deterministic) control problem applicable to Dynamic Programming, it needs to fit within the following framework:

\begin{align}

\boldsymbol{x}_{k+1} &= f(\boldsymbol{x}_{k},\boldsymbol{u}_{k}) \\

%

\mu_{k}^{*}(\boldsymbol{x}_{j}) &= \arg\min_{\hat{\boldsymbol{u}} \in \mathcal{U}_{D}} g_{k}(\boldsymbol{x}_{j},\hat{\boldsymbol{u}}) + V_{k+1}^{*}(f(\boldsymbol{x}_{j},\hat{\boldsymbol{u}}))\\

%

V_{k}^{*}(\boldsymbol{x}_{j}) &= g_{k}(\boldsymbol{x}_{j},\mu_{k}^{*}(\boldsymbol{x}_{j})) + V_{k+1}^{*}(\boldsymbol{x}_{k+1})\\

%

V_{k}^{*}(\boldsymbol{x}_{j}) &= g_{k}(\boldsymbol{x}_{j},\mu_{k}^{*}(\boldsymbol{x}_{j})) + V_{k+1}^{*}(f(\boldsymbol{x}_{j},\mu_{k}^{*}(\boldsymbol{x}_{j}))) \nonumber \\

%

V_{N}^{*}(\boldsymbol{x}_{N}) &= g_{N}(\boldsymbol{x}_{N})

\end{align}

$\forall k \in \lbrace 1, 2, 3, \cdots, N-1 \rbrace$, and $\forall j \in \lbrace 1, 2, 3, \cdots, |\mathcal{X}_{D}| \rbrace $. Note as well that $\mu_{k}^{*}(\boldsymbol{x})$ is the optimal controller (or policy) at the $k^{th}$ timestep as a function of some state $\boldsymbol{x} \in \mathcal{X}_{D}$. The idea of the above formulation is we compute a cost at some terminal time, $t_{N}$, using the cost function $g_{N}(\cdot)$, and then work backwards in time recursively to gradually obtain the optimal policy for the problem at each timestep. With the mathematical formulation resolved, the next step is to put all of this into an algorithm!

The algorithm can be defined in pseudocode using the following:

With the algorithm defined above, one can translate this into a code and apply it to some interesting problems! I have written a code in C++ to implement the above algorithm, which can be found at my Github. Assuming one has written the algorithm written above, the next step is to try it out on solving some control problems! Let’s take a look at an example.

The Nonlinear Pendulum control problem is one classically considered in most introductory control classes. The full nonlinear problem can be formulated with a nonlinear Second-Order Ordinary Differential Equation (ODE) in the following manner:

\begin{align}

\ddot{\theta}(t) + c \dot{\theta}(t) + \kappa \sin\left(\theta(t)\right) &= u \\

\theta(t_0) &= \theta_{0} \\

\dot{\theta}(t_0) &= \dot{\theta}_{0}

\end{align}

where $u$ is a torque the controller can apply, and $c,\kappa$ are constants based on the exact pendulum system configuration. For this particular problem, we are going to try and build a controller that can invert the pendulum. Additionally, we are going to constrain the values for $u$ such that it is too weak to directly lift the pendulum up to an inverted position. The constraint is to, at least, make $|u| \lt \kappa$, ensuring it needs to find some different strategy to get the pendulum to an inverted position. Given we now have this problem statement, let’s make this problem solvable using the Approximate Dynamic Programming approach shown earlier!

First and formost, we should take the dynamical system in the problem statement and convert it into the discrete dynamic equation Dynamic Programming requires. Our first step is to pose this problem as a system of First-Order differential equations in the following way:

\begin{align}

\dot{x_1} &= x_2 \\

\dot{x_2} &= u \; – c x_2 \; – \kappa \sin( x_1 ) \\

&\text{or} \nonumber \\

\dot{\boldsymbol{x}} &= F(\boldsymbol{x},u)

\end{align}

where $\lbrack \theta,\dot{\theta}\rbrack ^{T} = \lbrack x_1,x_2\rbrack^{T} = \boldsymbol{x}$. We can then discretize this dynamical system, using Finite Differences, into one that can be used in Dynamic Programming. This is done using the below steps:

\begin{align}

\dot{\boldsymbol{x}} &= F(\boldsymbol{x},u) \nonumber \\

\frac{\boldsymbol{x}_{k+1} – \boldsymbol{x}_{k}}{\Delta t} &\approx F(\boldsymbol{x}_{k},u_k) \nonumber \\

\boldsymbol{x}_{k+1} &= \boldsymbol{x}_{k} + \Delta t F(\boldsymbol{x}_{k},u_k) \nonumber \\

\boldsymbol{x}_{k+1} &= f(\boldsymbol{x}_{k},u_k)

\end{align}

where $f(\cdot,\cdot)$ becomes the discrete dynamical map that is used within the Dynamic Programming formulation. Now the next step is to define our discrete State and Control sets, $\mathcal{X}_{D}$ and $\mathcal{U}_{D}$ respectively, that will be used. These sets will be defined as the following for this problem:

\begin{align}

\mathcal{X}_{D} &= L(\theta_{min},\theta_{max},N_{\theta}) \bigtimes L(\dot{\theta}_{min},\dot{\theta}_{max},N_{\dot{\theta}}) \\

\mathcal{U}_{D} &= L(-u_{max},u_{max},N_{u})

\end{align}

where $\theta_{min},\theta_{max},\dot{\theta}_{min},\dot{\theta}_{max},u_{max}, N_{\theta},N_{\dot{\theta}},N_{u}$ will have exact values assigned based on the specific pendulum problem being solved. The last items needed to make this problem well posed is the cost functions needed to penalize different possible trajectories. For this problem, we will used the cost functions defined below:

\begin{align}

g_{N}(\boldsymbol{q}) = g_{N}(\theta,\dot{\theta}) &= Q_{f} (|\theta|-\pi)^2 + \dot{\theta}^2\\

g_{k}(\boldsymbol{q},u) = g_{k}(\theta,\dot{\theta},u) &= Q (|\theta|-\pi)^2 + Ru^2

\end{align}

where $g_{k}(\cdot,\cdot)$ is defined $\forall k \in \lbrace 1, 2, \cdots, N-1 \rbrace$ and $R, Q, Q_f$ are scalar weighting factors that can be defined depending on how smooth you want the control to be and how quickly you want the pendulum to become inverted. Now that all the items are defined so Dynamic Programming can be used, let’s solve this problem and see what we get!

Based on the Dynamic Programming formulation above of the Nonlinear Pendulum Control problem, we can crank out an optimal controller (at each timestep) algorithmically. To test the approach, algorithms I wrote that can be found at my Github are using the following values for the parameters mentioned earlier:

\begin{align*}

N &= 80 \\

c &= 0.0 \\

\kappa &= 5.0 \\

\theta_{min} &= -\pi \\

\theta_{max} &= \pi \\

N_{\theta} &= 3000\\

\dot{\theta}_{min} &= -3 \pi \\

\dot{\theta}_{max} &= 3 \pi \\

N_{\dot{\theta}} &= 3000 \\

u_{max} &= 1.0 \\

N_{u} &= 5 \\

R &= 0 \\

Q &= 10 \\

Q_f &= 100

\end{align*}

Note that in the discrete dynamics, due to the discontinuity of the angle $\theta$ at $\theta = -\pi$ and $\theta = \pi$, the discrete dynamics actually need to be modified for the equation updating $\theta$ at each timestep. This equation can be updated to the following:

\begin{align}

\theta_{k+1} = B( \theta_{k} + \Delta t \dot{\theta}_{k} )

\end{align}

where $B(\theta)$ bounds the input angle $\theta$ to be between $-\pi$ and $\pi$ and is defined as the following:

\begin{align}

B(\theta) = \begin{cases}

\theta \,- 2\pi & \text{if } \theta \gt \pi \\

\theta + 2\pi & \text{if } \theta \lt -\pi \\

\theta & \text{Otherwise}

\end{cases}

\end{align}

Given we use the modified dynamics for the pendulum, we can use the Approximate Dynamic Programming algorithm described earlier to produce an optimal controller shown below. Note that this controller is actually just the optimal controller found for the first timestep. However, since the cost function penalizes the pendulum for not being inverted throughout its whole trajectory, the controllers made via Dynamic Programming are actually individually capable of inverting and stabilizing the pendulum. Thus, one only really needs one of these optimal controllers to get the desired result.

The graphic below shows the value the controller produces for any given $\theta$ and $\dot{\theta}$ within $\mathcal{X}_{D}$. Yellow is a positive torque value for $u$, while blue is a negative torque value for $u$.

As we can see in the graphic above, the optimal controller produced using Dynamic Programming is extremely nonlinear. Looking at the result, it would be hard to think of a great way to even represent this controller using a finite approximation with some set of basis functions.

The result is also interesting due to the complexity of the controller and patterns produced for various values for $\theta$ and $\dot{\theta}$. While there is certainly analysis that could be done to further understand what the optimal controller is doing, it would probably just be better to get a glimpse at what this policy actually does via visualization. Below is a video showing how it performs!

Now while the above algorithm has proven to produce some pretty awesome results, the practicality of the algorithm as-is is pretty small. For starters, the amount of space needed for storing the complete controller at each timestep is on the order of $O( N |\mathcal{X}_{D}| )$, while the algorithmic computation is on the order of $O(N |\mathcal{X}_{D}| |\mathcal{U}_{D}| )$. For low dimensional problems, this may not seem like a big deal, but both $|\mathcal{X}_{D}|$ and $|\mathcal{U}_{D}|$ blow up as dimensions increase due to the Curse of Dimensionality.

For example, given equations $\refp{xd}$ and $\refp{ud}$, we can compute the cardinality of $\mathcal{X}_{D}$ and $\mathcal{U}_{D}$ to be the following:

\begin{align}

|\mathcal{X}_{D}| &= \prod_{i=1}^{n} N_{i} \\

|\mathcal{U}_{D}| &= \prod_{i=1}^{m} M_{i}

\end{align}

These cardinality results show that each dimension we add multiplies the size of the State and Control spaces, in turn making the values of $|\mathcal{X}_{D}|$ and $|\mathcal{U}_{D}|$ potentially huge! For example, if all we did was model a rocket in 3D, the state is 12 dimensions (or 13 if you use quaternions). Chopping up each dimension into just $10$ discrete pieces would make $|\mathcal{X}_{D}| = 10^{12}$ … which is way too huge a number to use practically, and $10$ discrete pieces per dimension is not even a lot! So even without looking at any discretized control space, this Approximate Dynamic Programming method proves impractical for a realistic problem.

Within this post, we saw a way to use Dynamic Programming and approximately tackle deterministic control problems… no matter how nonlinear the dynamics or cost functions are! We saw the algorithm described used to find a nonlinear optimal controller for a Nonlinear Pendulum and invert the pendulum. We also saw how impractical this method, as-is, can be for realistic problems of larger dimensionality.

While the dimensionality does become a problem for a variety of problems, there are fortunately still some problems that can be adequately solved using the above approach. For those looking for something more capable, those interested can investigate other Approximate Dynamic Programming techniques in the literature. Some related areas of potential interest is that of Reinforcement Learning, as these areas are attempting to solve the same problem but with more flexibility than traditional Dynamic Programming.

]]>Oh control. Who doesn’t enjoy having control of things in life every so often? While many of us probably wish life could be more easily controlled, alas things often have too much chaos to be adequately predicted and in turn controlled. While lack of complete controllability is the case for many things in life, like getting the ultimate job or getting even one for that matter (sorry I’m a Millenial), there are still plenty of things that can be controlled.. And many are controlled by engineers.

While controlling systems is a challenge, especially if they are nonlinear, stochastic, partially controllable, etc., there is a wealth of work that has been done to tackle these problems. Now, it is useful to gain familiarity with some of these mathematical tools successfully used in control problems. One interesting tool is that of Optimal Control.

In this blog post, I’m going to cover the use of **Dynamic Programming** to tackle deterministic Discrete-Time Linear Control problems for the case of bringing the state to $\boldsymbol{0}$. I will also assume for simplicity all states are observable. With this tool, one will have a useful formulation that can be used to approximately tackle control problems in some optimal sense.

In the general deterministic, nonlinear dynamics formulation, we can represent the dynamics in a State Space form written as shown below:

\begin{align}

\frac{d \boldsymbol{x}}{dt} &= f(t,\boldsymbol{x},\boldsymbol{u})

\end{align}

where $t \in \mathbb{R}$ is the time, $\boldsymbol{x} \in \mathbb{R}^n$ is the state vector, $\boldsymbol{u} \in \mathbb{R}^m$ is the control vector, and $f(\cdot,\cdot)$ is mapping used to compute the time derivative of the state vector, defined as the following:

\begin{align}

f &: \mathbb{R} \times \mathbb{R}^n \times \mathbb{R}^m \rightarrow \mathbb{R}^n

\end{align}

Given we have the dynamics written in this differential form, we can convert it to a discrete form by approximating the time derivative of $\boldsymbol{x}$ with a simple Finite Difference formula. Thus, we can derive the discrete form in the following way:

\begin{align}

\left.\frac{d \boldsymbol{x}}{dt}\right|_{t_k} &= f(t_k,\boldsymbol{x}_k,\boldsymbol{u}_k) \nonumber \\

\frac{\boldsymbol{x}_{k+1} – \boldsymbol{x}_k }{\Delta t} &\approx f(t_k,\boldsymbol{x}_k,\boldsymbol{u}_k) \nonumber \\

\boldsymbol{x}_{k+1} &\approx \boldsymbol{x}_k + \Delta t f(t_k,\boldsymbol{x}_k,\boldsymbol{u}_k)

\end{align}

Now we can assume the system is linear, making the actual discrete-time system of the following form:

\begin{align}

\boldsymbol{x}_{k+1} &\approx A_{k}\boldsymbol{x}_k + B_{k}\boldsymbol{u}_k

\end{align}

where $A_{k}$ and $B_{k}$ are matrices that can be dependent on time. With this discrete form, we have one of the main ingredients needed to approach optimal control problems using Dynamic Programming. Next, we need to investigate how we will define optimality.

For many optimal control problems, we can get away with using a cost function of the form shown below:

\begin{align}

J &= \left.\left( \boldsymbol{x}^{T}(t) Q_f \boldsymbol{x}(t)\right)\right|_{t=t_N} + \sum_{k=1}^{N-1} \left.\left( \boldsymbol{x}^{T}(t) Q(t) \boldsymbol{x}(t) + \boldsymbol{u}^{T}(t) R(t) \boldsymbol{u}(t) \right)\right|_{t=t_k} \nonumber \\

&= \boldsymbol{x}^{T}_N Q_f \boldsymbol{x}_N + \sum_{k=1}^{N-1} \left( \boldsymbol{x}^{T}_k Q_k \boldsymbol{x}_k + \boldsymbol{u}^{T}_k R_k \boldsymbol{u}_k \right) \\

\end{align}

The cost function shown above ends up being quadratic with respect to both the state and control vectors at each time that one cares to make the controller based on. Note that the matrices $Q_f$, $Q_k \forall k$, and $R_k \forall k$ are chosen by the control engineer to ensure desireable properties. The need to choose these various matrices for the cost function is one of the big parts that makes control engineer an art more than a science and benefits from lots of experience. Anyway, now that we have the cost function, we can start investigating how to obtain an optimal controller using Dynamic Programming!

So when tackling some optimal control problem, you are essentially striving to find an optimal time history of both the state, $\boldsymbol{x}$, and the control, $\boldsymbol{u}$. This time history of values can be viewed as optimal paths for $\boldsymbol{x}$ and $\boldsymbol{u}$. Now the premise of Dynamic Programming is essentially that a single optimal path can be broken up into optimal sub-paths that are optimal in their own domain, yet when unioned together become the desired complete optimal trajectory.

Using the concept behind Dynamic Programming, one should be able to obtain state and control trajectory in pieces and wind up with the overall optimal trajectories when you’re done. So given that, where can we even begin to start coming up with an optimal trajectory?

Since we earlier wrote out the cost function we’re going to use, we know that we have some desired end state and then a penalty cost based on the actual trajectory we take. Since we really want to end up with the final state, a smart strategy is to start at the end of the trajectory and work backwards in time! Might make sense, but that kind of sounds hard or even impossible.. Well to help us think this through, let’s start out with a simpler control problem to illustrate the approach!

For this problem, let’s assume the below scalar dynamics and cost function:

\begin{align}

x_{k+1} &= \alpha x_{k} + \beta u_{k}\\

J &= Q_f x_{3}^2 + \sum_{k=1}^{2} Q x_{k}^2 + R u_{k}^2

\end{align}

Notice that the discrete dynamics for $x_{k+1}$ is linear with respect to $x_{k}$ and $u_{k}$. This is a simplification for the sake of the example, but real world problems will not necessarily be linear. We will investigate nonlinear problems in future posts, but for this post linear is what we will stick to! Now back to the problem… Since we want to work from the end back to the beginning, let’s first assume our final cost is:

\begin{align}

V_{3} &= Q_f x_{3}^2

\end{align}

We then know recursively the following costs exist for the other parts of the control problem:

\begin{align}

V_{2} &= Q x_{2}^2 + R u_{2}^2 + V_{3}\\

V_{1} &= Q x_{1}^2 + R u_{1}^2 + V_{2}

\end{align}

Notice how we have broken up our problem into successive pieces of the overall cost function. This strategy gives us an ability to recursively tackle finding optimal state and control trajectories piece by piece.

So let us first start with finding $u_{2}$, the last control action that will be taken in the sequence. Looking at $V_{2}$, we know that $V_{3}$ is dependent on $x_{3}$. Through our dynamics, we know we can relate $x_{3}$ to $u_{2}$ using the fact that $x_{3} = \alpha x_{2} + \beta u_{2}$. Substituting this into the $V_{2}$ equation in the $V_{3}$ term can work out to give use the following:

\begin{align*}

V_{2} &= Q x_{2}^2 + R u_{2}^2 + V_{3}\\

&= Q x_{2}^2 + R u_{2}^2 + Q_f x_{3}^2\\

&= Q x_{2}^2 + R u_{2}^2 + Q_f \left(\alpha x_{2} + \beta u_{2}\right)^2\\

&= Q x_{2}^2 + R u_{2}^2 + Q_f\alpha^2 x_{2}^2 + Q_f \beta^2 u_{2}^2 + 2 Q_f \alpha \beta x_{2}u_{2}\\

&= \left(Q + Q_f \alpha^2 \right) x_{2}^2 + \left( R + Q_f \beta^2 \right) u_{2}^2 + 2 Q_f \alpha \beta x_{2}u_{2}

\end{align*}

Now given our expression for $V_{2}$, let’s compute the value for $u_{2}$ that minimizes it! Thus:

\begin{align}

\frac{\partial V_{2}}{\partial u_{2}} = 0 &= 2 \left( R + Q_f \beta^2 \right) u_{2} + 2 Q_f \alpha \beta x_{2} \nonumber \\

u_{2} &= -\left( R + Q_f \beta^2 \right)^{-1} Q_f \alpha \beta x_{2}\\

&= -K_{2} x_{2}

\end{align}

where we can see $u_{2}$ ends up as a feedback controller based on the state $x_{2}$ and $K_{2}$ is the linear control gain. That’s pretty interesting! But so, how do we proceed to obtain $u_{1}$? Well, we first use our result for $u_{2}$ to get $V_{2}$ in terms of just $x_{2}$ and then apply the same process to $V_{1}$! Let’s try it out:

\begin{align}

V_{2} &= \left(Q + Q_f \alpha^2 \right) x_{2}^2 + \left( R + Q_f \beta^2 \right) u_{2}^2 + 2 Q_f \alpha \beta x_{2}u_{2} \nonumber \\

&= \left(Q + Q_f \alpha^2 \right) x_{2}^2 + \left( R + Q_f \beta^2 \right) K_{2}^2 x_{2}^2 – 2 K_{2} Q_f \alpha \beta x_{2}^2 \nonumber \\

&= \left(Q + Q_f \alpha^2 + (R + Q_f \beta^2)K_{2}^2 – 2 K_{2} Q_f \alpha \beta \right) x_{2}^2 \nonumber \\

&= P_{2} x_{2}^2

\end{align}

So now that we have worked out $V_{2}$ in terms of quantities we know and just with respect to $x_{2}$, we can apply the same steps from earlier to end up with the control $u_{1}$ by doing the following:

\begin{align*}

V_{1} &= Q x_{1}^2 + R u_{1}^2 + V_{2}\\

&= Q x_{1}^2 + R u_{1}^2 + P_{2} x_{2}^2\\

&= Q x_{1}^2 + R u_{1}^2 + P_{2} \left(\alpha x_{1} + \beta u_{1}\right)^2\\

&= Q x_{1}^2 + R u_{1}^2 + P_{2} \alpha^2 x_{1}^2 + P_{2} \beta^2 u_{1}^2 + 2 P_{2} \alpha \beta x_{1}u_{1}\\

&= \left(Q + P_{2} \alpha^2 \right) x_{1}^2 + \left( R + P_{2} \beta^2 \right) u_{1}^2 + 2 P_{2} \alpha \beta x_{1}u_{1} \\

\frac{\partial V_{1}}{\partial u_{1}} = 0 &= 2 \left( R + P_{2} \beta^2 \right) u_{1} + 2 P_{2} \alpha \beta x_{1} \\

u_{1} &= -\left( R + P_{2} \beta^2 \right)^{-1} P_{2} \alpha \beta x_{1}\\

&= -K_{1} x_{1}

\end{align*}

Very interesting! We have indeed found the feedback control formulas for the control actions needed to minimize the cost function we have defined. Now we did a simple problem, but have you been able to see a pattern in the algorithmic steps to finding the optimal control sequence? Using the following definitions, we can define a recursive equation to finding each gain needed to perform the optimal control action. Here are the following equations:

\begin{align}

K_{k} &= (R + P_{k+1} \beta^2 )^{-1} P_{k+1} \alpha \beta \\

P_{k} &= Q + P_{k+1} \alpha^2 + (R + P_{k+1} \beta^2 )K_{k}^2 – 2 K_{k} P_{k+1} \alpha \beta

\end{align}

where $k \in \lbrace 1, 2, \cdots, N-1 \rbrace$, $P_{N} = Q_f$, and we compute the feedback control via the equation:

\begin{align}

u_{k} &= -K_{k} x_{k}

\end{align}

To prove the worth of this equation, let us implement some codes and show the system is asymptotically stabilized, aka $x_{N}$ reaches a value near $0$.

Let’s get this computational proof rolling by first implementing a function that will compute the sequential gains needed for the control problem. Below shows a Matlab code to do just that!

function [ K ] = getControlGains( alpha, beta, Qf, Q, R, N ) %GETCONTROLGAINS Method to get control gains sequence for linear scalar % control problem tackled using Dynamic Programming % Author: C. Howard K = zeros(N-1,1); P = Qf; for i = (N-1):-1:1 K(i) = (P*alpha*beta )/( R + P*beta); P = Q + P*alpha + (R+P*beta)*K(i)^2 - 2*K(i)*P*alpha*beta; end end

The next code is simply one that does a simple Monte Carlo sampling of tuning parameter values, simulates the control and dynamics based on them, and saves off a figure. This code can be found below:

% script to test control sequence % for linear scalar control problem % set constants N = 500; % number of control steps to use in algorithm Qf = 2; % final cost weighting Q = 10; % cost weighting of state R = 80; % cost weighting of control dt = 1e-2; alpha = (1-dt); beta = dt.*1; NMC = 10; for i = 1:NMC Qf = rand()*20; % final cost weighting Q = rand()*10; % cost weighting of state R = rand()*150; % cost weighting of control % Get Control Gains K = getControlGains(alpha,beta,Qf,Q,R,N); % Get ready to simulate x0 = 1; x = zeros(N,1); x(1) = x0; u = zeros(N-1,1); % Do simulation for i = 2:N u(i-1) = -K(i-1)*x(i-1); x(i) = alpha*x(i-1) + beta*u(i-1); end % plot results time = dt.*(0:N-1); figure(1) plot(time,x,'-','Color',[0.5,0,1.0],'LineWidth',2) hold on plot(time(1:end-1),u,'-','Color',[0.9,0,0.2],'LineWidth',2) grid on xlabel('Time','FontSize',16) ylabel('Value','FontSize',16) title({sprintf('\$\$Q_{f} = %0.2f | Q = %0.2f | R = %0.2f | \\alpha = %0.2f | \\beta = %0.2f\$\$',Qf,Q,R,alpha,beta)},'interpreter','latex','FontSize',16) legend({'x','u'},'Location','Best') hold off axis([0,time(end),-x0,x0]) print(gcf,'-dpng','-r300',sprintf('plots/qf%0.2f_q%0.2f_r%0.2f_a%0.2f_b%0.2f_sim.png',Qf,Q,R,alpha,beta)) close all; end

Now below we see some results from the above code for different values of the tuning parameters:

If one looks at these figures, we can see that smaller values for $R$ allows for $u$ to take on larger magnitudes at a given time. As $R$ increases, though, we see that the magnitude of $u$ reduces and in turn makes it take longer for the dynamical system to approach the desired steady state value of $0$.

This all makes sense since $R$ penalizes large magnitudes for $u$, meaning a large value for $R$ should shrink the magnitude of $u$ over the time history. Now if we were to set larger values for $Q_{f}$ and $Q$, we could expect the magnitude of $u$ to increase over the time history since the cost function would focus less on minimizing control and instead focus on getting the state to the desired steady state value. I think it is pretty neat seeing how the tuning variables can intuitively affect the control performance!

Now this was all for a scalar problem, but how about we extend this to Linear Multiple Input Multiple Output (MIMO) systems!

Let us first assume the dynamics can be written like the following:

\begin{align*}

\boldsymbol{x}_{k+1} &= A_{k} \boldsymbol{x}_{k} + B_{k} \boldsymbol{u}_{k}

\end{align*}

where $A_{k}$ and $B_{k}$ can be time varying matrices. Let’s then define the Dynamic Programming cost recursively as the following:

\begin{align}

V_{k} &= \boldsymbol{x}_{k}^{T}Q_{k}\boldsymbol{x}_{k} + \boldsymbol{u}_{k}^{T}R_{k}\boldsymbol{u}_{k} + \boldsymbol{x}_{k+1}^{T}P_{k+1}\boldsymbol{x}_{k+1} \\

P_{N} &= Q_{f}

\end{align}

We can then substitute the Linear MIMO dynamics to replace the $\boldsymbol{x}_{k+1}$ term in the recursive cost equation. This results in:

\begin{align}

V_{k} &= \boldsymbol{x}_{k}^{T}Q_{k}\boldsymbol{x}_{k} + \boldsymbol{u}_{k}^{T}R_{k}\boldsymbol{u}_{k} + \left(A_{k} \boldsymbol{x}_{k} + B_{k} \boldsymbol{u}_{k}\right)^{T}P_{k+1}\left(A_{k} \boldsymbol{x}_{k} + B_{k} \boldsymbol{u}_{k}\right) \label{eq1}

\end{align}

We can then take the derivative of $V_{k}$ with respect to $\boldsymbol{u}_{k}^{T}$ and solve for an optimal value for $\boldsymbol{u}_{k}$ by doing the following:

\begin{align*}

\frac{\partial V_{k}}{\partial \boldsymbol{u}_{k}^{T}} = 0 &= 2 R_{k}\boldsymbol{u}_{k} + 2 B_{k}^{T}P_{k+1}\left(A_{k} \boldsymbol{x}_{k} + B_{k} \boldsymbol{u}_{k}\right) \\

0 &= \left( R_{k} + B_{k}^{T}P_{k+1}B_{k} \right) \boldsymbol{u}_{k} + B_{k}^{T}P_{k+1}A_{k} \boldsymbol{x}_{k} \\

\boldsymbol{u}_{k} &= -\left( R_{k} + B_{k}^{T}P_{k+1}B_{k} \right)^{-1} B_{k}^{T}P_{k+1}A_{k} \boldsymbol{x}_{k} \\

\boldsymbol{u}_{k} &= -K_{k} \boldsymbol{x}_{k}

\end{align*}

where $K_{k} = \left( R_{k} + B_{k}^{T}P_{k+1}B_{k} \right)^{-1} B_{k}^{T}P_{k+1}A_{k}$. Now with the solution to the optimal control specified as $\boldsymbol{u}_{k} = -K_{k} \boldsymbol{x}_{k}$, we can substitute the result into equation $\refp{eq1}$ and get a recursive expression for $P_{k}$ in terms of known quantities. Thus:

\begin{align}

V_{k} &= \boldsymbol{x}_{k}^{T}Q_{k}\boldsymbol{x}_{k} + \boldsymbol{u}_{k}^{T}R_{k}\boldsymbol{u}_{k} + \left(A_{k} \boldsymbol{x}_{k} + B_{k} \boldsymbol{u}_{k}\right)^{T}P_{k+1}\left(A_{k} \boldsymbol{x}_{k} + B_{k} \boldsymbol{u}_{k}\right) \\

V_{k} &= \boldsymbol{x}_{k}^{T}Q_{k}\boldsymbol{x}_{k} + \boldsymbol{x}_{k}^{T}K_{k}^{T}R_{k}K_{k} \boldsymbol{x}_{k} + \boldsymbol{x}_{k}^{T}\left(A_{k} – B_{k}K_{k}\right)^{T}P_{k+1}\left(A_{k} – B_{k}K_{k}\right)\boldsymbol{x}_{k} \\

V_{k} &= \boldsymbol{x}_{k}^{T}\left(Q_{k} + K_{k}^{T}R_{k}K_{k} + \left(A_{k} – B_{k}K_{k}\right)^{T}P_{k+1}\left(A_{k} – B_{k}K_{k}\right)\right)\boldsymbol{x}_{k} \\

V_{k} &= \boldsymbol{x}_{k}^{T}P_{k}\boldsymbol{x}_{k} \\

\therefore P_{k} &= Q_{k} + K_{k}^{T}R_{k}K_{k} + \left(A_{k} – B_{k}K_{k}\right)^{T}P_{k+1}\left(A_{k} – B_{k}K_{k}\right)

\end{align}

With this, our full recursive equations with the base case are the following:

\begin{align}

K_{k} &= \left( R_{k} + B_{k}^{T}P_{k+1}B_{k} \right)^{-1} B_{k}^{T}P_{k+1}A_{k} \\

P_{k} &= Q_{k} + K_{k}^{T}R_{k}K_{k} + \left(A_{k} – B_{k}K_{k}\right)^{T}P_{k+1}\left(A_{k} – B_{k}K_{k}\right) \\

\boldsymbol{u}_{k} &= -K_{k} \boldsymbol{x}_{k} \\

P_{N} &= Q_{f}

\end{align}

where $k \in \lbrace 1, 2, \cdots, N-1 \rbrace$. With these equations, given a fully observable Linear MIMO system, one can work backwards from some time $t_{N}$ and figure out the sequence of gain matrices, $K_{k} \forall k$, that minimize the quadratic cost function! This is really cool because now we have a pretty general set of equations that can be used for many real control problems!

In this post we have gone over some fundamentals of developing optimal controllers using Dynamic Programming to tackle Linear control problems. In future posts, we will seek to consider nonlinear dynamical systems. Diving into this more complicated area is much more challenging and will require a new post devoted to introducing the topic.

With all that said, we have conquered a lot of stuff in this post and have good reason to feel good about ourselves! There is only one thing left to say…

]]>Mathematics and the methods to communicate and work with it have evolved ever since mankind began. In our modern world, most are driven to represent concepts and expressions in clear forms based on topics like Linear Algebra, keeping things in the form of matrices and vectors. While this is quite useful for visualization and ease of computation in many cases, there are other approaches to viewing and tackling problems that ignore some of the mainstream methods. One example I am going to provide an introduction to is the area of Tensor Notation.

Tensor notation is a tool to represent and work with mathematics in a way that essentially uses indices to represent other dimensions of a quantity. This notation allows for easily working with tensors of all varieties and in turn generalizes better than most typical Linear Algebra techniques.

The basic concepts of tensor notation are the following:

- Terms that share indicies represent a summation
- $a_i b_i = \sum_{i}^{n} a_i b_i$
- Shared indices are dummy variables, so they can be changed to anything
- $a_i b_i = a_k b_k = a_p b_p$
- Order of tensor variables next to each other doesn’t matter
- $a_i b_j c_k = b_j c_k a_i = b_j a_i c_k$
- Vectors are represented in some unit vector basis $\left\lbrace \hat{e}_i \right\rbrace$
- $\textbf{u} = u_i \hat{e}_i$
- Dot product between unit vectors in basis results in Kronecker Delta Property
- $\hat{e}_i \cdot \hat{e}_i = \delta_{i,j}$
- Derivative of tensors with respect to self result in Kronecker Delta Property
- $\frac{\partial x_i}{\partial x_j} = \delta_{ij}$
- $\frac{\partial A_{ij}}{\partial A_{mn}} = \delta_{im}\delta_{jn}$
- Multiplying tensor with Kronecker Delta when they share indices is equivalent to an indice swap
- $A_{ij}\delta_{jp} = A_{ip}$
- Cross products are represented by third-order tensors
- $(\textbf{a} \times \textbf{b})_i = \mathcal{E}_{ijk} a_j b_k$
- Transpose of second-order tensor is just a swapped indice
- $A_{ij}^{T} = A_{ji}$

where

\begin{align}

\delta_{ij} &= \begin{cases}

1 & i = j \\

0 & i \neq j

\end{cases} \\

%

\mathcal{E}_{ijk} &= \begin{cases}

1 & (i,j,k) \text{ is even permutation of (1,2,3)} \\

-1 & (i,j,k) \text{ is odd permutation of (1,2,3)} \\

0 & \text{otherwise}

\end{cases}

\end{align}

Given we have some simple basics listed out, let’s do a set of examples to try and solidify an understanding of the basics!

**Problem**

Expand the expression $3 a_i b_i$ given $i \in \left\lbrace 1,2\right\rbrace$.

**Solution**

\begin{align*}

3 a_i b_i = 3 \left(a_1 b_1 + a_2 b_2 \right)

\end{align*}

**Problem**

Use tensor notation to represent the inner product between vectors $\textbf{u}$ and $\textbf{v}$.

**Solution**

\begin{align*}

\textbf{u} \cdot \textbf{v} &= u_i \hat{e}_i \cdot v_j \hat{e}_j \\

&= u_i v_j (\hat{e}_i \cdot \hat{e}_j) \\

&= u_i v_j \delta_{ij} \\

&= u_i v_i

\end{align*}

**Problem**

Expand the expression $A_{ij} b_{j}$ given $j \in \left\lbrace 1,2\right\rbrace$.

**Solution**

\begin{align*}

A_{ij} b_{j} = A_{i1} b_{1} + A_{i2} b_{2}

\end{align*}

**Problem**

Compute the derivative of the quantity $J = x^{T}Ax$ with respect to $x$, where $A$ is symmetric

**Solution**

\begin{align*}

J = x^{T}Ax &= x_{i}A_{ij}x_{j}\\

\frac{\partial J}{\partial x_{k}} &= \frac{\partial}{\partial x_{k}}\left(x_{i}A_{ij}x_{j}\right) \\

\frac{\partial J}{\partial x_{k}} &= \frac{\partial x_{i}}{\partial x_{k}}A_{ij}x_{j} + x_{i}A_{ij}\frac{\partial x_{j}}{\partial x_{k}} \\

\frac{\partial J}{\partial x_{k}} &= \delta_{ik}A_{ij}x_{j} + x_{i}A_{ij}\delta_{jk} \\

\frac{\partial J}{\partial x_{k}} &= A_{kj}x_{j} + x_{i}A_{ik} \\

\frac{\partial J}{\partial x_{k}} &= A_{kj}x_{j} + A_{jk}x_{j} \\

\frac{\partial J}{\partial x_{k}} &= A_{kj}x_{j} + A_{kj}x_{j} \\

\frac{\partial J}{\partial x_{k}} &= 2A_{kj}x_{j}

\end{align*}

**Problem**

Assuming some matrix $C$ is invertible, find $\frac{\partial C^{-1}_{ij}}{\partial C_{kl}}$.

**Solution**

First, we know that since $C$ is invertible, the following is true: $C_{ik}C^{-1}_{kj} = \delta_{ij}$

Given this fact, the following derivation can be done:

\begin{align*}

C_{ik}C^{-1}_{kj} &= \delta_{ij}\\

\frac{\partial}{\partial C_{lm}}\left(C_{ik}C^{-1}_{kj}\right) &= \frac{\partial \delta_{ij}}{\partial C_{lm}}\\

\frac{\partial C_{ik}}{\partial C_{lm}}C^{-1}_{kj} + \frac{\partial C^{-1}_{kj}}{\partial C_{lm}}C_{ik} &= 0 \\

\frac{\partial C^{-1}_{kj}}{\partial C_{lm}}C_{ik} &= -\delta_{il}\delta_{km}C^{-1}_{kj} \\

\frac{\partial C^{-1}_{kj}}{\partial C_{lm}}C_{ik}C^{-1}_{ri} &= -\delta_{il}C^{-1}_{mj}C^{-1}_{ri} \\

\frac{\partial C^{-1}_{kj}}{\partial C_{lm}}\delta_{rk} &= -C^{-1}_{mj}C^{-1}_{rl} \\

\frac{\partial C^{-1}_{rj}}{\partial C_{lm}} &= -C^{-1}_{mj}C^{-1}_{rl} \\

\frac{\partial C^{-1}_{ij}}{\partial C_{kl}} &= -C^{-1}_{lj}C^{-1}_{ik} \\

\end{align*}

With this post, we have covered some basic aspects of Tensor Notation and investigated how to use it for various derivations. I have found that this skill has proven very useful in doing derivations with respect to matrices, which is a common task to complete in deriving control algorithms based on state-space models of the dynamics. I know Tensor Notation also finds its way into modern physics, though I am sure there are many other disciplines that use it often as well.

In the future, we may investigate using Tensor Notation and applying it to some area of study, like controls or some aspect of physics.

]]>*After a while since my last post, I have finally come back to getting some things written! Been really involved with work and other important things for myself, but I recently got most of these things out of the way so it is time to get back to the blogging grind! WOOT!*

In the world of calculus, a fundamental part is the computation of derivatives. The classic formalization of a derivative is in the following form below:

\begin{align}

\left.\frac{df}{dx}\right|_x = \lim_{h \rightarrow 0} \frac{f(x+h) – f(x)}{h} \label{eq1}

\end{align}

This is great and all, but what happens when you try to use equation $\refp{eq1}$ to compute a derivative on a computer? As one might guess, the division by $0$ based on $h$ in the denominator causes some undesireable affects on a computer, one being an estimation for a derivative that is not at all correct. So how can we begin to estimate the value of a derivative at some location? One of the common approaches is something referred to as **Finite Differences**.

The reality is estimating a derivative is much easier than it first appears. Let us pretend instead of making $h = 0$ in equation $\refp{eq1}$, we choose $h = \epsilon$, where $\epsilon \ll 1$. Computing the derivative in this fashion produces a basic **Finite Difference** scheme, where the name comes from the fact we use small and finite changes in a function, based on a small $\epsilon$, to estimate a value for the derivative of that function. Using this aproach, we can then just use the classic derivative formulation and approximate the derivative like so:

$$ \left.\frac{df}{dx}\right|_x \approx \frac{f(x+\epsilon) – f(x)}{\epsilon} $$

As we can imagine, for small values of $h$, the derivative is close. However, this is really just an approximation. What’s the theoretical error of this approximation? Well we can estimate this by using a Taylor Series! This can be done using the following steps:

\begin{align*}

\left.\frac{df}{dx}\right|_x &= \frac{f(x+\epsilon) – f(x)}{\epsilon} + \text{Error} \\

\left.\frac{df}{dx}\right|_x &= \frac{1}{\epsilon} f(x+\epsilon) – \frac{1}{\epsilon} f(x) + \text{Error} \\

\left.\frac{df}{dx}\right|_x &\approx \frac{1}{\epsilon} \left(f(x) + \epsilon \left.\frac{df}{dx}\right|_x + \frac{\epsilon^2}{2!}\left.\frac{d^2f}{dx^2}\right|_x \right) – \frac{1}{\epsilon} f(x) + \text{Error} \\

\left.\frac{df}{dx}\right|_x &\approx \left.\frac{df}{dx}\right|_x + \frac{\epsilon}{2!}\left.\frac{d^2f}{dx^2}\right|_x + \text{Error} \\

\text{Error} &\approx -\frac{\epsilon}{2!}\left.\frac{d^2f}{dx^2}\right|_x = O(\epsilon)

\end{align*}

What we get from this result is that the true error is approximately proportional to the value of $\epsilon$. So basically, if you cut the value of $\epsilon$ in half, you should expect the error of the derivative approximation to be cut in half. That’s pretty neat!

So now that we have proven we can approximate a derivative with this scheme, we’re done and it’s time to wrap things up… Oh wait, you want to know if we can do better than this scheme? Well that’s an interesting thought. Let’s try something. First, let’s assume we can approximate a derivative using the following weighted average:

$$ \left.\frac{df}{dx}\right|_{x_i} \approx \sum_{j=n}^{m} a_j f(x_i + jh)$$

where we are finding the derivative at some location $x_i$ and where $n$ and $m$ are integers one can choose such that $n \lt m$. Given this, let’s also state that the Taylor Series of some $f(x_i + jh)$ can be written out like so:

$$ f(x_i + jh) = f_{i+j} = f_i + \sum_{k=1}^{\infty} \frac{(jh)^k}{k!}f_i^{(k)}$$

What’s interesting is we can see that each $f_{i+j}$ in the weighted average will share similar Taylor Series, only differing in the $(jh)^{k}$ coefficient of each term. We can use this pattern to setup a set of equations. This can be shown to be the following:

\begin{align}

\left.\frac{df}{dx}\right|_{x_i} &\approx \sum_{j=n}^{m} a_j f(x_i + jh) \nonumber \\

\left.\frac{df}{dx}\right|_{x_i} &\approx \sum_{j=n}^{m} a_j \left( f_i + \sum_{k=1}^{\infty} \frac{(jh)^k}{k!}f_i^{(k)} \right) \nonumber \\

\left.\frac{df}{dx}\right|_{x_i} &\approx \left(\sum_{j=n}^{m} a_j\right)f_i + \sum_{j=n}^{m} a_j\sum_{k=1}^{\infty} \frac{(jh)^k}{k!}f_i^{(k)} \nonumber \\

\left.\frac{df}{dx}\right|_{x_i} &\approx \left(\sum_{j=n}^{m} a_j\right)f_i + \sum_{k=1}^{\infty} \left(\sum_{j=n}^{m} a_j j^{k}\right)\frac{h^{k}}{k!}f_i^{(k)} \label{eq_s}

\end{align}

If we equate both sides of this equation, we end up with the following set of equations to solve for the $(m-n+1)$ weights :

\begin{align*}

0 &= \left(\sum_{j=n}^{m} a_j\right) \\

1 &= \left(\sum_{j=n}^{m} a_j j\right)h \\

0 &= \left(\sum_{j=n}^{m} a_j j^{2}\right) \\

0 &= \left(\sum_{j=n}^{m} a_j j^{3}\right) \\

&\vdots \\

0 &= \left(\sum_{j=n}^{m} a_j j^{m-n}\right)

\end{align*}

If we solve this system of equations, we obtain the weights for the weighted average form of the Finite Difference scheme such that we zero out all but one of the $(m-n+1)$ terms in the truncated Taylor Series. This zeroing out of terms typically results in a numerical scheme of order $O(h^{m-n})$ for first order derivative approximations, though the exact order can be found by obtaining the first nonzero Taylor Series term found in the $\text{Error}$ after you plug in the values for $\left\lbrace a_j\right\rbrace$.

As an example, let’s choose the case where $n = -1$ and $m = 1$. Using these values, we end up with the following system of equations:

\begin{align*}

0 &= a_{-1} + a_{0} + a_{1}\\

\frac{1}{h} &= -a_{-1} + a_{1}\\

0 &= a_{-1} + a_{1}

\end{align*}

We will solve this set of equations analytically as an example, but typically you’d likely want to compute it these schemes numerically using some Linear Algebra routines. So based on these equations, we can first see that $a_{-1} = -a_{1}$. Thus, by the second equation, $a_{1} = \frac{1}{2h}$ and in turn $a_{1} = -\frac{1}{2h}$. Plugging in the values for $a_{-1}$ and $a_{1}$ into the first equation results in $a_{0} = 0$. Thus, our resulting Finite Difference scheme, known as a First Order Central Difference, is:

$$ \left.\frac{df}{dx}\right|_{x_i} = \frac{f_{i+1} – f_{i-1}}{2h} $$

That’s pretty convenient! What’s interesting with this setup is we can pretty easily compute Finite Fifferences for more than just a single first order derivative.

To show how Finite Difference schemes can be derived for more complicated expressions, let’s try a second example. So how about we try to compute a Finite Difference scheme to estimate the following quantity:

\begin{align}

\alpha\left.\frac{df}{dx}\right|_{x_i} + \beta\left.\frac{d^2f}{dx^2}\right|_{x_i} \approx \sum_{j=-2}^{2} a_j f(x_i + jh) \label{ex2}

\end{align}

If we use the right-hand side of equation $\refp{eq_s}$ from earlier and equate it to the left-hand side of equation $\refp{ex2}$, we end up with the following:

\begin{align*}

0 &= \left(\sum_{j=-2}^{2} a_j\right) \\

\alpha &= \left(\sum_{j=-2}^{2} a_j j\right)h \\

\beta &= \left(\sum_{j=-2}^{2} a_j j^{2}\right)\frac{h^2}{2!} \\

0 &= \left(\sum_{j=-2}^{2} a_j j^{3}\right) \\

0 &= \left(\sum_{j=-2}^{2} a_j j^{4}\right)

\end{align*}

If we expand the various series in the equations above and then write all these equations in matrix form, we end up with the following matrix equations to solve for the unknown coefficients $\left\lbrace a_j \right\rbrace$:

\begin{align}

\begin{pmatrix}

1 & 1 & 1 & 1 & 1 \\

-2 & -1 & 0 & 1 & 2 \\

4 & 1 & 0 & 1 & 4 \\

-8 & -1 & 0 & 1 & 8 \\

16 & 1 & 0 & 1 & 16

\end{pmatrix}

\begin{pmatrix}

a_{-2} \\

a_{-1} \\

a_{0} \\

a_{1} \\

a_{2}

\end{pmatrix}

=

\begin{pmatrix}

0 \\

\frac{\alpha}{h} \\

\frac{2\beta}{h^2} \\

0 \\

0

\end{pmatrix}

\label{ex2_mat}

\end{align}

After solving this set of equations, using whatever method you prefer (numerically, symbolically, by hand, etc.), one is able to employ the scheme in whatever problems are needed. As one can see from this example, the thing that changes the most in this formulation is just the vector on the right-hand side of the matrix equation where you basically express which derivative quantities you want the Finite Difference scheme to approximate. So as you can see, it’s really not too difficult to develop a Finite Difference scheme!

Now obtaining Finite Differences in this way is actually not the only approach. One potentially more straight forward way is to build Finite Difference schemes using Lagrange Interpolation. Essentially, the idea is to use Lagrange Interpolation to build an interpolant based on the number of points you wish to use to approximate the derivative. Then, you just take whatever derivatives you need of this interpolant to obtain the derivative you want! So to jump into it, we can write out our Lagrange Interpolant in 1-D, $\hat{f}(x)$, as the following, given we are evaluating our function $f(\cdot)$ at $x_j = x_i + jh \;\; \forall j \in \left\lbrace n, n+1, \cdots, m-1, m \right\rbrace$:

\begin{align}

\hat{f}(x) = \sum_{j=n}^{m} f(x_j) \prod_{k=n,k\neq j}^m \frac{(x-x_k)}{(x_j – x_k)}

\end{align}

Given this expression, we can then compute the necessary derivatives we wish to approximate, evaluate the results at $x_i$, and obtain our Finite Difference scheme. For example, let’s try this again against Example 1. First, we can expand the interpolant to be the following based on the fact $n=-1$ and $m=1$:

\begin{align}

\hat{f}(x) = f(x_{-1})\frac{(x-x_{0})}{(x_{-1} – x_{0})}\frac{(x-x_{1})}{(x_{-1} – x_{1})} +

f(x_{0})\frac{(x-x_{-1})}{(x_{0} – x_{-1})}\frac{(x-x_{1})}{(x_{0} – x_{1})} +

f(x_{1})\frac{(x-x_{0})}{(x_{1} – x_{0})}\frac{(x-x_{-1})}{(x_{1} – x_{-1})}

\end{align}

We then take the derivative once, resulting in the expression below:

\begin{align}

\frac{d\hat{f}}{dx}(x) = f(x_{-1})\frac{(x-x_{0}) + (x-x_{1})}{(x_{-1} – x_{1})(x_{-1} – x_{0})} +

f(x_{0})\frac{(x-x_{-1}) + (x-x_{1})}{(x_{0} – x_{-1})(x_{0} – x_{1})} +

f(x_{1})\frac{(x-x_{0}) + (x-x_{-1})}{(x_{1} – x_{0})(x_{1} – x_{-1})}

\end{align}

We then evaluate this derivative expression at $x_i$ and simplify the numerators and denominators, resulting in the following:

\begin{align}

\frac{d\hat{f}}{dx}(x_i) &= f(x_{-1})\frac{-h}{(-2h)(-h)} +

f(x_{0})\frac{0}{(h)(-h)} +

f(x_{1})\frac{h}{(h)(2h)} \nonumber \\

\frac{d\hat{f}}{dx}(x_i) &= f(x_{1})\frac{1}{2h} – f(x_{-1})\frac{1}{2h} \nonumber \\

\frac{d\hat{f}}{dx}(x_i) &= \frac{f(x_{1}) – f(x_{-1})}{2h} \nonumber \\

\frac{d\hat{f}}{dx}(x_i) &= \frac{f_{i+1} – f_{i-1}}{2h} \label{ex1_v2}

\end{align}

As we can see looking at equation $\refp{ex1_v2}$, the Lagrange Interpolant reproduced a Second Order Central Difference scheme at some location $x_i$, showing there’s more than one approach to generating a Finite Difference scheme. But now knowing a Lagrange Interpolant can help derive these schemes, there’s something quite interesting we can now understand with respect to Finite Differences.

In model building, there exists a phenomenon named Runge’s Phenomenon that essentially displays how fitting a polynomial model on the order of the number of data points you’re fitting to results in oscillations between points due to overfitting. An example of this phenomenon can be seen below.

As one can see based on the graphic, the polynomial based on Lagrange interpolation goes through each data point, but gets large deviations between points as it works its way away from the center. This oscillation actually results in large errors in derivatives, which makes sense if you look at the picture. If we just compare the slopes at the right most point on the plot, for example, we can see the slope of the true function is much smaller in magnitude than the slope based on the Lagrange interpolation.

This phenomenon often occurs when data points are roughly equially spaced apart, though it can be shown placing the data differently (like using Chebychev points) can greatly mitigate the occurance of Runge’s Phenomenon. Additionally, the distance between points makes a large difference, where larger distances between points increases the problem with Runge’s Phenomenon. An example plot below show how the fit using Lagrange Interpolation, even for a high order polynomial, does fine when the distances between points are small.

The improved fit using Lagrange interpolation makes sense because the true function approaches linear behavior between data points as the distance between points shrinks, which results in approximate fits being quite accurate.

Now with respect to Finite Differences, since they can be modeled using Lagrange Interpolation, we can see it is possible the accuracy of derivatives based on high order fits (or Finite Differences using many points in the weighted average) can gain a bit of error, especially if the distance between the points aren’t very small. This property of Finite Differences makes them trickier to use successfully if you want to implement, for example, a $9^{th}$ order accurate Finite Difference scheme to estimate a first order derivative.

However, if you do manage to make the stepsize, $h$, particularly small, you will likely result in a fine approximation using a high order Finite Difference scheme. However, it is important one validates the scheme is providing the sort of error one expects, especially since finite arithmetic in floating point numbers can generate its own set of problems (which won’t be covered here).

In this post, we covered some fundamental theory and examples revolving around the derivation of Finite Differences. In future posts, we will investigate the use of Finite Differences in various problems to get a feel for their value in computational mathematics.

For those interested, I recommend working out some of the math, deriving some error terms for some different Finite Difference formulas, and taking the mathematical equations and trying to build some codes around them… aka be like Bender:

For further investigation in the fundamentals of Finite Difference and related topics covered here, I recommend the book *Fundamentals of Engineering Numerical Analysis* by **Parviz Moin**. This covers a lot of topics fairly well for anyone first investigating the subject of Numerical Analysis and who may not be a mathematician (aka this book isn’t super mathematically rigorous). It is a good read nonetheless!

As we strive to piece together the data into something that can make useful predictions, we can find ourselves wondering,

How can we build models based on the data?

One of the most common ways to build a model is based on something called **Least Square Regression**, which essentially corresponds to finding a parametric model that minimizes the mean squared error between the model output and expected outputs in the dataset.

In this blog post, we’re going to go over some of the fun math involved with Least Square Regression (both linear and nonlinear), discuss some basic approaches to solving these problems, and tackle some sample problems!

Note that it is expected the reader is comfortable with calculus and basic linear algebra. The codes written below are also done in the language MATLAB. If you have any questions about what you read, please let me know in a comment below!

So before we hit up solving some problem, we need to step through some of the fun theory! So, one of the first things we should define when tackling some regression problem is the actual model we are building. In this case, the model $f(\cdot,\cdot)$ will be defined as the following mapping:

$$ f: \mathbb{R}^{m} \times \mathbb{R}^{n} \rightarrow \mathbb{R}$$

In English, this is saying $f(\cdot,\cdot)$ takes two inputs, the first being an $m$ dimensional vector, the latter being an $n$ dimensional vector. This mapping then produces and returns a scalar value based on the two inputs. In my definition, the parameter vector $\vec{\beta}$, which is what will be solved for using regression, is the first parameter and thus $\vec{\beta} \in \mathbb{R}^{m}$. The second input is a given input data vector, defined as $\vec{X} \in \mathbb{R}^{n}$, which is associated to some output scalar value $Y \in \mathbb{R}$. The end goal is to end up with a $\vec{\beta}$ that, when plugged into $f(\cdot,\cdot)$, can estimate the correct value of $Y$ for some input $\vec{X}$.

Now given a set of data, $\mathcal{D} = \lbrace (\vec{x}_i,y_i) \in \mathbb{R}^{n} \times \mathbb{R} : i \in[1,N] \rbrace$, we wish to compute the value of $\vec{\beta}$ such that the following cost function is minimized:

$$ J\left(\vec{\beta}\right) = \frac{1}{2N} \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta},\vec{x}_i) – y_i\right)^2 $$

where $w\left(\cdot\right)$ is a weight function that produces a positive scalar weight value based on the input $\vec{x}_i$. In this cost function, we are aiming to estimate a value for $\vec{\beta}$ that minimizes the weighted squared error between the model and the known data, typically referred to **Weighted** Least Squares. I have the weight function, $w\left(\cdot\right)$, in there for generalizing the solution and because it can be useful at times to weight certain data more than others.

Now given the above cost function, the goal is to solve the above optimization problem, hoping our parametric model can learn to represent the trends in the data well. The challenge of solving this optimization problem is dependent on the form of $f(\cdot,\cdot)$, where a $f(\cdot,\cdot)$ that is linear with respect to $\vec{\beta}$ will produce a convex optimization problem but will otherwise create a non-convex optimization problem. The former is simple to solve, while the latter is much more difficult. The reasoning non-convex optimization algorithms are trickier is due to the presence of various local optima that you can get stuck at while looking for the global optimum.

Since our model is linear with respect to $\vec{\beta}$, this means our model is of the form:

$$ f(\vec{\beta},\vec{x}) = \sum_{j=1}^{m} \beta_j \phi_j(\vec{x})$$

where $\beta_j$ is the $j^{th}$ component of $\vec{\beta}$ and $\phi_j(\cdot)$ is the $j^{th}$ basis function of this linear model. *Please note that a basis function is just a function that is part of a set of functions, called a basis, that we suspect is a sufficient finite dimensional representation of our solution*. Now one additional identity that will be useful is the derivative of our model with respect to one of the parameters $\beta_j$:

$$ \left.\frac{\partial f}{\partial \beta_k}\right|_{(\vec{\beta},\vec{x}_i)} = \frac{\partial f_i}{\partial \beta_k} = \phi_k(\vec{x}_i)$$

As stated earlier, the cost function we’re using is the Weighted Least Square cost function:

$$ J\left(\vec{\beta}\right) = \frac{1}{2N} \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta},\vec{x}_i) – y_i\right)^2 $$

We know from calculus that finding an optimum of some function requires taking a derivative of the function, with respect to the variable of interest, and finding where the derivative equates to $0$. Following this logic, we can obtain the solution via the following steps using indicial notation:

$$\begin{align}

\frac{\partial J}{\partial \beta_k} = 0 &= \frac{1}{N} \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta^{*}},\vec{x}_i) – y_i\right)\frac{\partial f_i}{\partial \beta_k} \;\;\; \forall k\\

%

0 &= \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta^{*}},\vec{x}_i) – y_i\right)\phi_k(\vec{x}_i)\\

%

\sum_{i=1}^{N} w\left(\vec{x}_i\right)f(\vec{\beta^{*}},\vec{x}_i)\phi_k(\vec{x}_i) &= \sum_{i=1}^{N} w\left(\vec{x}_i\right)y_i\phi_k(\vec{x}_i) \;\;\; \text{Put unknowns on left.}\\

%

\sum_{i=1}^{N} w\left(\vec{x}_i\right)\beta^{*}_j \phi_j(\vec{x}_i)\phi_k(\vec{x}_i) &= \sum_{i=1}^{N} w\left(\vec{x}_i\right)y_i\phi_k(\vec{x}_i) \;\;\; \text{Insert Linear form of model.}\\

%

\sum_{i=1}^{N} w\left(\vec{x}_i\right)\phi_j(\vec{x}_i)\phi_k(\vec{x}_i)\beta^{*}_j &= \sum_{i=1}^{N} w\left(\vec{x}_i\right)y_i\phi_k(\vec{x}_i) \;\;\; \text{Rearrange left side so $\beta^*_j$ is on right.}\\

%

\sum_{i=1}^{N} Q_{k,i}^{T} W_{i,i} Q_{i,j}\beta^{*}_j &= \sum_{i=1}^{N} Q_{k,i}^{T}W_{i,i}y_i \;\;\;\;\;\;\;\;\;\; \text{Make substitutions where } Q_{i,j} = Q_{j,i}^{T} = \phi_{j}(\vec{x}_i), W_{i,i} = w\left(\vec{x}_i\right)\\

%

Q^{T}WQ\vec{\beta^{*}} &= Q^{T}W\vec{y} \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \text{Convert into Matrix form.}\\

%

\vec{\beta^{*}} &= \left(Q^{T}WQ\right)^{-1}Q^{T}W\vec{y}

\end{align}$$

As we can see, obtaining the optimal parameter vector solution isn’t actually too bad and the result is quite concise when our model is linear with respect to $\vec{\beta}$. Pretty dope!

So let’s imagine we live in a 1-D world and have have some robot that is trying to measure the ground altitude variation with respect to some starting location. This robot, using an Inertial Measurement Unit (IMU) and Global Positioning System (GPS), manages to travel to some desired end location and collect a set of noisy altitude measurements during its travels.

Since the robot has completed its journey, we really want to post process the data and compute the 1-D altitude model relative to the initial starting location. We decide that tackling this using Weighted Least Square Regression is an awesome idea!

For this problem, we will define the following quantities based on our prior definition of Weighted Least Square Regression:

$$\begin{align}

w(x) &= 1\\

y(x) &= 4\left(e^{-(x-4)^2} + e^{-5\cdot10^{-2}(x-10)^2}\right)\\

m(x) &= y(x) + \eta \\

\mathcal{D} &= \lbrace (x_i,m_i): x_i \in [0,10], m_i = m(x_i), i \in [1,300] \rbrace \\

\lbrace \phi_i \rbrace &= \lbrace 1, x, x^2, x^3,x^4,e^{-(x-4)^2}\rbrace \\

\end{align}$$

where $m(\cdot)$ is the measurement function, $m_i \forall i$ represent measurements the robot takes, $\eta$ is a scalar number drawn from a normal distribution with $\mu = 0$ and $\sigma = 10^{-1}$, and $\lbrace \phi_i \rbrace$ is the basis we’re using!

A sample set of measurements the robot gets might look like the following:

So given this, let’s first write the method to generate the sensor data, $\mathcal{D}$:

function [x,m] = genSensorData(xStart,xEnd,N) % This method generates robot sensor data % related to altitude measurements along some distance x = linspace(xStart,xEnd,N)'; y = @(x) 4*( exp( -(x-4).^2 ) + exp( -5e-2*(x- 10).^2 )); eta = 0.1*randn(size(x)); m = y(x) + eta; end

Next, let’s write the method to do the Weighted Least Square (WLS) Solution for some input data set:

function beta = WLS(w,x,y) % This method generates optimal parameter vector % for quartic polynomial fit to data W = diag(w(x),0); Q = [ones(size(x)),x,x.^2,x.^3,x.^4,exp(-(x-4).^2)]; Qt= Q'; T = Qt*W; beta = (T*Q)\(T*y); end

Next we will define the weight function, $w(\cdot)$, as the following:

function weights = w(x) % This method generates weights all of value 1.0 weights = ones(size(x)); end

Lastly, we will define a function to evaluate the resulting fit at some given set of input values:

function y = evalFit(x, beta) % This method is used to evaluate polynomial fit % at some set of input locations y = zeros(size(x)); for i = 1:length(beta)-1 y = y + beta(i)*x.^(i-1); end y = y + beta(end)*exp( -(x-4).^2 ); end

Our final script that we can write and run will be the following:

% Script to generate fit to noisy sensor data from robot % Author: Christian Howard % start processing N = 300; [x, m] = genSensorData(0,10,N); beta = WLS(@w,x,m); % show results figure(1) plot(x,m,'ro',x,evalFit(x, beta),'b-','LineWidth',2) xlabel('X Position (m)','FontSize',16) ylabel('Altitude (m)','FontSize',16) title('Altitude Model vs Raw Data','FontSize',16) legend({'Raw Data','Fit'}) axis([0,10,0,10]) % Done processing

When you run this script, you’ll get a graphic with something along these lines:

As one can tell from the above plot, our model is doing pretty well from the looks of it! Definitely captures the trends we would hope to and makes sense! So now that we have tackled solving Linear Least Square problems, it’s time to try and look at Nonlinear Least Square Regression.

As discussed earlier, we described the cost function for Weighted Least Square Regression to be the following:

$$ J\left(\vec{\beta}\right) = \frac{1}{2N} \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta},\vec{x}_i) – y_i\right)^2 $$

This problem, given that $f(\vec{\beta},\vec{x})$ is nonlinear with respect to $\vec{\beta}$, can now be viewed as an unconstrained optimization problem of a non-convex cost function. The end goal obviously becomes, like earlier, to find the following:

$$\vec{\beta^{*}} = \arg \min J(\vec{\beta})$$

Solving the above problem when $J(\cdot)$ is non-convex is a challenging problem because typically the cost surface is riddled with local minima that aren’t necessarily the global minima.

As can be seen in the figure above, the circles on the curves represent local minima. In the convex case, the local minima is the global minima (the lowest point). In the non-convex case, there’s two local minima, but one is the global minima while the other is not.

Finding the global minima is often a challenge because, first, it’s hard to know for sure whether a local minima you found is the global minima. Additionally, most algorithms are built to find local minima efficiently, so you typically have to do heuristic strategies to have a better chance at getting near the global minimum (such as multi-start or Bayesian estimation strategies).

Some algorithms for optimizing any non-convex cost function are the following:

*Gradient/Hessian Based Methods***Gradient Descent with Constant/Annealing step size****Gradient Descent with Momentum****Newton Step Algorithm****Quasi-Newton Algorithms**- Nonlinear Conjugate Gradient
- Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm

**RProp Variants**- Note that these methods were originally designed to tackle training Neural Networks, but if you look at the original paper, it’s really a general algorithm that can be applied if you can generate gradients of the cost function with respect to the parameters you’re finding

*Stochastic Heuristic Global Optimization Algorithms***Genetic Algorithms****Particle Swarm Optimization****Simulated Annealing**

*Other Algorithms***Unscented Kalman Filter**for Nonlinear Parameter Estimation

A couple other algorithms tailored to regression styled problems are:

*Stochastic/Mini-batch Gradient Descent*- This method uses either a single or a small number of random data samples from the dataset to estimate the gradient and take a step towards finding a local minima
- This approach is used a lot in Machine Learning for big datasets since it is fairly efficient relative to full batch methods, can be scaled via parallel programming environments (GPU, cluster), and it’s simple to implement
- The randomness of data used to estimate the gradient also gives a potential ability to overcome local optima that would otherwise stop one from making progress to the global optimum

*Levenberg-Marquardt Method*- This a Quasi-Newton method formulated to use only gradients and tackle Nonlinear Least Square problems

One other interesting point, just for reference, is that the above algorithms can also be used to iteratively compute the solution for Linear Least Square problems, if that makes sense for your application.

Now with all of this said, solving a Nonlinear Least Square Regression problem really comes down to defining the cost function, $J(\vec{\beta})$, and picking an optimization algorithm to use to solve it. With that, let’s revisit solving the last example problem but using a Nonlinear Least Square approach.

For this problem, we will define the following quantities based on our prior definition of Weighted Least Square Regression and assumption of a nonlinear model:

$$\begin{align}

w(x) &= 1\\

y(x) &= 4\left(e^{-(x-4)^2} + e^{-5\cdot10^{-2}(x-10)^2}\right)\\

m(x) &= y(x) + \eta \\

\mathcal{D} &= \lbrace (x_i,m_i): x_i \in [0,10], m_i = m(x_i), i \in [1,300] \rbrace \\

f(\vec{\beta},x) &= \beta_1 e^{-\beta_2(x-\beta_3)^2} + \beta_4 e^{-\beta_5(x-\beta_6)^2} \\

\end{align}$$

where $m(\cdot)$ is the measurement function, $m_i \forall i$ represent measurements the robot takes, $\eta$ is a scalar number drawn from a normal distribution with $\mu = 0$ and $\sigma = 10^{-1}$, and $f(\cdot,\cdot)$ is the nonlinear model we’ll be using!

To solve this problem, we’re going to implement a mini-batch gradient descent algorithm. To do this, we need to first make sure we can compute the gradient of the cost function with respect to the model parameters. Earlier, it was noted that this gradient can be computed by using the following equation:

$$ \frac{\partial J}{\partial \beta_k} = \frac{1}{N} \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta^{*}},\vec{x}_i) – y_i\right)\frac{\partial f_i}{\partial \beta_k} \;\;\; \forall k$$

To compute this quantity, we also need the gradients of $f(\cdot,\cdot)$ with respect to the model parameters. We can find these gradient terms are the following:

$$\begin{align}

\frac{\partial f}{\partial \beta_1} &= e^{-\beta_2(x-\beta_3)^2}\\

\frac{\partial f}{\partial \beta_2} &= -\beta_1 (x-\beta_3)^2 e^{-\beta_2(x-\beta_3)^2}\\

\frac{\partial f}{\partial \beta_3} &= 2(x-\beta_3)\beta_1\beta_2 e^{-\beta_2(x-\beta_3)^2}\\

\frac{\partial f}{\partial \beta_4} &= e^{-\beta_5(x-\beta_6)^2}\\

\frac{\partial f}{\partial \beta_5} &= -\beta_4 (x-\beta_6)^2 e^{-\beta_5(x-\beta_6)^2}\\

\frac{\partial f}{\partial \beta_6} &= 2(x-\beta_6)\beta_4 \beta_5 e^{-\beta_5(x-\beta_6)^2}

\end{align}$$

Given the quantities above, let’s start by implementing the nonlinear model. The way we’re going to implement the model is such that it will return the value it computes for the model **AND** the gradient.

function [fval, grad] = NonlinearModel(x, beta) % Method representing the nonlinear model's output and its gradient exp1 = exp(-beta(2).*(x-beta(3)).^2); exp2 = exp(-beta(5).*(x-beta(6)).^2); fval = beta(1)*exp1 + beta(4)*exp2; grad = [exp1; -beta(1)*(x-beta(3)).^2.*exp1; 2*(x-beta(3)).*beta(1)*beta(2).*exp1; exp2; -beta(4)*(x-beta(6)).^2.*exp2; 2*(x-beta(6))*beta(4)*beta(5).*exp2]; end

Next, we want to compute the function that will do the mini-batch gradient descent step:

function [cost, dbeta] = MiniBatchStep( x, y, beta, Model, batch_size, step_size) % Method to perform a mini-batch based update to the % current parameter vector N = batch_size; % get batch size inds =ceil(rand(N,1)*length(x)); % get random batch indices xs = x(inds); % get mini batch input values ys = y(inds); % get mini batch output values % initialize net gradient vector net_grad = zeros(size(beta)); cost = 0; % compute gradient for i = 1:N [fval,grad] = Model(xs(i),beta); cost = cost + w(xs(i))*(fval-ys(i))^2/N; net_grad = net_grad + (w(xs(i))*(fval-ys(i))*grad/N); end % compute change in beta dbeta = -step_size*net_grad; end

Lastly, let’s setup a script to run the training algorithms:

% Script to train a nonlinear model using noisy data % and the mini-batch gradient descent algorithm % Author: C. Howard % start processing %% Get Data N = 300; [x, m] = genSensorData(0,10,N); %% Train Model beta = [5,1,3,3,0.1,9]'; % initial guess for beta % this can really affect convergence beta0 = beta; max_iters = 50000; % set max number of optimization iterations beta_soln = zeros(length(beta),max_iters); beta_soln(:,1) = beta0; % do optimization for k = 2:max_iters [cost, db] = MiniBatchStep(x, m, beta, @NonlinearModel, 100, 1e-3); msg = sprintf('The current cost was found to be J = %f.',cost) beta = beta + db; beta_soln(:,k) = beta; end %% show results % plot comparison and put into gif figure filename = 'out.gif'; for i = 1:500:max_iters plot(x,m,'ro',x,NonlinearModel(x,beta_soln(:,i)),'b-','LineWidth',2) xlabel('X Position (m)','FontSize',16) ylabel('Altitude (m)','FontSize',16) title( sprintf('Altitude Model vs Raw Data @ Iteration %i',i),'FontSize',16) legend({'Raw Data','Fit'}) axis([0,10,0,10]) drawnow frame = getframe(1); im = frame2im(frame); [imind,cm] = rgb2ind(im,256); if i == 1; imwrite(imind,cm,filename,'gif', 'Loopcount',inf); else imwrite(imind,cm,filename,'gif','WriteMode','append'); end pause(0.01) end % Done processing

After running the above script, you can produce the following GIF to visualize the convergence:

That turned out pretty cool, huh? Now, while this example worked out okay, one should be careful to note that the convergence of the above solution is fairly dependent on how good the initial guess is. There will likely be times a bad guess will result in sub-optimal results. A visual as to why this is can be seen below:

As stated earlier, there are heuristic approaches to helping avoid this issue, such as multi-start algorithms (just running the same optimization with a variety of different initial conditions). However, one will have to experiment with these approaches to see what you feel most comfortable with.

Phew! That was a lot of work but we got through it and hopefully got a better understanding of Linear and Nonlinear Least Square Regression. We learned to implement the algorithms and have some areas we can investigate further, like what optimization methods to use, if we feel up for it!

With this new knowledge, we will now be able to take some data set and take steps towards building a predictive model when the Least Square cost function makes sense!

]]>

Please make note that this post will be using things from areas of calculus, differential equations, and computing, so be prepared to do some homework if you read something you’ve never heard of. Any code pieces will be done in C++, too.

Now to get started, let’s first look at solving differential equations that are time varying. A typical system of time varying differential equations can be described in a form, typically referred to as a **State Space** form, where the system is decomposed into only a system of first order differential equations. In mathematical notation, it can be written as:

$$ \frac{d\textbf{q}}{dt} = f\left(t,\textbf{q}\right)$$

where $t$ represents time, $\textbf{q}$ is a vector of quantities, called the state, that are changing as a function of time, and $f(\cdot,\cdot)$ represents a mapping that takes an input of time and the state and produces the associated time derivatives for each state variable. Using this equation, the goal is to then compute $\textbf{q} = \textbf{q}(t)$, meaning find the state variables’ values as a function of time.

One simple way to do this is to approximate the derivative on the left-hand side using the formal definition of a derivative. If we think back to calculus class, a derivative definition can be the following:

$$ \frac{dx}{dt} = \lim_{h \to 0} \frac{x(t+h) – x(t)}{h} $$

For the sake of approximation, instead of making $h$ go to $0$, we can instead choose $h$ to be small. Thus, we end up with the following approximation:

$$ \frac{dx}{dt} \approx \frac{x(t+h) – x(t)}{h} $$

where $h \ll 1$ and the approximation formula is called a first order accurate **Finite Difference** formula for a first order derivative. Given $h$ is also a step in time, you can find the above formula being called the **Explicit Euler** time stepping formula. Now using this formula, we can take the differential equation described above and simplify the equation into the following:

$$\begin{align}

\frac{d\textbf{q}}{dt} &= f\left(t,\textbf{q}\right)\\

\frac{\textbf{q}(t+\Delta t) – \textbf{q}(t)}{\Delta t} &\approx f\left(t,\textbf{q}\right)\\

\textbf{q}(t+\Delta t) &= \textbf{q}(t) + \Delta t f\left(t,\textbf{q}(t)\right)\\

\textbf{q}_{i+1} &= \textbf{q}_i + \Delta t f\left(t,\textbf{q}_i\right)

\end{align}$$

where $\textbf{q}(t+j\Delta t) = \textbf{q}_{i+j}$ and $\Delta t$ represents the time step. As you can see based on the final result, we are actually able to, given we know $\textbf{q}_{i}$, estimate $\textbf{q}_{i+1}$. We are on our way to predicting the future using differential equation models!

One thing I should note, which may appear obvious is, we need to actually know some starting $\textbf{q}_{i}$ value to be able to make predictions based on it. This starting $\textbf{q}_{i}$, let’s call it $\textbf{q}_{0}$, is called the **Initial Condition** for the equation. This is required to ensure the solution to this problem is a unique one. Otherwise, we could pick any initial condition and get some different result for each initial condition we try. With that, let’s jump into implementing something!

For the sake of learning, we’re going to try tackling a simple pendulum problem. Note that the dynamics for a simple pendulum are the following:

$$ \ddot{\theta} + \frac{c}{m} \dot{\theta} + \frac{g}{l} \sin(\theta) = 0$$

where $\dot{\theta}=\frac{d\theta}{dt}$ and $\ddot{\theta} = \frac{d^2\theta}{dt^2}$. Now for those with a differential equations background, you’ll note that this equation is not a first order differential equation, but in fact a second order differential equation (the highest derivative order in the equation is 2 aka the $\ddot{\theta}$). This means, to make our above formulation work using the State Space form, we need to transform this equation. We can transform this equation into a State Space form by doing the following:

$$\begin{align}

[\theta,\dot{\theta}]^{T} &= [x_1, x_2]^{T}\\

\frac{dx_1}{dt} &= x_2\\

\frac{dx_2}{dt} &= \;- \left(\frac{g}{l} \sin(x_1) + \frac{c}{m} x_2 \right)

\end{align}$$

As you can see, we have now taken the original second order equation and broken it into two first order equations.. and it actually wasn’t too much work! The main things you see is we state that the derivative of $\theta$ is $\dot{\theta}$, which is obvious. The second equation is then saying the derivative of $\dot{\theta}$, which is equivalent to $\ddot{\theta}$, is the same as everything in the original equation being put on the right-hand side, excluding the $\ddot{\theta}$ term. Pretty straight forward!

Using the State Space form of the equation and our time stepping approach from the first part of the blog post, we can put together the following codes:

#include <vector> #include <math.h> /* Class representing the pendulum dynamics that will be required to estimate the pendulum's state in the future */ class PendulumDynamics { public: typedef std::vector< double > vec; // dynamics constructor PendulumDynamics():c(0.1),m(1.0),g(9.81),l(5.0) {} // mathematical f(.,.) operator for dynamics void operator()(double t, const vec & q, vec & dqdt){ double x1 = q[0], x2 = q[1]; dqdt[0] = x2; dqdt[1] = -((g/l)*sin(x1) + (c/m)*x2 ); } // setter methods for physical variables void setMass(double m_){ m = m_; } void setGravity( double g_){ g = g_; } void setDampening( double c_){ c = c_; } void setLength( double l_){ l = l_; } private: double c, m, g, l; // physical constants };

#include <vector> #include <math.h> namespace integration { /* function representing time stepping via Explicit Euler scheme */ template< class Dynamics > void explicitEuler( Dynamics & den, double t, double dt, const Dynamics::vec & q_old, Dynamics::vec & q_new ) { static Dynamics::vec dqdt(q_old.size(),0); dyn(t,q_old,dqdt); for(int i = 0; i < q_old.size(); ++i){ q_new[i] = q_old[i] + dt*dqdt[i]; } } }// end integration namespace

#include "ExplicitEuler.hpp" #include "PendulumDynamics.hpp" #include <stdio.h> int main( int argc, char** argv ){ // define constants double rad2deg = 180.0/M_PI; // define the dynamics PendulumDynamics pendulum; pendulum.setMass(2.0); pendulum.setDampening(1.0); // define the time bounds double time = 0; // initial value is starting time double endTime = 20; double dt = 1e-2; // define initial state PendulumDynamics::vec q(2,0); q[0] = M_PI/6.0; // initial pendulum angle is pi/6 radians = 30 degrees q[1] = 0.0; // pendulum has zero initial angular rate (radian/second) // print initial condition printf("q(%lf) = [%lf degrees, %lf deg/s]\n",time,q[0]*rad2deg,q[1]*rad2deg); // do time stepping while( time < endTime ){ integration::explicitEuler(pendulum, time, dt, q, q); time += dt; printf("q(%lf) = [%lf degrees, %lf deg/s]\n",time,q[0]*rad2deg,q[1]*rad2deg); } // finish return 0; }

After putting these codes together, you should compile and get a result along the lines of the following when you run it:

q(0.000000) = [30.000000 degrees, 0.000000 deg/s] q(0.010000) = [30.000000 degrees, -0.562072 deg/s] q(0.020000) = [29.994379 degrees, -1.121333 deg/s] q(0.030000) = [29.983166 degrees, -1.677702 deg/s] q(0.040000) = [29.966389 degrees, -2.231099 deg/s] q(0.050000) = [29.944078 degrees, -2.781444 deg/s] q(0.060000) = [29.916263 degrees, -3.328658 deg/s] q(0.070000) = [29.882977 degrees, -3.872663 deg/s] q(0.080000) = [29.844250 degrees, -4.413382 deg/s] q(0.090000) = [29.800116 degrees, -4.950738 deg/s] q(0.100000) = [29.750609 degrees, -5.484656 deg/s] q(0.110000) = [29.695763 degrees, -6.015062 deg/s] q(0.120000) = [29.635612 degrees, -6.541881 deg/s] q(0.130000) = [29.570193 degrees, -7.065040 deg/s] q(0.140000) = [29.499543 degrees, -7.584468 deg/s] q(0.150000) = [29.423698 degrees, -8.100092 deg/s] q(0.160000) = [29.342697 degrees, -8.611843 deg/s] q(0.170000) = [29.256579 degrees, -9.119650 deg/s] q(0.180000) = [29.165382 degrees, -9.623444 deg/s] q(0.190000) = [29.069148 degrees, -10.123158 deg/s] q(0.200000) = [28.967916 degrees, -10.618724 deg/s] .. etc

Which, when plotted, gives you the following:

What the figure above shows is we have simulated the motion of a pendulum and can now predict things related to it’s motion, like it’s angle at some point in time or it’s angular velocity!

Now while this example is fairly simple, the code above could be modified to use different dynamics, instead of the pendulum, while still using the same integration code. This could in turn allow someone to make predictions based on other time dependent differential equations!

In this blog post, you got an idea of how to implement and tackle using models based on time varying differential equations. We covered how to implement and use the simple Explicit Euler time integration scheme to make future predictions of dynamical systems, such as a pendulum.

The content of this post is introductory, since there are is much more that could be learned in this subject. For example, there are many more methods out there for integrating differential equations, some examples being the **Implicit Euler** scheme, the **Runge-Kutta $4^{th}$ Order ** scheme, and the **Trapezoidal** scheme. Additionally, the choice of time step, $\Delta t$, is typically constrained such that the solution can be numerically stable. The theory going into finding and obeying this constraint has not been touched here, but may be in future posts.

Lastly, the example of a pendulum is really only an **Ordinary Differential Equation**. Other equations that could use this type of approach are **Partial Differential Equations**, which explicitly take into account spatial dimensions as well. Examples of a Partial Differential Equation are the **Transient Heat Equation** and the **Navier-Stokes** equation.

In future blog posts, I may dive into some of the details of these things that weren’t covered. But in the meantime, happy coding!

]]>What will the weather be like today?

Will there be traffic on my way to work?

Will I finally get the recognition and promotion I have been working for?

In life, there’s many things we don’t have enough data to predict. To us, many things might appear like chance because we don’t understand the long-term affect of choices and actions we take that might impact our life trajectories. I like to think of life as a complex system of differential algebraic equations, of which we are very far from ever understanding in a mathematical sense.

However, in many areas, such as science and engineering, models have been discovered and used to make predictions of many phenomena, ranging from simple predictions of pendulum motion to complicated predictions of the stable flight of the space shuttle (**how else can they know the guidance systems should work?**).

So how are these predictions made, one might ask? The answer generally lies in how the model is formulated. For example, lets imagine we have a statistical model that can state the probability that someone of a particular age will go clubbing on a Friday night, defined as the following:

$$

\begin{align}

p &= \text{WillGoClubbing}(\text{Age})\\

\text{where } p &\in [0,1]

\end{align}

$$

In this case, we could state a probability that a person will go clubbing on a Friday night given their age. We could even use this to find the age someone is expected to be if they are going clubbing! Pretty interesting way to make predictions. These sorts of statistical predictions are done all the time in areas such as Machine Learning, Quantitive Finance, and even areas like Missile Performance (**shameless plug since I work on this stuff.. okay?**).

However, these aren’t the only types of predictions that are made. For example, how do we predict the flight of a rocket, or the weather, or the trajectory of a bullet? How do we predict earthquakes or estimate where hurricanes will end up? Typically, these sorts of problems are tackled using models based on differential equations, some time varying and others steady state (**meaning they won’t change over time**). Below is an example of a few differential equations that you might find being simulated:

$$

\begin{align}

m \vec{a} = \sum_i^n \vec{F}_i

\end{align}

$$

$$

\begin{align}

\frac{\partial T}{\partial t} + \nabla \cdot \nabla T = g(t,T)

\end{align}

$$

$$

\begin{align}

\frac{\partial}{\partial t}\left(\rho \textbf{u}\right) + \nabla \cdot \left(\rho \textbf{u} \otimes \textbf{u} + p\textbf{I} \right) = \nabla \cdot \mathbf{\tau} + \rho \textbf{g}

\end{align}

$$

Many of these equations are used today by researchers to make predictions of very complicated and sensitive phenomenon. Depending on the problem, solving these equations accurately, such as Navier-Stokes, can require resources like clusters or supercomputers that run for weeks or months and generate enough data to leave researchers working to understanding the results for months or more. These sorts of activities are definitely not cakewalk.

With all this said, there is still many things out there we don’t have the models or data for to make adequate predictions. Additionally, many of the models we do have are idealized enough or incomplete in that they don’t always capture the real world phenomena accurately. This is why at times the weather predictions are wrong or why we can’t predict the stock market too well. The models used to make these predictions just break down over long periods of time, whether by assumptions, incomplete modeling, or not adequately taking into account transient inputs to the dynamical system.

Fortunately, due to the great abundance of data becoming available (think Big Data) and the great advancements in AI, models have been getting built statistically that can make improved predictions of many things.. From what you’ll likely want to purchase, to what types of shows you might like based on the types of things you watch, to predicting captions for pictures, and more. Using these new techniques and data, more precise models are being empirically created and understood and helping pave the way to understanding more complicated phenomenon down the road.

After explaining a few of the fundamentals in prediction and how it’s used, I hope in future posts to dive into algorithms that can be built to help make predictions of various kinds. In the meantime, thanks for reading and best wishes.

]]>