Predicting the Future – Building Models with Data & Least Square Approach

When I look at the world we have now, especially after the emergence of Big Data, I just see everyone freaking out over needing to collect and learn from data.. And fortunately, we have more data available than we can handle!

Data Everywhere

As we strive to piece together the data into something that can make useful predictions, we can find ourselves wondering,

How can we build models based on the data?

One of the most common ways to build a model is based on something called Least Square Regression, which essentially corresponds to finding a parametric model that minimizes the mean squared error between the model output and expected outputs in the dataset.

In this blog post, we’re going to go over some of the fun math involved with Least Square Regression (both linear and nonlinear), discuss some basic approaches to solving these problems, and tackle some sample problems!

Note that it is expected the reader is comfortable with calculus and basic linear algebra. The codes written below are also done in the language MATLAB. If you have any questions about what you read, please let me know in a comment below!


Mathematical Intro

So before we hit up solving some problem, we need to step through some of the fun theory! So, one of the first things we should define when tackling some regression problem is the actual model we are building. In this case, the model $f(\cdot,\cdot)$ will be defined as the following mapping:

$$ f: \mathbb{R}^{m} \times \mathbb{R}^{n} \rightarrow \mathbb{R}$$

In English, this is saying $f(\cdot,\cdot)$ takes two inputs, the first being an $m$ dimensional vector, the latter being an $n$ dimensional vector. This mapping then produces and returns a scalar value based on the two inputs. In my definition, the parameter vector $\vec{\beta}$, which is what will be solved for using regression, is the first parameter and thus $\vec{\beta} \in \mathbb{R}^{m}$. The second input is a given input data vector, defined as $\vec{X} \in \mathbb{R}^{n}$, which is associated to some output scalar value $Y \in \mathbb{R}$. The end goal is to end up with a $\vec{\beta}$ that, when plugged into $f(\cdot,\cdot)$, can estimate the correct value of $Y$ for some input $\vec{X}$.

Now given a set of data, $\mathcal{D} = \lbrace (\vec{x}_i,y_i) \in \mathbb{R}^{n} \times \mathbb{R} : i \in[1,N] \rbrace$, we wish to compute the value of $\vec{\beta}$ such that the following cost function is minimized:

$$ J\left(\vec{\beta}\right) = \frac{1}{2N} \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta},\vec{x}_i) – y_i\right)^2 $$

where $w\left(\cdot\right)$ is a weight function that produces a positive scalar weight value based on the input $\vec{x}_i$. In this cost function, we are aiming to estimate a value for $\vec{\beta}$ that minimizes the weighted squared error between the model and the known data, typically referred to Weighted Least Squares. I have the weight function, $w\left(\cdot\right)$, in there for generalizing the solution and because it can be useful at times to weight certain data more than others.

Now given the above cost function, the goal is to solve the above optimization problem, hoping our parametric model can learn to represent the trends in the data well. The challenge of solving this optimization problem is dependent on the form of $f(\cdot,\cdot)$, where a $f(\cdot,\cdot)$ that is linear with respect to $\vec{\beta}$ will produce a convex optimization problem but will otherwise create a non-convex optimization problem. The former is simple to solve, while the latter is much more difficult. The reasoning non-convex optimization algorithms are trickier is due to the presence of various local optima that you can get stuck at while looking for the global optimum.

convex vs nonconvex

The Linear Case

Solution Derivation

Since our model is linear with respect to $\vec{\beta}$, this means our model is of the form:

$$ f(\vec{\beta},\vec{x}) = \sum_{j=1}^{m} \beta_j \phi_j(\vec{x})$$

where $\beta_j$ is the $j^{th}$ component of $\vec{\beta}$ and $\phi_j(\cdot)$ is the $j^{th}$ basis function of this linear model. Please note that a basis function is just a function that is part of a set of functions, called a basis, that we suspect is a sufficient finite dimensional representation of our solution. Now one additional identity that will be useful is the derivative of our model with respect to one of the parameters $\beta_j$:

$$ \left.\frac{\partial f}{\partial \beta_k}\right|_{(\vec{\beta},\vec{x}_i)} = \frac{\partial f_i}{\partial \beta_k} = \phi_k(\vec{x}_i)$$

As stated earlier, the cost function we’re using is the Weighted Least Square cost function:

$$ J\left(\vec{\beta}\right) = \frac{1}{2N} \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta},\vec{x}_i) – y_i\right)^2 $$

We know from calculus that finding an optimum of some function requires taking a derivative of the function, with respect to the variable of interest, and finding where the derivative equates to $0$. Following this logic, we can obtain the solution via the following steps using indicial notation:

\frac{\partial J}{\partial \beta_k} = 0 &= \frac{1}{N} \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta^{*}},\vec{x}_i) – y_i\right)\frac{\partial f_i}{\partial \beta_k} \;\;\; \forall k\\
0 &= \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta^{*}},\vec{x}_i) – y_i\right)\phi_k(\vec{x}_i)\\
\sum_{i=1}^{N} w\left(\vec{x}_i\right)f(\vec{\beta^{*}},\vec{x}_i)\phi_k(\vec{x}_i) &= \sum_{i=1}^{N} w\left(\vec{x}_i\right)y_i\phi_k(\vec{x}_i) \;\;\; \text{Put unknowns on left.}\\
\sum_{i=1}^{N} w\left(\vec{x}_i\right)\beta^{*}_j \phi_j(\vec{x}_i)\phi_k(\vec{x}_i) &= \sum_{i=1}^{N} w\left(\vec{x}_i\right)y_i\phi_k(\vec{x}_i) \;\;\; \text{Insert Linear form of model.}\\
\sum_{i=1}^{N} w\left(\vec{x}_i\right)\phi_j(\vec{x}_i)\phi_k(\vec{x}_i)\beta^{*}_j &= \sum_{i=1}^{N} w\left(\vec{x}_i\right)y_i\phi_k(\vec{x}_i) \;\;\; \text{Rearrange left side so $\beta^*_j$ is on right.}\\
\sum_{i=1}^{N} Q_{k,i}^{T} W_{i,i} Q_{i,j}\beta^{*}_j &= \sum_{i=1}^{N} Q_{k,i}^{T}W_{i,i}y_i \;\;\;\;\;\;\;\;\;\; \text{Make substitutions where } Q_{i,j} = Q_{j,i}^{T} = \phi_{j}(\vec{x}_i), W_{i,i} = w\left(\vec{x}_i\right)\\
Q^{T}WQ\vec{\beta^{*}} &= Q^{T}W\vec{y} \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \text{Convert into Matrix form.}\\
\vec{\beta^{*}} &= \left(Q^{T}WQ\right)^{-1}Q^{T}W\vec{y}

As we can see, obtaining the optimal parameter vector solution isn’t actually too bad and the result is quite concise when our model is linear with respect to $\vec{\beta}$. Pretty dope!


Example Problem – Situation

So let’s imagine we live in a 1-D world and have have some robot that is trying to measure the ground altitude variation with respect to some starting location. This robot, using an Inertial Measurement Unit (IMU) and Global Positioning System (GPS), manages to travel to some desired end location and collect a set of noisy altitude measurements during its travels.


Since the robot has completed its journey, we really want to post process the data and compute the 1-D altitude model relative to the initial starting location. We decide that tackling this using Weighted Least Square Regression is an awesome idea!

Example Problem – Mathematical Definition

For this problem, we will define the following quantities based on our prior definition of Weighted Least Square Regression:

w(x) &= 1\\
y(x) &= 4\left(e^{-(x-4)^2} + e^{-5\cdot10^{-2}(x-10)^2}\right)\\
m(x) &= y(x) + \eta \\
\mathcal{D} &= \lbrace (x_i,m_i): x_i \in [0,10], m_i = m(x_i), i \in [1,300] \rbrace \\
\lbrace \phi_i \rbrace &= \lbrace 1, x, x^2, x^3,x^4,e^{-(x-4)^2}\rbrace \\

where $m(\cdot)$ is the measurement function, $m_i \forall i$ represent measurements the robot takes, $\eta$ is a scalar number drawn from a normal distribution with $\mu = 0$ and $\sigma = 10^{-1}$, and $\lbrace \phi_i \rbrace$ is the basis we’re using!

A sample set of measurements the robot gets might look like the following:


So given this, let’s first write the method to generate the sensor data, $\mathcal{D}$:

function [x,m] = genSensorData(xStart,xEnd,N)
% This method generates robot sensor data
% related to altitude measurements along some distance
x   = linspace(xStart,xEnd,N)';
y   = @(x) 4*( exp( -(x-4).^2 ) + exp( -5e-2*(x- 10).^2 ));
eta = 0.1*randn(size(x));
m   = y(x) + eta;

Next, let’s write the method to do the Weighted Least Square (WLS) Solution for some input data set:

function beta = WLS(w,x,y)
% This method generates optimal parameter vector
% for quartic polynomial fit to data
W = diag(w(x),0);
Q = [ones(size(x)),x,x.^2,x.^3,x.^4,exp(-(x-4).^2)];
Qt= Q';
T = Qt*W;
beta = (T*Q)\(T*y);

Next we will define the weight function, $w(\cdot)$, as the following:

function weights = w(x)
% This method generates weights all of value 1.0
weights = ones(size(x));

Lastly, we will define a function to evaluate the resulting fit at some given set of input values:

function y = evalFit(x, beta)
% This method is used to evaluate polynomial fit 
% at some set of input locations
y = zeros(size(x));
for i = 1:length(beta)-1
    y = y + beta(i)*x.^(i-1);
y = y + beta(end)*exp( -(x-4).^2 );


Our final script that we can write and run will be the following:

% Script to generate fit to noisy sensor data from robot
% Author: Christian Howard
% start processing

N = 300;
[x, m] = genSensorData(0,10,N);
beta = WLS(@w,x,m);

% show results
plot(x,m,'ro',x,evalFit(x, beta),'b-','LineWidth',2)
xlabel('X Position (m)','FontSize',16)
ylabel('Altitude (m)','FontSize',16)
title('Altitude Model vs Raw Data','FontSize',16)
legend({'Raw Data','Fit'})

% Done processing

When you run this script, you’ll get a graphic with something along these lines:

As one can tell from the above plot, our model is doing pretty well from the looks of it! Definitely captures the trends we would hope to and makes sense! So now that we have tackled solving Linear Least Square problems, it’s time to try and look at Nonlinear Least Square Regression.

The Nonlinear Case

Solution Discussion

As discussed earlier, we described the cost function for Weighted Least Square Regression to be the following:
$$ J\left(\vec{\beta}\right) = \frac{1}{2N} \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta},\vec{x}_i) – y_i\right)^2 $$

This problem, given that $f(\vec{\beta},\vec{x})$ is nonlinear with respect to $\vec{\beta}$, can now be viewed as an unconstrained optimization problem of a non-convex cost function. The end goal obviously becomes, like earlier, to find the following:

$$\vec{\beta^{*}} = \arg \min J(\vec{\beta})$$

Solving the above problem when $J(\cdot)$ is non-convex is a challenging problem because typically the cost surface is riddled with local minima that aren’t necessarily the global minima.


As can be seen in the figure above, the circles on the curves represent local minima. In the convex case, the local minima is the global minima (the lowest point). In the non-convex case, there’s two local minima, but one is the global minima while the other is not.

Finding the global minima is often a challenge because, first, it’s hard to know for sure whether a local minima you found is the global minima. Additionally, most algorithms are built to find local minima efficiently, so you typically have to do heuristic strategies to have a better chance at getting near the global minimum (such as multi-start or Bayesian estimation strategies).

Some algorithms for optimizing any non-convex cost function are the following:

  • Gradient/Hessian Based Methods
    • Gradient Descent with Constant/Annealing step size
    • Gradient Descent with Momentum
    • Newton Step Algorithm
    • Quasi-Newton Algorithms
      • Nonlinear Conjugate Gradient
      • Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm
    • RProp Variants
      • Note that these methods were originally designed to tackle training Neural Networks, but if you look at the original paper, it’s really a general algorithm that can be applied if you can generate gradients of the cost function with respect to the parameters you’re finding
  • Stochastic Heuristic Global Optimization Algorithms
    • Genetic Algorithms
    • Particle Swarm Optimization
    • Simulated Annealing
  • Other Algorithms
    • Unscented Kalman Filter for Nonlinear Parameter Estimation

A couple other algorithms tailored to regression styled problems are:

  • Stochastic/Mini-batch Gradient Descent
    • This method uses either a single or a small number of random data samples from the dataset to estimate the gradient and take a step towards finding a local minima
    • This approach is used a lot in Machine Learning for big datasets since it is fairly efficient relative to full batch methods, can be scaled via parallel programming environments (GPU, cluster), and it’s simple to implement
    • The randomness of data used to estimate the gradient also gives a potential ability to overcome local optima that would otherwise stop one from making progress to the global optimum
  • Levenberg-Marquardt Method
    • This a Quasi-Newton method formulated to use only gradients and tackle Nonlinear Least Square problems

One other interesting point, just for reference, is that the above algorithms can also be used to iteratively compute the solution for Linear Least Square problems, if that makes sense for your application.

Now with all of this said, solving a Nonlinear Least Square Regression problem really comes down to defining the cost function, $J(\vec{\beta})$, and picking an optimization algorithm to use to solve it. With that, let’s revisit solving the last example problem but using a Nonlinear Least Square approach.

Example Problem – Mathematical Definition

For this problem, we will define the following quantities based on our prior definition of Weighted Least Square Regression and assumption of a nonlinear model:

w(x) &= 1\\
y(x) &= 4\left(e^{-(x-4)^2} + e^{-5\cdot10^{-2}(x-10)^2}\right)\\
m(x) &= y(x) + \eta \\
\mathcal{D} &= \lbrace (x_i,m_i): x_i \in [0,10], m_i = m(x_i), i \in [1,300] \rbrace \\
f(\vec{\beta},x) &= \beta_1 e^{-\beta_2(x-\beta_3)^2} + \beta_4 e^{-\beta_5(x-\beta_6)^2} \\

where $m(\cdot)$ is the measurement function, $m_i \forall i$ represent measurements the robot takes, $\eta$ is a scalar number drawn from a normal distribution with $\mu = 0$ and $\sigma = 10^{-1}$, and $f(\cdot,\cdot)$ is the nonlinear model we’ll be using!

To solve this problem, we’re going to implement a mini-batch gradient descent algorithm. To do this, we need to first make sure we can compute the gradient of the cost function with respect to the model parameters. Earlier, it was noted that this gradient can be computed by using the following equation:

$$ \frac{\partial J}{\partial \beta_k} = \frac{1}{N} \sum_{i=1}^{N} w\left(\vec{x}_i\right)\left(f(\vec{\beta^{*}},\vec{x}_i) – y_i\right)\frac{\partial f_i}{\partial \beta_k} \;\;\; \forall k$$

To compute this quantity, we also need the gradients of $f(\cdot,\cdot)$ with respect to the model parameters. We can find these gradient terms are the following:
\frac{\partial f}{\partial \beta_1} &= e^{-\beta_2(x-\beta_3)^2}\\
\frac{\partial f}{\partial \beta_2} &= -\beta_1 (x-\beta_3)^2 e^{-\beta_2(x-\beta_3)^2}\\
\frac{\partial f}{\partial \beta_3} &= 2(x-\beta_3)\beta_1\beta_2 e^{-\beta_2(x-\beta_3)^2}\\
\frac{\partial f}{\partial \beta_4} &= e^{-\beta_5(x-\beta_6)^2}\\
\frac{\partial f}{\partial \beta_5} &= -\beta_4 (x-\beta_6)^2 e^{-\beta_5(x-\beta_6)^2}\\
\frac{\partial f}{\partial \beta_6} &= 2(x-\beta_6)\beta_4 \beta_5 e^{-\beta_5(x-\beta_6)^2}

Given the quantities above, let’s start by implementing the nonlinear model. The way we’re going to implement the model is such that it will return the value it computes for the model AND the gradient.

function [fval, grad] = NonlinearModel(x, beta)
% Method representing the nonlinear model's output and its gradient
  exp1 = exp(-beta(2).*(x-beta(3)).^2);
  exp2 = exp(-beta(5).*(x-beta(6)).^2);

  fval = beta(1)*exp1 + beta(4)*exp2;
  grad = [exp1;

Next, we want to compute the function that will do the mini-batch gradient descent step:

function [cost, dbeta] = MiniBatchStep( x, y, beta, Model, batch_size, step_size)
% Method to perform a mini-batch based update to the
% current parameter vector
N = batch_size;         % get batch size
inds =ceil(rand(N,1)*length(x)); % get random batch indices
xs = x(inds);               % get mini batch input values
ys = y(inds);                % get mini batch output values

% initialize net gradient vector
net_grad = zeros(size(beta));
cost = 0;

% compute gradient
for i = 1:N
  [fval,grad] = Model(xs(i),beta);
  cost = cost + w(xs(i))*(fval-ys(i))^2/N;
  net_grad = net_grad + (w(xs(i))*(fval-ys(i))*grad/N);

% compute change in beta
dbeta = -step_size*net_grad;


Lastly, let’s setup a script to run the training algorithms:

% Script to train a nonlinear model using noisy data
% and the mini-batch gradient descent algorithm
% Author: C. Howard
% start processing

%% Get Data
N = 300;
[x, m] = genSensorData(0,10,N);

%% Train Model
beta = [5,1,3,3,0.1,9]'; % initial guess for beta
                         % this can really affect convergence
beta0 = beta;
max_iters = 50000; % set max number of optimization iterations
beta_soln = zeros(length(beta),max_iters);
beta_soln(:,1) = beta0;

% do optimization
for k = 2:max_iters
  [cost, db] = MiniBatchStep(x, m, beta, @NonlinearModel, 100, 1e-3);
  msg = sprintf('The current cost was found to be J = %f.',cost)
  beta = beta + db;
  beta_soln(:,k) = beta;

%% show results
% plot comparison and put into gif
filename = 'out.gif';

for i = 1:500:max_iters
    xlabel('X Position (m)','FontSize',16)
    ylabel('Altitude (m)','FontSize',16)
    title( sprintf('Altitude Model vs Raw Data @ Iteration %i',i),'FontSize',16)
    legend({'Raw Data','Fit'})
    frame = getframe(1);
    im = frame2im(frame);
    [imind,cm] = rgb2ind(im,256);
    if i == 1;
      imwrite(imind,cm,filename,'gif', 'Loopcount',inf);

% Done processing

After running the above script, you can produce the following GIF to visualize the convergence:

That turned out pretty cool, huh? Now, while this example worked out okay, one should be careful to note that the convergence of the above solution is fairly dependent on how good the initial guess is. There will likely be times a bad guess will result in sub-optimal results. A visual as to why this is can be seen below:


As stated earlier, there are heuristic approaches to helping avoid this issue, such as multi-start algorithms (just running the same optimization with a variety of different initial conditions). However, one will have to experiment with these approaches to see what you feel most comfortable with.


Phew! That was a lot of work but we got through it and hopefully got a better understanding of Linear and Nonlinear Least Square Regression. We learned to implement the algorithms and have some areas we can investigate further, like what optimization methods to use, if we feel up for it!

With this new knowledge, we will now be able to take some data set and take steps towards building a predictive model when the Least Square cost function makes sense!


6 comments Add yours
  1. First of all, thank you for your great and nice works! I like your blog and your reasearches which you updated in I’m learning about some model predictive control and neural-filters and i would like to ask you about references books relate to these problems. At least, in “MissSim” which you’ve updated i cannot find the history files in your folder with this line code
    “std::string historyFile(“/Users/christianjhoward/history.txt”
    Please can you add some initial condition files to run your works for me and other readers?
    Thank you and wish you all the best and happiness!

    1. Hey Tom,

      Thanks for the compliment, glad you’re getting something out of this! In terms of MPC, I personally don’t have much experience implementing things with that yet. I could certainly look up things related to it, but that wouldn’t be too useful since you could as well. In terms of Neural-Filters, the only stuff I have seen previously was ADELINE styled filters. I have tinkered with that and implemented some for time-series estimation, though I have found using nonlinear activation functions can be quite good compared to just a linear approach. Where are you at with those topics?

      In terms of the MissileSim code I have on GitHub, the “history.txt” file is actually the output file of the simulation. You could change that path in there to something you want and have it output data. At the moment, that code base doesn’t use an input file since I still haven’t setup certain models (like the thrust model and sensor models for the missile), so technically you could take the code and run with it as long as you have the dependencies (which might be some of my other projects on my GitHub).

      Let me know if you have any other comments or questions!

  2. Dear Christian,
    Thank you for your message, i’m glad to read something useful from your explain. I have tried to create a project in Visual C++ 2013 by using your code in the terms of the “missile Simulator” as you guided but there are some errors occurred and i really don’t know how to debug them. I need your help to successfully compile this work. I usually using the Matlab/Simulink software to simulate flight dynamics and control based on FDC model but it’s not good idea to ensure simulation in real time. I will post all of errors occurred bellow and please help me to solve them. Thank you very much.

    error C2872: ‘vec3’ : ambiguous symbol
    i think this error occurred in the function :
    NED CoordTransforms::NEDtoENU(const NED & vec){

    return vec3(vec[1], vec[0], -vec[2]);
    and when i tried to change the return of function as:
    return NED(vec[1], vec[0], -vec[2]);
    After that there are some errors generated:

    error LNK2005: “public: double __thiscall Fraction::convert(void)const ” (??$convert@N@Fraction@@QBENXZ) already defined in Fraction.obj

    error LNK2005: “public: double __thiscall Fraction::convert(void)const ” (??$convert@N@Fraction@@QBENXZ) already defined in Fraction.obj

    error LNK1169: one or more multiply defined symbols found \Visual Studio 2013\projectcu\MissileSim_Chris\Debug\MissileSim_Chris.exe 1 1 MissileSim_Chris

    I appreciate you taking the time to read my questions. Thank you again!

    1. First, you should consider putting up this project and the code into some Github project so I can investigate your setup.

      Second, the errors I see are partly due to linking issues where it thinks you’re trying to generate object code that has already been made. The ambiguity might also be because vec3 is defined a second time in some other part of the code, making its use ambiguous. I would have to see the Visual Studio project to be sure.


    1. Hi Tom,
      Sorry for the delay, I have had a lot going on for the last month. I will probably setup a Visual Studio project (using VS 2015 though) that you can download soon so you can have something working on Windows. It might take until early January, which is when I will have some time off from work.

So what do you want to tell me?