## Introduction

Welcome back! In a previous discussion on artificial intelligence (AI), we touched on various machine learning (ML) methods. Within ML, we listed a variety of different techniques we could use to model, sort, and classify data. Of these, linear and logistic regressions were the first option that came to mind; this is not surprising, since much of ML hinges on data reduction and regression analysis. In order to build a solid foundation, let’s take a look at the simplest case: linear regressions.

### Regressions

To better understand linear and logistic regressions, let’s imagine we own a company, and our company makes a Widget. To see how we’re doing, we ask our salespersons to keep track of how many sales they make over nine months. If our company is successful, we would expect the total number of sales to increase as the year progresses. But here’s a more interesting question: Is there a relationship between the month and the total sales we’ve made? Let’s say our salesperson collects the following data:

We can plot the number of sales as a function of time, and that might begin to get us the answer we’re looking for. But we’d really like to fit some sort of line to the data to see if we can approximately describe the relationship with an equation. But what kind of line? This is the difference between a linear regression (“linear” implying we use a straight line), and a logistic regression (“logistic” implying we use an exponential or logarithmic relationship).

#### Linear Regression: Implementation

As the name implies, in a linear regression we assume the best fit to our dataset is a straight line. And from algebra, we recall that any straight line can be defined by the equation $Y=m*x+b$ where $m$ is the slope and $b$ is the Y-axis offset. We can easily perform a linear regression to fit our dataset using linear algebra: $month=\begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \\ ... & ... \\ 1 & 9 \end{bmatrix}$, and $sales=\begin{bmatrix} 5 \\ 16 \\ 31 \\ ... \\ 101 \end{bmatrix}$

Effectively, these matrices are the $Y$ (sales) and $x$ (month) variables in the equation for our line. We can solve for $m$ and $b$ by matrix dividing $sales$ by $month$ as in the Julia code below — these are given as the two coefficients obtained in our variable $V$! Once we have these, we can plot the line representing our linear fit:

# Create table of sales values
month = [1 1; 1 2; 1 3; 1 4; 1 5; 1 6; 1 7; 1 8; 1 9]
sales = [5; 16; 31; 59; 62; 78; 90; 98; 101]

# Determine the linear fit
V = month\sales
r = range(1, 9, step=0.1)
linear_fit = V .+ r.*V

# Plot points and linear fit
figure()
scatter(month[:,2], y, c="blue", s=8)
plot(r, linear_fit, c="black"); show(); gcf()


While this technique is explicit and yields a reliable solution, it’s not the most computationally efficient routine we can muster. In fact, we can develop a function to improve the speed and computational efficiency of our linear regression algorithm! Beware, this is less clean than our previous example, but it’s much faster.

# Define the optimized linear regression function
function optimized_linreg(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
(N = length(x)) == length(y) || throw(DimensionMismatch())
ldiv!(cholesky!(Symmetric([T(N) sum(x); zero(T) sum(abs2, x)], :U)), [sum(y), dot(x, y)])
end

# Determine the linear fit
x,y = float.(month), float.(sales)
V = optimized_linreg(x,y)
r = range(1, 9, step=0.1)
linear_fit = V .+ r.*V

# Plot the results
figure()
scatter(month[:,2], y, c="blue", s=8)
plot(r, linear_fit, c="black"); show(); gcf()


#### Linear Regression: Error analysis

Great! We now have a linear model that seems to fit our dataset appropriately. But we can still apply a numerical metric to our data to tell us exactly how good our match is. The method we’ll demonstrate here is called the “R-squared” method, which produces a decimal between 0 (not great) and 1 (a perfect fit). This will also allow us to compare the accuracy of our optimized function to our initial matrix division technique.

We’ll start by defining the R-squared measurement as the ratio between , or: \begin{aligned} R^2=1-\frac{SSR}{SST} =1-\frac{\Sigma(y_i-\hat{y_i})^2}{\Sigma(y_i-\bar{y_i})^2} \\ \end{aligned}

where $y_i$ are the realized data (sales), $\hat{y_i}$ is the predicted (fit) data, and $\bar{y_i}$ is the mean of the real data points. Similar to our linear regression models, we can write this up as a compact function (not optimized for speed). Below, our function r_squared ingests our realized data values (y) and the data predicted by our fit (fitted_data) at the same x-values (months).

function r_squared(y,fitted_data)
ssr = sum((y.-fitted_data[:,2]).^2)
sst = sum((y.-mean(y)).^2)
rs = 1-(ssr/sst)
return rs
end


Putting this all together, we end up with the following code snippet. The plot produced by the code shows an overlap between our optimized linear fit function (optimized_linreg) and our original linear fit function (original_linreg), so it is not surprising that their $R^2$ values match (0.965).

using PyPlot
using Statistics
using LinearAlgebra

function original_linreg(x, y)
V = x\y
return V
end

function optimized_linreg(x::AbstractVector{T}, y::AbstractVector{T}) where {T<:AbstractFloat}
(N = length(x)) == length(y) || throw(DimensionMismatch())
ldiv!(cholesky!(Symmetric([T(N) sum(x); zero(T) sum(abs2, x)], :U)), [sum(y), dot(x, y)])
end

# Calculate error
function r_squared(y,fitted_data)
ssr = sum((y.-fitted_data[:,2]).^2)
sst = sum((y.-mean(y)).^2)
rs = 1-(ssr/sst)
return rs
end

# Create table of sales values
month = [1 1; 1 2; 1 3; 1 4; 1 5; 1 6; 1 7; 1 8; 1 9]
sales = [5; 16; 31; 59; 62; 78; 90; 98; 101]

# Original regression
V = original_linreg(month,sales)
r = range(1, 9, step=0.1)
fine_orig_fit = V .+ r.*V
coarse_orig_fit = V .+ month.*V

# Optimized regression
x,y = float.(month[:,2]), float.(sales)
V = optimized_linreg(x,y)
rf = range(1, 9, step=0.1)
fine_optim_fit = V .+ rf.*V
coarse_optim_fit = V .+ month.*V

# Linear fit error analysis
r_squared_orig = r_squared(sales,coarse_orig_fit)
r_squared_optim = r_squared(sales,coarse_optim_fit)
@info("", r_squared_orig, r_squared_optim)

# Plot the results
figure()
scatter(month[:,2], sales, c="blue", s=8)
plot(r, fine_orig_fit, c="gray")
plot(rf, fine_optim_fit, c="black", linestyle=":")
show(); gcf()


## Conclusion

In this tutorial, we took a look at what linear regressions are, how they can be used as a data analysis tool, and their accuracy can be derived using the $R^2$ data metric. We then wrote a short Julia program to show these algorithms in action. Now that we have a better understanding of linear regressions, we can begin discussing how they are used in an ML-based context.

In our next post, we’ll apply what we’ve learned to build a machine learning algorithm based on a much larger dataset. To stay updated as I post, feel free to like, comment, and subscribe! See you next time, and thank you for joining me — things are just starting to heat up!

Get new content delivered directly to your inbox.