Line of Best Fit icon Line of Best Fit
University Project #Data Science#Mathematics

Line of Best Fit#

Creating a line of best fit to predict future country population, optimizing line hyperparameters through a genetic algorithm and error function.

.rmd knitted with execution results available for download here

Part 1: Download the file#

# url = "https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2022_Demographic_Indicators_Medium.zip"
# download.file(url, "output.zip")

Part 2: Unzip the file, put into dataframe#

# unzip
# zip = "output.zip"
# unzip(zip)
# put into dataframe
# file_name = "WPP2022_Demographic_Indicators_Medium.csv"
# full_data = read.csv(file_name)
full_data = read.csv(url("https://raw.githubusercontent.com/julien-arino/math-of-data-science/main/CODE/WPP2022_Demographic_Indicators_Medium.csv"))
# filter out irrelevant data
data = full_data[full_data$Location == "Canada" & full_data$Time <= 2023,]

Part 3: Plot the total population of that country#

Multiply the population by 1000 to get the real number, not in a measure of thousands

# sort data by time ascending
sorted = data[order(data$Time),]
# plot data
plot(x=data$Time ,y=data$TPopulation1Jan*1000, xlab="Year", ylab="Population")

Part 4a: Fit lines to the plot#

Start with fitting the line: y = a + bx#

Initially, create the matrix ‘A’, which consists of two columns of the coefficients for ‘a’ and ‘b’

The coefficients for the ‘a’ value is just 1’s, so fill the first column with a 1 for every ‘x’ value in the plot

The coefficients for the ‘b’ value are the ‘x’ values, so the second column is all of the values in ‘data$Time’

A = matrix(c(rep(1, length(data$Time)), data$Time), nr=length(data$Time), nc=2)
A
## [,1] [,2]
## [1,] 1 1950
## [2,] 1 1951
## [3,] 1 1952
## [4,] 1 1953
## [5,] 1 1954
## [6,] 1 1955
## [7,] 1 1956
## [8,] 1 1957
## [9,] 1 1958
## [10,] 1 1959
## [11,] 1 1960
## [12,] 1 1961
## [13,] 1 1962
## [14,] 1 1963
## [15,] 1 1964
## [16,] 1 1965
## [17,] 1 1966
## [18,] 1 1967
## [19,] 1 1968
## [20,] 1 1969
## [21,] 1 1970
## [22,] 1 1971
## [23,] 1 1972
## [24,] 1 1973
## [25,] 1 1974
## [26,] 1 1975
## [27,] 1 1976
## [28,] 1 1977
## [29,] 1 1978
## [30,] 1 1979
## [31,] 1 1980
## [32,] 1 1981
## [33,] 1 1982
## [34,] 1 1983
## [35,] 1 1984
## [36,] 1 1985
## [37,] 1 1986
## [38,] 1 1987
## [39,] 1 1988
## [40,] 1 1989
## [41,] 1 1990
## [42,] 1 1991
## [43,] 1 1992
## [44,] 1 1993
## [45,] 1 1994
## [46,] 1 1995
## [47,] 1 1996
## [48,] 1 1997
## [49,] 1 1998
## [50,] 1 1999
## [51,] 1 2000
## [52,] 1 2001
## [53,] 1 2002
## [54,] 1 2003
## [55,] 1 2004
## [56,] 1 2005
## [57,] 1 2006
## [58,] 1 2007
## [59,] 1 2008
## [60,] 1 2009
## [61,] 1 2010
## [62,] 1 2011
## [63,] 1 2012
## [64,] 1 2013
## [65,] 1 2014
## [66,] 1 2015
## [67,] 1 2016
## [68,] 1 2017
## [69,] 1 2018
## [70,] 1 2019
## [71,] 1 2020
## [72,] 1 2021
## [73,] 1 2022
## [74,] 1 2023

Now make the vector ‘b’, which is just the ‘y’ values of the plot

b = data$TPopulation1Jan

Now create the matrix multiplication of ‘A’ transpose times ‘A’, and get it’s inverse

AtA = t(A) %*% A
# Look at the result, and it's determinate
AtA
## [,1] [,2]
## [1,] 74 147001
## [2,] 147001 292051249
det(AtA)
## [1] 2498425
# Invert the result
invAtA = solve(AtA)

Since the determinate of ‘AtA’ is non-zero, and the columns that make up ‘A’ are clearly linearly independent, there is a unique solution.

The unique solution is the inverse of ‘AtA’, times the transpose of ‘A’, times ‘b’.

solution_linear = invAtA %*% t(A) %*% b
solution_linear
## [,1]
## [1,] -627577.8531
## [2,] 329.1654
Now fit the line: y = a + bx + cx^2#

It’s pretty much the same, only ‘A’ has a third column consisting of the coefficients for ‘c’, namely a column of ‘x^2’

However, due to the large values of ‘x^2’, that will continue to get larger as we work towards a solution for our quadratic line, we could run into problems when we try and find the inverse, namely the value of our determinate will get so extreme that the result will be functionally 0.

This can be solved by modifying the values of ‘x’, so long as they maintain the gaps between each value there will be no problem.

So by subtracting the minimum ‘x’ value from all ‘x’ values, we get smaller numbers that maintain the gap between the original ‘x’ values, thereby solving the issue of the extreme determinate.

A2 = matrix(c(rep(1, length(data$Time)), data$Time-min(data$Time), (data$Time-min(data$Time))^2), nr=length(data$Time), nc=3)
A2
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 1 1 1
## [3,] 1 2 4
## [4,] 1 3 9
## [5,] 1 4 16
## [6,] 1 5 25
## [7,] 1 6 36
## [8,] 1 7 49
## [9,] 1 8 64
## [10,] 1 9 81
## [11,] 1 10 100
## [12,] 1 11 121
## [13,] 1 12 144
## [14,] 1 13 169
## [15,] 1 14 196
## [16,] 1 15 225
## [17,] 1 16 256
## [18,] 1 17 289
## [19,] 1 18 324
## [20,] 1 19 361
## [21,] 1 20 400
## [22,] 1 21 441
## [23,] 1 22 484
## [24,] 1 23 529
## [25,] 1 24 576
## [26,] 1 25 625
## [27,] 1 26 676
## [28,] 1 27 729
## [29,] 1 28 784
## [30,] 1 29 841
## [31,] 1 30 900
## [32,] 1 31 961
## [33,] 1 32 1024
## [34,] 1 33 1089
## [35,] 1 34 1156
## [36,] 1 35 1225
## [37,] 1 36 1296
## [38,] 1 37 1369
## [39,] 1 38 1444
## [40,] 1 39 1521
## [41,] 1 40 1600
## [42,] 1 41 1681
## [43,] 1 42 1764
## [44,] 1 43 1849
## [45,] 1 44 1936
## [46,] 1 45 2025
## [47,] 1 46 2116
## [48,] 1 47 2209
## [49,] 1 48 2304
## [50,] 1 49 2401
## [51,] 1 50 2500
## [52,] 1 51 2601
## [53,] 1 52 2704
## [54,] 1 53 2809
## [55,] 1 54 2916
## [56,] 1 55 3025
## [57,] 1 56 3136
## [58,] 1 57 3249
## [59,] 1 58 3364
## [60,] 1 59 3481
## [61,] 1 60 3600
## [62,] 1 61 3721
## [63,] 1 62 3844
## [64,] 1 63 3969
## [65,] 1 64 4096
## [66,] 1 65 4225
## [67,] 1 66 4356
## [68,] 1 67 4489
## [69,] 1 68 4624
## [70,] 1 69 4761
## [71,] 1 70 4900
## [72,] 1 71 5041
## [73,] 1 72 5184
## [74,] 1 73 5329

The vector ‘b’ remains unchanged

AtA2 = t(A2) %*% A2
# Look at the result, and it's determinate
AtA2
## [,1] [,2] [,3]
## [1,] 74 2701 132349
## [2,] 2701 132349 7295401
## [3,] 132349 7295401 428943109
det(AtA2)
## [1] 3.0772e+13
# Invert the result
invAtA2 = solve(AtA2)
invAtA2
## [,1] [,2] [,3]
## [1,] 1.152774e-01 -6.273115e-03 7.112376e-05
## [2,] -6.273115e-03 4.622882e-04 -5.926980e-06
## [3,] 7.112376e-05 -5.926980e-06 8.119150e-08

Now compute the solution which is the inverse of ‘AtA2’, times the transpose of ‘A2’, times ‘b’

solution_quadratic = invAtA2 %*% t(A2) %*% b
solution_quadratic
## [,1]
## [1,] 14104.2697665
## [2,] 345.0353083
## [3,] -0.2173958

Part 4b: Fit lines to the plot using Least Squares Fit#

Starting with y = a + bx#

Start by creating a function to model a line.

Then we need to find the error between this line and the actual values.

This is done by taking the difference between the actual points and the modeled points

# Create a function to output the estimated values
# y = a + bx
linear = function(x, a, b){
return(a + b*x)
}
# Create the error function for a linear function
# e_i = y_i - y_i~
# total is sqrt of sum of e_i ^ 2
error_linear = function(a, b, points){
yt = linear(points$x, a, b)
e = points$y - yt
return(sqrt(sum(e^2)))
}

But we can’t actually do anything with this unless we have the parameters needed to model a line.

To solve this we can use a genetic algorithm to automatically optimize our hyper parameters using our error function.

# Create points list for model
points = list()
points$x = c(data$Time-min(data$Time))
points$y = c(data$TPopulation1Jan)*1000
# installs 'GA' package if not already installed
if(!require(GA)){
install.packages("GA")
library(GA)
}
## Loading required package: GA
## Loading required package: foreach
## Loading required package: iterators
## Package 'GA' version 3.2.3
## Type 'citation("GA")' for citing this R package in publications.
##
## Attaching package: 'GA'
## The following object is masked from 'package:utils':
##
## de
library(GA)
GA = ga(type = "real-valued",
fitness = function(val) {return(-error_linear(a = val[1], b = val[2], points))},
suggestions = c(a = 1.4e+07, b = 300000),
lower = c(0, 0), upper = c(4e+07, 1.5e+10),
popSize = 200, maxiter = 500)
plot(GA)

And the values for ‘a’ and ‘b’ are:

GA@solution
## x1 x2
## [1,] 13629086 347462.6
-GA@fitnessValue
## [1] 4271004
Now for y = a + bx + cx^2#

Simply create a new quadratic function and an error function to go along with it.

The formula for the error stays the same.

# Create a function to output the estimated values
# y = a + bx + cx^2
quadratic = function(x, a, b, c){
return(a + b*x + c*(x^2))
}
# Create the error function for a quadratic function
# e_i = y_i - y_i~
# total is sqrt of sum of e_i ^ 2
error_quadratic = function(a, b, c, points){
yt = quadratic(points$x, a, b, c)
e = points$y - yt
return(sqrt(sum(e^2)))
}

Then just preform the genetic algorithm with the new error function

GA2 = ga(type = "real-valued",
fitness = function(val) {return(-error_quadratic(a = val[1], b = val[2], c=val[3], points))},
suggestions = c(a = 1.4e+07, b = 3000, c = 10),
lower = c(0, 0, 0), upper = c(2e+07, 1.5e+7, 3000),
popSize = 200, maxiter = 500)
plot(GA2)
GA2@solution
## x1 x2 x3
## [1,] 13736523 339490.7 8.124464
-GA2@fitnessValue
## [1] 3614614

Part 5: Plot lines on graph#

# use the modified x-values
x = data$Time-min(data$Time)
# plot data every time
# plot y = a + bx
plot(x=data$Time ,y=data$TPopulation1Jan*1000, xlab="Year", ylab="Population")
title("y = a + bx")
lines(x=data$Time, y=1000*(solution_linear[1] + solution_linear[2]*data$Time), type="l", lwd=2, col="red")
# plot y = a + bx + cx^2
plot(x=data$Time ,y=data$TPopulation1Jan*1000, xlab="Year", ylab="Population")
title("y = a + bx + cx^2")
lines(x=data$Time, y=1000*(solution_quadratic[1] + solution_quadratic[2]*x + solution_quadratic[3]*(x ^ 2)), type="l", lwd=1, col="blue")
# plot y = a + bx least squares
plot(x=data$Time ,y=data$TPopulation1Jan*1000, xlab="Year", ylab="Population")
title("y = a + bx least squares")
lines(x=data$Time, y=(linear((points$x), GA@solution[1], GA@solution[2])), lwd=2, col="black")
# plot y = a + bx + cx^2 least squares
plot(x=data$Time ,y=data$TPopulation1Jan*1000, xlab="Year", ylab="Population")
title("y = a + bx + cx^2 least squares")
lines(x=data$Time, y=(quadratic((points$x), GA2@solution[1], GA2@solution[2], GA2@solution[3])), lwd=2, col="green")

Part 6: Predict the population at 2030#

Predicting the population at 2030 just involves plugging in 2030 for the ‘x’ values used in the ‘y=’ part of the line plots.

Though, you have to remember to shift 2030 by the earliest year value for the quadratic prediction and the least squares prediction.

linear_pred = 1000*(solution_linear[1] + solution_linear[2]*2030)
quad_pred = 1000*(solution_quadratic[1] + solution_quadratic[2]*(2030-min(data$Time)) + solution_quadratic[3]*((2030-min(data$Time)) ^ 2))
least_lpred = linear((2030-min(data$Time)), GA@solution[1], GA@solution[2])
least_qpred = quadratic((2030-min(data$Time)), GA2@solution[1], GA2@solution[2], GA2@solution[3])
linear_pred
## [1] 40627942
quad_pred
## [1] 40315761
least_lpred
## [1] 41426094
least_qpred
## [1] 40947774

It should be noted that the original data has predictions for Canada’s future population up to the year 2100, so we can compare our predicted result, to the predicted result in the .csv.

pred = full_data[full_data$Location == "Canada" & full_data$Time == 2030,]
pred$TPopulation1Jan * 1000
## [1] 40852177

From this we can see that the closest model to the given prediction is the least squares quadratic model.

← Back to Projects