Line of Best Fit Line of Best Fit
Creating a line of best fit to predict future country population, optimizing line hyperparameters through a genetic algorithm and error function.
.rmd knitted with execution results available for download here
Part 1: Download the file
# url = "https://population.un.org/wpp/Download/Files/1_Indicators%20(Standard)/CSV_FILES/WPP2022_Demographic_Indicators_Medium.zip"
# download.file(url, "output.zip")Part 2: Unzip the file, put into dataframe
# unzip# zip = "output.zip"# unzip(zip)
# put into dataframe# file_name = "WPP2022_Demographic_Indicators_Medium.csv"# full_data = read.csv(file_name)
full_data = read.csv(url("https://raw.githubusercontent.com/julien-arino/math-of-data-science/main/CODE/WPP2022_Demographic_Indicators_Medium.csv"))
# filter out irrelevant datadata = full_data[full_data$Location == "Canada" & full_data$Time <= 2023,]Part 3: Plot the total population of that country
Multiply the population by 1000 to get the real number, not in a measure of thousands
# sort data by time ascendingsorted = data[order(data$Time),]
# plot dataplot(x=data$Time ,y=data$TPopulation1Jan*1000, xlab="Year", ylab="Population")Part 4a: Fit lines to the plot
Start with fitting the line: y = a + bx
Initially, create the matrix ‘A’, which consists of two columns of the coefficients for ‘a’ and ‘b’
The coefficients for the ‘a’ value is just 1’s, so fill the first column with a 1 for every ‘x’ value in the plot
The coefficients for the ‘b’ value are the ‘x’ values, so the second column is all of the values in ‘data$Time’
A = matrix(c(rep(1, length(data$Time)), data$Time), nr=length(data$Time), nc=2)A## [,1] [,2]## [1,] 1 1950## [2,] 1 1951## [3,] 1 1952## [4,] 1 1953## [5,] 1 1954## [6,] 1 1955## [7,] 1 1956## [8,] 1 1957## [9,] 1 1958## [10,] 1 1959## [11,] 1 1960## [12,] 1 1961## [13,] 1 1962## [14,] 1 1963## [15,] 1 1964## [16,] 1 1965## [17,] 1 1966## [18,] 1 1967## [19,] 1 1968## [20,] 1 1969## [21,] 1 1970## [22,] 1 1971## [23,] 1 1972## [24,] 1 1973## [25,] 1 1974## [26,] 1 1975## [27,] 1 1976## [28,] 1 1977## [29,] 1 1978## [30,] 1 1979## [31,] 1 1980## [32,] 1 1981## [33,] 1 1982## [34,] 1 1983## [35,] 1 1984## [36,] 1 1985## [37,] 1 1986## [38,] 1 1987## [39,] 1 1988## [40,] 1 1989## [41,] 1 1990## [42,] 1 1991## [43,] 1 1992## [44,] 1 1993## [45,] 1 1994## [46,] 1 1995## [47,] 1 1996## [48,] 1 1997## [49,] 1 1998## [50,] 1 1999## [51,] 1 2000## [52,] 1 2001## [53,] 1 2002## [54,] 1 2003## [55,] 1 2004## [56,] 1 2005## [57,] 1 2006## [58,] 1 2007## [59,] 1 2008## [60,] 1 2009## [61,] 1 2010## [62,] 1 2011## [63,] 1 2012## [64,] 1 2013## [65,] 1 2014## [66,] 1 2015## [67,] 1 2016## [68,] 1 2017## [69,] 1 2018## [70,] 1 2019## [71,] 1 2020## [72,] 1 2021## [73,] 1 2022## [74,] 1 2023Now make the vector ‘b’, which is just the ‘y’ values of the plot
b = data$TPopulation1JanNow create the matrix multiplication of ‘A’ transpose times ‘A’, and get it’s inverse
AtA = t(A) %*% A
# Look at the result, and it's determinateAtA## [,1] [,2]## [1,] 74 147001## [2,] 147001 292051249det(AtA)## [1] 2498425# Invert the resultinvAtA = solve(AtA)Since the determinate of ‘AtA’ is non-zero, and the columns that make up ‘A’ are clearly linearly independent, there is a unique solution.
The unique solution is the inverse of ‘AtA’, times the transpose of ‘A’, times ‘b’.
solution_linear = invAtA %*% t(A) %*% bsolution_linear## [,1]## [1,] -627577.8531## [2,] 329.1654Now fit the line: y = a + bx + cx^2
It’s pretty much the same, only ‘A’ has a third column consisting of the coefficients for ‘c’, namely a column of ‘x^2’
However, due to the large values of ‘x^2’, that will continue to get larger as we work towards a solution for our quadratic line, we could run into problems when we try and find the inverse, namely the value of our determinate will get so extreme that the result will be functionally 0.
This can be solved by modifying the values of ‘x’, so long as they maintain the gaps between each value there will be no problem.
So by subtracting the minimum ‘x’ value from all ‘x’ values, we get smaller numbers that maintain the gap between the original ‘x’ values, thereby solving the issue of the extreme determinate.
A2 = matrix(c(rep(1, length(data$Time)), data$Time-min(data$Time), (data$Time-min(data$Time))^2), nr=length(data$Time), nc=3)A2## [,1] [,2] [,3]## [1,] 1 0 0## [2,] 1 1 1## [3,] 1 2 4## [4,] 1 3 9## [5,] 1 4 16## [6,] 1 5 25## [7,] 1 6 36## [8,] 1 7 49## [9,] 1 8 64## [10,] 1 9 81## [11,] 1 10 100## [12,] 1 11 121## [13,] 1 12 144## [14,] 1 13 169## [15,] 1 14 196## [16,] 1 15 225## [17,] 1 16 256## [18,] 1 17 289## [19,] 1 18 324## [20,] 1 19 361## [21,] 1 20 400## [22,] 1 21 441## [23,] 1 22 484## [24,] 1 23 529## [25,] 1 24 576## [26,] 1 25 625## [27,] 1 26 676## [28,] 1 27 729## [29,] 1 28 784## [30,] 1 29 841## [31,] 1 30 900## [32,] 1 31 961## [33,] 1 32 1024## [34,] 1 33 1089## [35,] 1 34 1156## [36,] 1 35 1225## [37,] 1 36 1296## [38,] 1 37 1369## [39,] 1 38 1444## [40,] 1 39 1521## [41,] 1 40 1600## [42,] 1 41 1681## [43,] 1 42 1764## [44,] 1 43 1849## [45,] 1 44 1936## [46,] 1 45 2025## [47,] 1 46 2116## [48,] 1 47 2209## [49,] 1 48 2304## [50,] 1 49 2401## [51,] 1 50 2500## [52,] 1 51 2601## [53,] 1 52 2704## [54,] 1 53 2809## [55,] 1 54 2916## [56,] 1 55 3025## [57,] 1 56 3136## [58,] 1 57 3249## [59,] 1 58 3364## [60,] 1 59 3481## [61,] 1 60 3600## [62,] 1 61 3721## [63,] 1 62 3844## [64,] 1 63 3969## [65,] 1 64 4096## [66,] 1 65 4225## [67,] 1 66 4356## [68,] 1 67 4489## [69,] 1 68 4624## [70,] 1 69 4761## [71,] 1 70 4900## [72,] 1 71 5041## [73,] 1 72 5184## [74,] 1 73 5329The vector ‘b’ remains unchanged
AtA2 = t(A2) %*% A2
# Look at the result, and it's determinateAtA2## [,1] [,2] [,3]## [1,] 74 2701 132349## [2,] 2701 132349 7295401## [3,] 132349 7295401 428943109det(AtA2)## [1] 3.0772e+13# Invert the resultinvAtA2 = solve(AtA2)invAtA2## [,1] [,2] [,3]## [1,] 1.152774e-01 -6.273115e-03 7.112376e-05## [2,] -6.273115e-03 4.622882e-04 -5.926980e-06## [3,] 7.112376e-05 -5.926980e-06 8.119150e-08Now compute the solution which is the inverse of ‘AtA2’, times the transpose of ‘A2’, times ‘b’
solution_quadratic = invAtA2 %*% t(A2) %*% bsolution_quadratic## [,1]## [1,] 14104.2697665## [2,] 345.0353083## [3,] -0.2173958Part 4b: Fit lines to the plot using Least Squares Fit
Starting with y = a + bx
Start by creating a function to model a line.
Then we need to find the error between this line and the actual values.
This is done by taking the difference between the actual points and the modeled points
# Create a function to output the estimated values# y = a + bxlinear = function(x, a, b){ return(a + b*x)}
# Create the error function for a linear function# e_i = y_i - y_i~# total is sqrt of sum of e_i ^ 2error_linear = function(a, b, points){ yt = linear(points$x, a, b) e = points$y - yt return(sqrt(sum(e^2)))}But we can’t actually do anything with this unless we have the parameters needed to model a line.
To solve this we can use a genetic algorithm to automatically optimize our hyper parameters using our error function.
# Create points list for modelpoints = list()points$x = c(data$Time-min(data$Time))points$y = c(data$TPopulation1Jan)*1000
# installs 'GA' package if not already installedif(!require(GA)){ install.packages("GA") library(GA)}## Loading required package: GA## Loading required package: foreach## Loading required package: iterators## Package 'GA' version 3.2.3## Type 'citation("GA")' for citing this R package in publications.#### Attaching package: 'GA'## The following object is masked from 'package:utils':#### delibrary(GA)GA = ga(type = "real-valued", fitness = function(val) {return(-error_linear(a = val[1], b = val[2], points))}, suggestions = c(a = 1.4e+07, b = 300000), lower = c(0, 0), upper = c(4e+07, 1.5e+10), popSize = 200, maxiter = 500)plot(GA)And the values for ‘a’ and ‘b’ are:
GA@solution## x1 x2## [1,] 13629086 347462.6-GA@fitnessValue## [1] 4271004Now for y = a + bx + cx^2
Simply create a new quadratic function and an error function to go along with it.
The formula for the error stays the same.
# Create a function to output the estimated values# y = a + bx + cx^2quadratic = function(x, a, b, c){ return(a + b*x + c*(x^2))}
# Create the error function for a quadratic function# e_i = y_i - y_i~# total is sqrt of sum of e_i ^ 2error_quadratic = function(a, b, c, points){ yt = quadratic(points$x, a, b, c) e = points$y - yt return(sqrt(sum(e^2)))}Then just preform the genetic algorithm with the new error function
GA2 = ga(type = "real-valued", fitness = function(val) {return(-error_quadratic(a = val[1], b = val[2], c=val[3], points))}, suggestions = c(a = 1.4e+07, b = 3000, c = 10), lower = c(0, 0, 0), upper = c(2e+07, 1.5e+7, 3000), popSize = 200, maxiter = 500)plot(GA2)GA2@solution## x1 x2 x3## [1,] 13736523 339490.7 8.124464-GA2@fitnessValue## [1] 3614614Part 5: Plot lines on graph
# use the modified x-valuesx = data$Time-min(data$Time)
# plot data every time
# plot y = a + bxplot(x=data$Time ,y=data$TPopulation1Jan*1000, xlab="Year", ylab="Population")title("y = a + bx")lines(x=data$Time, y=1000*(solution_linear[1] + solution_linear[2]*data$Time), type="l", lwd=2, col="red")# plot y = a + bx + cx^2plot(x=data$Time ,y=data$TPopulation1Jan*1000, xlab="Year", ylab="Population")title("y = a + bx + cx^2")lines(x=data$Time, y=1000*(solution_quadratic[1] + solution_quadratic[2]*x + solution_quadratic[3]*(x ^ 2)), type="l", lwd=1, col="blue")# plot y = a + bx least squaresplot(x=data$Time ,y=data$TPopulation1Jan*1000, xlab="Year", ylab="Population")title("y = a + bx least squares")lines(x=data$Time, y=(linear((points$x), GA@solution[1], GA@solution[2])), lwd=2, col="black")# plot y = a + bx + cx^2 least squaresplot(x=data$Time ,y=data$TPopulation1Jan*1000, xlab="Year", ylab="Population")title("y = a + bx + cx^2 least squares")lines(x=data$Time, y=(quadratic((points$x), GA2@solution[1], GA2@solution[2], GA2@solution[3])), lwd=2, col="green")Part 6: Predict the population at 2030
Predicting the population at 2030 just involves plugging in 2030 for the ‘x’ values used in the ‘y=’ part of the line plots.
Though, you have to remember to shift 2030 by the earliest year value for the quadratic prediction and the least squares prediction.
linear_pred = 1000*(solution_linear[1] + solution_linear[2]*2030)quad_pred = 1000*(solution_quadratic[1] + solution_quadratic[2]*(2030-min(data$Time)) + solution_quadratic[3]*((2030-min(data$Time)) ^ 2))least_lpred = linear((2030-min(data$Time)), GA@solution[1], GA@solution[2])least_qpred = quadratic((2030-min(data$Time)), GA2@solution[1], GA2@solution[2], GA2@solution[3])
linear_pred## [1] 40627942quad_pred## [1] 40315761least_lpred## [1] 41426094least_qpred## [1] 40947774It should be noted that the original data has predictions for Canada’s future population up to the year 2100, so we can compare our predicted result, to the predicted result in the .csv.
pred = full_data[full_data$Location == "Canada" & full_data$Time == 2030,]pred$TPopulation1Jan * 1000## [1] 40852177From this we can see that the closest model to the given prediction is the least squares quadratic model.