Data Analyst Project: Temperature Anomalies Forecasting

Introduction

In this project, we will analyze the annual temperature anomalies from 1850 to 2021 for the northern hemisphere. This data set we use is a time series data set as it has a metric temperature anomaly compared over time and it is important to analyze as temperature changes are vital to our planet and environment. It is valuable and necessary to be able to know how temperature will change over time so we can implement strategies to accomodate such temperature changes.

Materials and Methods

For this project, data was obtained from the Climate Research Center at the University of East Anglia, UK. The data has 172 observations each representing a year starting in 1850 and ending in 2021. The other column in the data set is a numeric measure of temperature anomalies which ranges from -0.702 to 1.2760.

There are several time series methods used in this project. First we use graphical methods such as line charts, ACF plots, and PACF plots to assess the stationarity of the data. We use methods of model selection such as AIC to select appropriate ARIMA models for the data. We also use graphical and numerical methods to assess diagnostics on the final model such as normal probability plots, Ljung-Box tests, and histograms. Finally we fit other models such as an ARMA model through spline estimation of trend and forecast future values.

Results

Method 1: modeling the stationary series

Visualization of Original and First Differenced Data

	Temperature anomalies
Minimum	-0.70200
25% quantile	-0.36125
Median	-0.12400
Mean	-0.03456
75% quantile	0.10075
Maximum	1.27600

The boxplot on the left summarizes the distribution of the temperature anomalies variable. We can find the summary Statistics from the table above.

As seen from the above plot on the right, the series doesn’t look stationary because the mean is not the same for all years. The variance may not have a constant value that it fluctuates around as well as having fluctuations that change over time. This can be seen as time increases, the vertical fluctuation shrinks compared to lower values of time. We will need to difference the series to achieve stationarity.

The plot above on is the first differenced series. This series has a constant value that the series fluctuates around as well as having roughly equation vertical fluctuation over time. The mean seems to stay roughly the same around 0. This is assumed to be stationary and we will proceed with this series as our data from here out.

The above plots are the ACF and PACF plots for the first differenced series. For the ACF plot, lag 0 and lag 1 are both outside of the 95% confidence interval and the remaining lags are all within the confidence interval. This indicates that an MA(1) model may be appropriate for the data.

For the PACF plot, lags 1,2, and 3 are all outside of the 95% confidence interval and the remaining lags are all within the confidence interval. This indicates that an AR(3) model may be appropriate for the data.

Putting all of this together, our preliminary identification of which model to fit is an ARIMA(3,1,1) model.

Fitting an ARIMA(3,1,1) model obtained via preliminary identification

The plots above displays the properties of the residuals for our preliminary ARIMA(3,1,1) model. The line graph for the residuals over time has no apparent pattern and have roughly equal variance. The ACF plot shows that no lags have an ACF value outside of the 95% confidence interval indicating that the residuals are iid. The histogram of the residuals is roughly normally distributed and lastly, the normal probability plot has no large variation from the straight line indicating that the data is normally distributed.

Selecting a Model Based on AIC Criterion

We now use the AIC criterion to select the most appropriate ARIMA(p,d,q) model for the data. We consider 16 models with p = 0, 1 ,2 and 3, and q = 0, 1, 2 and 3, where p is the AR order and q is the MA order.

AIC table:

	q = 0	q = 1	q = 2	q = 3
p = 0	-0.9229414	-1.094005	-1.119878	-1.108348
p = 1	-0.9922384	-1.114102	-1.108236	-1.109690
p = 2	-1.0455561	-1.114930	-1.112556	-1.114580
p = 3	-1.1233958	-1.116518	-1.110752	-1.103335

From the table above, we compare the AIC values from 16 ARIMA models. You can see that the AIC criterion is minimized for the first differenced series with AR(3) and MA(0) resulting in an AIC of -1.1234. This suggests that the best model is an ARIMA(3,1,0). We now will fit the ARIMA(3,1,0) model and analyze the appropriate diagnostics.

Model Diagnostics on the final ARIMA(3,1,0) Mode

The plots above displays the properties of the residuals for our ARIMA(3,1,0) model that minimized the AIC criterion. The line graph for the residuals over time has no apparent pattern and have roughly equal variance. The ACF plot shows that no lags have an ACF value outside of the 95% confidence interval. The histogram of the residuals is roughly normally distributed and lastly, the normal probability plot has no large variation from the straight line indicating that the data is normally distributed.

Since the ACF plot has no lags outside of the 95% confidence interval, the data can be said to be i.i.d. To check numerically, we will conduct a Ljung-Box test.

## 
##  Box-Ljung test
## 
## data:  residuals(mod)
## X-squared = 7.032, df = 10, p-value = 0.7224

For the Ljung Box test above, the hypotheses are such:

\[H_0: \rho(0) = \rho(1) = ... = \rho(10) = 0\] \[H_a: \text{At least 1 of the } \rho(0),\rho(1),...,\rho(10) \neq 0\] We fail to reject \(H_0\) because the p-value is 0.7224 which is greater than 0.05. There is not sufficient evidence to claim that the sequence is not i.i.d.

Final Model

Table of the estimated parameters and the standard errors:

	Coefficients	Standard Errors
ar1	-0.4113455	0.0738018
ar2	-0.3429713	0.0758651
ar3	-0.2837367	0.0738616

The final model is:

\[X_t =-0.41135X_{t-1} -0.34297X_{t-2} - 0.28374X_{t-3} + \varepsilon_t\]

where \(X_t = \nabla Y_t = Y_t- Y_{t-1}\)

Forecasting temperature anomalies using the final ARIMA(3,1,0) model for year 2016 to 2021

	95% CI lower bound	Predicted value	Standard error of the Predicted value	95% CI upper bound
2016	0.7290739	0.9931122	0.1347134	1.257151
2017	0.6349624	0.9391939	0.1552202	1.243425
2018	0.6225290	0.9459326	0.1650018	1.269336
2019	0.6644738	1.0004318	0.1714072	1.336390
2020	0.6241784	0.9901110	0.1867003	1.356044
2021	0.5822893	0.9740973	0.1999021	1.365905

The above plot is a plot showing how the ARIMA(3,1,0) model fit only on the data from 1850 to 2015 performs on forecasting for the last 6 years in the data set. The blue region is 95% confidence interval for the forecasted values. All true values for the last 6 years are inside of the 95% confidence interval. In addition, as you can see the model typically predicted a lower value for anomaly than the true value but for the year 2018 the forecasted value is very close to the true value.

Method 2: modeling the series with trend using Spline and the rough using ARMA

We first fit the Spline trend and then exam if the rough is stationary using a plot of the rough against time.

Plot the rough against time

The plot of the rough against time showS that the mean of the rough seems to stay at 0. The series has a constant value that the series fluctuates around as well as having roughly equation vertical fluctuation over time. This is assumed to be stationary.

Obtain the ACF and PACF plots

From the ACF plot, we can see that lag1 is outside of the 95% confident interval, and the rest of the lags have 0 value. Hence, a MA(1) is a reasonable model based on the ACF plot.

From the PACF plot, lag1, lag13, lag16 and lag21 are outside of the 95% confident interval. The PACF plot cannot tell us what AR model we should use. So we will use the AIC criterion to select the appropriate model.

Model selection using AIC criterion for estimating the rough

We use the AIC criterion to select the most appropriate ARMA(p,q) model for the data. We consider 16 models with p = 0, 1 ,2 and 3, and q = 0, 1, 2 and 3, where p is the AR order and q is the MA order.

AIC table:

	q = 0	q = 1	q = 2	q = 3
p = 0	-228.5292	-235.6150	-233.6283	-238.1587
p = 1	-234.2372	-233.6198	-233.7616	-254.8977
p = 2	-235.4474	-256.2489	-234.9504	-253.3076
p = 3	-235.8595	-234.6004	-254.0339	-251.3119

Based on the information from the AIC table above, we decided to fit a ARMA(2,1) model for the rough because ARMA(2,1) has the smallest AIC value.

Forecasting Temperature Anomalies using Spline and ARMA (2,1)

	Estimated trend	Estimated rough	Standard error of the Estimated rough	Predicted value
2016	1.354	0.059693842	0.1076235	1.413694
2017	1.577	0.005939844	0.1081116	1.582940
2018	1.800	-0.013887872	0.1110765	1.786112
2019	2.023	-0.017161129	0.1153997	2.005839
2020	2.246	-0.013965717	0.1181625	2.232034
2021	2.469	-0.009367361	0.1193805	2.459633

The plot above shows the forecasts for the temperature anomalies for year 2016 to 2021 for both method 1 and method 2.

Method 1: The green line is the forecast for ARIMA(3,1,0) model using the first difference of the series. As we mentioned at the end of the Method 1 session, all the true values are inside of the 95% confident interval which indicates that the ARIMA(3,1,0) is a good model.

However, we can see that the method 2 didn’t seem to be a good model. The reason is the forecasting values for the last 6 years increases as time increases. But the model can predict the temperature anomaly well for year 2016. May be the method 2 is only good for a very short term forecasting. It is not good for the long term forecasting.

Conclusion and Discussion

In this project, we used the annual temperature anomalies from 1850 to 2015 for the northern hemisphere to forecast the temperature anomalies for year 2016 to 2021. One method we used is fitting a ARIMA(3,1,0) model using the first difference of the series. The other method we used is fitting a Spline model and a ARMA(2,1) model using the original data. Comparing these two methods, we think the ARIMA(3,1,0) is a better model. The reason is all 6 predicted values are close to the 6 true values. However, for method 2, only the year 2016 were predicted reasonably. The forecast trend looks like an upward straight line which seems not reasonable.

If we just focus on the year 2016, method 2 had a better result than method 1. Based on this result, we think that method 2 is still a good model but it is only for short term forecasting. Hence, we concluded that the method 2 is good for short term forecasting, and method 1 is good for a longer term forecasting. But we are not sure if we apply these two methods to a different data set, our conclusion still holds.

Appendix (R Code)

knitr::opts_chunk$set(echo = FALSE)
library(flextable)
library(kableExtra)
library(readxl)
library(forecast)
library(astsa)
library(tseries)
library(Hmisc)
library(dplyr)
data <- read_excel("TempNH_1850_2021.xlsx")
x <- data[, 2]
#summary(x)

par(mfrow = c(1,2))
boxplot(x, ylab = " Temperture anomalies", main = "Boxplot of temperture anomalies ")
my_ts <- ts(x, start=c(1850,1))
plot.ts(my_ts, ylab="Temperature anomalies ", main="Annual temperature anomalies ")

summary_table<-data.frame(summary = c("Minimum","25% quantile","Median","Mean", "75% quantile","Maximum"),
                 A = c( -0.70200, -0.36125, -0.12400, -0.03456, 0.10075,1.27600))
knitr::kable(summary_table,"pipe",col.name=c("","Temperature anomalies"),align=c("l","c"))
x<-unlist(x)
y<-diff(x,1)
plot.ts(y, ylab="Diff_Temp_Anom", main = "First Differenced Series")
par(mfrow = c(1,2))
acf(y, main = "ACF of the first diff. series")
pacf(y, main = "PACF of the first diff. series")
initial_mod<-arima(x,order=c(3,1,1))
par(mfrow = c(2,2))
ts.plot(residuals(initial_mod), ylab="Residuals", main="Residuals vs. Time")
acf(residuals(initial_mod), main="ACF Plot")
hist(residuals(initial_mod), freq = F, main = "Histogram", xlab = "Rough Part")
qqnorm(residuals(initial_mod), main = "Normal Plot")
qqline(residuals(initial_mod))
AIC<-matrix(0,4,4)
for (i in 1:4){
  for (j in 1:4){
      AIC[i,j]<-sarima(x,p=i-1,d=1,q=j-1,details=FALSE)$AIC ##use x, not y
   }
}
rownames(AIC) <- c("AR(0)", "AR(1)", "AR(2)", "AR(3)")
colnames(AIC) <- c("q=0", "q=1", "q=2", "q=3")
AIC_table<-data.frame(P = c("p = 0", "p = 1", "p = 2", "p = 3"),
                 A = c( -0.9229414, -0.9922384, -1.0455561, -1.1233958),
                 B = c( -1.094005, -1.114102, -1.114930,-1.116518),
                 C = c( -1.119878, -1.108236, -1.112556, -1.110752),
                 D = c( -1.108348, -1.109690, -1.114580,-1.103335))
knitr::kable(AIC_table,"pipe",col.name=c("","q = 0", "q = 1", "q = 2", "q = 3"),align=c("l","c", "c", "c", "c"))
## Fitting the ARIMA(3,1,0) model
mod<-arima(x,order=c(3,1,0))
par(mfrow = c(2,2))
ts.plot(residuals(mod), ylab="Residuals", main="Residuals vs. Time")
acf(residuals(mod), main="ACF Plot")
hist(residuals(mod), freq = F, main = "Histogram", xlab = "Rough Part")
qqnorm(residuals(mod), main = "Normal Plot")
qqline(residuals(mod))
test <- Box.test(residuals(mod), lag = 10, type = "Ljung-Box")
test
fin_mod_table <- data.frame(coefs = mod$coef, se = sqrt(diag(mod$var.coef)))
knitr::kable(fin_mod_table,"pipe",col.name=c("Coefficients", "Standard Errors"),align=c("l","c"))
n <- length(x)
#split data
xnew <- x[1:(n-6)]
xlast <- x[(n-5):n]

#fit
model_new <- arima(xnew,order = c(3,1,0))

#prediction
h <- 6
m <- n - h
fcast <- predict(model_new, n.ahead=h)
upper <- fcast$pred+1.96*fcast$se
lower <- fcast$pred-1.96*fcast$se

pred_table<-data.frame(P = c("2016", "2017", "2018", "2019","2020","2021"),lower,fcast,upper)
knitr::kable(pred_table,"pipe",col.name=c("","95% CI lower bound", "Predicted value", "Standard error of the Predicted value", "95% CI upper bound"),align=c("l","c", "c", "c","c"))

plot(2016:2021, xlast,type = "b", col = "black", lty = 2,lwd = 2, ylab = "First_Diff_Temp_Anom", xlab = "Year",main = "Observed and forecasted values from 2016 to 2021", ylim = c(0.5, 1.5))
polygon(x=c(2016:2021,2021:2016), y=c(upper,rev(lower)), col='lightblue', border=NA)
lines(2016:2021, y=fcast$pred,col='blue', type ="b")
lines(2016:2021, y=xlast,col='black',type = "b")
legend("top", legend = c("True","Predicted"), lty=c(1, 1), col = c("black","blue"))

## Use Prof. Burman’s spline trend function that estimates the spline trend.
trend_spline=function(y, lam){ 
 n=length(y)
 p=length(lam)
 rsq=rep(0, p) 
 y=sapply(y,as.numeric) 
 tm=seq(1/n, 1, by=1/n) 
 xx=cbind(tm, tm^2, tm^3) 
 knot=seq(.1, .9, by=.1) 
 m=length(knot)
 for (j in 1:m){ 
 u=pmax(tm-knot[j], 0); u=u^3
 xx=cbind(xx,u)
 }
 for (i in 1:p){
 if (lam[i]==0){ 
 ytran=log(y)
 } else { 
 ytran=(y^lam[i]-1)/lam[i]
 }
 ft=lm(ytran~xx)
 res=ft$resid; sse=sum(res^2)
 ssto=(n-1)*var(ytran)
 rsq[i]=1-sse/ssto
 }
 ii=which.max(rsq); lamopt=lam[ii] 
 if (lamopt==0) {
 ytran=log(y) 
 } else {
 ytran=y^lamopt
 }
 ft=lm(ytran~xx);
 best_ft=step(ft, trace=0)
 fit=best_ft$fitted; res=best_ft$resid
 result=list(ytrans=ytran, fitted=fit, residual=res, rsq=rsq, lamopt=lamopt)
 return(result)
}

## Fitting spline trend
splinetrnd <- trend_spline(xnew, 1)
#plot the rough against time 
plot(splinetrnd$residual, type = "l", xlab = "Time", ylab = expression(paste(X[t])), main = "Plot of the rough vs. Time")
par(mfrow = c(1,2))
acf(splinetrnd$residual, main = "ACF of the rough")
pacf(splinetrnd$residual, main = "PACF of the rough")
AIC<-matrix(0,4,4)
for (i in 1:4){
  for (j in 1:4){
      AIC[i,j]<-arima(splinetrnd$residual,order=c(i-1,0,j-1))$aic
   }
}

rownames(AIC) <- c("p=0", "p=1", "p=2", "p=3")
colnames(AIC) <- c("q=0", "q=1", "q=2", "q=3")

#AICc table
#AIC

#The smallest AICc 
#which(AIC == min(AIC), arr.ind = TRUE)
AIC_table<-data.frame(P = c("p = 0", "p = 1", "p = 2", "p = 3"),
                 A = c( -228.5292, -234.2372, -235.4474, -235.8595),
                 B = c( -235.6150, -233.6198, -256.2489, -234.6004),
                 C = c( -233.6283, -233.7616, -234.9504, -254.0339),
                 D = c( -238.1587, -254.8977, -253.3076, -251.3119))
knitr::kable(AIC_table,"pipe",col.name=c("","q = 0", "q = 1", "q = 2", "q = 3"),align=c("l","c", "c", "c", "c"))
## guessing the trend for these years
new_data<-data[1:166,]
out <- c(2016,2017,2018,2019,2020,2021)
New <- approxExtrap(x=new_data$Year, y=new_data$Anomaly,xout=out)
#New$y
#Forecasting the rough
model <- arima(splinetrnd$residual,order=c(2,0,1))
fcast_r <- predict(model, n.ahead=6)
#fcast_r$pred

upper <- fcast_r$pred+1.96*fcast_r$se
lower <- fcast_r$pred-1.96*fcast_r$se
obs <- as.numeric(tail(x,n=6))
forec_spine <- as.numeric(New$y + fcast_r$pred)

method2_table<-data.frame(P = c("2016", "2017", "2018", "2019","2020","2021"),New$y,fcast_r,forec_spine)
knitr::kable(method2_table,"pipe",col.name=c("","Estimated trend", "Estimated rough", "Standard error of the Estimated rough", "Predicted value"),align=c("l","c", "c", "c","c"))

forec_arima<- as.numeric(fcast$pred)

plot(2016:2021, obs, type = "b", col = "red", lty = 2,lwd = 2, xlab = "Year", ylab = "Temperature anomalies",
main = "Observed and forecasted vs. year", ylim = c(0.8, 2.50))
lines(2016:2021, forec_spine, type = "b", col = "blue", lty = 3, lwd = 3)
lines(2016:2021, forec_arima, type = "b", col = "green", lty = 3, lwd = 3)
legend("topleft", legend = c("Observed", "Forecasted Spline and ARMA (2,1)", "Forecasted ARIMA (3,1,0)"),
col = c("red", "blue","green"), lty = c(2,3), lwd = c(2,3))