## The Beauty of Modelling: How AVES determines the best trees to fight climate change

Modelling is a beautiful thing.

To be clear, I am talking about statistical modelling.

To be clear, I am talking about statistical modelling.

Modelling is a powerful tool that allows us to find and quantify relationships in the world. Indeed, AVES uses modelling to determine which tree species are best for fighting climate change by modelling tree growth after a given number of years.

What is Modelling?Simply put, modelling consists of finding the best way to fit a line (or multiple lines) through data. This allows us to determine the value of a variable (the dependent variable) based on one or more other variables (the independent variable). In this case, AVES aims to identify the diameter of a tree based on its age. The diameter will allow us to use more complex equations to estimate the total number of carbon atoms stored in a tree of a certain species during its lifetime (the tree's carbon sequestration potential). |

**Types of Models: Linear and Additive**

Linear and additive models (figure 1 and 2, respectively) are two types of models used in statistical modelling. Linear models are by far some of the most common and useful models available. They give us a formula with a regression parameter that quantifies the relationship between the dependent and independent variables. Although linear models are not limited to straight lines, sometimes models need extra “flexibility” in order to fit more wavering patterns. In this case, we can apply additive models that use something called “smoothers” to fit the data properly.

**M**

**odel Validation:**

The models described above need data that meet certain conditions in order to be valid—in other words, we need to verify that the models do not violate underlying assumptions to be confident that they work properly. These assumptions include equal variance (also called homogeneity), normality, independence, and deterministic variables. Here, we will only discuss variance to keep things brief:

**Variance:**

Variance is the amount of dispersion (the scatter or the spread) in the data. It is one of the most important aspects of statistical modelling because the more variability there is, the greater the uncertainty in the model.

As mentioned earlier, linear and additive models need to meet the homogeneity (equal variance) assumption. However, data often won’t behave like this―think of this intuitive example: In a 10-second race with five runners, the variance will increase as time goes on since at time zero all the runners are on equal footing at the starting point and only as time passes can the runners distance themselves from each other. If we keep things simple and assume no runners overtake one another once the race has started, we will have a funnel shape in the variance structure where the fastest runners will progressively get further and further away from the slowest runners as time increases. Although in this example we can clearly see the variance structure when we plot our data (figure 3), a common approach to verifying the variance is to look at the residuals (figure 4) which are the observed values minus the fitted values (i.e. how far the data point is from the line we fit).

The same pattern applies to many ecological phenomena including tree growth. The more time has passed, the more a tree has had the opportunity to grow differently than the others. Therefore, violation of the homogeneity principle is expected and we need to account for this in how we model our data.

**GLS (Generalized Least Squares):**

Using GLS, we can apply various mathematical parameterizations to account for a non-equal spread (also called heterogeniety) in the variance structure. Without going into the details, we will model three variance structures—we’ll call these variance structure 1, 2, and 3 (very creative names, I know).

**Model Selection**

Now that we have a set of potential models, how do we decide which one is the best? We simply follow the maxim “everything should be made as simple as possible, but no simpler.” Luckily for us, we can compute this using an AIC (Akaike’s Information Criterion) test. AIC finds the "sweet spot" between the fit of a model and its complexity (the number of parameters) then assigns each model a numerical value which reflects its "quality". The model with the lowest AIC score is the best model*.

So, we take all the models with the different variance structures that we created and compare them to each other using AIC:

**Model AIC score**

Original Model 22079.49

Variance Structure 1 20689.13

Variance Structure 2 20557.77

Variance Structure 3 21102.41

We see that the best model is the one with variance structure 2 since it has the lowest AIC score.

**Final Model Validation**

If we use GLS modelling, we still need to validate our model by looking for signs of heterogeneity that could compromise its effectiveness. Our model now allows for heterogeneity in the ordinary residuals so we look for heterogeneity by plotting and inspecting the standardized residuals instead. We obtain standardized residuals by calculating the observed values minus the fitted values then dividing these by the square root of the variance. Below are the ordinary residuals (figure 5) and the standardized residuals (figure 6) obtained from Common Hackberry (Celtis occidentalis) data from the Montreal region:

Beautiful! The funnel shape is gone and there are no obvious patterns in the graph. We can now use our model to determine a value for the expected tree diameter of this species after a given number of years*:

The next step is to plug the diameter value along with other measurements in a published tree equation to obtain a “bioscore” representing the CO2 sequestration potential for that species.

AVES is continuously improving its models and is now working on developing models that take additional variables into consideration (e.g. habitat and environmental factors). I hope to have your support as I undertake this challenging yet fulfilling endeavor.

Notes:

* There are many important aspects such as data exploration, variable significance, etc. which are not mentioned in this text

* The data points in figure 3 are not independent since we are taking repeated measures, but the example is simply used to understand/visualize variance

* Attributing the lowest AIC score to the best model is a slight oversimplification because it does not account for uninformative parameters, but this is beyond the scope of this text; see Arnold (2010) for details

* Due to lack of data for some species, other means that are not described here were used to obtain a diameter readings for these species

* There are many important aspects such as data exploration, variable significance, etc. which are not mentioned in this text

* The data points in figure 3 are not independent since we are taking repeated measures, but the example is simply used to understand/visualize variance

* Attributing the lowest AIC score to the best model is a slight oversimplification because it does not account for uninformative parameters, but this is beyond the scope of this text; see Arnold (2010) for details

* Due to lack of data for some species, other means that are not described here were used to obtain a diameter readings for these species

Sources:

Arnold, T.W., 2010. Uninformative parameters and model selection using Akaike's Information Criterion.

Burnham, K.P. and Anderson, D.R., 2002. A practical information-theoretic approach.

Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., R Core Team 2020.

R Core Team, 2020.

Zuur, A., Ieno, E.N., Walker, N., Saveliev, A.A. and Smith, G.M., 2009.

Images:

Model posing in trees: jovibingelyte from Pixabay

Statistical Models: Patrick J. Turgeon made using R 2020

Celtis occidentalis: Chhe (public domain)

Arnold, T.W., 2010. Uninformative parameters and model selection using Akaike's Information Criterion.

*The Journal of Wildlife Management*,*74*(6), pp.1175-1178.Burnham, K.P. and Anderson, D.R., 2002. A practical information-theoretic approach.

*Model selection and multimodel inference, 2nd ed. Springer, New York*,*2*.Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D., R Core Team 2020.

*nlme: Linear and Nonlinear Mixed Effects*

Models. R package version 3.1-147, <URL: https://CRAN.R-project.org/package=nlme>Models

R Core Team, 2020.

*R: A language and environment for statistical computing*. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.Zuur, A., Ieno, E.N., Walker, N., Saveliev, A.A. and Smith, G.M., 2009.

*Mixed effects models and extensions in ecology with R*. Springer Science & Business Media.Images:

Model posing in trees: jovibingelyte from Pixabay

Statistical Models: Patrick J. Turgeon made using R 2020

Celtis occidentalis: Chhe (public domain)