After a well deserved vacation involving a trip back home to Chicago, an ArcGIS course and a wedding (not mine); I find my self getting ready to return to work. Having just finished that ArcGIS online course I have been spending my free time trying to get my hand on any and all books that could enlighten me on even more creative uses of this program for biologists. With all this new found knowledge I have gained I am ready and anticipating a new project. Upon returning to the office, my project is set before me! The next question is, can I do it?
At first my heart plummets as the project is explained to me. The sudden drop my heart does is not due to disappointment with being assigned a boring project, or even one that doesn’t related to my interest, the plummet my heart takes is caused by the intensity of the project. The project is exactly what I wanted, an outlet to try my newfound knowledge in ArcGIS and continue expanding my repertoire of R commands, but now sitting in front of my mentors, the task seems overly daunting. The explanation is peppered with statistical and technical terms I have never heard of and requests to “write an application” figure out the best algorithm”, and “obviously make sure to statistical analyze your models using an AIC”. As the mentors leave with promises of getting me a few papers and the data sets, I sit there in shell shock. The world is spinning around me. Doubt slowly creeps in, the fear of failure blinds me. I take a deep breath. Focus. I’ve got this.
This brings me to a small tangent. BIOLOGISTS NEED MORE MATH in their training. I know that oftentimes students get to choose some of the courses and one could specialize in math, but what about the core courses? I do not particularly like math, but as I am working on different projects as an ecologist I have been reminded how essential truly understanding statistics is to experimental design. Statistician and GIS professionals are good at what they do, but oftentimes don’t seem to be able to grasp the underlying biological concepts to be to too much use to biologist. So, in essence, I think that this is just my little wake-up call to each and everyone of you to become a stronger biologist by solidifying an intimate knowledge of statistics, modeling and programming. I did not think I would be going back to school (I just finished my Masters), but it seems that this intership might have gotten me super stoked about ecological modeling. Any good programs you guys can recommend?
Anyway, my mentor soon returns with a thumb drive loaded with all sorts of goodies. Welcome back!, The equivletant of 3 to4″ inch manila folders filled with reading and data sets gets thrown on my desk.
The key, and hardest parts to succeeding at this project is starting. With that in mind, I start out with determining what exactly AIC actually stands for and how do I use it to analyze my models.
With the help of The R coding Hand book and Wikipedia (oh, how students I TA’d would love to yell at me for this as I once did to them) I compiled a little fact sheet about AIC curves
I do not expect this to be very useful to you guys but since I wrote it out for myself I figured I would not be selfish and share it.
AIC Definition, History and R coding
The Akaike information criterion also known as AIC is a measure of the relative quality of a statistic model for a certain set of data. It is, more officially a “statistic trade as a penalized log-likelihood”. It looks at the complexity and goodness of fit allowing a means for model selection. It is important to remember that this criterion can not test a hypothesis or provide an absolute test telling us how good the models fit the data; it just tells you which ones fit better. It was published in 1974 by Hirotuga Akaike; this was in Japanese and was not widely known. Only in 2002 was it published in English by Burhams & Anderson.
The AIC is:
K=#of parameters in model
L=max value of likeligood function for the models
*Best models is the one with the lowest AIC, and simple is better ( too many parameters in an equation is penalized while goodness of fit is sought after)
**Only used with large sample sizes (number of models) there is a correction if you want to correct for a finite sample size(AICc)
In practice once you have calculated the AIC criterion for your models then you have to decide which ones minimize the amount of data you lose. This is done by looking at the relative probability of the model in question minimizing information loss, aka ((AICmin-AIC)/2).
R CODING
#Getting AIC
Data<-read.table(“—–”, header=T)
attach( data)
name(data) # the names are growth and tannin, lets pretend
model<-lm(growth~tannin) #this is to work out the linear regression model for this data in R
#now to define all the variable in the equation
N<-length (growth)
sse<- sum((growth-fitted(model))^2)
s2<- sse/(n-2)
s<-sqrt(s2)
#computing log likelihood
-(n/2)*log(2*pi)-n*log(s)-sse/(2*s2)
#not to calculate the AIC, -2* loglikelihood+ 2(p+1)
-2 *(insert your loglik number)+ 2(number of parameters+1)
# Once you have this the AIC you want to compare them to each other
model.1<- lm(Fruit~Grazing*Root)
model.2<-lm(Fruit*Grazing*Root)
AIC(model.1, model.2)
#if you have more than two models you do this:
models<-list(model1, model2, model3, model4)
aic<- unlist (lapply (models,AIC) #this extracts the aic, aic will be a vector in which you can search for the minimum.
More on this project to come. See you next time! That is, if the monsoons do not wash me and my computer away.