Navigation

Entries by Frank Davenport (10)

Wednesday
Mar142012

Plotting forecast() objects in ggplot part 1: Extracting the Data

Lately I've been using Rob J Hyndman's excellent forecast package. The package comes with some built in plotting functions but I found I wanted to customize and make my own plots in ggplot. In order to do that, I need a generalizable function that will extract all the data I want (forecasts, fitted values, training data, actual observations in the forecast period, confidence intervals, et cetera) and place it into a data.frame with a properly formatted date field (ie, not  a ts() object).

The function below does all that and should work for any forecast object (though I've only tested it on Arima() outputs). The only arguments it takes are the original observations and the forecast object (whatever results from calling forecast()). In my next post I'll give some examples of plotting the results using ggplot and explain why I wanted more than the default plot.forecast() function.

 

#--Produces a data.frame with the Source Data+Training Data, Fitted Values+Forecast Values, forecast data Confidence Intervals
funggcast<-function(dn,fcast){ 
	require(zoo) #needed for the 'as.yearmon()' function
 
	en<-max(time(fcast$mean)) #extract the max date used in the forecast
 
	#Extract Source and Training Data
	ds<-as.data.frame(window(dn,end=en))
	names(ds)<-'observed'
	ds$date<-as.Date(time(window(dn,end=en)))
 
	#Extract the Fitted Values (need to figure out how to grab confidence intervals)
	dfit<-as.data.frame(fcast$fitted)
	dfit$date<-as.Date(time(fcast$fitted))
	names(dfit)[1]<-'fitted'
 
	ds<-merge(ds,dfit,all.x=T) #Merge fitted values with source and training data
 
	#Exract the Forecast values and confidence intervals
	dfcastn<-as.data.frame(fcast)
	dfcastn$date<-as.Date(as.yearmon(row.names(dfcastn)))
	names(dfcastn)<-c('forecast','lo80','hi80','lo95','hi95','date')
 
	pd<-merge(ds,dfcastn,all.x=T) #final data.frame for use in ggplot
	return(pd)
 
}

Created by Pretty R at inside-R.org

 

Monday
Mar122012

Property rights and the economic origins of the Sicilian mafia

This is fascinating. Originally posted on FT Alphaville, the paper seems like a great example of work that intersects econ, physical geography, and political science.  I'll just post the link for now, and revisit once I've had time to read the actual paper.

Friday
Mar092012

Recovering Marginal Effects and Standard Errors of Interactions Terms Pt. II: Implement and Visualize

In the last post I presented a function for recovering marginal effects of interaction terms. Here we implement the function with simulated data and plot the results using ggplot2.  

 

 

#---Simulate Data and Fit a linear model with an interaction term
y<-rnorm(100,5,1)
x<-rnorm(100,5,1)
d<-data.frame(y=y,x=x,fac=sample(letters[1:3],100,replace=T))
 
mod<-lm(y~x*fac,data=d)
 
#========================================================
 
#---Extract the Main Effects, including the baseline, into a data.frame
dusp<-funinteff(mod,'x') #returns a data.frame of the Estimate and Standard Error, row.names correspond to the variables
 
#----Now Set the data up to visualize in ggplot-----
library(ggplot2)
#------Quick ggplot (move into graph code later)
#quick convenience function to compute significance at .95
funsig<-function(d){
	tstat<-abs(d$b/d$se)
	sig<-ifelse(tstat>=1.96,'yes','no')
	return(sig)
}
 
 
names(dusp)[1:2]<-c('b','se') #change the names to to make typing easier
 
#Add confidence intervals and signficance test
dusp$hi<-dusp$b+1.96*dusp$se
dusp$lo<-dusp$b-1.96*dusp$se
dusp$sig95<-funsig(dusp)
 
dusp$var<-row.names(dusp)
 
 
pd<-dusp
 
p1<-ggplot(data=pd,aes(x=var,y=b,shape=sig95))
p1<-p1+geom_hline(yintercept=0,col='grey')+geom_line()
p1<-p1+geom_pointrange(aes(ymin=lo,ymax=hi)) #+coord_flip() #uncomment coord_flip to switch the axes
p1<-p1+scale_y_continuous(name='Marginal Effect of Interaction Terms')

Created by Pretty R at inside-R.org

 

Monday
Mar052012

Recovering Marginal Effects and Standard Errors from Interaction Terms in R

When I fit models with interactions, I often want to recover not only the interaction effect but also the marginal effect (the main effect + the interaction) and of course the standard errors. There are a couple of ways to do this in R but I ended writing my own function (essentially a wrapper around the deltaMethod() function) to fit my needs.

In this case, I have a model where a continuous variable has been interacted with a discrete variable. I want to create a data.frame that stores the main effect, the marginal effects of the interactions, and their standard errors.

I also want the standard errors for the marginal effects on the interaction to match the standard errors on the main effect which in this case were HAC (Heteroskedastic and Autocorrelation Consistent ie. Newey-West).

The function below is a little sloppy as I built it iteratively while I was working through a specific problem, but it is generalized to any lm() type object. The basic use case is that you have one or more models, each with one or more interactions and the discrete terms in the interactions have more than two levels (it will work with two levels, but in that case the function might be overkill). In my case I fit a series of region specific panel models where two of the main effects were interacted with a year term. I wanted have a generalizable function where I could just feed it the model object, variable name, and covariance estimator, and then return a data.frame with estimates and standard errors. In the next post I'll demonstrate the function with simulated data and show how to visualize the marginal in ggplot.

 Note: The code below assumes you have already loaded car (for deltaMethod()), sandwich (for vcovHAC), and lmtest (for coeftest()).

 

##Calls this function which is a wrapper for coeftest from the sandwhich package
sehac<-function(fit,vcov=vcovHAC){ #Convenience function for HAC standard erros
	coeftest(fit,vcov)
}
 
funinteff<-function(mod,var,vcov=vcovHAC){ #mod is an lm() object, var is the name of the main effect that was interacted, vcov is the type of variance covariance method you want to use 
	#Extract Coefficient names create 'beta names' to feed to deltaMethod()
	cnames<-coef(mod)
	pnams<-data.frame('b'=paste('b',0:(length(cnames)-1),sep=""),'est'=cnames) #assign parameter names so that deltaMethod does not throw an error
 
	#Extract the specific parameters of interest
	vars<-grep(var,names(cnames),value=T)
	var1<-vars[1]
	intvars<-vars[2:length(vars)]
	bi<-pnams[var1,'b']
 
	#--Create Data Frame to store Main Effect
	int<-sehac(mod,vcov=vcov)[var1,c('Estimate','Std. Error')]
	int<-as.data.frame(t(int))
	names(int)<-c('Estimate','SE')
	row.names(int)<-var1
 
	#Loop through and store the results in a data.frame
	for(i in 1:length(intvars)){
		bint<-pnams[intvars[i],'b']
		eq<-paste(bi,bint,sep="+")
		interac<-deltaMethod(mod,eq,parameterNames=pnams[,1],vcov=vcov)
		row.names(interac)<-intvars[i]
 
		int<-rbind(int,interac)
	}
	return(int)
}

Created by Pretty R at inside-R.org

 

Thursday
Nov112010

Spatial Tunnel Vision

Geography has a problem I call 'spatial tunnel vision'. In a nerdy nutshell you could summarize it as the failure to realize the absence of Tobler's Law as a null-hypothesis that should be rejected. Somewhat more plainly put: We always assume spatial correlation, and what's worse, we almost always assume that correlation is the result of some underlying spatial process, rather than the result of some unobserved and spatially concentrated influence.

For example, unemployment often exhibits high spatial correlation. Does this mean that if your neighbor loses her job it will likely happen to you too, simply because you live next door? No, people with similar incomes, and often those who work in similar industries, tend to live in close proximity to one another. So the same forces that impacted your neighbors employer might also impact yours.

This may seem like an obvious argument, but I see this mistake alot. People, usually geographers, use the fact that many things are spatially correlated to argue that 'place matters' and 'spatial is special'. Well, sort of. In the above example, yes place does matter, but only as a special case of cross-sectional correlation, that if properly controlled for, will not impact whatever broader statistical inference you want to make with the data.

I think this problem stems from the back-seat that Geography had in academia for much of the 20th century. Prior to the quantitative revolution (in geography) and especially the availability of consumer accessible GIS packages, Geographers spent alot of time persuading others that what they did was relevant. This has created a great sense of camaderie within the community, but, especially after the rise of GIS and Remote Sensing, has created a sense that every other discipline is misguided until they see the spatial light.

I use to feel this way. As a PhD student in a prominent geography department I see (and sympathize) with this view alot. It was not until I really delved into statistics and started working with economists that I realized that yeah, spatial is special, but not all time, and the burden of proof lies with the geographers. And, even if spaital is not special, there may still be a lot of interesting questions to ask of spatially correlated data.

Page 1 2