Thursday, March 31, 2011

Empirical confidence intervals on scatter plots

Update: see this instead for a better solution.




Here's a little R-trick for you nerds out there in cyberland, but mainly for me, so I won't have to rediscover this little bit of code.

>xyplot(results3gr[,1]~results3gr[,2], xlim=c(0, 5000), cex=.1, ylim=c(-200, 200))



So you've got a dense scatter plot. Damn, there's points EVERYWHERE on this stupid thing, but how many more are being overplotted along the x axis versus some other blue-colored spot?

Well, how about using a 2D-kernel density estimator to get the relative density of points across a regular grid? Then you could find the ordinates correspond to 95% of the mass at a given slice of the grid.

So, the package KernSmooth should do the trick:

> ks = bkde2D(results3gr, bandwidth=c(10, 30), range.x=list(c(-200, 200), c(0, 1e4)))

Then we need to find the marginal mass at each abscissa of the grid, and the cumulative mass:

> ksmarg = apply(ks$fhat, 2, sum)
> kscum = apply(ks$fhat, 2, cumsum)

Then divide the cumulative mass by the marginal mass, and get a list of linear approximations


> kscond = as.data.frame((t(t(kscum)/ksmarg)))
> kscdf = list()
> for (i in 1:ncol(kscond)){kscdf[[i]] = approxfun(kscond[[i]], ks$x1)}

to get a cumulative distribution function for each abscissa of the grid. Now we get a list of 1% and 99% ordinates for each value in the abscissa:

> confbars = matrix(nrow=2, ncol = ncol(kscond))
> for ( i in 1:ncol(kscond)){confbars[1, i] = kscdf[[i]](.025); confbars[2, i] = kscdf[[i]](.975)}


Now do some lattice witchcraft and plot:

> xyplot(results3gr[,1]~results3gr[,2], panel=function(x, y, ...){panel.xyplot(x, y, cex=.3); panel.xyplot(ks$x2, confbars[1,], type="l", col="black"); panel.xyplot(ks$x2, confbars[2, ], type="l", col="black")}, xlim=c(0, 5e3), ylim=c(-200, 200))



The bars appear to be over-conservative, probably because of sparsity...but oh well. It'll have to do.


Hat tip to Yihui to for the reference to KernSmooth.