R Resources‎ > ‎

Bubble scatterplots for overlapping data

Ever have this problem?
You are doing a regression on your data, but you have overlapping data, so what comes out in the figure doesn't accurately reflect the analysis you want to convey.
Here's an example:

# A sample dataset of two variables, a and b
a=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5)

b=c(2,2,2,4,8,2,4,3,7,4,4,4,6,10,6,6,7,8,8,8,3,8,10,2,8,10,10)

dat=as.data.frame(cbind(a,b),colnames=c("a","b"))
dat
   a  b
1  1  2
2  1  2
3  1  2
4  1  4
5  1  8
6  1  2
7  2  4
8  2  3
9  2  7
10 2  4
11 2  4
12 2  4
13 3  6
14 3 10
15 3  6
16 3  6
17 3  7
18 4  8
19 4  8
20 4  8
21 4  3
22 4  8
23 5 10
24 5  2
25 5  8
26 5 10
27 5 10

You can easily show that there is a strong linear relationship between a and b:
mod=lm(dat$b~dat$a,data=dat)
summary(mod)
...gives you:
Call:

lm(formula = dat$b ~ dat$a, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max
-6.3394 -1.0925  0.0874  0.9807  4.5142

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)   2.2724     0.9785   2.322 0.028651 * 
dat$a         1.2134     0.3038   3.993 0.000504 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.247 on 25 degrees of freedom
Multiple R-squared: 0.3895,    Adjusted R-squared: 0.365
F-statistic: 15.95 on 1 and 25 DF,  p-value: 0.0005038



However, if you simply plot this data:
plot(a,b,pch=20)

... seems not terribly convincing.
This is because the scatterplot is hiding all the overlapping data points that are important to your analysis. So here is the code to produce a better figure, which scales the size of the "bubble" points according to the number of data points:


# Sample data
a=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5)
b=c(2,2,2,4,8,2,4,3,7,4,4,4,6,10,6,6,7,8,8,8,3,8,10,2,8,10,10)

dat=as.data.frame(cbind(a,b),colnames=c("a","b"))
dat

# Create an "aggregated" dataset that has three columns: variable a, variable b, and the number of samples that have that combination of values. (you might notice that "newdat" actually has four columns... that is due to the inelegance of my code... but whatever, it does the trick)
newdat=aggregate(dat,by=list(a,b),length)
colnames(newdat)=c("a","b","N")
newdat

# a bubble plot of "newdat"
symbols(newdat$a,newdat$b,circles=newdat$N/20,inches=FALSE,xlab="variable a",ylab="variable b")

# a linear model of the relationship between a and b, from the original dataset ("dat")
mod=lm(dat$b~dat$a,data=dat)
summary(mod)

# draw the regression line on the bubble plot
lines(dat$a,predict(mod),lty=1,lwd=2)




Comments