Datavis with ggplot2
The ggplot2 package is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts.
Overview
- Introduction: Plotting with ggplot2
- Scatterplot
- Faceting
- Title and axis
- Histograms
- Boxplots and violin plots
- Output and saving
1. Introduction: Plotting with ggplot2
First, install and load the ggplot2 package
install.packages("ggplot2")
library(ggplot2)
For this session, we will explore the iris
data that is already pre-loaded in R.
head(iris)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
You can read about the data by typing
?iris
The general formula is qplot(x, y, ...)
. To produce a basic scatter plot, type
qplot(x=Sepal.Length, y=Petal.Length, data=iris)
Adding color per species and size depending of petals width
qplot(Sepal.Length, Petal.Length, data=iris,
color=Species, size=Petal.Width,
xlab="Sepal", ylab="Petal", main="Iris dataset")
We also add a title and labels for the x and y-axis, using main
, xlab
and ylab
.
To start at 0 for the y-axis: add qplot(...,ylim=c(0,35))
This set lower and upper bounds for y axis.
You can do the same for the x-axis
qplot(Sepal.Length, Petal.Length, data=iris,
color=Species, size=Petal.Width, alpha=I(0.7),
xlab="Sepal Length", ylab="Petal Length", main="Iris dataset")
By setting the alpha of each point to 0.7, we reduce the effects of overplotting.
1.2. A robust function: ggplot
General formula:
ggplot(data, aes(x,y)) + geom_*()
ggplot begins a plot that you finish by adding layers to, using geom()
. ggplot provides more control than qplot().
- aes: aesthetic, visual properties of the graph
- options aes: color, fill, shape, size
- geom: graphical property
- geom_line; geom_bar; geom_histogram
- geom_chart; geom_hex, geom_c(point,line) etc. You can specify for each geom the aesthetic mappings, and a default stat and position adjustment:
geom_*(aes(color=, fill=, size=....))
- additional elements :
- You can add a smoothing trend : +
geom_smooth (method="lm")
- You can change the background (Themes):
+ theme_bw()
, `+theme_classic()
- You can add a smoothing trend : +
ggplot(mtcars,aes(x=disp,y=mpg))+ geom_point()
From now on, we will only work using ggplot.
2. Scatterplot
ggplot(diamonds, aes(x=carat, y=price)) + geom_point()
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()
Adding color: the color of the points is determined by the clarity of the diamonds.
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point()
Here we changed the color by the parameter “cut” of the dataset diamonds.
ggplot(diamonds, aes(x=carat, y=price, color=clarity, size=cut)) + geom_point()
Here we changed the size of the points according to the parameter cut.
3. Faceting
- Divide your plot up into multiple plots, one for each level.
facet_*()
function
Write a tilde (~), and then the attribute we would like to divide the plots by (here “clarity”)
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + facet_wrap(~ clarity)
To divide your graph based on two different attributes: facet_grid()
Here we could use "color ~ clarity"
. The tilde (~) means “explained by.”
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + facet_grid(color ~ clarity)
4. Title and axis
Title: add to the end of the line of code ggtile()
ggplot(diamonds, aes(carat,price)) + geom_point() + ggtitle("My scatter plot")
Axis label: xlab
and ylab
ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + ggtitle("My scatter plot") + xlab("Weight (carats)")
To limit the range of the x or the y axes: xlim
or ylim
ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + ggtitle("My scatter plot") + xlab("Weight (carats)") + xlim(0, 2)
This code limits the diamonds weight from 0 to 2 carats.
Another possibility is to put one of the axes on a log scale. You can do this with the scale_y_log10()
function:
ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + ggtitle("My scatter plot") + xlab("Weight (carats)") + scale_y_log10()
5. Histograms
Using the data set diamonds, we visualize the density of the distribution of price:
ggplot(diamonds, aes(x=price)) + geom_histogram()
You notice that we do not precise the y axis in the aesthetic. Indeed, using histograms to visualize the density of distribution, y is directly equal to the density within each bin!
You can change the width of each bin as an option, using binwidth
inside the geom_histogram()
function:
ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=2000)
And we can add all the options that we used with a basci scatter plot. For instance, the facet_wrap()
function:
ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=200) + facet_wrap(~ clarity)
Here, all the subplots share the same y-axis: this makes it hard to interpret the frequencies…
Some subplots have far more points than others.
To have different y-axis for each subplot, add scale="free_y"
ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=200) + facet_wrap(~ clarity, scale="free_y")
The same function can be used differentiating for the x-axis.
To make a stacked histogram based here on the clarity attribute: Try adding the fill
aesthetic within the aesthetic of the ggplot()
function.
ggplot(diamonds, aes(x=price, fill=clarity)) + geom_histogram()
6. Boxplots and violin plots
Boxplots are useful to compare multiple distributions. Compare the distribution of the price within each color using geom_boxplot()
:
ggplot(diamonds, aes(x=color, y=price)) + geom_boxplot()
Put the y-axis on a log scale: we get a better sense of how the distribution of price differs across multiple colors.
ggplot(diamonds, aes(x=color, y=price)) + geom_boxplot() + scale_y_log10()
A box plot will always look like a square. We can instead view the distribution as a density using a “violin plot”.
ggplot(diamonds, aes(x=color, y=price)) + geom_violin() + scale_y_log10()
ggplot(diamonds, aes(x=color, y=price)) + geom_violin() + scale_y_log10() + facet_wrap(~ clarity)
We used the facet_wrap()
function to display multiple subplots.
7. Output and saving
Run your line of code and save it to a variable. For instance, call it p for plot:
p = ggplot(diamonds, aes(x=carat, y=price)) + geom_point()
Save that plot to a file
ggsave(filename="diamonds.png", p) ## as a png
ggsave(filename="diamonds.pdf", p) ## as a pdf
ggsave(filename="diamonds.jpeg", p) ## as a jpeg
One useful shortcut is that if you just displayed a plot, like in a line like this:
ggplot(diamonds, aes(x=carat, y=price)) + geom_point()
ggsave("diamonds.png")
Then ggsave()
will know to save that plot by default when you perform ggsave
- you don’t even have to tell it which plot you’re saving.