Datavis with ggplot2

The ggplot2 package is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts.


1. Introduction: Plotting with ggplot2

First, install and load the ggplot2 package

install.packages("ggplot2")
library(ggplot2)

For this session, we will explore the iris data that is already pre-loaded in R.

head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

You can read about the data by typing

?iris


1.1. A simple function: qplot

The general formula is qplot(x, y, ...). To produce a basic scatter plot, type

qplot(x=Sepal.Length, y=Petal.Length, data=iris)   

Adding color per species and size depending of petals width

qplot(Sepal.Length, Petal.Length, data=iris,
      color=Species, size=Petal.Width, 
      xlab="Sepal", ylab="Petal", main="Iris dataset")

We also add a title and labels for the x and y-axis, using main, xlaband ylab.

To start at 0 for the y-axis: add qplot(...,ylim=c(0,35))
This set lower and upper bounds for y axis.
You can do the same for the x-axis

qplot(Sepal.Length, Petal.Length, data=iris, 
      color=Species, size=Petal.Width, alpha=I(0.7), 
      xlab="Sepal Length", ylab="Petal Length", main="Iris dataset")

By setting the alpha of each point to 0.7, we reduce the effects of overplotting.


1.2. A robust function: ggplot

General formula:
ggplot(data, aes(x,y)) + geom_*()
ggplot begins a plot that you finish by adding layers to, using geom(). ggplot provides more control than qplot().

  • aes: aesthetic, visual properties of the graph
    • options aes: color, fill, shape, size
  • geom: graphical property
    • geom_line; geom_bar; geom_histogram
    • geom_chart; geom_hex, geom_c(point,line) etc. You can specify for each geom the aesthetic mappings, and a default stat and position adjustment: geom_*(aes(color=, fill=, size=....))
  • additional elements :
    • You can add a smoothing trend : + geom_smooth (method="lm")
    • You can change the background (Themes): + theme_bw(), `+theme_classic()
ggplot(mtcars,aes(x=disp,y=mpg))+ geom_point()

From now on, we will only work using ggplot.

Back to top


2. Scatterplot

ggplot(diamonds, aes(x=carat, y=price)) + geom_point()

ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

Adding color: the color of the points is determined by the clarity of the diamonds.

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point()

Here we changed the color by the parameter “cut” of the dataset diamonds.

ggplot(diamonds, aes(x=carat, y=price, color=clarity, size=cut)) + geom_point()

Here we changed the size of the points according to the parameter cut.

Back to top


3. Faceting

  • Divide your plot up into multiple plots, one for each level.
  • facet_*() function

3.1. Facet wrap

Write a tilde (~), and then the attribute we would like to divide the plots by (here “clarity”)

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + facet_wrap(~ clarity)

3.2. Facet grid

To divide your graph based on two different attributes: facet_grid()
Here we could use "color ~ clarity". The tilde (~) means “explained by.”

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point() + facet_grid(color ~ clarity)

Back to top


4. Title and axis

Title: add to the end of the line of code ggtile()

ggplot(diamonds, aes(carat,price)) + geom_point() + ggtitle("My scatter plot")

Axis label: xlab and ylab

ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + ggtitle("My scatter plot") + xlab("Weight (carats)")

To limit the range of the x or the y axes: xlim or ylim

ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + ggtitle("My scatter plot") + xlab("Weight (carats)") + xlim(0, 2)

This code limits the diamonds weight from 0 to 2 carats.

Another possibility is to put one of the axes on a log scale. You can do this with the scale_y_log10() function:

ggplot(diamonds, aes(x=carat, y=price)) + geom_point() + ggtitle("My scatter plot") + xlab("Weight (carats)") + scale_y_log10()

Back to top


5. Histograms

Using the data set diamonds, we visualize the density of the distribution of price:

ggplot(diamonds, aes(x=price)) + geom_histogram()

You notice that we do not precise the y axis in the aesthetic. Indeed, using histograms to visualize the density of distribution, y is directly equal to the density within each bin!

You can change the width of each bin as an option, using binwidth inside the geom_histogram() function:

ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=2000)

And we can add all the options that we used with a basci scatter plot. For instance, the facet_wrap() function:

ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=200) + facet_wrap(~ clarity)

Here, all the subplots share the same y-axis: this makes it hard to interpret the frequencies…
Some subplots have far more points than others.

To have different y-axis for each subplot, add scale="free_y"

ggplot(diamonds, aes(x=price)) + geom_histogram(binwidth=200) + facet_wrap(~ clarity, scale="free_y")

The same function can be used differentiating for the x-axis.

To make a stacked histogram based here on the clarity attribute: Try adding the fill aesthetic within the aesthetic of the ggplot() function.

ggplot(diamonds, aes(x=price, fill=clarity)) + geom_histogram()

Back to top


6. Boxplots and violin plots

Boxplots are useful to compare multiple distributions. Compare the distribution of the price within each color using geom_boxplot():

ggplot(diamonds, aes(x=color, y=price)) + geom_boxplot()

Put the y-axis on a log scale: we get a better sense of how the distribution of price differs across multiple colors.

ggplot(diamonds, aes(x=color, y=price)) + geom_boxplot() + scale_y_log10()

A box plot will always look like a square. We can instead view the distribution as a density using a “violin plot”.

ggplot(diamonds, aes(x=color, y=price)) + geom_violin() + scale_y_log10()

ggplot(diamonds, aes(x=color, y=price)) + geom_violin() + scale_y_log10() + facet_wrap(~ clarity)

We used the facet_wrap() function to display multiple subplots.

Back to top


7. Output and saving

Run your line of code and save it to a variable. For instance, call it p for plot:

p = ggplot(diamonds, aes(x=carat, y=price)) + geom_point()

Save that plot to a file

ggsave(filename="diamonds.png", p)  ## as a png
ggsave(filename="diamonds.pdf", p)  ## as a pdf
ggsave(filename="diamonds.jpeg", p) ## as a jpeg

One useful shortcut is that if you just displayed a plot, like in a line like this:

ggplot(diamonds, aes(x=carat, y=price)) + geom_point()
ggsave("diamonds.png")

Then ggsave() will know to save that plot by default when you perform ggsave - you don’t even have to tell it which plot you’re saving.

Back to top