---
jupytext:
  cell_metadata_filter: -all
  formats: py:percent,md:myst,ipynb
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.13.8
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

# Effective data visualization {#viz}

```{r viz-setup, include = FALSE}
library(tidyverse)
library(cowplot)
library(knitr)
library(kableExtra)
library(magick)


knitr::opts_chunk$set(fig.align = "center")
```

## Overview 
This chapter will introduce concepts and tools relating to data visualization
beyond what we have seen and practiced so far.  We will focus on guiding
principles for effective data visualization and explaining visualizations
independent of any particular tool or programming language.  In the process, we
will cover some specifics of creating visualizations (scatter plots, bar
plots, line plots, and histograms) for data using R. 

## Chapter learning objectives

By the end of the chapter, readers will be able to do the following:

- Describe when to use the following kinds of visualizations to answer specific questions using a data set:
    - scatter plots
    - line plots
    - bar plots 
    - histogram plots
- Given a data set and a question, select from the above plot types and use R to create a visualization that best answers the question.
- Given a visualization and a question, evaluate the effectiveness of the visualization and suggest improvements to better answer the question.
- Referring to the visualization, communicate the conclusions in non-technical terms.
- Identify rules of thumb for creating effective visualizations. 
- Define the three key aspects of ggplot objects:
    - aesthetic mappings
    - geometric objects
    - scales
- Use the `ggplot2` package in R to create and refine the above visualizations using:
    - geometric objects: `geom_point`, `geom_line`, `geom_histogram`, `geom_bar`, `geom_vline`, `geom_hline`
    - scales: `xlim`, `ylim`
    - aesthetic mappings: `x`, `y`, `fill`, `color`, `shape`
    - labeling: `xlab`, `ylab`, `labs`
    - font control and legend positioning: `theme`
    - subplots: `facet_grid`
- Describe the difference in raster and vector output formats.
- Use `ggsave` to save visualizations in `.png` and `.svg` format.

## Choosing the visualization
#### *Ask a question, and answer it* {-}

The purpose of a visualization is to answer a question
\index{question!visualization} about a data set of interest. So naturally, the
first thing to do **before** creating a visualization is to formulate the
question about the data you are trying to answer.  A good visualization will
clearly answer your question without distraction; a *great* visualization will
suggest even what the question was itself without additional explanation.
Imagine your visualization as part of a poster presentation for a project; even
if you aren't standing at the poster explaining things, an effective
visualization will convey your message to the audience.

Recall the different data analysis questions 
from Chapter \@ref(intro). 
With the visualizations we will cover in this chapter, 
we will be able to answer *only descriptive and exploratory* questions. 
Be careful to not answer any *predictive, inferential, causal* 
*or mechanistic* questions with the visualizations presented here, 
as we have not learned the tools necessary to do that properly just yet.  

As with most coding tasks, it is totally fine (and quite common) to make
mistakes and iterate a few times before you find the right visualization for
your data and question. There are many different kinds of plotting
graphics available to use (see Chapter 5 of *Fundamentals of Data Visualization* [@wilkeviz] for a directory). 
The types of plot that we introduce in this book are shown in Figure \@ref(fig:plot-sketches);
which one you should select depends on your data 
and the question you want to answer. 
In general, the guiding principles of when to use each type of plot 
are as follows:

\index{visualization!line}
\index{visualization!histogram}
\index{visualization!scatter}
\index{visualization!bar}

- **scatter plots** visualize the relationship between two quantitative variables
- **line plots** visualize trends with respect to an independent, ordered quantity (e.g., time)
- **bar plots** visualize comparisons of amounts
- **histograms** visualize the distribution of one quantitative variable (i.e., all its possible values and how often they occur) \index{distribution}

```{r plot-sketches, echo = FALSE, fig.width = 4.5, fig.height = 4.65, fig.align = 'center', fig.cap = "Examples of scatter, line and bar plots, as well as histograms."}
set.seed(1)

scatter_plot <- tibble(x = seq(0.25, 10, by = 0.5) + rnorm(20, 1, 1.5),
                       y = seq(0.25, 10, by = 0.5) + rnorm(20, 1, 0.5)) |>
  ggplot(aes(x = x, y = y)) +
  geom_point() +
  scale_x_continuous(limits = c(0, 12), breaks = seq(0, 12, by = 3)) +
  scale_y_continuous(limits = c(0, 12), breaks = seq(0, 12, by = 3)) +
  ggtitle("Scatter plot") +
  theme_classic()

line_plot <- tibble(x = seq(0.5, 10, by = 1) + rnorm(10, 1, 0.5),
                       y = seq(0.5, 10, by = 1) + rnorm(10, 1, 0.1)) |>
  ggplot(aes(x = x, y = y)) +
  geom_line() +
  scale_x_continuous(limits = c(0, 12), breaks = seq(0, 12, by = 3)) +
  scale_y_continuous(limits = c(0, 12), breaks = seq(0, 12, by = 3)) +
  ggtitle("Line plot") +
  theme_classic()

bar_plot <- tibble(count = c(35, 27, 21),
                   category = as_factor(c("Group 1", "Group 2", "Group 3"))) |>
  ggplot(aes(y = count, x = category)) +
  geom_bar(stat = "identity") +
  ggtitle("Bar plot") +
  theme_classic()

histogram_plot <- tibble(measurements = rnorm(200, 25, 5)) |>
  ggplot(aes(x = measurements)) +
  geom_histogram(binwidth = 3) +
  ggtitle("Histogram") +
  theme_classic()

plot_grid(scatter_plot,
          line_plot,
          bar_plot,
          histogram_plot,
          ncol = 2, 
          greedy = FALSE)
```

All types of visualization have their (mis)uses, but three kinds are usually
hard to understand or are easily replaced with an oft-better alternative.  In
particular, you should avoid **pie charts**; it is generally better to use
bars, as it is easier to compare bar heights than pie slice sizes.  You should
also not use **3-D visualizations**, as they are typically hard to understand
when converted to a static 2-D image format. Finally, do not use tables to make
numerical comparisons; humans are much better at quickly processing visual
information than text and math. Bar plots are again typically a better
alternative.

+++

## Refining the visualization
#### *Convey the message, minimize noise* {-}

Just being able to make a visualization in R with `ggplot2` (or any other tool
for that matter) doesn't mean that it effectively communicates your message to
others. Once you have selected a broad type of visualization to use, you will
have to refine it to suit your particular need.  Some rules of thumb for doing
this are listed below. They generally fall into two classes: you want to 
*make your visualization convey your message*, and you want to *reduce visual noise*
as much as possible. Humans have limited cognitive ability to process
information; both of these types of refinement aim to reduce the mental load on
your audience when viewing your visualization, making it easier for them to
understand and remember your message quickly.

**Convey the message**

- Make sure the visualization answers the question you have asked most simply and plainly as possible.
- Use legends and labels so that your visualization is understandable without reading the surrounding text.
- Ensure the text, symbols, lines, etc., on your visualization are big enough to be easily read.
- Ensure the data are clearly visible; don't hide the shape/distribution of the data behind other objects (e.g.,  a bar).
- Make sure to use color schemes that are understandable by those with
  colorblindness (a surprisingly large fraction of the overall 
  population&mdash;from about 1% to 10%, depending on sex and ancestry [@deebblind]).
  For example, [ColorBrewer](https://colorbrewer2.org) 
  and [the `RColorBrewer` R package](https://cran.r-project.org/web/packages/RColorBrewer/index.html) [@RColorBrewer]  
  provide the ability to pick such color schemes, and you can check
  your visualizations after you have created them by uploading to online tools
  such as a [color blindness simulator](https://www.color-blindness.com/coblis-color-blindness-simulator/).
- Redundancy can be helpful; sometimes conveying the same message in multiple ways reinforces it for the audience.

**Minimize noise**

- Use colors sparingly. Too many different colors can be distracting, create false patterns, and detract from the message. 
- Be wary of overplotting. Overplotting is when marks that represent the data
  overlap, and is problematic as it prevents you from seeing how many data
  points are represented in areas of the visualization where this occurs. If your
  plot has too many dots or lines and starts to look like a mess, you need to do
  something different.
- Only make the plot area (where the dots, lines, bars are) as big as needed. Simple plots can be made small.
- Don't adjust the axes to zoom in on small differences. If the difference is small, show that it's small!

+++

## Creating visualizations with `ggplot2` 
#### *Build the visualization iteratively* {-}

This section will cover examples of how to choose and refine a visualization given a data set and a question that you want to answer, 
and then how to create the visualization in R \index{ggplot} using `ggplot2`.  To use the `ggplot2` package, we need to load the `tidyverse` metapackage.

```{r 03-tidyverse, warning=FALSE, message=FALSE}
library(tidyverse)
```

```{r 03-warn-off, echo = FALSE, results = 'hide', message = FALSE, warning = FALSE}
options(warn = -1)
```

### Scatter plots and line plots: the Mauna Loa CO$_{\text{2}}$ data set

The [Mauna Loa CO$_{\text{2}}$ data set](https://www.esrl.noaa.gov/gmd/ccgg/trends/data.html), 
curated by Dr. Pieter Tans, NOAA/GML 
and Dr. Ralph Keeling, Scripps Institution of Oceanography,
records the atmospheric concentration of carbon dioxide 
(CO$_{\text{2}}$, in parts per million) 
at the Mauna Loa research station in \index{Mauna Loa} Hawaii 
from 1959 onward [@maunadata].
For this book, we are going to focus on the last 40 years of the data set,
1980-2020.

**Question:** \index{question!visualization} 
Does the concentration of atmospheric CO$_{\text{2}}$ change over time, 
and are there any interesting patterns to note?

```{r, echo = FALSE, warning = FALSE, message = FALSE}
# convert year and month to date column
# note: eventually move this to the data script
library(lubridate)
read_csv("data/mauna_loa.csv") |>
  unite(col = "date_measured", year:month, remove = TRUE, sep = "-") |>
  mutate(date_measured = ym(date_measured)) |>
  select(-date_decimal) |>
  filter(ppm > 0, date_measured > date("1980/01/01")) |>
  write_csv("data/mauna_loa_data.csv")
```

To get started, we will read and inspect the data:

```{r 03-data-co2, warning=FALSE, message=FALSE}
# mauna loa carbon dioxide data
co2_df <- read_csv("data/mauna_loa_data.csv")
co2_df
```

We see that there are two columns in the `co2_df` data frame; `date_measured` and `ppm`. 
The `date_measured` column holds the date the measurement was taken, 
and is of type `date`.
The `ppm` column holds the value of CO$_{\text{2}}$ in parts per million 
that was measured on each date, and is type `double`.

> **Note:** `read_csv` was able to parse the `date_measured` column into the
> `date` vector type because it was entered 
> in the international standard date format, 
> called ISO 8601, which lists dates as `year-month-day`.
> `date` vectors are `double` vectors with special properties that allow 
> them to handle dates correctly.
> For example, `date` type vectors allow functions like `ggplot` 
> to treat them as numeric dates and not as character vectors, 
> even though they contain non-numeric characters 
> (e.g., in the `date_measured` column in the `co2_df` data frame).
> This means R will not accidentally plot the dates in the wrong order 
> (i.e., not alphanumerically as would happen if it was a character vector). 
> An in-depth study of dates and times is beyond the scope of the book, 
> but interested readers 
> may consult the Dates and Times chapter of *R for Data Science* [@wickham2016r];
> see the additional resources at the end of this chapter.

Since we are investigating a relationship between two variables 
(CO$_{\text{2}}$ concentration and date), 
a scatter plot is a good place to start. 
Scatter plots show the data as individual points with `x` (horizontal axis) 
and `y` (vertical axis) coordinates.
Here, we will use the measurement date as the `x` coordinate 
and the CO$_{\text{2}}$ concentration as the `y` coordinate. 
When using the `ggplot2` package, 
we create a plot object with the `ggplot` function. 
There are a few basic aspects of a plot that we need to specify:
\index{ggplot!aesthetic mapping}
\index{ggplot!geometric object}

- The name of the data frame object to visualize.
    - Here, we specify the `co2_df` data frame.
- The **aesthetic mapping**, which tells \index{aesthetic mapping} `ggplot` how the columns in the data frame map to properties of the visualization.
    - To create an aesthetic mapping, we use the `aes` function.
    - Here, we set the plot `x` axis to the `date_measured` variable, and the plot `y` axis to the `ppm` variable.
- The `+` operator, which tells `ggplot` that we would like to add another layer to the plot.\index{aaaplussymb@$+$|see{ggplot!add layer}}\index{ggplot!add layer}
- The **geometric object**, which specifies \index{aesthetic mapping} how the mapped data should be displayed.
    - To create a geometric object, we use a `geom_*` function (see the [ggplot reference](https://ggplot2.tidyverse.org/reference/) for a list of geometric objects).
    - Here, we use the `geom_point` function to visualize our data as a scatter plot.

Figure \@ref(fig:03-ggplot-function-scatter) 
shows how each of these aspects map to code
for creating a basic scatter plot of the `co2_df` data.
Note that we could pass many other possible arguments to the aesthetic mapping
and geometric object to change how the plot looks. For the purposes of quickly
testing things out to see what they look like, though, we can just start with the
default settings.
\index{ggplot!aes}
\index{ggplot!geom\_point}
 
(ref:03-ggplot-function-scatter) Creating a scatter plot with the `ggplot` function.

```{r 03-ggplot-function-scatter, echo = FALSE, fig.cap = "(ref:03-ggplot-function-scatter)", message = FALSE, out.width = "100%"}
image_read("img/ggplot_function_scatter.jpeg") |>
  image_crop("1625x1900")
```

\newpage

```{r 03-data-co2-scatter, warning=FALSE, message=FALSE, fig.height = 3.1, fig.width = 4.5, fig.align = "center", fig.cap = "Scatter plot of atmospheric concentration of CO$_{2}$ over time."}
co2_scatter <- ggplot(co2_df, aes(x = date_measured, y = ppm)) +
  geom_point()

co2_scatter
```

Certainly, the visualization in Figure \@ref(fig:03-data-co2-scatter) 
shows a clear upward trend 
in the atmospheric concentration of CO$_{\text{2}}$ over time.
This plot answers the first part of our question in the affirmative, 
but that appears to be the only conclusion one can make 
from the scatter visualization. 

One important thing to note about this data is that one of the variables
we are exploring is time.
Time is a special kind of quantitative variable 
because it forces additional structure on the data&mdash;the 
data points have a natural order. 
Specifically, each observation in the data set has a predecessor 
and a successor, and the order of the observations matters; changing their order 
alters their meaning.
In situations like this, we typically use a line plot to visualize
the data. Line plots connect the sequence of `x` and `y` coordinates
of the observations with line segments, thereby emphasizing their order.

We can create a line plot in `ggplot` using the `geom_line` function. 
Let's now try to visualize the `co2_df` as a line plot 
with just the default arguments: 
\index{ggplot!geom\_line}

```{r 03-data-co2-line, warning=FALSE, message=FALSE, fig.height = 3.1, fig.width = 4.5, fig.align = "center", fig.cap = "Line plot of atmospheric concentration of CO$_{2}$ over time."}
co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) +
  geom_line()

co2_line
```

Aha! Figure \@ref(fig:03-data-co2-line) shows us there *is* another interesting
phenomenon in the data: in addition to increasing over time, the concentration
seems to oscillate as well.  Given the visualization as it is now, it is still
hard to tell how fast the oscillation is, but nevertheless, the line seems to
be a better choice for answering the question than the scatter plot was. The
comparison between these two visualizations also illustrates a common issue with
scatter plots: often, the points are shown too close together or even on top of
one another, muddling information that would otherwise be clear
(*overplotting*). \index{overplotting}

Now that we have settled on the rough details of the visualization, it is time
to refine things. This plot is fairly straightforward, and there is not much
visual noise to remove. But there are a few things we must do to improve
clarity, such as adding informative axis labels and making the font a more
readable size.  To add axis labels, we use the `xlab` and `ylab` functions. To
change the font size, we use the `theme` function with the `text` argument:
\index{ggplot!xlab,ylab}
\index{ggplot!theme}

```{r 03-data-co2-line-2, warning=FALSE, message=FALSE, fig.height = 3.1, fig.width = 4.5, fig.align = "center",  fig.cap = "Line plot of atmospheric concentration of CO$_{2}$ over time with clearer axes and labels."}
co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) +
  geom_line() +
  xlab("Year") +
  ylab("Atmospheric CO2 (ppm)") +
  theme(text = element_text(size = 12))

co2_line
```

> **Note:** The `theme` function is quite complex and has many arguments 
> that can be specified to control many non-data aspects of a visualization.
> An in-depth discussion of the `theme` function is beyond the scope of this book.
> Interested readers may consult the `theme` function documentation;
> see the additional resources section at the end of this chapter.

Finally, let's see if we can better understand the oscillation by changing the
visualization slightly. Note that it is totally fine to use a small number of
visualizations to answer different aspects of the question you are trying to
answer. We will accomplish this by using *scales*, \index{ggplot!scales}
another important feature of `ggplot2` that easily transforms the different
variables and set limits.  We scale the horizontal axis using the `xlim` function,
and the vertical axis with the `ylim` function.
In particular, here, we will use the `xlim` function to zoom in 
on just five years of data (say, 1990-1994).
`xlim` takes a vector of length two 
to specify the upper and lower bounds to limit the axis. 
We can create that using the `c` function.
Note that it is important that the vector given to `xlim` must be of the same
type as the data that is mapped to that axis. 
Here, we have mapped a date to the x-axis, 
and so we need to use the `date` function 
(from the `tidyverse` [`lubridate` R package](https://lubridate.tidyverse.org/) [@lubridate; @lubridatepaper]) 
to convert the character strings we provide to `c` to `date` vectors.

> **Note:** `lubridate` is a package that is installed by the `tidyverse` metapackage,
> but is not loaded by it. 
> Hence we need to load it separately in the code below.

```{r 03-data-co2-line-3, warning = FALSE, message = FALSE, fig.height = 3.1, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Line plot of atmospheric concentration of CO$_{2}$ from 1990 to 1994."}
library(lubridate)

co2_line <- ggplot(co2_df, aes(x = date_measured, y = ppm)) +
  geom_line() +
  xlab("Year") +
  ylab("Atmospheric CO2 (ppm)") +
  xlim(c(date("1990-01-01"), date("1993-12-01"))) +
  theme(text = element_text(size = 12))

co2_line
```

Interesting! It seems that each year, the atmospheric CO$_{\text{2}}$ increases until it reaches its peak somewhere around April, decreases until around late September, 
and finally increases again until the end of the year. In Hawaii, there are two seasons: summer from May through October, and winter from November through April.
Therefore, the oscillating pattern in CO$_{\text{2}}$ matches up fairly closely with the two seasons.

As you might have noticed from the code used to create the final visualization
of the `co2_df` data frame, 
we construct the visualizations in `ggplot` with layers.
New layers are added with the `+` operator, 
and we can really add as many as we would like!
A useful analogy to constructing a data visualization is painting a picture.
We start with a blank canvas, 
and the first thing we do is prepare the surface 
for our painting by adding primer. 
In our data visualization this is akin to calling `ggplot` 
and specifying the data set we will be using.
Next, we sketch out the background of the painting. 
In our data visualization, 
this would be when we map data to the axes in the `aes` function.
Then we add our key visual subjects to the painting.
In our data visualization, 
this would be the geometric objects (e.g., `geom_point`, `geom_line`, etc.).
And finally, we work on adding details and refinements to the painting.
In our data visualization this would be when we fine tune axis labels,
change the font, adjust the point size, and do other related things.

### Scatter plots: the Old Faithful eruption time data set
The `faithful` data set \index{Old Faithful} contains measurements 
of the waiting time between eruptions 
and the subsequent eruption duration (in minutes) of the Old Faithful
geyser in Yellowstone National Park, Wyoming, United States. 
The `faithful` data set is available in base R as a data frame,
so it does not need to be loaded.
We convert it to a tibble to take advantage of the nicer print output 
these specialized data frames provide.

**Question:** \index{question!visualization} 
Is there a relationship between the waiting time before an eruption 
and the duration of the eruption? 

```{r 03-data-faithful, warning=FALSE, message=FALSE}
# old faithful eruption time / wait time data
faithful <- as_tibble(faithful)
faithful
```

Here again, we investigate the relationship between two quantitative variables 
(waiting time and eruption time). 
But if you look at the output of the data frame, 
you'll notice that unlike time in the Mauna Loa CO$_{\text{2}}$ data set,
neither of the variables here have a natural order to them.
So a scatter plot is likely to be the most appropriate
visualization. Let's create a scatter plot using the `ggplot`
function with the `waiting` variable on the horizontal axis, the `eruptions` 
variable on the vertical axis, and the `geom_point` geometric object.
The result is shown in Figure \@ref(fig:03-data-faithful-scatter).

```{r 03-data-faithful-scatter, warning=FALSE, message=FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of waiting time and eruption time."}
faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_point()

faithful_scatter
```

We can see in Figure \@ref(fig:03-data-faithful-scatter) that the data tend to fall
into two groups: one with short waiting and eruption times, and one with long
waiting and eruption times. Note that in this case, there is no overplotting:
the points are generally nicely visually separated, and the pattern they form
is clear.  In order to refine the visualization, we need only to add axis
labels and make the font more readable:

```{r 03-data-faithful-scatter-2, warning=FALSE, message=FALSE, fig.height = 3.5, fig.width = 3.75, fig.align = "center",  fig.pos = "H", out.extra="", fig.cap = "Scatter plot of waiting time and eruption time with clearer axes and labels."}
faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_point() +
  labs(x = "Waiting Time (mins)", y = "Eruption Duration (mins)") +
  theme(text = element_text(size = 12))

faithful_scatter
```

+++

\newpage

### Axis transformation and colored scatter plots: the Canadian languages data set

Recall the `can_lang` data set [@timbers2020canlang] from Chapters \@ref(intro), \@ref(reading), and \@ref(wrangling),
\index{Canadian languages} which contains counts of languages from the 2016
Canadian census.

**Question:** \index{question!visualization} Is there a relationship between
the percentage of people who speak a language as their mother tongue and 
the percentage for whom that is the primary language spoken at home?
And is there a pattern in the strength of this relationship in the
higher-level language categories (Official languages, Aboriginal languages, or
non-official and non-Aboriginal languages)?

To get started, we will read and inspect the data:

```{r 03-canlang-example, message = F, warning = F}
can_lang <- read_csv("data/can_lang.csv")
can_lang
```

We will begin with a scatter plot of the `mother_tongue` and `most_at_home` columns from our data frame.
The resulting plot is shown in Figure \@ref(fig:03-mother-tongue-vs-most-at-home).
\index{ggplot!geom\_point}

```{r 03-mother-tongue-vs-most-at-home, fig.height=3.5, fig.width=3.75, fig.align = "center", warning=FALSE, fig.pos = "H", out.extra="", fig.cap = "Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home."}
ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
  geom_point()
``` 

To make an initial improvement in the interpretability 
of Figure \@ref(fig:03-mother-tongue-vs-most-at-home), we should 
replace the default axis
names with more informative labels. We can use `\n` to create a line break in
the axis names so that the words after `\n` are printed on a new line. This will
make the axes labels on the plots more readable.
\index{escape character} We should also increase the font size to further 
improve readability.

```{r 03-mother-tongue-vs-most-at-home-labs, fig.height=3.5, fig.width=3.75, fig.align = "center", warning=FALSE, fig.pos = "H", out.extra="", fig.cap = "Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home with x and y labels."}
ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
  geom_point() +
  labs(x = "Language spoken most at home \n (number of Canadian residents)",
       y = "Mother tongue \n (number of Canadian residents)") +
  theme(text = element_text(size = 12))
```

```{r mother-tongue-hidden-summaries, echo = FALSE, warning = FALSE, message = FALSE}
numlang_speakers <- can_lang |> 
              select(mother_tongue) |> 
              summarize(maxsp = max(mother_tongue), 
                        minsp = min(mother_tongue))

maxlang_speakers <- numlang_speakers |> 
  pull(maxsp)

minlang_speakers <- numlang_speakers |> 
  pull(minsp)
```

Okay! The axes and labels in Figure \@ref(fig:03-mother-tongue-vs-most-at-home-labs) are
much more readable and interpretable now. However, the scatter points themselves could use
some work; most of the 214 data points are bunched
up in the lower left-hand side of the visualization. The data is clumped because
many more people in Canada speak English or French (the two points in
the upper right corner) than other languages. 
In particular, the most common mother tongue language 
has `r  format(maxlang_speakers, scientific = FALSE, big.mark = ",")` speakers,
while the least common has only `r  format(minlang_speakers, scientific = FALSE, big.mark = ",")`.
That's a `r as.integer(floor(log10(maxlang_speakers/minlang_speakers)))`-decimal-place difference
in the magnitude of these two numbers!
We can confirm that the two points in the upper right-hand corner correspond
to Canada's two official languages by filtering the data:
\index{filter}

```{r english-mother-tongue}
can_lang |>
  filter(language == "English" | language == "French")
```

Recall that our question about this data pertains to *all* languages;
so to properly answer our question, 
we will need to adjust the scale of the axes so that we can clearly
see all of the scatter points.
In particular, we will improve the plot by adjusting the horizontal
and vertical axes so that they are on a **logarithmic** (or **log**) scale. \index{logarithmic scale}
Log scaling is useful when your data take both *very large* and *very small* values,
because it helps space out small values and squishes larger values together.
For example, $\log_{10}(1) = 0$, $\log_{10}(10) = 1$, $\log_{10}(100) = 2$, and $\log_{10}(1000) = 3$;
on the logarithmic scale, \index{ggplot!logarithmic scaling} 
the values 1, 10, 100, and 1000 are all the same distance apart!
So we see that applying this function is moving big values closer together 
and moving small values farther apart.
Note that if your data can take the value 0, logarithmic scaling may not 
be appropriate (since `log10(0) = -Inf` in R). There are other ways to transform
the data in such a case, but these are beyond the scope of the book. 

We can accomplish logarithmic scaling in a `ggplot` visualization
using the `scale_x_log10` and `scale_y_log10` functions.
Given that the x and y axes have large numbers, we should also format the axis labels
to put commas in these numbers to increase their readability.
We can do this in R by passing the `label_comma` function (from the `scales` package)
to the `labels` argument of the `scale_x_log10` and `scale_x_log10` functions.

```{r 03-mother-tongue-vs-most-at-home-scale, message = FALSE, warning =  FALSE, fig.height=3.5,  fig.width=3.75, fig.align = "center",  fig.pos = "H", out.extra="", fig.cap = "Scatter plot of number of Canadians reporting a language as their mother tongue vs the primary language at home with log adjusted x and y axes."}
library(scales)

ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
  geom_point() +
  labs(x = "Language spoken most at home \n (number of Canadian residents)",
       y = "Mother tongue \n (number of Canadian residents)") +
  theme(text = element_text(size = 12)) +
  scale_x_log10(labels = label_comma()) +
  scale_y_log10(labels = label_comma())
```

```{r 03-changing-the-units, include = FALSE}
english_mother_tongue <- can_lang |>
  filter(language == "English") |>
  pull(mother_tongue)

census_popn <- 35151728
```

Similar to some of the examples in Chapter \@ref(wrangling), 
we can convert the counts to percentages to give them context 
and make them easier to understand.
We can do this by dividing the number of people reporting a given language 
as their mother tongue or primary language at home 
by the number of people who live in Canada and multiplying by 100\%. 
For example, 
the percentage of people who reported that their mother tongue was English 
in the 2016 Canadian census 
was `r  format(english_mother_tongue, scientific = FALSE, big.mark = ",") ` 
/ `r  format(census_popn, scientific = FALSE, big.mark = ",")` $\times$ 
`r 100` \% =
`r format(round(english_mother_tongue/census_popn*100, 2), scientific = FALSE, big.mark = ",")`\%.

Below we use `mutate` to calculate the percentage of people reporting a given
language as their mother tongue and primary language at home for all the
languages in the `can_lang` data set. Since the new columns are appended to the
end of the data table, we selected the new columns after the transformation so
you can clearly see the mutated output from the table.
\index{mutate}\index{select}

```{r}
can_lang <- can_lang |>
  mutate(
    mother_tongue_percent = (mother_tongue / 35151728) * 100,
    most_at_home_percent = (most_at_home / 35151728) * 100
  )

can_lang |> 
  select(mother_tongue_percent, most_at_home_percent)
```

Finally, we will edit the visualization to use the percentages we just computed
(and change our axis labels to reflect this change in 
units). Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props) displays
the final result.

```{r 03-mother-tongue-vs-most-at-home-scale-props, fig.height=3.5,  fig.width=3.75, fig.align = "center",  warning=FALSE, fig.pos = "H", out.extra="", fig.cap = "Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home."}
ggplot(can_lang, aes(x = most_at_home_percent, y = mother_tongue_percent)) +
  geom_point() +
  labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
       y = "Mother tongue \n (percentage of Canadian residents)") +
  theme(text = element_text(size = 12)) +
  scale_x_log10(labels = comma) +
  scale_y_log10(labels = comma)
```

Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props) is the appropriate
visualization to use to answer the first question in this section, i.e.,
whether there is a relationship between the percentage of people who speak 
a language as their mother tongue and the percentage for whom that
is the primary language spoken at home.
To fully answer the question, we need to use
 Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props)
to assess a few key characteristics of the data: 

- **Direction:** if the y variable tends to increase when the x variable increases, then y has a **positive** relationship with x. If 
  y tends to decrease when x increases, then y has a **negative** relationship with x. If y does not meaningfully increase or decrease 
  as x increases, then y has **little or no** relationship with x. \index{relationship!positive, negative, none}
- **Strength:** if the y variable *reliably* increases, decreases, or stays flat as x increases,
  then the relationship is **strong**. Otherwise, the relationship is **weak**. Intuitively, \index{relationship!strong, weak}
  the relationship is strong when the scatter points are close together and look more like a "line" or "curve" than a "cloud."
- **Shape:** if you can draw a straight line roughly through the data points, the relationship is **linear**. Otherwise, it is **nonlinear**. \index{relationship!linear, nonlinear}

In Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props), we see that 
as the percentage of people who have a language as their mother tongue increases, 
so does the percentage of people who speak that language at home. 
Therefore, there is a **positive** relationship between these two variables.
Furthermore, because the points in Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props) 
are fairly close together, and the points look more like a "line" than a "cloud",
we can say that this is a **strong** relationship. 
And finally, because drawing a straight line through these points in 
Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props)
would fit the pattern we observe quite well, we say that the relationship is **linear**.

Onto the second part of our exploratory data analysis question!
Recall that we are interested in knowing whether the strength 
of the relationship we uncovered 
in Figure \@ref(fig:03-mother-tongue-vs-most-at-home-scale-props) depends
on the higher-level language category (Official languages, Aboriginal languages,
and non-official, non-Aboriginal languages).
One common way to explore this
is to color the data points on the scatter plot we have already created by
group. For example, given that we have the higher-level language category for
each language recorded in the 2016 Canadian census, we can color the points in
our previous 
scatter plot to represent each language's higher-level language category.

Here we want to distinguish the values according to the `category` group with
which they belong.  We can add an argument to the `aes` function, specifying
that the `category` column should color the points. Adding this argument will
color the points according to their group and add a legend at the side of the
plot. 

```{r 03-scatter-color-by-category, warning=FALSE, fig.height=3.5, fig.width=5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category."}
ggplot(can_lang, aes(x = most_at_home_percent, 
                     y = mother_tongue_percent, 
                     color = category)) +
  geom_point() +
  labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
       y = "Mother tongue \n (percentage of Canadian residents)") +
  theme(text = element_text(size = 12)) +
  scale_x_log10(labels = comma) +
  scale_y_log10(labels = comma)
```

The legend in Figure \@ref(fig:03-scatter-color-by-category) 
takes up valuable plot area. 
We can improve this by moving the legend title using the `legend.position`
and `legend.direction`
arguments of the `theme` function. 
Here we set `legend.position` to `"top"` to put the legend above the plot
and `legend.direction` to `"vertical"` so that the legend items remain 
vertically stacked on top of each other.
When the `legend.position` is set to either `"top"` or `"bottom"` 
the default direction is to stack the legend items horizontally.
However, that will not work well for this particular visualization 
because the legend labels are quite long 
and would run off the page if displayed this way.

```{r 03-scatter-color-by-category-legend-edit, warning=FALSE, fig.height=4.75,  fig.width=3.75, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with the legend edited."}
ggplot(can_lang, aes(x = most_at_home_percent, 
                     y = mother_tongue_percent, 
                     color = category)) +
  geom_point() +
  labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
       y = "Mother tongue \n (percentage of Canadian residents)") +
  theme(text = element_text(size = 12),
        legend.position = "top",
        legend.direction = "vertical") +
  scale_x_log10(labels = comma) +
  scale_y_log10(labels = comma)
```

In Figure \@ref(fig:03-scatter-color-by-category-legend-edit), the points are colored with
the default `ggplot2` color palette. But what if you want to use different
colors? In R, two packages that provide alternative color 
palettes \index{color palette} are `RColorBrewer` [@RColorBrewer]
and `ggthemes` [@ggthemes]; in this book we will cover how to use `RColorBrewer`.
You can visualize the list of color
palettes that `RColorBrewer` has to offer with the `display.brewer.all`
function. You can also print a list of color-blind friendly palettes by adding
`colorblindFriendly = TRUE` to the function. 

(ref:rcolorbrewer) Color palettes available from the `RColorBrewer` R package.

```{r rcolorbrewer, fig.height = 7, fig.cap = "(ref:rcolorbrewer)"}
library(RColorBrewer)
display.brewer.all(colorblindFriendly = TRUE)
```

From Figure \@ref(fig:rcolorbrewer), 
we can choose the color palette we want to use in our plot. 
To change the color palette, 
we add the `scale_color_brewer` layer indicating the palette we want to use. 
You can use 
this [color blindness simulator](https://www.color-blindness.com/coblis-color-blindness-simulator/) to check 
if your visualizations \index{color palette!color blindness simulator} 
are color-blind friendly.
Below we pick the `"Set2"` palette, with the result shown
in Figure \@ref(fig:scatter-color-by-category-palette).
We also set the `shape` aesthetic mapping to the `category` variable as well;
this makes the scatter point shapes different for each category. This kind of 
visual redundancy&mdash;i.e., conveying the same information with both scatter point color and shape&mdash;can
further improve the clarity and accessibility of your visualization.

```{r scatter-color-by-category-palette, fig.height=4.75,  fig.width=3.75, fig.align = "center", warning=FALSE, fig.pos = "H", out.extra="",  fig.cap = "Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with color-blind friendly colors."}
ggplot(can_lang, aes(x = most_at_home_percent, 
                     y = mother_tongue_percent, 
                     color = category, 
                     shape = category)) +
  geom_point() +
  labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
       y = "Mother tongue \n (percentage of Canadian residents)") +
  theme(text = element_text(size = 12),
        legend.position = "top",
        legend.direction = "vertical") +
  scale_x_log10(labels = comma) +
  scale_y_log10(labels = comma) +
  scale_color_brewer(palette = "Set2")
```

From the visualization in Figure \@ref(fig:scatter-color-by-category-palette), 
we can now clearly see that the vast majority of Canadians reported one of the official languages 
as their mother tongue and as the language they speak most often at home. 
What do we see when considering the second part of our exploratory question? 
Do we see a difference in the relationship
between languages spoken as a mother tongue and as a primary language
at home across the higher-level language categories? 
Based on Figure \@ref(fig:scatter-color-by-category-palette), there does not
appear to be much of a difference.
For each higher-level language category, 
there appears to be a strong, positive, and linear relationship between 
the percentage of people who speak a language as their mother tongue 
and the percentage who speak it as their primary language at home. 
The relationship looks similar regardless of the category. 

Does this mean that this relationship is positive for all languages in the
world? And further, can we use this data visualization on its own to predict how many people
have a given language as their mother tongue if we know how many people speak
it as their primary language at home? The answer to both these questions is
"no!" However, with exploratory data analysis, we can create new hypotheses,
ideas, and questions (like the ones at the beginning of this paragraph).
Answering those questions often involves doing more complex analyses, and sometimes
even gathering additional data. We will see more of such complex analyses later on in 
this book.  

<!--
Finally, we can go one step further and distinguish English and French languages
with different colors in our visualization. To separate these languages, we
will filter the rows where the language is either English or French and mutate
the `category` column to equal the corresponding language. 
\index{filter}\index{mutate}

```{r 03-separate-English-French}
english_and_french <- can_lang |>
  filter(language == "English" | language == "French") |>
  mutate(category = language)
english_and_french
```

Next we will bind \index{bind\_rows} the mutated data set `english_and_french` that we just created with the remaining rows in the `can_lang` data set:
```{r 03-bind-english-french}
can_lang <- bind_rows(
  english_and_french,
  can_lang |>
    filter(language != "English" & language != "French")
)
```

+++

We have added a few more layers to make the data visualization in Figure \@ref(fig:03-nachos-to-cheesecake) even more effective. Specifically, we used have improved the visualizations accessibility by choosing colors that are easier to distinguish, mapped category to shape, and handled the problem of overlapping data points by making them slightly transparent. \index{ggplot!transparency}\index{alpha|see{ggplot}}

```{r 03-nachos-to-cheesecake, fig.width=7.75, fig.height=4, warning=FALSE, message=FALSE, fig.cap = "Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category."}
ggplot(can_lang, aes(
  x = most_at_home_percent,
  y = mother_tongue_percent,
  color = category,
  shape = category # map categories to different shapes
)) + 
  geom_point(alpha = 0.6) + # set the transparency of the points
  labs(x = "Language spoken most at home \n (percentage of Canadian residents)",
       y = "Mother tongue \n (percentage of Canadian residents)") +
  theme(text = element_text(size = 12)) +
  scale_x_log10(labels = comma) +
  scale_y_log10(labels = comma) + 
  scale_color_brewer(palette = "RdYlBu")
```
-->

### Bar plots: the island landmass data set
The `islands.csv` data set \index{Island landmasses} contains a list of Earth's landmasses as well as their area (in thousands of square miles) [@islandsdata]. 

**Question:** \index{question!visualization} Are the continents (North / South America, Africa, Europe, Asia, Australia, Antarctica) Earth's seven largest landmasses? If so, what are the next few largest landmasses after those?

```{r, echo = FALSE, message = FALSE, warning = FALSE}
islands_df <- read_csv("data/islands.csv")
continents <- c("Africa", "Antarctica", "Asia", "Australia", 
                "Europe", "North America", "South America")

islands_df <- mutate(islands_df, 
                     landmass_type = ifelse(landmass %in% continents, 
                                            "Continent", "Other"))

write_csv(islands_df, "data/islands.csv")
```

To get started, we will read and inspect the data:

```{r 03-data-islands, warning=FALSE, message=FALSE}
# islands data
islands_df <- read_csv("data/islands.csv")
islands_df
```

Here, we have a data frame of Earth's landmasses, 
and are trying to compare their sizes. 
The right type of visualization to answer this question is a bar plot. 
In a bar plot, the height of the bar represents the value of a summary statistic 
(usually a size, count, proportion or percentage).
They are particularly useful for comparing summary statistics between different
groups of a categorical variable.

We specify that we would like to use a bar plot
via the `geom_bar` function in `ggplot2`. 
However, by default, `geom_bar` sets the heights
of bars to the number of times a value appears in a data frame (its *count*); here, we want to plot exactly the values in the data frame, i.e.,
the landmass sizes. So we have to pass the `stat = "identity"` argument to `geom_bar`. The result is 
shown in Figure \@ref(fig:03-data-islands-bar).
\index{ggplot!geom\_bar}

```{r 03-data-islands-bar, warning=FALSE, message=FALSE,  fig.width=5, fig.height=2.75, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Bar plot of all Earth's landmasses' size with squished labels."}
islands_bar <- ggplot(islands_df, aes(x = landmass, y = size)) +
  geom_bar(stat = "identity")

islands_bar
```

Alright, not bad! The plot in Figure \@ref(fig:03-data-islands-bar) is
definitely the right kind of visualization, as we can clearly see and compare
sizes of landmasses. The major issues are that the smaller landmasses' sizes
are hard to distinguish, and the names of the landmasses are obscuring each
other as they have been squished into too little space. But remember that the
question we asked was only about the largest landmasses; let's make the plot a
little bit clearer by keeping only the largest 12 landmasses. We do this using
the `slice_max` function.  Then to help us make sure the labels have enough
space, we'll use horizontal bars instead of vertical ones. We do this by
swapping the `x` and `y` variables:
\index{slice\_max}

```{r 03-data-islands-bar-2, warning=FALSE, message=FALSE, fig.width=5, fig.height=2.75, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Bar plot of size for Earth's largest 12 landmasses."}
islands_top12 <- slice_max(islands_df, order_by = size, n = 12)
islands_bar <- ggplot(islands_top12, aes(x = size, y = landmass)) +
  geom_bar(stat = "identity") 

islands_bar
```

The plot in Figure \@ref(fig:03-data-islands-bar-2) is definitely clearer now, 
and allows us to answer our question 
("are the top 7 largest landmasses continents?") in the affirmative. 
But the question could be made clearer from the plot 
by organizing the bars not by alphabetical order
but by size, and to color them based on whether they are a continent. 
The data for this is stored in the `landmass_type` column. 
To use this to color the bars, 
we add the `fill` argument to the aesthetic mapping 
and set it to `landmass_type`. 

To organize the landmasses by their `size` variable, 
we will use the `tidyverse` `fct_reorder` function
in the aesthetic mapping to organize the landmasses by their `size` variable.
The first argument passed to `fct_reorder` is the name of the factor column
whose levels we would like to reorder (here, `landmass`). 
The second argument is the column name 
that holds the values we would like to use to do the ordering (here, `size`).
The `fct_reorder` function uses ascending order by default, 
but this can be changed to descending order 
by setting  `.desc = TRUE`.
We do this here so that the largest bar will be closest to the axis line,
which is more visually appealing.

To label the x and y axes, we will use the `labs` function
instead of the `xlab` and `ylab` functions from earlier in this chapter. 
The `labs` function is more general; we are using it in this case because 
 we would also like to change the legend label.
The default label is the name of the column being mapped to `fill`. Here that
would be `landmass_type`;
however `landmass_type` is not proper English (and so is less readable).
Thus we use the `fill` argument inside `labs` to change that to "Type."
Finally, we again \index{ggplot!reorder} use the `theme` function 
to change the font size.

```{r 03-data-islands-bar-4, warning = FALSE, message = FALSE, fig.width=5, fig.height=2.75, fig.align="center", fig.pos = "H", out.extra="", fig.cap = "Bar plot of size for Earth's largest 12 landmasses colored by whether its a continent with clearer axes and labels."}
islands_bar <- ggplot(islands_top12, 
                      aes(x = size,
                          y = fct_reorder(landmass, size, .desc = TRUE), 
                          fill = landmass_type)) +
  geom_bar(stat = "identity") +
  labs(x = "Size (1000 square mi)", y = "Landmass",  fill = "Type") +
  theme(text = element_text(size = 12))

islands_bar
```

The plot in Figure \@ref(fig:03-data-islands-bar-4) is now a very effective
visualization for answering our original questions. Landmasses are organized by
their size, and continents are colored differently than other landmasses,
making it quite clear that continents are the largest seven landmasses.

### Histograms: the Michelson speed of light data set
The `morley` data set \index{Michelson speed of light} 
contains measurements of the speed of light 
collected in experiments performed in 1879.
Five experiments were performed, 
and in each experiment, 20 runs were performed&mdash;meaning that 
20 measurements of the speed of light were collected 
in each experiment [@lightdata].
The `morley` data set is available in base R as a data frame,
so it does not need to be loaded.
Because the speed of light is a very large number 
(the true value is 299,792.458 km/sec), the data is coded
to be the measured speed of light minus 299,000.
This coding allows us to focus on the variations in the measurements, which are generally
much smaller than 299,000.
If we used the full large speed measurements, the variations in the measurements
would not be noticeable, making it difficult to study the differences between the experiments.
Note that we convert the `morley` data to a tibble to take advantage of the nicer print output 
these specialized data frames provide.

**Question:** \index{question!visualization} Given what we know now about the speed of 
light (299,792.458 kilometres per second), how accurate were each of the experiments?

```{r 03-data-morley, warning=FALSE, message=FALSE}
# michelson morley experimental data
morley <- as_tibble(morley)
morley
```

In this experimental data, 
Michelson was trying to measure just a single quantitative number 
(the speed of light). 
The data set contains many measurements of this single quantity. 
\index{distribution} 
To tell how accurate the experiments were, 
we need to visualize the distribution of the measurements 
(i.e., all their possible values and how often each occurs). 
We can do this using a *histogram*. 
A histogram \index{ggplot!histogram} 
helps us visualize how a particular variable is distributed in a data set 
by separating the data into bins, 
and then using vertical bars to show how many data points fell in each bin. 

To create a histogram in `ggplot2` we will use the `geom_histogram` geometric
object, setting the `x` axis to the `Speed` measurement variable. As usual, 
let's use the default arguments just to see how things look.

```{r 03-data-morley-hist, warning=FALSE, message=FALSE,  fig.height = 2.75, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of Michelson's speed of light data."}
morley_hist <- ggplot(morley, aes(x = Speed)) +
  geom_histogram()

morley_hist
```

Figure \@ref(fig:03-data-morley-hist) is a great start. 
However, 
we cannot tell how accurate the measurements are using this visualization 
unless we can see the true value.
In order to visualize the true speed of light, 
we will add a vertical line with the `geom_vline` function.
To draw a vertical line with `geom_vline`,  \index{ggplot!geom\_vline}
we need to specify where on the x-axis the line should be drawn. 
We can do this by setting the `xintercept` argument. 
Here we set it to 792.458, which is the true value of light speed 
minus 299,000; this ensures it is coded the same way as the 
measurements in the `morley` data frame.
We would also like to fine tune this vertical line, 
styling it so that it is dashed and 1 point in thickness.
A point is a measurement unit commonly used with fonts, 
and 1 point is about 0.353 mm. 
We do this by setting `linetype = "dashed"` and `size = 1`, respectively. 
There is a similar function, `geom_hline`, 
that is used for plotting horizontal lines. 
Note that 
*vertical lines* are used to denote quantities on the *horizontal axis*, 
while *horizontal lines* are used to denote quantities on the *vertical axis*. 

```{r 03-data-morley-hist-2, warning=FALSE,  fig.height = 2.75, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", message=FALSE,fig.cap = "Histogram of Michelson's speed of light data with vertical line indicating true speed of light."}
morley_hist <- ggplot(morley, aes(x = Speed)) +
  geom_histogram() +
  geom_vline(xintercept = 792.458, linetype = "dashed", size = 1)

morley_hist
```

In Figure \@ref(fig:03-data-morley-hist-2), 
we still cannot tell which experiments (denoted in the `Expt` column) 
led to which measurements; 
perhaps some experiments were more accurate than others. 
To fully answer our question, 
we need to separate the measurements from each other visually. 
We can try to do this using a *colored* histogram, 
where counts from different experiments are stacked on top of each other 
in different colors. 
We can create a histogram colored by the `Expt` variable 
by adding it to the `fill` aesthetic mapping. 
We make sure the different colors can be seen 
(despite them all sitting on top of each other) 
by setting the `alpha` argument in `geom_histogram` to `0.5` 
to make the bars slightly translucent. 
We also specify `position = "identity"` in `geom_histogram` to ensure 
the histograms for each experiment will be overlaid side-by-side, 
instead of stacked bars 
(which is the default for bar plots or histograms 
when they are colored by another categorical variable).

```{r 03-data-morley-hist-3, warning=FALSE, message=FALSE,  fig.height = 2.75, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of Michelson's speed of light data colored by experiment."}
morley_hist <- ggplot(morley, aes(x = Speed, fill = Expt)) +
  geom_histogram(alpha = 0.5, position = "identity") +
  geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)

morley_hist
```

Alright great, Figure \@ref(fig:03-data-morley-hist-3) looks...wait a second! The
histogram is still all the same color! What is going on here? Well, if you 
recall from Chapter \@ref(wrangling), the *data type* you use for each variable
can influence how R and `tidyverse` treats it. Here, we indeed have an issue
with the data types in the `morley` data frame. In particular, the `Expt` column
is currently an *integer* (you can see the label `<int>` underneath the `Expt` column in \index{integer} the printed
data frame at the start of this section). But we want to treat it as a
*category*, i.e., there should be one category per type of experiment.  

To fix this issue we can convert the `Expt` variable into a *factor* by \index{factor}
passing it to `as_factor` in the `fill` aesthetic mapping.
Recall that factor is a data type in R that is often used to represent
categories. By writing
`as_factor(Expt)` we are ensuring that R will treat this variable as a factor,
and the color will be mapped discretely.
\index{factor!usage in ggplot}

```{r 03-data-morley-hist-with-factor, warning=FALSE, message=FALSE,  fig.height = 2.75, fig.width = 5, fig.pos = "H", out.extra="", fig.align = "center", fig.cap = "Histogram of Michelson's speed of light data colored by experiment as factor."}
morley_hist <- ggplot(morley, aes(x = Speed, fill = as_factor(Expt))) +
  geom_histogram(alpha = 0.5, position = "identity") +
  geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)

morley_hist
```

> **Note:** Factors impact plots in two ways:
> (1) ensuring a color is mapped as discretely where appropriate (as in this
> example) and (2) the ordering of levels in a plot. `ggplot` takes into account
> the order of the factor levels as opposed to the order of data in
> your data frame. Learning how to reorder your factor levels will help you with
> reordering the labels of a factor on a plot.  
 
Unfortunately, the attempt to separate out the experiment number visually has
created a bit of a mess. All of the colors in Figure
\@ref(fig:03-data-morley-hist-with-factor) are blending together, and although it is
possible to derive *some* insight from this (e.g., experiments 1 and 3 had some
of the most incorrect measurements), it isn't the clearest way to convey our
message and answer the question. Let's try a different strategy of creating
grid of separate histogram plots.

+++

We use the `facet_grid` function to create a plot 
that has multiple subplots arranged in a grid.
The argument to `facet_grid` specifies the variable(s) used to split the plot 
into subplots, and how to split them (i.e., into rows or columns).
If the plot is to be split horizontally, into rows, 
then the `rows` argument is used.
If the plot is to be split vertically, into columns, 
then the `columns` argument is used.
Both the `rows` and `columns` arguments take the column names on which to split the data when creating the subplots. 
Note that the column names must be surrounded by the `vars` function.
This function allows the column names to be correctly evaluated 
in the context of the data frame.
\index{ggplot!facet\_grid}

```{r 03-data-morley-hist-4, warning=FALSE, message=FALSE, fig.height = 5, fig.width = 4.25, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of Michelson's speed of light data split vertically by experiment."}
morley_hist <- ggplot(morley, aes(x = Speed, fill = as_factor(Expt))) +
  geom_histogram() +
  facet_grid(rows = vars(Expt)) +
  geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)

morley_hist
```

The visualization in Figure \@ref(fig:03-data-morley-hist-4) 
now makes it quite clear how accurate the different experiments were 
with respect to one another. 
The most variable measurements came from Experiment 1. 
There the measurements ranged from about 650&ndash;1050 km/sec.
The least variable measurements came from Experiment 2.
There, the measurements ranged from about 750&ndash;950 km/sec.
The most different experiments still obtained quite similar results!

There are two finishing touches to make this visualization even clearer. First and foremost, we need to add informative axis labels
using the `labs` function, and increase the font size to make it readable using the `theme` function. Second, and perhaps more subtly, even though it 
is easy to compare the experiments on this plot to one another, it is hard to get a sense 
of just how accurate all the experiments were overall. For example, how accurate is the value 800 on the plot, relative to the true speed of light?
To answer this question, we'll use the `mutate` function to transform our data into a relative measure of accuracy rather than absolute measurements:
\index{ggplot!labs}\index{ggplot!theme}

```{r 03-data-morley-hist-5, warning=FALSE, message=FALSE, fig.height = 5.25, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of relative accuracy split vertically by experiment with clearer axes and labels."}
morley_rel <- mutate(morley, 
                     relative_accuracy = 100 * 
                       ((299000 + Speed) - 299792.458) / (299792.458))

morley_hist <- ggplot(morley_rel, 
                      aes(x = relative_accuracy, 
                          fill = as_factor(Expt))) +
  geom_histogram() +
  facet_grid(rows = vars(Expt)) +
  geom_vline(xintercept = 0, linetype = "dashed", size = 1.0) +
  labs(x = "Relative Accuracy (%)", 
       y = "# Measurements", 
       fill = "Experiment ID") +
  theme(text = element_text(size = 12))

morley_hist
```

Wow, impressive! These measurements of the speed of light from 1879 had errors around *0.05%* of the true speed. Figure \@ref(fig:03-data-morley-hist-5) shows you that even though experiments 2 and 5 were perhaps the most accurate, all of the experiments did quite an 
admirable job given the technology available at the time.

\newpage

#### Choosing a binwidth for histograms {-}

When you create a histogram in R, the default number of bins used is 30.
Naturally, this is not always the right number to use.
You can set the number of bins yourself by using
the `bins` argument in the `geom_histogram` geometric object.
You can also set the *width* of the bins using the
`binwidth` argument in the `geom_histogram` geometric object.
But what number of bins, or bin width, is the right one to use? 

Unfortunately there is no hard rule for what the right bin number
or width is. It depends entirely on your problem; the *right* number of bins 
or bin width is 
the one that *helps you answer the question* you asked. 
Choosing the correct setting for your problem 
is something that commonly takes iteration.
We recommend setting the *bin width* (not the *number of bins*) because
it often more directly corresponds to values in your problem of interest. For example,
if you are looking at a histogram of human heights,
a bin width of 1 inch would likely be reasonable, while the number of bins to use is 
not immediately clear.
It's usually a good idea to try out several bin widths to see which one
most clearly captures your data in the context of the question
you want to answer.

To get a sense for how different bin widths affect visualizations, 
let's experiment with the histogram that we have been working on in this section.
In Figure \@ref(fig:03-data-morley-hist-binwidth),
we compare the default setting with three other histograms where we set the 
`binwidth` to 0.001, 0.01 and 0.1.
In this case, we can see that both the default number of bins 
and the binwidth of 0.01 are effective for helping answer our question.
On the other hand, the bin widths of 0.001 and 0.1 are too small and too big, respectively.

```{r 03-data-morley-hist-binwidth, echo = FALSE, warning = FALSE, message = FALSE, fig.height = 10, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Effect of varying bin width on histograms."}
morley_hist_default <- ggplot(morley_rel, 
                              aes(x = relative_accuracy, 
                                  fill = as_factor(Expt))) +
  geom_histogram() +
  facet_grid(rows = vars(Expt)) +
  geom_vline(xintercept = 0, linetype = "dashed", size = 1.0) +
  labs(x = "Relative Accuracy (%)", 
       y = "# Measurements", 
       fill = "Experiment ID") +
  theme(legend.position = "none") +
  ggtitle("Default (bins = 30)")  + 
  theme(text = element_text(size = 14), axis.title=element_text(size=14)) 

morley_hist_big <- ggplot(morley_rel, 
                          aes(x = relative_accuracy, 
                              fill = as_factor(Expt))) +
  geom_histogram(binwidth = 0.1) +
  facet_grid(rows = vars(Expt)) +
  geom_vline(xintercept = 0, linetype = "dashed", size = 1.0) +
  labs(x = "Relative Accuracy (%)", 
       y = "# Measurements", 
       fill = "Experiment ID") +
  theme(legend.position = "none") +
  ggtitle( "binwidth = 0.1")  + 
  theme(text = element_text(size = 14), axis.title=element_text(size=14)) 

morley_hist_med <- ggplot(morley_rel, 
                          aes(x = relative_accuracy, 
                              fill = as_factor(Expt))) +
  geom_histogram(binwidth = 0.01) +
  facet_grid(rows = vars(Expt)) +
  geom_vline(xintercept = 0, linetype = "dashed", size = 1.0) +
  labs(x = "Relative Accuracy (%)", 
       y = "# Measurements", 
       fill = "Experiment ID") +
  theme(legend.position = "none") +
  ggtitle("binwidth = 0.01")  + 
  theme(text = element_text(size = 14), axis.title=element_text(size=14)) 

morley_hist_small <- ggplot(morley_rel, 
                            aes(x = relative_accuracy, 
                                fill = as_factor(Expt))) +
  geom_histogram(binwidth = 0.001) +
  facet_grid(rows = vars(Expt)) +
  geom_vline(xintercept = 0, linetype = "dashed", size = 1.0) +
  labs(x = "Relative Accuracy (%)", 
       y = "# Measurements", 
       fill = "Experiment ID") +
  theme(legend.position = "none") +
  ggtitle("binwidth = 0.001")  + 
  theme(text = element_text(size = 14), axis.title=element_text(size=14)) 

plot_grid(morley_hist_default, 
          morley_hist_small, 
          morley_hist_med, 
          morley_hist_big, 
          ncol = 2)
```

#### Adding layers to a `ggplot` plot object {-}

One of the powerful features of `ggplot` is that you 
can continue to iterate on a single plot object, adding and refining
one layer \index{ggplot!add layer} at a time. If you stored your plot as a named object
using the assignment symbol (`<-`), you can 
add to it using the `+` operator.
For example, if we wanted to add a title to the last plot we created (`morley_hist`), 
we can use the `+` operator to add a title layer with the `ggtitle` function.
The result is shown in Figure \@ref(fig:03-data-morley-hist-addlayer).

```{r 03-data-morley-hist-addlayer, warning = FALSE, message = FALSE, fig.height = 5.25, fig.width = 4.5, fig.align = "center", fig.pos = "H", out.extra="", fig.cap = "Histogram of relative accuracy split vertically by experiment with a descriptive title highlighting the take home message of the visualization."}
morley_hist_title <- morley_hist +
  ggtitle("Speed of light experiments \n were accurate to about 0.05%")

morley_hist_title
```

> **Note:** Good visualization titles clearly communicate 
> the take home message to the audience. Typically, 
> that is the answer to the question you posed before making the visualization.

## Explaining the visualization
#### *Tell a story* {-}

Typically, your visualization will not be shown entirely on its own, but rather
it will be part of a larger presentation.  Further, visualizations can provide
supporting information for any aspect of a presentation, from opening to
conclusion.  For example, you could use an exploratory visualization in the
opening of the presentation to motivate your choice of a more detailed data
analysis / model, a visualization of the results of your analysis to show what
your analysis has uncovered, or even one at the end of a presentation to help
suggest directions for future work. 

Regardless of where it appears, a good way to discuss your visualization \index{visualization!explanation} is as
a story: 

1) Establish the setting and scope, and describe why you did what you did. 
2) Pose the question that your visualization answers. Justify why the question is important to answer.
3) Answer the question using your visualization. Make sure you describe *all* aspects of the visualization (including describing the axes). But you 
   can emphasize different aspects based on what is important to answer your question:
    - **trends (lines):** Does a line describe the trend well? If so, the trend is *linear*, and if not, the trend is *nonlinear*. Is the trend increasing, decreasing, or neither?
                        Is there a periodic oscillation (wiggle) in the trend? Is the trend noisy (does the line "jump around" a lot) or smooth?
    - **distributions (scatters, histograms):** How spread out are the data? Where are they centered, roughly? Are there any obvious "clusters" or "subgroups", which would be visible as multiple bumps in the histogram?
    - **distributions of two variables (scatters):** Is there a clear / strong relationship between the variables (points fall in a distinct pattern), a weak one (points fall in a pattern but there is some noise), or no discernible
      relationship (the data are too noisy to make any conclusion)?
    - **amounts (bars):** How large are the bars relative to one another? Are there patterns in different groups of bars? 
4) Summarize your findings, and use them to motivate whatever you will discuss next.

Below are two examples of how one might take these four steps in describing the example visualizations that appeared earlier in this chapter.
Each of the steps is denoted by its numeral in parentheses, e.g. (3).

**Mauna Loa Atmospheric CO$_{\text{2}}$ Measurements:** (1) \index{Mauna Loa} Many 
current forms of energy generation and conversion&mdash;from automotive
engines to natural gas power plants&mdash;rely on burning fossil fuels and produce
greenhouse gases, typically primarily carbon dioxide (CO$_{\text{2}}$), as a
byproduct. Too much of these gases in the Earth's atmosphere will cause it to
trap more heat from the sun, leading to global warming. (2) In order to assess
how quickly the atmospheric concentration of CO$_{\text{2}}$ is increasing over
time, we (3) used a data set from the Mauna Loa observatory in Hawaii,
consisting of CO$_{\text{2}}$ measurements from 1980 to 2020. We plotted the
measured concentration of CO$_{\text{2}}$ (on the vertical axis) over time (on
the horizontal axis). From this plot, you can see a clear, increasing, and
generally linear trend over time. There is also a periodic oscillation that
occurs once per year and aligns with Hawaii's seasons, with an amplitude that
is small relative to the growth in the overall trend. This shows that
atmospheric CO$_{\text{2}}$ is clearly increasing over time, and (4) it is
perhaps worth investigating more into the causes.

**Michelson Light Speed Experiments:** (1) \index{Michelson speed of light} Our
modern understanding of the physics of light has advanced significantly from
the late 1800s when Michelson and Morley's experiments first demonstrated that
it had a finite speed. We now know, based on modern experiments, that it moves at
roughly 299,792.458 kilometers per second. (2) But how accurately were we first
able to measure this fundamental physical constant, and did certain experiments
produce more accurate results than others?  (3) To better understand this, we
plotted data from 5 experiments by Michelson in 1879, each with 20 trials, as
histograms stacked on top of one another.  The horizontal axis shows the
accuracy of the measurements relative to the true speed of light as we know it
today, expressed as a percentage.  From this visualization, you can see that
most results had relative errors of at most 0.05%. You can also see that
experiments 1 and 3 had measurements that were the farthest from the true
value, and experiment 5 tended to provide the most consistently accurate
result. (4) It would be worth further investigating the differences between
these experiments to see why they produced different results.

## Saving the visualization
#### *Choose the right output format for your needs* {-}

Just as there are many ways to store data sets, there are many ways to store
visualizations and images.  Which one you choose can depend on several factors,
such as file size/type limitations (e.g., if you are submitting your
visualization as part of a conference paper or to a poster printing shop) and
where it will be displayed (e.g., online, in a paper, on a poster, on a
billboard, in talk slides).  Generally speaking, images come in two flavors:
*raster* \index{bitmap|see{raster graphics}}\index{raster graphics} formats 
and *vector* \index{vector graphics} formats.

**Raster** images are represented as a 2-D grid of square pixels, each
with its own color. Raster images are often *compressed* before storing so they
take up less space. A compressed format is *lossy* if the image cannot be
perfectly re-created when loading and displaying, with the hope that the change
is not noticeable. *Lossless* formats, on the other hand, allow a perfect
display of the original image.
\index{raster graphics!file types}

- *Common file types:* 
    - [JPEG](https://en.wikipedia.org/wiki/JPEG) (`.jpg`, `.jpeg`): lossy, usually used for photographs 
    - [PNG](https://en.wikipedia.org/wiki/Portable_Network_Graphics) (`.png`): lossless, usually used for plots / line drawings
    - [BMP](https://en.wikipedia.org/wiki/BMP_file_format) (`.bmp`): lossless, raw image data, no compression (rarely used)
    - [TIFF](https://en.wikipedia.org/wiki/TIFF) (`.tif`, `.tiff`): typically lossless, no compression, used mostly in graphic arts, publishing
- *Open-source software:* [GIMP](https://www.gimp.org/)

**Vector** images are represented as a collection of mathematical 
objects (lines, surfaces, shapes, curves). When the computer displays the image, it 
redraws all of the elements using their mathematical formulas.
\index{vector graphics!file types}

- *Common file types:* 
    - [SVG](https://en.wikipedia.org/wiki/Scalable_Vector_Graphics) (`.svg`): general-purpose use 
    - [EPS](https://en.wikipedia.org/wiki/Encapsulated_PostScript) (`.eps`), general-purpose use (rarely used)
- *Open-source software:* [Inkscape](https://inkscape.org/)

Raster and vector images have opposing advantages and disadvantages. A raster
image of a fixed width / height takes the same amount of space and time to load
regardless of what the image shows (the one caveat is that the compression algorithms may
shrink the image more or run faster for certain images). A vector image takes
space and time to load corresponding to how complex the image is, since the
computer has to draw all the elements each time it is displayed. For example,
if you have a scatter plot with 1 million points stored as an SVG file, it may
take your computer some time to open the image. On the other hand, you can zoom
into / scale up vector graphics as much as you like without the image looking
bad, while raster images eventually start to look "pixelated." 

> **Note:** The portable document format [PDF](https://en.wikipedia.org/wiki/PDF) (`.pdf`) is commonly used to
> store *both* raster and vector formats. If you try to open a PDF and it's taking a long time
> to load, it may be because there is a complicated vector graphics image that your computer is rendering. 
\index{PDF}
\index{portable document format|see{PDF}}

Let's learn how to save plot images to these different file formats using a 
scatter plot of 
the [Old Faithful data set](https://www.stat.cmu.edu/~larry/all-of-statistics/=data/faithful.dat) [@faithfuldata],
shown in Figure \@ref(fig:03-plot-line).

```{r 03-plot-line, collapse=TRUE, warning=FALSE, message=FALSE, fig.width = 3.75, fig.height = 3.5, fig.pos = "H", out.extra="", fig.cap = "Scatter plot of waiting time and eruption time."}
library(svglite) # we need this to save SVG files
faithful_plot <- ggplot(data = faithful, aes(x = waiting, y = eruptions)) +
  geom_point() +
  labs(x = "Waiting time to next eruption \n (minutes)",
       y = "Eruption time \n (minutes)") + 
  theme(text = element_text(size = 12))

faithful_plot
```

Now that we have a named `ggplot` plot object, we can use the `ggsave` function
to save a file containing this image. 
`ggsave` works by taking a file name to create for the image 
as its first argument. 
This can include the path to the directory where you would like to save the file 
(e.g., `img/filename.png` to save a file named `filename` to the `img` directory),
and the name of the plot object to save as its second argument.
The kind of image to save is specified by the file extension.
For example, 
to create a PNG image file, we specify that the file extension is `.png`.
Below we demonstrate how to save PNG, JPG, BMP, TIFF and SVG file types 
for the `faithful_plot`:

```{r warning=FALSE, message=FALSE}
ggsave("img/faithful_plot.png", faithful_plot)
ggsave("img/faithful_plot.jpg", faithful_plot)
ggsave("img/faithful_plot.bmp", faithful_plot)
ggsave("img/faithful_plot.tiff", faithful_plot)
ggsave("img/faithful_plot.svg", faithful_plot)
```

```{r, filesizes, echo = FALSE}
file_sizes <- tibble(`Image type` = c("Raster", 
                        "Raster", 
                        "Raster", 
                        "Raster",
                        "Vector"),
       `File type` = c("PNG", "JPG", "BMP", "TIFF", "SVG"),
       `Image size` = c(paste(round(file.info("img/faithful_plot.png")["size"] 
                                    / 1000000, 2), "MB"),
                        paste(round(file.info("img/faithful_plot.jpg")["size"] 
                                    / 1000000, 2), "MB"),
                        paste(round(file.info("img/faithful_plot.bmp")["size"] 
                                    / 1000000, 2), "MB"),
                        paste(round(file.info("img/faithful_plot.tiff")["size"] 
                                    / 1000000, 2), "MB"),
                        paste(round(file.info("img/faithful_plot.svg")["size"] 
                                    / 1000000, 2), "MB")))
kable(file_sizes,
      caption = "File sizes of the scatter plot of the Old Faithful data set when saved as different file formats.") |>
  kable_styling(latex_options = "hold_position")
```

Take a look at the file sizes in Table \@ref(tab:filesizes).
Wow, that's quite a difference! Notice that for such a simple plot with few
graphical elements (points), the vector graphics format (SVG) is over 100 times
smaller than the uncompressed raster images (BMP, TIFF). Also, note that the
JPG format is twice as large as the PNG format since the JPG compression
algorithm is designed for natural images (not plots). 

In Figure \@ref(fig:03-raster-image), we also show what
the images look like when we zoom in to a rectangle with only 3 data points.
You can see why vector graphics formats are so useful: because they're just
based on mathematical formulas, vector graphics can be scaled up to arbitrary
sizes.  This makes them great for presentation media of all sizes, from papers
to posters to billboards.

(ref:03-raster-image) Zoomed in `faithful`, raster (PNG, left) and vector (SVG, right) formats.

```{r 03-raster-image, echo=FALSE, fig.cap = "(ref:03-raster-image)", fig.show="hold", fig.align= "center", message =F, out.width="100%"}
knitr::include_graphics("img/png-vs-svg.png")
```

## Exercises

Practice exercises for the material covered in this chapter 
can be found in the accompanying 
[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme)
in the "Effective data visualization" row.
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
You can also preview a non-interactive version of the worksheet by clicking "view worksheet."
If you instead decide to download the worksheet and run it on your own machine,
make sure to follow the instructions for computer setup
found in Chapter \@ref(move-to-your-own-machine). This will ensure that the automated feedback
and guidance that the worksheets provide will function as intended.

## Additional resources
- The [`ggplot2` R package page](https://ggplot2.tidyverse.org) [@ggplot] is
  where you should look if you want to learn more about the functions in this
  chapter, the full set of arguments you can use, and other related functions.
  The site also provides a very nice cheat sheet that summarizes many of the data
  wrangling functions from this chapter.
- The *Fundamentals of Data Visualization* [@wilkeviz] has
  a wealth of information on designing effective visualizations. It is not
  specific to any particular programming language or library. If you want to
  improve your visualization skills, this is the next place to look.
- *R for Data Science* [@wickham2016r] has a [chapter on creating visualizations using
  `ggplot2`](https://r4ds.had.co.nz/data-visualisation.html). This reference is
  specific to R and `ggplot2`, but provides a much more detailed introduction to
  the full set of tools that `ggplot2` provides. This chapter is where you should
  look if you want to learn how to make more intricate visualizations in
  `ggplot2` than what is included in this chapter.
- The [`theme` function documentation](https://ggplot2.tidyverse.org/reference/theme.html)
  is an excellent reference to see how you can fine tune the non-data aspects 
  of your visualization.
- *R for Data Science* [@wickham2016r] has a chapter on [dates and
  times](https://r4ds.had.co.nz/dates-and-times.html).  This chapter is where
  you should look if you want to learn about `date` vectors, including how to
  create them, and how to use them to effectively handle durations, periods and
  intervals using the `lubridate` package.
