Cleaning and wrangling data¶

Overview¶

This chapter is centered around defining tidy data—a data format that is suitable for analysis—and the tools needed to transform raw data into this format. This will be presented in the context of a real-world data science application, providing more practice working through a whole case study.

Chapter learning objectives¶

By the end of the chapter, readers will be able to do the following:

Define the term “tidy data”.
Discuss the advantages of storing data in a tidy data format.
Define what lists, series and data frames are in Python, and describe how they relate to each other.
Describe the common types of data in Python and their uses.
Recall and use the following functions for their intended data wrangling tasks:
- .agg
- .apply
- .assign
- .groupby
- .melt
- .pivot
- .str.split
Recall and use the following operators for their intended data wrangling tasks:
- ==
- in
- and
- or
- df[]
- .iloc[]
- .loc[]

Data frames, series, and lists¶

In Chapters intro and reading, data frames were the focus: we learned how to import data into Python as a data frame, and perform basic operations on data frames in Python. In the remainder of this book, this pattern continues. The vast majority of tools we use will require that data are represented as a pandas data frame in Python. Therefore, in this section, we will dig more deeply into what data frames are and how they are represented in Python. This knowledge will be helpful in effectively utilizing these objects in our data analyses.

What is a data frame?¶

A data frame \index{data frame!definition} is a table-like structure for storing data in Python. Data frames are important to learn about because most data that you will encounter in practice can be naturally stored as a table. In order to define data frames precisely, we need to introduce a few technical terms:

variable: a \index{variable} characteristic, number, or quantity that can be measured.
observation: all \index{observation} of the measurements for a given entity.
value: a \index{value} single measurement of a single variable for a given entity.

Given these definitions, a data frame is a tabular data structure in Python that is designed to store observations, variables, and their values. Most commonly, each column in a data frame corresponds to a variable, and each row corresponds to an observation. For example, Fig. 12 displays a data set of city populations. Here, the variables are “region, year, population”; each of these are properties that can be collected or measured. The first observation is “Toronto, 2016, 2235145”; these are the values that the three variables take for the first entity in the data set. There are 13 entities in the data set in total, corresponding to the 13 rows in Fig. 12.

Fig. 12 A data frame storing data regarding the population of various regions in Canada. In this example data frame, the row that corresponds to the observation for the city of Vancouver is colored yellow, and the column that corresponds to the population variable is colored blue.¶

What is a series?¶

In Python, pandas series are arrays with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, etc), including a mix of them (objects); Python has several different basic data types, as shown in Table 3. You can create a pandas series using the pd.Series() function. For example, to create the vector region as shown in Fig. 13, you can write:

import pandas as pd
region = pd.Series(["Toronto", "Montreal", "Vancouver", "Calgary", "Ottawa"])
region

    Toronto
   Montreal
  Vancouver
    Calgary
     Ottawa
dtype: object

Fig. 13 Example of a pandas series whose type is string.¶

\newpage

Table 3 Basic data types in Python¶
English name	Type name	Type Category	Description	Example
integer	`int`	Numeric Type	positive/negative whole numbers	`42`
floating point number	`float`	Numeric Type	real number in decimal form	`3.14159`
boolean	`bool`	Boolean Values	true or false	`True`
string	`str`	Sequence Type	text	`"Can I have a cheezburger?"`
list	`list`	Sequence Type	a collection of objects - mutable & ordered	`['Ali', 'Xinyi', 'Miriam']`
tuple	`tuple`	Sequence Type	a collection of objects - immutable & ordered	`('Thursday', 6, 9, 2018)`
dictionary	`dict`	Mapping Type	mapping of key-value pairs	`{'name':'DSCI', 'code':100, 'credits':2}`
none	`NoneType`	Null Object	represents no value	`None`

\index{data types} \index{character}\index{chr|see{character}} \index{integer}\index{int|see{integer}} \index{double}\index{dbl|see{double}} \index{logical}\index{lgl|see{logical}} \index{factor}\index{fct|see{factor}} It is important in Python to make sure you represent your data with the correct type. Many of the pandas functions we use in this book treat the various data types differently. You should use integers and float types (which both fall under the “numeric” umbrella type) to represent numbers and perform arithmetic. Strings are used to represent data that should be thought of as “text”, such as words, names, paths, URLs, and more. There are other basic data types in Python, such as set and complex, but we do not use these in this textbook.

What is a list?¶

Lists \index{list} are built-in objects in Python that have multiple, ordered elements. pandas series can be treated as lists with labels (indices).

What does this have to do with data frames?¶

A data frame \index{data frame!definition} is really just series stuck together that follows two rules:

Each element itself is a series.
Each element (series) must have the same length.

Not all columns in a data frame need to be of the same type. Fig. 14 shows a data frame where the columns are series of different types.

Fig. 14 Data frame and vector types.¶

Note: You can use the function type \index{class} on a data object. For example we can check the class of the Canadian languages data set, can_lang, we worked with in the previous chapters and we see it is a pandas.core.frame.DataFrame.

can_lang = pd.read_csv("data/can_lang.csv")
type(can_lang)

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 can_lang = pd.read_csv("data/can_lang.csv")
type(can_lang)

File ~/opt/miniconda3/envs/dsci/lib/python3.9/site-packages/pandas/util/_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
if len(args) > num_allow_args:
   warnings.warn(
       msg.format(arguments=arguments),
       FutureWarning,
       stacklevel=stacklevel,
   )
--> 311 return func(*args, **kwargs)

File ~/opt/miniconda3/envs/dsci/lib/python3.9/site-packages/pandas/io/parsers/readers.py:680, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
kwds_defaults = _refine_defaults_read(
   dialect,
   delimiter,
   (...)
   defaults={"delimiter": ","},
)
kwds.update(kwds_defaults)
--> 680 return _read(filepath_or_buffer, kwds)

File ~/opt/miniconda3/envs/dsci/lib/python3.9/site-packages/pandas/io/parsers/readers.py:575, in _read(filepath_or_buffer, kwds)
_validate_names(kwds.get("names", None))
# Create the parser.
--> 575 parser = TextFileReader(filepath_or_buffer, **kwds)
if chunksize or iterator:
   return parser

File ~/opt/miniconda3/envs/dsci/lib/python3.9/site-packages/pandas/io/parsers/readers.py:933, in TextFileReader.__init__(self, f, engine, **kwds)
   self.options["has_index_names"] = kwds["has_index_names"]
self.handles: IOHandles | None = None
--> 933 self._engine = self._make_engine(f, self.engine)

File ~/opt/miniconda3/envs/dsci/lib/python3.9/site-packages/pandas/io/parsers/readers.py:1217, in TextFileReader._make_engine(self, f, engine)
   mode = "rb"
# error: No overload variant of "get_handle" matches argument types
# "Union[str, PathLike[str], ReadCsvBuffer[bytes], ReadCsvBuffer[str]]"
# , "str", "bool", "Any", "Any", "Any", "Any", "Any"
-> 1217 self.handles = get_handle(  # type: ignore[call-overload]
   f,
   mode,
   encoding=self.options.get("encoding", None),
   compression=self.options.get("compression", None),
   memory_map=self.options.get("memory_map", False),
   is_text=is_text,
   errors=self.options.get("encoding_errors", "strict"),
   storage_options=self.options.get("storage_options", None),
)
assert self.handles is not None
f = self.handles.handle

File ~/opt/miniconda3/envs/dsci/lib/python3.9/site-packages/pandas/io/common.py:789, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
elif isinstance(handle, str):
   # Check whether the filename is to be opened in binary mode.
   # Binary mode does not support 'encoding' and 'newline'.
   if ioargs.encoding and "b" not in ioargs.mode:
       # Encoding
--> 789         handle = open(
           handle,
           ioargs.mode,
           encoding=ioargs.encoding,
           errors=errors,
           newline="",
       )
   else:
       # Binary mode
       handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'data/can_lang.csv'

Lists, Series and DataFrames are basic types of data structure in Python, which are core to most data analyses. We summarize them in Table 4. There are several other data structures in the Python programming language (e.g., matrices), but these are beyond the scope of this book.

Table 4 Basic data structures in Python¶
Data Structure	Description
list	An 1D ordered collection of values that can store multiple data types at once.
Series	An 1D ordered collection of values with labels that can store multiple data types at once.
DataFrame	A 2D labeled data structure with columns of potentially different types.

Tidy data¶

There are many ways a tabular data set can be organized. This chapter will focus on introducing the tidy data \index{tidy data!definition} format of organization and how to make your raw (and likely messy) data tidy. A tidy data frame satisfies the following three criteria [Wickham, 2014]:

each row is a single observation,
each column is a single variable, and
each value is a single cell (i.e., its entry in the data frame is not shared with another value).

Fig. 15 demonstrates a tidy data set that satisfies these three criteria.

Fig. 15 Tidy data satisfies three criteria.¶

There are many good reasons for making sure your data are tidy as a first step in your analysis. The most important is that it is a single, consistent format that nearly every function in the pandas recognizes. No matter what the variables and observations in your data represent, as long as the data frame \index{tidy data!arguments for} is tidy, you can manipulate it, plot it, and analyze it using the same tools. If your data is not tidy, you will have to write special bespoke code in your analysis that will not only be error-prone, but hard for others to understand. Beyond making your analysis more accessible to others and less error-prone, tidy data is also typically easy for humans to interpret. Given these benefits, it is well worth spending the time to get your data into a tidy format upfront. Fortunately, there are many well-designed pandas data cleaning/wrangling tools to help you easily tidy your data. Let’s explore them below!

Note: Is there only one shape for tidy data for a given data set? Not necessarily! It depends on the statistical question you are asking and what the variables are for that question. For tidy data, each variable should be its own column. So, just as it’s essential to match your statistical question with the appropriate data analysis tool, it’s important to match your statistical question with the appropriate variables and ensure they are represented as individual columns to make the data tidy.

Tidying up: going from wide to long using `.melt`¶

One task that is commonly performed to get data into a tidy format \index{pivot_longer} is to combine values that are stored in separate columns, but are really part of the same variable, into one. Data is often stored this way because this format is sometimes more intuitive for human readability and understanding, and humans create data sets. In Fig. 16, the table on the left is in an untidy, “wide” format because the year values (2006, 2011, 2016) are stored as column names. And as a consequence, the values for population for the various cities over these years are also split across several columns.

For humans, this table is easy to read, which is why you will often find data stored in this wide format. However, this format is difficult to work with when performing data visualization or statistical analysis using Python. For example, if we wanted to find the latest year it would be challenging because the year values are stored as column names instead of as values in a single column. So before we could apply a function to find the latest year (for example, by using max), we would have to first extract the column names to get them as a list and then apply a function to extract the latest year. The problem only gets worse if you would like to find the value for the population for a given region for the latest year. Both of these tasks are greatly simplified once the data is tidied.

Another problem with data in this format is that we don’t know what the numbers under each year actually represent. Do those numbers represent population size? Land area? It’s not clear. To solve both of these problems, we can reshape this data set to a tidy data format by creating a column called “year” and a column called “population.” This transformation—which makes the data “longer”—is shown as the right table in Fig. 16.

Fig. 16 Melting data from a wide to long data format.¶

We can achieve this effect in Python using the .melt function from the pandas package. The .melt function combines columns, and is usually used during tidying data when we need to make the data frame longer and narrower. To learn how to use .melt, we will work through an example with the region_lang_top5_cities_wide.csv data set. This data set contains the counts of how many Canadians cited each language as their mother tongue for five major Canadian cities (Toronto, Montréal, Vancouver, Calgary and Edmonton) from the 2016 Canadian census. \index{Canadian languages} To get started, we will use pd.read_csv to load the (untidy) data.

lang_wide = pd.read_csv("data/region_lang_top5_cities_wide.csv")
lang_wide

	category	language	Toronto	Montréal	Vancouver	Calgary	Edmonton
0	Aboriginal languages	Aboriginal languages, n.o.s.	80	30	70	20	25
1	Non-Official & Non-Aboriginal languages	Afrikaans	985	90	1435	960	575
2	Non-Official & Non-Aboriginal languages	Afro-Asiatic languages, n.i.e.	360	240	45	45	65
3	Non-Official & Non-Aboriginal languages	Akan (Twi)	8485	1015	400	705	885
4	Non-Official & Non-Aboriginal languages	Albanian	13260	2450	1090	1365	770
...	...	...	...	...	...	...	...
209	Non-Official & Non-Aboriginal languages	Wolof	165	2440	30	120	130
210	Aboriginal languages	Woods Cree	5	0	20	10	155
211	Non-Official & Non-Aboriginal languages	Wu (Shanghainese)	5290	1025	4330	380	235
212	Non-Official & Non-Aboriginal languages	Yiddish	3355	8960	220	80	55
213	Non-Official & Non-Aboriginal languages	Yoruba	3380	210	190	1430	700

214 rows × 7 columns

What is wrong with the untidy format above? The table on the left in Fig. 17 represents the data in the “wide” (messy) format. From a data analysis perspective, this format is not ideal because the values of the variable region (Toronto, Montréal, Vancouver, Calgary and Edmonton) are stored as column names. Thus they are not easily accessible to the data analysis functions we will apply to our data set. Additionally, the mother tongue variable values are spread across multiple columns, which will prevent us from doing any desired visualization or statistical tasks until we combine them into one column. For instance, suppose we want to know the languages with the highest number of Canadians reporting it as their mother tongue among all five regions. This question would be tough to answer with the data in its current format. We could find the answer with the data in this format, though it would be much easier to answer if we tidy our data first. If mother tongue were instead stored as one column, as shown in the tidy data on the right in Fig. 17, we could simply use one line of code (df["mother_tongue"].max()) to get the maximum value.

Fig. 17 Going from wide to long with the .melt function.¶

Fig. 18 details the arguments that we need to specify in the .melt function to accomplish this data transformation.

Fig. 18 Syntax for the melt function.¶

We use .melt to combine the Toronto, Montréal, Vancouver, Calgary, and Edmonton columns into a single column called region, and create a column called mother_tongue that contains the count of how many Canadians report each language as their mother tongue for each metropolitan area. We specify value_vars to be all the columns between Toronto and Edmonton: \index{column range}\index{aaacolonsymb@\texttt{:}|see{column range}}

lang_mother_tidy = lang_wide.melt(
    id_vars=["category", "language"],
    value_vars=["Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton"],
    var_name="region",
    value_name="mother_tongue",
)

lang_mother_tidy

	category	language	region	mother_tongue
0	Aboriginal languages	Aboriginal languages, n.o.s.	Toronto	80
1	Non-Official & Non-Aboriginal languages	Afrikaans	Toronto	985
2	Non-Official & Non-Aboriginal languages	Afro-Asiatic languages, n.i.e.	Toronto	360
3	Non-Official & Non-Aboriginal languages	Akan (Twi)	Toronto	8485
4	Non-Official & Non-Aboriginal languages	Albanian	Toronto	13260
...	...	...	...	...
1065	Non-Official & Non-Aboriginal languages	Wolof	Edmonton	130
1066	Aboriginal languages	Woods Cree	Edmonton	155
1067	Non-Official & Non-Aboriginal languages	Wu (Shanghainese)	Edmonton	235
1068	Non-Official & Non-Aboriginal languages	Yiddish	Edmonton	55
1069	Non-Official & Non-Aboriginal languages	Yoruba	Edmonton	700

1070 rows × 4 columns

Note: In the code above, the call to the .melt function is split across several lines. This is allowed in certain cases; for example, when calling a function as above, as long as the line ends with a comma , Python knows to keep reading on the next line. Splitting long lines like this across multiple lines is encouraged as it helps significantly with code readability. Generally speaking, you should limit each line of code to about 80 characters.

The data above is now tidy because all three criteria for tidy data have now been met:

All the variables (category, language, region and mother_tongue) are now their own columns in the data frame.
Each observation, i.e., each category, language, region, and count of Canadians where that language is the mother tongue, are in a single row.
Each value is a single cell, i.e., its row, column position in the data frame is not shared with another value.

Tidying up: going from long to wide using `.pivot`¶

Suppose we have observations spread across multiple rows rather than in a single \index{pivot_wider} row. For example, in Fig. 19, the table on the left is in an untidy, long format because the count column contains three variables (population, commuter, and incorporated count) and information about each observation (here, population, commuter, and incorporated counts for a region) is split across three rows. Remember: one of the criteria for tidy data is that each observation must be in a single row.

Using data in this format—where two or more variables are mixed together in a single column—makes it harder to apply many usual pandas functions. For example, finding the maximum number of commuters would require an additional step of filtering for the commuter values before the maximum can be computed. In comparison, if the data were tidy, all we would have to do is compute the maximum value for the commuter column. To reshape this untidy data set to a tidy (and in this case, wider) format, we need to create columns called “population”, “commuters”, and “incorporated.” This is illustrated in the right table of Fig. 19.

Fig. 19 Going from long to wide data.¶

To tidy this type of data in Python, we can use the .pivot function. The .pivot function generally increases the number of columns (widens) and decreases the number of rows in a data set. To learn how to use .pivot, we will work through an example with the region_lang_top5_cities_long.csv data set. This data set contains the number of Canadians reporting the primary language at home and work for five major cities (Toronto, Montréal, Vancouver, Calgary and Edmonton).

lang_long = pd.read_csv("data/region_lang_top5_cities_long.csv")
lang_long

	region	category	language	type	count
0	Montréal	Aboriginal languages	Aboriginal languages, n.o.s.	most_at_home	15
1	Montréal	Aboriginal languages	Aboriginal languages, n.o.s.	most_at_work	0
2	Toronto	Aboriginal languages	Aboriginal languages, n.o.s.	most_at_home	50
3	Toronto	Aboriginal languages	Aboriginal languages, n.o.s.	most_at_work	0
4	Calgary	Aboriginal languages	Aboriginal languages, n.o.s.	most_at_home	5
...	...	...	...	...	...
2135	Calgary	Non-Official & Non-Aboriginal languages	Yoruba	most_at_work	0
2136	Edmonton	Non-Official & Non-Aboriginal languages	Yoruba	most_at_home	280
2137	Edmonton	Non-Official & Non-Aboriginal languages	Yoruba	most_at_work	0
2138	Vancouver	Non-Official & Non-Aboriginal languages	Yoruba	most_at_home	40
2139	Vancouver	Non-Official & Non-Aboriginal languages	Yoruba	most_at_work	0

2140 rows × 5 columns

What makes the data set shown above untidy? In this example, each observation is a language in a region. However, each observation is split across multiple rows: one where the count for most_at_home is recorded, and the other where the count for most_at_work is recorded. Suppose the goal with this data was to visualize the relationship between the number of Canadians reporting their primary language at home and work. Doing that would be difficult with this data in its current form, since these two variables are stored in the same column. Fig. 20 shows how this data will be tidied using the .pivot function.

Fig. 20 Going from long to wide with the .pivot function.¶

Fig. 21 details the arguments that we need to specify in the .pivot function.

Fig. 21 Syntax for the .pivot function.¶

We will apply the function as detailed in Fig. 21.

lang_home_tidy = lang_long.pivot(
    index=["region", "category", "language"], columns=["type"], values=["count"]
).reset_index()

lang_home_tidy.columns = [
    "region",
    "category",
    "language",
    "most_at_home",
    "most_at_work",
]
lang_home_tidy

	region	category	language	most_at_home	most_at_work
0	Calgary	Aboriginal languages	Aboriginal languages, n.o.s.	5	0
1	Calgary	Aboriginal languages	Algonquian languages, n.i.e.	0	0
2	Calgary	Aboriginal languages	Algonquin	0	0
3	Calgary	Aboriginal languages	Athabaskan languages, n.i.e.	0	0
4	Calgary	Aboriginal languages	Atikamekw	0	0
...	...	...	...	...	...
1065	Vancouver	Non-Official & Non-Aboriginal languages	Wu (Shanghainese)	2495	45
1066	Vancouver	Non-Official & Non-Aboriginal languages	Yiddish	10	0
1067	Vancouver	Non-Official & Non-Aboriginal languages	Yoruba	40	0
1068	Vancouver	Official languages	English	1622735	1330555
1069	Vancouver	Official languages	French	8630	3245

1070 rows × 5 columns

lang_home_tidy.dtypes

region          object
category        object
language        object
most_at_home     int64
most_at_work     int64
dtype: object

The data above is now tidy! We can go through the three criteria again to check that this data is a tidy data set.

All the statistical variables are their own columns in the data frame (i.e., most_at_home, and most_at_work have been separated into their own columns in the data frame).
Each observation, (i.e., each language in a region) is in a single row.
Each value is a single cell (i.e., its row, column position in the data frame is not shared with another value).

You might notice that we have the same number of columns in the tidy data set as we did in the messy one. Therefore .pivot didn’t really “widen” the data. This is just because the original type column only had two categories in it. If it had more than two, .pivot would have created more columns, and we would see the data set “widen.”

Tidying up: using `.str.split` to deal with multiple delimiters¶

Data are also not considered tidy when multiple values are stored in the same \index{separate} cell. The data set we show below is even messier than the ones we dealt with above: the Toronto, Montréal, Vancouver, Calgary and Edmonton columns contain the number of Canadians reporting their primary language at home and work in one column separated by the delimiter (/). The column names are the \index{delimiter} values of a variable, and each value does not have its own cell! To turn this messy data into tidy data, we’ll have to fix these issues.

lang_messy = pd.read_csv("data/region_lang_top5_cities_messy.csv")
lang_messy

	category	language	Toronto	Montréal	Vancouver	Calgary	Edmonton
0	Aboriginal languages	Aboriginal languages, n.o.s.	50/0	15/0	15/0	5/0	10/0
1	Non-Official & Non-Aboriginal languages	Afrikaans	265/0	10/0	520/10	505/15	300/0
2	Non-Official & Non-Aboriginal languages	Afro-Asiatic languages, n.i.e.	185/10	65/0	10/0	15/0	20/0
3	Non-Official & Non-Aboriginal languages	Akan (Twi)	4045/20	440/0	125/10	330/0	445/0
4	Non-Official & Non-Aboriginal languages	Albanian	6380/215	1445/20	530/10	620/25	370/10
...	...	...	...	...	...	...	...
209	Non-Official & Non-Aboriginal languages	Wolof	75/0	770/0	5/0	65/0	90/10
210	Aboriginal languages	Woods Cree	0/10	0/0	5/0	0/0	20/0
211	Non-Official & Non-Aboriginal languages	Wu (Shanghainese)	3130/30	760/15	2495/45	210/0	120/0
212	Non-Official & Non-Aboriginal languages	Yiddish	350/20	6665/860	10/0	10/0	0/0
213	Non-Official & Non-Aboriginal languages	Yoruba	1080/10	45/0	40/0	350/0	280/0

214 rows × 7 columns

First we’ll use .melt to create two columns, region and value, similar to what we did previously. The new region columns will contain the region names, and the new column value will be a temporary holding place for the data that we need to further separate, i.e., the number of Canadians reporting their primary language at home and work.

lang_messy_longer = lang_messy.melt(
    id_vars=["category", "language"],
    value_vars=["Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton"],
    var_name="region",
    value_name="value",
)

lang_messy_longer

	category	language	region	value
0	Aboriginal languages	Aboriginal languages, n.o.s.	Toronto	50/0
1	Non-Official & Non-Aboriginal languages	Afrikaans	Toronto	265/0
2	Non-Official & Non-Aboriginal languages	Afro-Asiatic languages, n.i.e.	Toronto	185/10
3	Non-Official & Non-Aboriginal languages	Akan (Twi)	Toronto	4045/20
4	Non-Official & Non-Aboriginal languages	Albanian	Toronto	6380/215
...	...	...	...	...
1065	Non-Official & Non-Aboriginal languages	Wolof	Edmonton	90/10
1066	Aboriginal languages	Woods Cree	Edmonton	20/0
1067	Non-Official & Non-Aboriginal languages	Wu (Shanghainese)	Edmonton	120/0
1068	Non-Official & Non-Aboriginal languages	Yiddish	Edmonton	0/0
1069	Non-Official & Non-Aboriginal languages	Yoruba	Edmonton	280/0

1070 rows × 4 columns

Next we’ll use .str.split to split the value column into two columns. One column will contain only the counts of Canadians that speak each language most at home, and the other will contain the counts of Canadians that speak each language most at work for each region. Fig. 22 outlines what we need to specify to use .str.split.

Fig. 22 Syntax for the .str.split function.¶

tidy_lang = (
    pd.concat(
        (lang_messy_longer, lang_messy_longer["value"].str.split("/", expand=True)),
        axis=1,
    )
    .rename(columns={0: "most_at_home", 1: "most_at_work"})
    .drop(columns=["value"])
)

tidy_lang

	category	language	region	most_at_home	most_at_work
0	Aboriginal languages	Aboriginal languages, n.o.s.	Toronto	50	0
1	Non-Official & Non-Aboriginal languages	Afrikaans	Toronto	265	0
2	Non-Official & Non-Aboriginal languages	Afro-Asiatic languages, n.i.e.	Toronto	185	10
3	Non-Official & Non-Aboriginal languages	Akan (Twi)	Toronto	4045	20
4	Non-Official & Non-Aboriginal languages	Albanian	Toronto	6380	215
...	...	...	...	...	...
1065	Non-Official & Non-Aboriginal languages	Wolof	Edmonton	90	10
1066	Aboriginal languages	Woods Cree	Edmonton	20	0
1067	Non-Official & Non-Aboriginal languages	Wu (Shanghainese)	Edmonton	120	0
1068	Non-Official & Non-Aboriginal languages	Yiddish	Edmonton	0	0
1069	Non-Official & Non-Aboriginal languages	Yoruba	Edmonton	280	0

1070 rows × 5 columns

tidy_lang.dtypes

category        object
language        object
region          object
most_at_home    object
most_at_work    object
dtype: object

Is this data set now tidy? If we recall the three criteria for tidy data:

each row is a single observation,
each column is a single variable, and
each value is a single cell.

We can see that this data now satisfies all three criteria, making it easier to analyze. But we aren’t done yet! Notice in the table, all of the variables are “object” data types. Object data types are columns of strings or columns with mixed types. In the previous example in Section Tidying up: going from long to wide using .pivot, the most_at_home and most_at_work variables were int64 (integer)—you can verify this by calling df.dtypes—which is a type of numeric data. This change is due to the delimiter (/) when we read in this messy data set. Python read these columns in as string types, and by default, .str.split will return columns as object data types.

It makes sense for region, category, and language to be stored as a object type. However, suppose we want to apply any functions that treat the most_at_home and most_at_work columns as a number (e.g., finding rows above a numeric threshold of a column). In that case, it won’t be possible to do if the variable is stored as a object. Fortunately, the pandas.to_numeric function provides a natural way to fix problems like this: it will convert the columns to the best numeric data types.

tidy_lang["most_at_home"] = pd.to_numeric(tidy_lang["most_at_home"])
tidy_lang["most_at_work"] = pd.to_numeric(tidy_lang["most_at_work"])
tidy_lang

	category	language	region	most_at_home	most_at_work
0	Aboriginal languages	Aboriginal languages, n.o.s.	Toronto	50	0
1	Non-Official & Non-Aboriginal languages	Afrikaans	Toronto	265	0
2	Non-Official & Non-Aboriginal languages	Afro-Asiatic languages, n.i.e.	Toronto	185	10
3	Non-Official & Non-Aboriginal languages	Akan (Twi)	Toronto	4045	20
4	Non-Official & Non-Aboriginal languages	Albanian	Toronto	6380	215
...	...	...	...	...	...
1065	Non-Official & Non-Aboriginal languages	Wolof	Edmonton	90	10
1066	Aboriginal languages	Woods Cree	Edmonton	20	0
1067	Non-Official & Non-Aboriginal languages	Wu (Shanghainese)	Edmonton	120	0
1068	Non-Official & Non-Aboriginal languages	Yiddish	Edmonton	0	0
1069	Non-Official & Non-Aboriginal languages	Yoruba	Edmonton	280	0

1070 rows × 5 columns

tidy_lang.dtypes

category        object
language        object
region          object
most_at_home     int64
most_at_work     int64
dtype: object

Now we see most_at_home and most_at_work columns are of int64 data types, indicating they are integer data types (i.e., numbers)!

Using `.loc[]` and `.iloc[]` to extract a range of columns¶

Now that the tidy_lang data is indeed tidy, we can start manipulating it \index{select!helpers} using the powerful suite of functions from the pandas. For the first example, recall .loc[] from Chapter intro, which lets us create a subset of columns from a data frame. Suppose we wanted to select only the columns language, region, most_at_home and most_at_work from the tidy_lang data set. Using what we learned in Chapter intro, we would pass all of these column names into the square brackets:

selected_columns = tidy_lang.loc[:, ["language", "region", "most_at_home", "most_at_work"]]
selected_columns

	language	region	most_at_home	most_at_work
0	Aboriginal languages, n.o.s.	Toronto	50	0
1	Afrikaans	Toronto	265	0
2	Afro-Asiatic languages, n.i.e.	Toronto	185	10
3	Akan (Twi)	Toronto	4045	20
4	Albanian	Toronto	6380	215
...	...	...	...	...
1065	Wolof	Edmonton	90	10
1066	Woods Cree	Edmonton	20	0
1067	Wu (Shanghainese)	Edmonton	120	0
1068	Yiddish	Edmonton	0	0
1069	Yoruba	Edmonton	280	0

1070 rows × 4 columns

Here we wrote out the names of each of the columns. However, this method is time-consuming, especially if you have a lot of columns! Another approach is to index with integers. .iloc[] make it easier for us to select columns. For instance, we can use .iloc[] to choose a range of columns rather than typing each column name out. To do this, we use the colon (:) operator to denote the range. For example, to get all the columns in \index{column range} the tidy_lang data frame from language to most_at_work, we pass : before the comma indicating we want to retrieve all rows, and 1: after the comma indicating we want only columns from index 1 (i.e. language) and afterwords.

column_range = tidy_lang.iloc[:, 1:]
column_range

	language	region	most_at_home	most_at_work
0	Aboriginal languages, n.o.s.	Toronto	50	0
1	Afrikaans	Toronto	265	0
2	Afro-Asiatic languages, n.i.e.	Toronto	185	10
3	Akan (Twi)	Toronto	4045	20
4	Albanian	Toronto	6380	215
...	...	...	...	...
1065	Wolof	Edmonton	90	10
1066	Woods Cree	Edmonton	20	0
1067	Wu (Shanghainese)	Edmonton	120	0
1068	Yiddish	Edmonton	0	0
1069	Yoruba	Edmonton	280	0

1070 rows × 4 columns

Notice that we get the same output as we did above, but with less (and clearer!) code. This type of operator is especially handy for large data sets.

Suppose instead we wanted to extract columns that followed a particular pattern rather than just selecting a range. For example, let’s say we wanted only to select the columns most_at_home and most_at_work. There are other functions that allow us to select variables based on their names. In particular, we can use the .str.startswith method \index{select!starts_with} to choose only the columns that start with the word “most”:

tidy_lang.loc[:, tidy_lang.columns.str.startswith('most')]

	most_at_home	most_at_work
0	50	0
1	265	0
2	185	10
3	4045	20
4	6380	215
...	...	...
1065	90	10
1066	20	0
1067	120	0
1068	0	0
1069	280	0

1070 rows × 2 columns

We could also have chosen the columns containing an underscore _ by using the .str.contains("_"), since we notice the columns we want contain underscores and the others don’t. \index{select!contains}

tidy_lang.loc[:, tidy_lang.columns.str.contains('_')]

	most_at_home	most_at_work
0	50	0
1	265	0
2	185	10
3	4045	20
4	6380	215
...	...	...
1065	90	10
1066	20	0
1067	120	0
1068	0	0
1069	280	0

1070 rows × 2 columns

There are many different functions that help with selecting variables based on certain criteria. The additional resources section at the end of this chapter provides a comprehensive resource on these functions.

Using `df[]` to extract rows¶

Next, we revisit the df[] from Chapter intro, which lets us create a subset of rows from a data frame. Recall the argument to the df[]: column names or a logical statement evaluated to either True or False; df[] works by returning the rows where the logical statement evaluates to True. This section will highlight more advanced usage of the df[] function. In particular, this section provides an in-depth treatment of the variety of logical statements one can use in the df[] to select subsets of rows.

Extracting rows that have a certain value with `==`¶

Suppose we are only interested in the subset of rows in tidy_lang corresponding to the official languages of Canada (English and French). We can extract these rows by using the equivalency operator (==) to compare the values of the category column with the value "Official languages". With these arguments, df[] returns a data frame with all the columns of the input data frame but only the rows we asked for in the logical statement, i.e., those where the category column holds the value "Official languages". We name this data frame official_langs.

official_langs = tidy_lang[tidy_lang["category"] == "Official languages"]
official_langs

	category	language	region	most_at_home	most_at_work
54	Official languages	English	Toronto	3836770	3218725
59	Official languages	French	Toronto	29800	11940
268	Official languages	English	Montréal	620510	412120
273	Official languages	French	Montréal	2669195	1607550
482	Official languages	English	Vancouver	1622735	1330555
487	Official languages	French	Vancouver	8630	3245
696	Official languages	English	Calgary	1065070	844740
701	Official languages	French	Calgary	8630	2140
910	Official languages	English	Edmonton	1050410	792700
915	Official languages	French	Edmonton	10950	2520

Extracting rows that do not have a certain value with `!=`¶

What if we want all the other language categories in the data set except for those in the "Official languages" category? We can accomplish this with the != operator, which means “not equal to”. So if we want to find all the rows where the category does not equal "Official languages" we write the code below.

tidy_lang[tidy_lang["category"] != "Official languages"]

	category	language	region	most_at_home	most_at_work
0	Aboriginal languages	Aboriginal languages, n.o.s.	Toronto	50	0
1	Non-Official & Non-Aboriginal languages	Afrikaans	Toronto	265	0
2	Non-Official & Non-Aboriginal languages	Afro-Asiatic languages, n.i.e.	Toronto	185	10
3	Non-Official & Non-Aboriginal languages	Akan (Twi)	Toronto	4045	20
4	Non-Official & Non-Aboriginal languages	Albanian	Toronto	6380	215
...	...	...	...	...	...
1065	Non-Official & Non-Aboriginal languages	Wolof	Edmonton	90	10
1066	Aboriginal languages	Woods Cree	Edmonton	20	0
1067	Non-Official & Non-Aboriginal languages	Wu (Shanghainese)	Edmonton	120	0
1068	Non-Official & Non-Aboriginal languages	Yiddish	Edmonton	0	0
1069	Non-Official & Non-Aboriginal languages	Yoruba	Edmonton	280	0

1060 rows × 5 columns

Extracting rows satisfying multiple conditions using `&`¶

Suppose now we want to look at only the rows for the French language in Montréal. To do this, we need to filter the data set to find rows that satisfy multiple conditions simultaneously. We can do this with the ampersand symbol (&), which is interpreted by Python as “and”. We write the code as shown below to filter the official_langs data frame to subset the rows where region == "Montréal" and the language == "French".

tidy_lang[(tidy_lang["region"] == "Montréal") & (tidy_lang["language"] == "French")]

	category	language	region	most_at_home	most_at_work
273	Official languages	French	Montréal	2669195	1607550

Extracting rows satisfying at least one condition using `|`¶

Suppose we were interested in only those rows corresponding to cities in Alberta in the official_langs data set (Edmonton and Calgary). We can’t use & as we did above because region cannot be both Edmonton and Calgary simultaneously. Instead, we can use the vertical pipe (|) logical operator, which gives us the cases where one condition or another condition or both are satisfied. In the code below, we ask Python to return the rows where the region columns are equal to “Calgary” or “Edmonton”.

official_langs[
    (official_langs["region"] == "Calgary") | (official_langs["region"] == "Edmonton")
]

	category	language	region	most_at_home	most_at_work
696	Official languages	English	Calgary	1065070	844740
701	Official languages	French	Calgary	8630	2140
910	Official languages	English	Edmonton	1050410	792700
915	Official languages	French	Edmonton	10950	2520

Extracting rows with values in a list using `.isin()`¶

Next, suppose we want to see the populations of our five cities. Let’s read in the region_data.csv file that comes from the 2016 Canadian census, as it contains statistics for number of households, land area, population and number of dwellings for different regions.

region_data = pd.read_csv("data/region_data.csv")
region_data

	region	households	area	population	dwellings
0	Belleville	43002	1354.65121	103472	45050
1	Lethbridge	45696	3046.69699	117394	48317
2	Thunder Bay	52545	2618.26318	121621	57146
3	Peterborough	50533	1636.98336	121721	55662
4	Saint John	52872	3793.42158	126202	58398
...	...	...	...	...	...
30	Ottawa - Gatineau	535499	7168.96442	1323783	571146
31	Calgary	519693	5241.70103	1392609	544870
32	Vancouver	960894	3040.41532	2463431	1027613
33	Montréal	1727310	4638.24059	4098927	1823281
34	Toronto	2135909	6269.93132	5928040	2235145

35 rows × 5 columns

To get the population of the five cities we can filter the data set using the .isin method. The .isin method is used to see if an element belongs to a list. Here we are filtering for rows where the value in the region column matches any of the five cities we are intersted in: Toronto, Montréal, Vancouver, Calgary, and Edmonton.

city_names = ["Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton"]
five_cities = region_data[region_data["region"].isin(city_names)]
five_cities

	region	households	area	population	dwellings
29	Edmonton	502143	9857.77908	1321426	537634
31	Calgary	519693	5241.70103	1392609	544870
32	Vancouver	960894	3040.41532	2463431	1027613
33	Montréal	1727310	4638.24059	4098927	1823281
34	Toronto	2135909	6269.93132	5928040	2235145

Note: What’s the difference between == and .isin? Suppose we have two Series, seriesA and seriesB. If you type seriesA == seriesB into Python it will compare the series element by element. Python checks if the first element of seriesA equals the first element of seriesB, the second element of seriesA equals the second element of seriesB, and so on. On the other hand, seriesA.isin(seriesB) compares the first element of seriesA to all the elements in seriesB. Then the second element of seriesA is compared to all the elements in seriesB, and so on. Notice the difference between == and .isin in the example below.

pd.Series(["Vancouver", "Toronto"]) == pd.Series(["Toronto", "Vancouver"])

0    False
1    False
dtype: bool

pd.Series(["Vancouver", "Toronto"]).isin(pd.Series(["Toronto", "Vancouver"]))

0    True
1    True
dtype: bool

Extracting rows above or below a threshold using `>` and `<`¶

We saw in Section Extracting rows satisfying multiple conditions using & that 2,669,195 people reported speaking French in Montréal as their primary language at home. If we are interested in finding the official languages in regions with higher numbers of people who speak it as their primary language at home compared to French in Montréal, then we can use df[] to obtain rows where the value of most_at_home is greater than 2,669,195.

official_langs[official_langs["most_at_home"] > 2669195]

	category	language	region	most_at_home	most_at_work
54	Official languages	English	Toronto	3836770	3218725

This operation returns a data frame with only one row, indicating that when considering the official languages, only English in Toronto is reported by more people as their primary language at home than French in Montréal according to the 2016 Canadian census.

Using `.assign` to modify or add columns¶

Using `.assign` to modify columns¶

In Section Tidying up: using .str.split to deal with multiple delimiters, when we first read in the "region_lang_top5_cities_messy.csv" data, all of the variables were “object” data types. \index{mutate} During the tidying process, we used the pandas.to_numeric function to convert the most_at_home and most_at_work columns to the desired integer (i.e., numeric class) data types and then used df[] to overwrite columns. But suppose we didn’t use the df[], and needed to modify the columns some other way. Below we create such a situation so that we can demonstrate how to use .assign to change the column types of a data frame. .assign is a useful function to modify or create new data frame columns.

lang_messy = pd.read_csv("data/region_lang_top5_cities_messy.csv")
lang_messy_longer = lang_messy.melt(
    id_vars=["category", "language"],
    value_vars=["Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton"],
    var_name="region",
    value_name="value",
)
tidy_lang_obj = (
    pd.concat(
        (lang_messy_longer, lang_messy_longer["value"].str.split("/", expand=True)),
        axis=1,
    )
    .rename(columns={0: "most_at_home", 1: "most_at_work"})
    .drop(columns=["value"])
)
official_langs_obj = tidy_lang_obj[tidy_lang_obj["category"] == "Official languages"]

official_langs_obj

	category	language	region	most_at_home	most_at_work
54	Official languages	English	Toronto	3836770	3218725
59	Official languages	French	Toronto	29800	11940
268	Official languages	English	Montréal	620510	412120
273	Official languages	French	Montréal	2669195	1607550
482	Official languages	English	Vancouver	1622735	1330555
487	Official languages	French	Vancouver	8630	3245
696	Official languages	English	Calgary	1065070	844740
701	Official languages	French	Calgary	8630	2140
910	Official languages	English	Edmonton	1050410	792700
915	Official languages	French	Edmonton	10950	2520

official_langs_obj.dtypes

category        object
language        object
region          object
most_at_home    object
most_at_work    object
dtype: object

To use the .assign method, again we first specify the object to be the data set, and in the following arguments, we specify the name of the column we want to modify or create (here most_at_home and most_at_work), an = sign, and then the function we want to apply (here pandas.to_numeric). In the function we want to apply, we refer to the column upon which we want it to act (here most_at_home and most_at_work). In our example, we are naming the columns the same names as columns that already exist in the data frame (“most_at_home”, “most_at_work”) and this will cause .assign to overwrite those columns (also referred to as modifying those columns in-place). If we were to give the columns a new name, then .assign would create new columns with the names we specified. .assign’s general syntax is detailed in Fig. 23.

Fig. 23 Syntax for the .assign function.¶

Below we use .assign to convert the columns most_at_home and most_at_work to numeric data types in the official_langs data set as described in Fig. 23:

official_langs_numeric = official_langs_obj.assign(
    most_at_home=pd.to_numeric(official_langs_obj["most_at_home"]),
    most_at_work=pd.to_numeric(official_langs_obj["most_at_work"]),
)

official_langs_numeric

	category	language	region	most_at_home	most_at_work
54	Official languages	English	Toronto	3836770	3218725
59	Official languages	French	Toronto	29800	11940
268	Official languages	English	Montréal	620510	412120
273	Official languages	French	Montréal	2669195	1607550
482	Official languages	English	Vancouver	1622735	1330555
487	Official languages	French	Vancouver	8630	3245
696	Official languages	English	Calgary	1065070	844740
701	Official languages	French	Calgary	8630	2140
910	Official languages	English	Edmonton	1050410	792700
915	Official languages	French	Edmonton	10950	2520

official_langs_numeric.dtypes

category        object
language        object
region          object
most_at_home     int64
most_at_work     int64
dtype: object

Now we see that the most_at_home and most_at_work columns are both int64 (which is a numeric data type)!

Using `.assign` to create new columns¶

We can see in the table that 3,836,770 people reported speaking English in Toronto as their primary language at home, according to the 2016 Canadian census. What does this number mean to us? To understand this number, we need context. In particular, how many people were in Toronto when this data was collected? From the 2016 Canadian census profile, the population of Toronto was reported to be 5,928,040 people. The number of people who report that English is their primary language at home is much more meaningful when we report it in this context. We can even go a step further and transform this count to a relative frequency or proportion. We can do this by dividing the number of people reporting a given language as their primary language at home by the number of people who live in Toronto. For example, the proportion of people who reported that their primary language at home was English in the 2016 Canadian census was 0.65 in Toronto.

Let’s use .assign to create a new column in our data frame that holds the proportion of people who speak English for our five cities of focus in this chapter. To accomplish this, we will need to do two tasks beforehand:

Create a list containing the population values for the cities.
Filter the official_langs data frame so that we only keep the rows where the language is English.

To create a list containing the population values for the five cities (Toronto, Montréal, Vancouver, Calgary, Edmonton), we will use the [] (recall that we can also use list() to create a list):

city_pops = [5928040, 4098927, 2463431, 1392609, 1321426]
city_pops

[5928040, 4098927, 2463431, 1392609, 1321426]

And next, we will filter the official_langs data frame so that we only keep the rows where the language is English. We will name the new data frame we get from this english_langs:

english_langs = official_langs[official_langs["language"] == "English"]
english_langs

	category	language	region	most_at_home	most_at_work
54	Official languages	English	Toronto	3836770	3218725
268	Official languages	English	Montréal	620510	412120
482	Official languages	English	Vancouver	1622735	1330555
696	Official languages	English	Calgary	1065070	844740
910	Official languages	English	Edmonton	1050410	792700

Finally, we can use .assign to create a new column, named most_at_home_proportion, that will have value that corresponds to the proportion of people reporting English as their primary language at home. We will compute this by dividing the column by our vector of city populations.

english_langs = english_langs.assign(
    most_at_home_proportion=english_langs["most_at_home"] / city_pops
)

english_langs

	category	language	region	most_at_home	most_at_work	most_at_home_proportion
54	Official languages	English	Toronto	3836770	3218725	0.647224
268	Official languages	English	Montréal	620510	412120	0.151384
482	Official languages	English	Vancouver	1622735	1330555	0.658730
696	Official languages	English	Calgary	1065070	844740	0.764802
910	Official languages	English	Edmonton	1050410	792700	0.794906

In the computation above, we had to ensure that we ordered the city_pops vector in the same order as the cities were listed in the english_langs data frame. This is because Python will perform the division computation we did by dividing each element of the most_at_home column by each element of the city_pops list, matching them up by position. Failing to do this would have resulted in the incorrect math being performed.

Note: In more advanced data wrangling, one might solve this problem in a less error-prone way though using a technique called “joins”. We link to resources that discuss this in the additional resources at the end of this chapter.

Combining functions by chaining the methods¶

In Python, we often have to call multiple methods in a sequence to process a data frame. The basic ways of doing this can become quickly unreadable if there are many steps. For example, suppose we need to perform three operations on a data frame called data:

add a new column new_col that is double another old_col,
filter for rows where another column, other_col, is more than 5, and
select only the new column new_col for those rows.

One way of performing these three steps is to just write multiple lines of code, storing temporary objects as you go:

output_1 = data.assign(new_col=data["old_col"] * 2)
output_2 = output_1[output_1["other_col"] > 5]
output = output_2.loc[:, "new_col"]

This is difficult to understand for multiple reasons. The reader may be tricked into thinking the named output_1 and output_2 objects are important for some reason, while they are just temporary intermediate computations. Further, the reader has to look through and find where output_1 and output_2 are used in each subsequent line.

Chaining the sequential functions solves this problem, resulting in cleaner and easier-to-follow code. The code below accomplishes the same thing as the previous two code blocks:

output = (
    data.assign(new_col=data["old_col"] * 2)
    .query("other_col > 5")
    .loc[:, "new_col"]
)

Note: You might also have noticed that we split the function calls across lines, similar to when we did this earlier in the chapter for long function calls. Again, this is allowed and recommended, especially when the chained function calls create a long line of code. Doing this makes your code more readable. When you do this, it is important to use parentheses to tell Python that your code is continuing onto the next line.

Chaining `df[]` and `.loc`¶

Let’s work with the tidy tidy_lang data set from Section Tidying up: using .str.split to deal with multiple delimiters, which contains the number of Canadians reporting their primary language at home and work for five major cities (Toronto, Montréal, Vancouver, Calgary, and Edmonton):

tidy_lang

	category	language	region	most_at_home	most_at_work
0	Aboriginal languages	Aboriginal languages, n.o.s.	Toronto	50	0
1	Non-Official & Non-Aboriginal languages	Afrikaans	Toronto	265	0
2	Non-Official & Non-Aboriginal languages	Afro-Asiatic languages, n.i.e.	Toronto	185	10
3	Non-Official & Non-Aboriginal languages	Akan (Twi)	Toronto	4045	20
4	Non-Official & Non-Aboriginal languages	Albanian	Toronto	6380	215
...	...	...	...	...	...
1065	Non-Official & Non-Aboriginal languages	Wolof	Edmonton	90	10
1066	Aboriginal languages	Woods Cree	Edmonton	20	0
1067	Non-Official & Non-Aboriginal languages	Wu (Shanghainese)	Edmonton	120	0
1068	Non-Official & Non-Aboriginal languages	Yiddish	Edmonton	0	0
1069	Non-Official & Non-Aboriginal languages	Yoruba	Edmonton	280	0

1070 rows × 5 columns

Suppose we want to create a subset of the data with only the languages and counts of each language spoken most at home for the city of Vancouver. To do this, we can use the df[] and .loc. First, we use df[] to create a data frame called van_data that contains only values for Vancouver.

van_data = tidy_lang[tidy_lang["region"] == "Vancouver"]
van_data

	category	language	region	most_at_home	most_at_work
428	Aboriginal languages	Aboriginal languages, n.o.s.	Vancouver	15	0
429	Non-Official & Non-Aboriginal languages	Afrikaans	Vancouver	520	10
430	Non-Official & Non-Aboriginal languages	Afro-Asiatic languages, n.i.e.	Vancouver	10	0
431	Non-Official & Non-Aboriginal languages	Akan (Twi)	Vancouver	125	10
432	Non-Official & Non-Aboriginal languages	Albanian	Vancouver	530	10
...	...	...	...	...	...
637	Non-Official & Non-Aboriginal languages	Wolof	Vancouver	5	0
638	Aboriginal languages	Woods Cree	Vancouver	5	0
639	Non-Official & Non-Aboriginal languages	Wu (Shanghainese)	Vancouver	2495	45
640	Non-Official & Non-Aboriginal languages	Yiddish	Vancouver	10	0
641	Non-Official & Non-Aboriginal languages	Yoruba	Vancouver	40	0

214 rows × 5 columns

We then use .loc on this data frame to keep only the variables we want:

van_data_selected = van_data.loc[:, ["language", "most_at_home"]]
van_data_selected

	language	most_at_home
428	Aboriginal languages, n.o.s.	15
429	Afrikaans	520
430	Afro-Asiatic languages, n.i.e.	10
431	Akan (Twi)	125
432	Albanian	530
...	...	...
637	Wolof	5
638	Woods Cree	5
639	Wu (Shanghainese)	2495
640	Yiddish	10
641	Yoruba	40

214 rows × 2 columns

Although this is valid code, there is a more readable approach we could take by chaining the operations. With chaining, we do not need to create an intermediate object to store the output from df[]. Instead, we can directly call .loc upon the output of df[]:

van_data_selected = tidy_lang[tidy_lang["region"] == "Vancouver"].loc[
    :, ["language", "most_at_home"]
]

van_data_selected

	language	most_at_home
428	Aboriginal languages, n.o.s.	15
429	Afrikaans	520
430	Afro-Asiatic languages, n.i.e.	10
431	Akan (Twi)	125
432	Albanian	530
...	...	...
637	Wolof	5
638	Woods Cree	5
639	Wu (Shanghainese)	2495
640	Yiddish	10
641	Yoruba	40

214 rows × 2 columns

As you can see, both of these approaches—with and without chaining—give us the same output, but the second approach is clearer and more readable.

Chaining more than two functions¶

Chaining can be used with any method in Python. Additionally, we can chain together more than two functions. For example, we can chain together three functions to:

extract rows (df[]) to include only those where the counts of the language most spoken at home are greater than 10,000,
extract only the columns (.loc) corresponding to region, language and most_at_home, and
sort the data frame rows in order (.sort_values) by counts of the language most spoken at home from smallest to largest.

As we saw in Chapter intro, we can use the .sort_values function \index{arrange} to order the rows in the data frame by the values of one or more columns. Here we pass the column name most_at_home to sort the data frame rows by the values in that column, in ascending order.

large_region_lang = (
    tidy_lang[tidy_lang["most_at_home"] > 10000]
    .loc[:, ["region", "language", "most_at_home"]]
    .sort_values(by="most_at_home")
)

large_region_lang

	region	language	most_at_home
865	Edmonton	Arabic	10590
402	Montréal	Tamil	10670
584	Vancouver	Russian	10795
1034	Edmonton	Spanish	10880
915	Edmonton	French	10950
...	...	...	...
910	Edmonton	English	1050410
696	Calgary	English	1065070
482	Vancouver	English	1622735
273	Montréal	French	2669195
54	Toronto	English	3836770

67 rows × 3 columns

Now that we’ve shown you chaining as an alternative to storing temporary objects and composing code, does this mean you should never store temporary objects or compose code? Not necessarily! There are times when you will still want to do these things. For example, you might store a temporary object before feeding it into a plot function so you can iteratively change the plot without having to redo all of your data transformations. Additionally, chaining many functions can be overwhelming and difficult to debug; you may want to store a temporary object midway through to inspect your result before moving on with further steps.

Aggregating data with `.assign`, `.agg` and `.apply`¶

Calculating summary statistics on whole columns¶

As a part of many data analyses, we need to calculate a summary value for the data (a summary statistic). \index{summarize} Examples of summary statistics we might want to calculate are the number of observations, the average/mean value for a column, the minimum value, etc. Oftentimes, this summary statistic is calculated from the values in a data frame column, or columns, as shown in Fig. 24.

Fig. 24 Calculating summary statistics on one or more column(s). In its simplest use case, it creates a new data frame with a single row containing the summary statistic(s) for each column being summarized. The darker, top row of each table represents the column headers.¶

We can use .assign as mentioned in Section Using .assign to modify or add columns along with proper summary functions to create a aggregated column.

First a reminder of what region_lang looks like:

region_lang = pd.read_csv("data/region_lang.csv")
region_lang

	region	category	language	mother_tongue	most_at_home	most_at_work	lang_known
0	St. John's	Aboriginal languages	Aboriginal languages, n.o.s.	5	0	0	0
1	Halifax	Aboriginal languages	Aboriginal languages, n.o.s.	5	0	0	0
2	Moncton	Aboriginal languages	Aboriginal languages, n.o.s.	0	0	0	0
3	Saint John	Aboriginal languages	Aboriginal languages, n.o.s.	0	0	0	0
4	Saguenay	Aboriginal languages	Aboriginal languages, n.o.s.	5	5	0	0
...	...	...	...	...	...	...	...
7485	Ottawa - Gatineau	Non-Official & Non-Aboriginal languages	Yoruba	265	65	10	910
7486	Kelowna	Non-Official & Non-Aboriginal languages	Yoruba	5	0	0	0
7487	Abbotsford - Mission	Non-Official & Non-Aboriginal languages	Yoruba	20	0	0	50
7488	Vancouver	Non-Official & Non-Aboriginal languages	Yoruba	190	40	0	505
7489	Victoria	Non-Official & Non-Aboriginal languages	Yoruba	20	0	0	90

7490 rows × 7 columns

We apply min to calculate the minimum and max to calculate maximum number of Canadians reporting a particular language as their primary language at home, for any region, and .assign a column name to each:

lang_summary = pd.DataFrame()
lang_summary = lang_summary.assign(min_most_at_home=[min(region_lang["most_at_home"])])
lang_summary = lang_summary.assign(max_most_at_home=[max(region_lang["most_at_home"])])
lang_summary

	min_most_at_home	max_most_at_home
0	0	3836770

From this we see that there are some languages in the data set that no one speaks as their primary language at home. We also see that the most commonly spoken primary language at home is spoken by 3,836,770 people.

Calculating summary statistics when there are `NaN`s¶

In pandas DataFrame, the value NaN is often used to denote missing data. Many of the base python statistical summary functions (e.g., max, min, sum, etc) will return NaN when applied to columns containing NaN values. \index{missing data}\index{NA|see{missing data}} Usually that is not what we want to happen; instead, we would usually like Python to ignore the missing entries and calculate the summary statistic using all of the other non-NaN values in the column. Fortunately pandas provides many equivalent methods (e.g., .max, .min, .sum, etc) to these summary functions while providing an extra argument skipna that lets us tell the function what to do when it encounters NaN values. In particular, if we specify skipna=True (default), the function will ignore missing values and return a summary of all the non-missing entries. We show an example of this below.

First we create a new version of the region_lang data frame, named region_lang_na, that has a seemingly innocuous NaN in the first row of the most_at_home column:

region_lang_na

	region	category	language	mother_tongue	most_at_home	most_at_work	lang_known
0	St. John's	Aboriginal languages	Aboriginal languages, n.o.s.	5	NaN	0	0
1	Halifax	Aboriginal languages	Aboriginal languages, n.o.s.	5	0.0	0	0
2	Moncton	Aboriginal languages	Aboriginal languages, n.o.s.	0	0.0	0	0
3	Saint John	Aboriginal languages	Aboriginal languages, n.o.s.	0	0.0	0	0
4	Saguenay	Aboriginal languages	Aboriginal languages, n.o.s.	5	5.0	0	0
...	...	...	...	...	...	...	...
7485	Ottawa - Gatineau	Non-Official & Non-Aboriginal languages	Yoruba	265	65.0	10	910
7486	Kelowna	Non-Official & Non-Aboriginal languages	Yoruba	5	0.0	0	0
7487	Abbotsford - Mission	Non-Official & Non-Aboriginal languages	Yoruba	20	0.0	0	50
7488	Vancouver	Non-Official & Non-Aboriginal languages	Yoruba	190	40.0	0	505
7489	Victoria	Non-Official & Non-Aboriginal languages	Yoruba	20	0.0	0	90

7490 rows × 7 columns

Now if we apply the Python built-in summary function as above, we see that we no longer get the minimum and maximum returned, but just an NaN instead!

lang_summary_na = pd.DataFrame()
lang_summary_na = lang_summary_na.assign(
    min_most_at_home=[min(region_lang_na["most_at_home"])]
)
lang_summary_na = lang_summary_na.assign(
    max_most_at_home=[max(region_lang_na["most_at_home"])]
)
lang_summary_na

	min_most_at_home	max_most_at_home
0	NaN	NaN

We can fix this by using the pandas Series methods (i.e. .min and .max) with skipna=True as explained above:

lang_summary_na = pd.DataFrame()
lang_summary_na = lang_summary_na.assign(
    min_most_at_home=[region_lang_na["most_at_home"].min(skipna=True)]
)
lang_summary_na = lang_summary_na.assign(
    max_most_at_home=[region_lang_na["most_at_home"].max(skipna=True)]
)
lang_summary_na

	min_most_at_home	max_most_at_home
0	0.0	3836770.0

Calculating summary statistics for groups of rows¶

A common pairing with summary functions is .groupby. Pairing these functions \index{group_by} together can let you summarize values for subgroups within a data set, as illustrated in Fig. 25. For example, we can use .groupby to group the regions of the tidy_lang data frame and then calculate the minimum and maximum number of Canadians reporting the language as the primary language at home for each of the regions in the data set.

Fig. 25 Calculating summary statistics on one or more column(s) for each group. It creates a new data frame—with one row for each group—containing the summary statistic(s) for each column being summarized. It also creates a column listing the value of the grouping variable. The darker, top row of each table represents the column headers. The gray, blue, and green colored rows correspond to the rows that belong to each of the three groups being represented in this cartoon example.¶

The .groupby function takes at least one argument—the columns to use in the grouping. Here we use only one column for grouping (region), but more than one can also be used. To do this, pass a list of column names to the by argument.

region_summary = pd.DataFrame()
region_summary = region_summary.assign(
    min_most_at_home=region_lang.groupby(by="region")["most_at_home"].min(),
    max_most_at_home=region_lang.groupby(by="region")["most_at_home"].max()
).reset_index()

region_summary.columns = ["region", "min_most_at_home", "max_most_at_home"]
region_summary

	region	min_most_at_home	max_most_at_home
0	Abbotsford - Mission	0	137445
1	Barrie	0	182390
2	Belleville	0	97840
3	Brantford	0	124560
4	Calgary	0	1065070
...	...	...	...
30	Trois-Rivières	0	149835
31	Vancouver	0	1622735
32	Victoria	0	331375
33	Windsor	0	270715
34	Winnipeg	0	612595

35 rows × 3 columns

pandas also has a convenient method .agg (shorthand for .aggregate) that allows us to apply multiple aggregate functions in one line of code. We just need to pass in a list of function names to .agg as shown below.

region_summary = (
    region_lang.groupby(by="region")["most_at_home"].agg(["min", "max"]).reset_index()
)
region_summary.columns = ["region", "min_most_at_home", "max_most_at_home"]
region_summary

	region	min_most_at_home	max_most_at_home
0	Abbotsford - Mission	0	137445
1	Barrie	0	182390
2	Belleville	0	97840
3	Brantford	0	124560
4	Calgary	0	1065070
...	...	...	...
30	Trois-Rivières	0	149835
31	Vancouver	0	1622735
32	Victoria	0	331375
33	Windsor	0	270715
34	Winnipeg	0	612595

35 rows × 3 columns

Notice that .groupby converts a DataFrame object to a DataFrameGroupBy object, which contains information about the groups of the dataframe. We can then apply aggregating functions to the DataFrameGroupBy object.

region_lang.groupby("region")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x16da8ea30>

Calculating summary statistics on many columns¶

Sometimes we need to summarize statistics across many columns. An example of this is illustrated in Fig. 26. In such a case, using summary functions alone means that we have to type out the name of each column we want to summarize. In this section we will meet two strategies for performing this task. First we will see how we can do this using .iloc[] to slice the columns before applying summary functions. Then we will also explore how we can use a more general iteration function, .apply, to also accomplish this.

Fig. 26 .iloc[] or .apply is useful for efficiently calculating summary statistics on many columns at once. The darker, top row of each table represents the column headers.¶

Aggregating on a data frame for calculating summary statistics on many columns¶

Recall that in the Section Using .loc[] and .iloc[] to extract a range of columns, we can use .iloc[] to extract a range of columns with indices. Here we demonstrate finding the maximum value of each of the numeric columns of the region_lang data set through pairing .iloc[] and .max. This means that the summary methods (e.g. .min, .max, .sum etc.) can be used for data frames as well.

pd.DataFrame(region_lang.iloc[:, 3:].max(axis=0)).T

	mother_tongue	most_at_home	most_at_work	lang_known
0	3061820	3836770	3218725	5600480

`.apply` for calculating summary statistics on many columns¶

An alternative to aggregating on a dataframe for applying a function to many columns is the .apply method. Let’s again find the maximum value of each column of the region_lang data frame, but using .apply with the max function this time. We focus on the two arguments of .apply: the function that you would like to apply to each column, and the axis along which the function will be applied (0 for columns, 1 for rows). Note that .apply does not have an argument to specify which columns to apply the function to. Therefore, we will use the .iloc[] before calling .apply to choose the columns for which we want the maximum.

pd.DataFrame(region_lang.iloc[:, 3:].apply(max, axis=0)).T

	mother_tongue	most_at_home	most_at_work	lang_known
0	3061820	3836770	3218725	5600480

Note: Similar to when we use base Python statistical summary functions (e.g., max, min, sum, etc.) when there are NaNs, .apply functions paired with base Python statistical summary functions also return NaN values when we apply them to columns that contain NaN values. \index{missing data}

To avoid this, again we need to use the pandas variants of summary functions (i.e. .max, .min, .sum, etc.) with skipna=True. When we use this with .apply, we do this by constructing a anonymous function that calls the .max method with skipna=True, as illustrated below:

pd.DataFrame(
    region_lang_na.iloc[:, 3:].apply(lambda col: col.max(skipna=True), axis=0)
).T

	mother_tongue	most_at_home	most_at_work	lang_known
0	3061820.0	3836770.0	3218725.0	5600480.0

The .apply function is generally quite useful for solving many problems involving repeatedly applying functions in Python. Additionally, a variant of .apply is .applymap, which can be used to apply functions element-wise. To learn more about these functions, see the additional resources section at the end of this chapter.

Apply functions across many columns with `.apply`¶

Sometimes we need to apply a function to many columns in a data frame. For example, we would need to do this when converting units of measurements across many columns. We illustrate such a data transformation in Fig. 27.

Fig. 27 .apply is useful for applying functions across many columns. The darker, top row of each table represents the column headers.¶

For example, imagine that we wanted to convert all the numeric columns in the region_lang data frame from int64 type to int32 type using the .as_type function. When we revisit the region_lang data frame, we can see that this would be the columns from mother_tongue to lang_known.

region_lang

	region	category	language	mother_tongue	most_at_home	most_at_work	lang_known
0	St. John's	Aboriginal languages	Aboriginal languages, n.o.s.	5	0	0	0
1	Halifax	Aboriginal languages	Aboriginal languages, n.o.s.	5	0	0	0
2	Moncton	Aboriginal languages	Aboriginal languages, n.o.s.	0	0	0	0
3	Saint John	Aboriginal languages	Aboriginal languages, n.o.s.	0	0	0	0
4	Saguenay	Aboriginal languages	Aboriginal languages, n.o.s.	5	5	0	0
...	...	...	...	...	...	...	...
7485	Ottawa - Gatineau	Non-Official & Non-Aboriginal languages	Yoruba	265	65	10	910
7486	Kelowna	Non-Official & Non-Aboriginal languages	Yoruba	5	0	0	0
7487	Abbotsford - Mission	Non-Official & Non-Aboriginal languages	Yoruba	20	0	0	50
7488	Vancouver	Non-Official & Non-Aboriginal languages	Yoruba	190	40	0	505
7489	Victoria	Non-Official & Non-Aboriginal languages	Yoruba	20	0	0	90

7490 rows × 7 columns

To accomplish such a task, we can use .apply. This works in a similar way for column selection, as we saw when we used in Section .apply for calculating summary statistics on many columns earlier. As we did above, we again use .iloc to specify the columns as well as the .apply to specify the function we want to apply on these columns. However, a key difference here is that we are not using aggregating function here, which means that we get back a data frame with the same number of rows.

region_lang.dtypes

region           object
category         object
language         object
mother_tongue     int64
most_at_home      int64
most_at_work      int64
lang_known        int64
dtype: object

region_lang_int32 = region_lang.iloc[:, 3:].apply(lambda col: col.astype('int32'), axis=0)
region_lang_int32 = pd.concat((region_lang.iloc[:, :3], region_lang_int32), axis=1)
region_lang_int32

	region	category	language	mother_tongue	most_at_home	most_at_work	lang_known
0	St. John's	Aboriginal languages	Aboriginal languages, n.o.s.	5	0	0	0
1	Halifax	Aboriginal languages	Aboriginal languages, n.o.s.	5	0	0	0
2	Moncton	Aboriginal languages	Aboriginal languages, n.o.s.	0	0	0	0
3	Saint John	Aboriginal languages	Aboriginal languages, n.o.s.	0	0	0	0
4	Saguenay	Aboriginal languages	Aboriginal languages, n.o.s.	5	5	0	0
...	...	...	...	...	...	...	...
7485	Ottawa - Gatineau	Non-Official & Non-Aboriginal languages	Yoruba	265	65	10	910
7486	Kelowna	Non-Official & Non-Aboriginal languages	Yoruba	5	0	0	0
7487	Abbotsford - Mission	Non-Official & Non-Aboriginal languages	Yoruba	20	0	0	50
7488	Vancouver	Non-Official & Non-Aboriginal languages	Yoruba	190	40	0	505
7489	Victoria	Non-Official & Non-Aboriginal languages	Yoruba	20	0	0	90

7490 rows × 7 columns

region_lang_int32.dtypes

region           object
category         object
language         object
mother_tongue     int32
most_at_home      int32
most_at_work      int32
lang_known        int32
dtype: object

We see that we get back a data frame with the same number of columns and rows. The only thing that changes is the transformation we applied to the specified columns (here mother_tongue to lang_known).

Apply functions across columns within one row with `.apply`¶

What if you want to apply a function across columns but within one row? We illustrate such a data transformation in Fig. 28.

Fig. 28 .apply is useful for applying functions across columns within one row. The darker, top row of each table represents the column headers.¶

For instance, suppose we want to know the maximum value between mother_tongue, most_at_home, most_at_work and lang_known for each language and region in the region_lang data set. In other words, we want to apply the max function row-wise. Before we use .apply, we will again use .iloc to select only the count columns so we can see all the columns in the data frame’s output easily in the book. So for this demonstration, the data set we are operating on looks like this:

region_lang.iloc[:, 3:]

	mother_tongue	most_at_home	most_at_work	lang_known
0	5	0	0	0
1	5	0	0	0
2	0	0	0	0
3	0	0	0	0
4	5	5	0	0
...	...	...	...	...
7485	265	65	10	910
7486	5	0	0	0
7487	20	0	0	50
7488	190	40	0	505
7489	20	0	0	90

7490 rows × 4 columns

Now we use .apply with argument axis=1, to tell Python that we would like the max function to be applied across, and within, a row, as opposed to being applied on a column (which is the default behavior of .apply):

region_lang_rowwise = region_lang.assign(
    maximum=region_lang.iloc[:, 3:].apply(max, axis=1)
)

region_lang_rowwise

	region	category	language	mother_tongue	most_at_home	most_at_work	lang_known	maximum
0	St. John's	Aboriginal languages	Aboriginal languages, n.o.s.	5	0	0	0	5
1	Halifax	Aboriginal languages	Aboriginal languages, n.o.s.	5	0	0	0	5
2	Moncton	Aboriginal languages	Aboriginal languages, n.o.s.	0	0	0	0	0
3	Saint John	Aboriginal languages	Aboriginal languages, n.o.s.	0	0	0	0	0
4	Saguenay	Aboriginal languages	Aboriginal languages, n.o.s.	5	5	0	0	5
...	...	...	...	...	...	...	...	...
7485	Ottawa - Gatineau	Non-Official & Non-Aboriginal languages	Yoruba	265	65	10	910	910
7486	Kelowna	Non-Official & Non-Aboriginal languages	Yoruba	5	0	0	0	5
7487	Abbotsford - Mission	Non-Official & Non-Aboriginal languages	Yoruba	20	0	0	50	50
7488	Vancouver	Non-Official & Non-Aboriginal languages	Yoruba	190	40	0	505	505
7489	Victoria	Non-Official & Non-Aboriginal languages	Yoruba	20	0	0	90	90

7490 rows × 8 columns

We see that we get an additional column added to the data frame, named maximum, which is the maximum value between mother_tongue, most_at_home, most_at_work and lang_known for each language and region.

Summary¶

Cleaning and wrangling data can be a very time-consuming process. However, it is a critical step in any data analysis. We have explored many different functions for cleaning and wrangling data into a tidy format. Table 5 summarizes some of the key wrangling functions we learned in this chapter. In the following chapters, you will learn how you can take this tidy data and do so much more with it to answer your burning data science questions!

Table 5 Summary of wrangling functions ¶
Function	Description
`.agg`	calculates aggregated summaries of inputs
`.apply`	allows you to apply function(s) to multiple columns/rows
`.assign`	adds or modifies columns in a data frame
`.groupby`	allows you to apply function(s) to groups of rows
`.iloc`	subsets columns/rows of a data frame using integer indices
`.loc`	subsets columns/rows of a data frame using labels
`.melt`	generally makes the data frame longer and narrower
`.pivot`	generally makes a data frame wider and decreases the number of rows
`.str.split`	splits up a string column into multiple columns

Exercises¶

Practice exercises for the material covered in this chapter can be found in the accompanying worksheets repository in the “Cleaning and wrangling data” row. You can launch an interactive version of the worksheet in your browser by clicking the “launch binder” button. You can also preview a non-interactive version of the worksheet by clicking “view worksheet.” If you instead decide to download the worksheet and run it on your own machine, make sure to follow the instructions for computer setup found in Chapter move-to-your-own-machine. This will ensure that the automated feedback and guidance that the worksheets provide will function as intended.

Additional resources¶

The pandas package documentation is another resource to learn more about the functions in this chapter, the full set of arguments you can use, and other related functions. The site also provides a very nice cheat sheet that summarizes many of the data wrangling functions from this chapter.
Python for Data Analysis [McKinney, 2012] has a few chapters related to data wrangling that go into more depth than this book. For example, the data wrangling chapter covers tidy data, .melt and .pivot, but also covers missing values and additional wrangling functions (like .stack). The data aggregation chapter covers .groupby, aggregating functions, .apply, etc.
You will occasionally encounter a case where you need to iterate over items in a data frame, but none of the above functions are flexible enough to do what you want. In that case, you may consider using a for loop [McKinney, 2012].

DSCΙ 100

Cleaning and wrangling data

Contents

Cleaning and wrangling data¶

Overview¶

Chapter learning objectives¶

Data frames, series, and lists¶

What is a data frame?¶

What is a series?¶

What is a list?¶

What does this have to do with data frames?¶

Tidy data¶

Tidying up: going from wide to long using .melt¶

Tidying up: going from long to wide using .pivot¶

Tidying up: using .str.split to deal with multiple delimiters¶

Using .loc[] and .iloc[] to extract a range of columns¶

Using df[] to extract rows¶

Extracting rows that have a certain value with ==¶

Extracting rows that do not have a certain value with !=¶

Extracting rows satisfying multiple conditions using &¶

Extracting rows satisfying at least one condition using |¶

Extracting rows with values in a list using .isin()¶

Extracting rows above or below a threshold using > and <¶

Using .assign to modify or add columns¶

Using .assign to modify columns¶

Using .assign to create new columns¶

Combining functions by chaining the methods¶

Chaining df[] and .loc¶

Chaining more than two functions¶

Aggregating data with .assign, .agg and .apply¶

Calculating summary statistics on whole columns¶

Calculating summary statistics when there are NaNs¶

Calculating summary statistics for groups of rows¶

Calculating summary statistics on many columns¶

Aggregating on a data frame for calculating summary statistics on many columns¶

.apply for calculating summary statistics on many columns¶

Apply functions across many columns with .apply¶

Apply functions across columns within one row with .apply¶

Summary¶

Exercises¶

Additional resources¶

Tidying up: going from wide to long using `.melt`¶

Tidying up: going from long to wide using `.pivot`¶

Tidying up: using `.str.split` to deal with multiple delimiters¶

Using `.loc[]` and `.iloc[]` to extract a range of columns¶

Using `df[]` to extract rows¶

Extracting rows that have a certain value with `==`¶

Extracting rows that do not have a certain value with `!=`¶

Extracting rows satisfying multiple conditions using `&`¶

Extracting rows satisfying at least one condition using `|`¶

Extracting rows with values in a list using `.isin()`¶

Extracting rows above or below a threshold using `>` and `<`¶

Using `.assign` to modify or add columns¶

Using `.assign` to modify columns¶

Using `.assign` to create new columns¶

Chaining `df[]` and `.loc`¶

Aggregating data with `.assign`, `.agg` and `.apply`¶

Calculating summary statistics when there are `NaN`s¶

`.apply` for calculating summary statistics on many columns¶

Apply functions across many columns with `.apply`¶

Apply functions across columns within one row with `.apply`¶