-
Notifications
You must be signed in to change notification settings - Fork 0
/
06A-Plotting with categorical data.tex
262 lines (243 loc) · 9.87 KB
/
06A-Plotting with categorical data.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
\documentclass{beamer}
\usepackage{framed}
\usepackage{graphicx}
\begin{document}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\noindent \textbf{Plotting with categorical data}
\begin{itemize}
\item We previously learned how to use scatterplots and regression model fits to visualize the relationship between two variables and how it changes across levels of additional categorical variables.
\item However, what if one of the main variables you are interested in is categorical? In this case, the scatterplot and regression model approach won’t work. There are several options, however, for visualizing such a relationship, which we will discuss in this tutorial.
\end{itemize}
\end{frame}
%===================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\begin{itemize}
\item It’s useful to divide seaborn’s categorical plots into two groups: those that show the full distribution of observations within each level of the categorical variable, and those that apply a statistical estimation to show a measure of central tendency and confidence interval. \item The former includes the functions \texttt{stripplot()}, boxplot(), and violinplot(), while the latter includes the functions \texttt{barplot()}, \texttt{countplot()}, and \texttt{pointplot()}. \item These functions all share a basic API for how they accept data, although each has specific parameters that control the particulars of the visualization that is applied to that data.
\end{itemize}
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
Much like the relationship between regplot() and lmplot(), in seaborn there are both relatively low-level and relatively high-level approaches for making categorical plots. The functions named above are all low-level in that they plot onto a specific matplotlib axes. There is also the higher-level factorplot(), which combines these functions with a FacetGrid to apply a categorical plot across a grid of figure panels.
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
It is easiest and best to invoke these functions with a DataFrame that is in “tidy” format, although the lower-level functions also accept wide-form DataFrames or simple vectors of observations. See below for examples.
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\begin{verbatim}
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
np.random.seed(sum(map(ord, "categorical")))
titanic = sns.load_dataset("titanic")
tips = sns.load_dataset("tips")
iris = sns.load_dataset("iris")
\end{verbatim}
\end{frame}
%====================================%
\section{Distributions of observations within categories}
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
Distributions of observations within categories
\begin{itemize}
\item The first set of functions shows the full distribution of the quantitative variable within each level of the categorical variable(s).
\item These generalize some of the approaches we discussed in the chapter to the case where we want to quickly compare across several distributions.
\end{itemize}
\end{frame}
\section{Categorical scatterplots}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\noindent \textbf{Categorical scatterplots}
A simple way to show the distribution of some quantitative variable across the levels of a categorical variable uses stripplot(), which generalizes a scatterplot to the case where one of the variables is categorical:
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\begin{verbatim}
sns.stripplot(x="day", y="total_bill", data=tips);
\end{verbatim}
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{images/categorical_9_0}
\end{figure}
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\begin{itemize}
\item It’s also possible to add a nested categorical variable with the hue paramater.
\item Above the color and position on the categorical axis are redundent, but now each provides information about one of the two variables:
\end{itemize}
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\begin{verbatim}
sns.stripplot(x="day", y="total_bill", hue="time", data=tips);
\end{verbatim}
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{images/categorical_11_0}
\end{figure}
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\begin{verbatim}
sns.stripplot(x="size", y="total_bill", data=tips.sort("size"));
\end{verbatim}
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{images/categorical_13_0}
\end{figure}
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
With these plots, it’s often helpful to put the categorical variable on the vertical axis (this is particularly useful when the category names are relatively long or there are many categories):
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\begin{verbatim}
sns.stripplot(x="total_bill", y="day", hue="time", data=tips);
\end{verbatim}
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{images/categorical_15_0}
\end{figure}
\end{frame}
\section{Boxplots}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\noindent \textbf{Boxplots}\\
\begin{itemize}
\item At a certain point, the categorical scatterplot approach becomes limited in the information it can provide about the distribution of values within each category.
\item There are several ways to summarize this information in ways that facilitate easy comparisons across the category levels.
\end{itemize}
\end{frame}
%====================================%
\begin{frame}[fragile]
\large
\begin{itemize}
\item The first is the familiar \texttt{boxplot()}. This kind of plot shows the three quartile values of the distribution along with extreme values.
\item The “whiskers” extend to points that lie within 1.5 IQRs of the lower and upper quartile, and then observations that fall outside this range are displayed independently.
\item Importantly, this means that each value in the boxplot corresponds to an actual observation in the data:
\end{itemize}
\end{frame}
%====================================%
\begin{frame}[fragile]
\begin{verbatim}
sns.boxplot(x="day", y="total_bill", hue="time", data=tips);
\end{verbatim}
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{images/categorical_17_0}
\end{figure}
\end{frame}
\section{Violinplots}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\noindent \textbf{Violinplots}
A different approach is a violinplot(), which combines a boxplot with the kernel density estimation procedure described in the distributions tutorial
\begin{verbatim}
sns.violinplot(x="total_bill", y="day", hue="time", data=tips);
\end{verbatim}
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{images/categorical_19_0}
\end{figure}
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\begin{itemize}
\item This approach uses the kernel density estimate to provide a better description of the distribution of values.
\item Additionally, the quartile and whikser values from the boxplot are shown inside the violin.
\item Because the violinplot uses a KDE, there are some other parameters that may need tweaking, adding some complexity relative to the straightforward boxplot:
\end{itemize}
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large
\begin{verbatim}
sns.violinplot(x="total_bill", y="day", hue="time", data=tips,
bw=.1, scale="count", scale_hue=False);
\end{verbatim}
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{images/categorical_21_0}
\end{figure}
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
\large It’s also possible to “split” the violins when the hue parameter has only two levels, which can allow for a more efficient use of space:
\begin{verbatim}
sns.violinplot(x="day", y="total_bill", hue="sex", data=tips, split=True);
\end{verbatim}
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{images/categorical_23_0}
\caption{}
\label{fig:categorical_23_0}
\end{figure}
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
Finally, there are several options for the plot that is drawn on the interior of the violins, including ways to show each individual observation instead of the summary boxplot values:
\begin{verbatim}
sns.violinplot(x="day", y="total_bill", hue="sex", data=tips,
split=True, inner="stick", palette="Set3");
\end{verbatim}
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{images/categorical_25_0}
\end{figure}
\end{frame}
%====================================%
\begin{frame}[fragile]
\frametitle{Seaborn Workshop}
It can also be useful to combine \texttt{stripplot()} with \texttt{violinplot()} or \texttt{boxplot()} to show each observation along with a summary of the distribution:
\begin{verbatim}
sns.violinplot(x="day", y="total_bill", data=tips, inner=None)
sns.stripplot(x="day", y="total_bill", data=tips, jitter=True, size=4);
\end{verbatim}
\begin{figure}
\centering
\includegraphics[width=0.7\linewidth]{images/categorical_27_0}
\end{figure}
\end{frame}
%============================================= %
\end{document}