Joy of Plotting
Vikram Ranga
2023-12-06
Plots
In statistics, Exploratory Data Analysis or EDA is an approach where an understanding about the data and its behavior is made by making plots between or or more variables. This is in contrast to confirmatory analysis wherein a definite relation between variables is made. This leads to model building, a model is a mathematical relationship between two or more variables. For example, below is a model representing a line, also popularly known as linear regression model.
\[ x = \alpha + \beta y + \epsilon \] However, there are numerous models and all of them have certain assumptions. These assumptions can be made through careful investigation of EDA plots. People use number of plots such as boxplots, histograms, scatterplots, linecharts, lollipop plots to name as few. Here I take some freely avaialable data and plot some nice (only try :-p) plots. In R, one of the (arguably) most popular programming languages, there are three plotting systems:
1. Base plotting system
penguins <- modeldata::penguins
penguins <- penguins |>
filter(!is.na(bill_length_mm) & !is.na(bill_depth_mm))
billL <- penguins$bill_length_mm
billD <- penguins$bill_depth_mm
plot(billL, billD, type = 'p', xlab = "Bill Length", ylab = "Bill Depth")
2. Lattice plotting system
library(lattice)
xyplot(billD ~ billL, type = 'p', xlab = "Bill Length", ylab = "Bill Depth")
Now, this is a great improvement over the base plotting system
with neat and nice tick marks.
3. Grammar of graphic or ggplot plotting system
library(ggplot2)
penguins <- modeldata::penguins
penguins |>
filter(!is.na(bill_length_mm) & !is.na(bill_depth_mm)) |>
ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_point() +
labs(x = "Bill Length",
y = "Bill Depth") +
theme_bw()
As you can see the difference, the graph is made by layers of
commands joined by ‘+’. This is a special philosophy of making a graph
using the grammar of graphics, developed by Leland
Wilkinson and implemented in ggplot2 package
of R.
In any of the plotting system, it is possible to produce very wide variety of graphs, and some of the most popular ones are boxplot, histograms, scatterplots, linegraphs etc.
Boxplots
library(ggplot2)
penguins <- modeldata::penguins
penguins |>
filter(!is.na(bill_length_mm) & !is.na(bill_depth_mm)) |>
ggplot(aes(x = bill_length_mm, colour = species)) +
geom_boxplot() +
scale_color_brewer(palette = "Set2") +
coord_flip() +
labs(x = "Bill Length",
title = "Bill Lengths of Penguin Species") +
theme_dark()
It is not very hard to see that Adelie penguins have shorter
bill than other two species. In just one graph, we are able to see the
differences in the data.
Scatterplots
penguins <- modeldata::penguins
penguins |>
filter(!is.na(bill_length_mm) & !is.na(bill_depth_mm)) |>
ggplot(aes(x = bill_length_mm, y = bill_depth_mm, colour = species)) +
geom_point() +
scale_color_brewer(palette = "Set2") +
labs(x = "Bill Length",
y = "Bill Depth",
title = "Bill Lengths of Penguin Species") +
theme_dark()
OK! things are getting clearer, isn’t it? we are able to see
the difference and see the characteristics of three species are quite
different!
Histograms
library(ggplot2)
library(patchwork)
penguins <- modeldata::penguins
bilL <- penguins |>
filter(!is.na(bill_length_mm) & !is.na(bill_depth_mm)) |>
ggplot(aes(x = bill_length_mm, fill = species)) +
geom_histogram(bins = 50) +
facet_wrap(~species) +
coord_flip() +
scale_color_brewer(palette = "Set2") +
labs(x = "Bill Length") +
theme_dark()
bilD <- penguins |>
filter(!is.na(bill_length_mm) & !is.na(bill_depth_mm)) |>
ggplot(aes(x = bill_depth_mm, fill = species)) +
geom_histogram(bins = 50) +
facet_wrap(~species) +
coord_flip() +
scale_color_brewer(palette = "Set2") +
labs(x = "Bill Depth") +
theme_dark()
bilL / bilD
Voila! A lot is clearer now with just two
variables. I once attended a conference, and there a statistician said:
“If you dont understand your scatterplot, print it in the largest
possible paper roll it out and see how they are distributed” - well this
is a bit extreme but EDA is taken seriously everywhere!
Spatial Plots or Maps
Now I want to show the power of spatial plots aka Maps! There has been a number of software packages released in recent year to give tough competition to tradition GUI based proprietory software. In R, it is quite possible to use grammar of graphics and make impressive maps, thanks to tmap package.
library(tmap)
library(spData)
usState<- spData::us_states
names(usState)[2] <- "States"
# Using sf packages plotting function
tm_shape(usState) +
tm_grid() +
tm_polygons(col = "States", legend.show = FALSE) +
tm_compass(position = c("right", "top")) +
tm_scale_bar()
The maps and plots above are not even scratching the surface of
the capabilities of R and other progamming languages such as python but
these are pointers - in the direction of tremendous possibilities.