Tidy tuesday 25-06-2024
Vikram Ranga
25-06-2024
The Data set
This blog post is in response to very nice initiative under
tidytuesday. Each week a dataset is provided to work upon. You can learn
more about it by visiting: https://github.com/rfordatascience/tidytuesday.
This week’s dataset is on the movies released addressing LGBTQ topic.
Before diving in the dataset I want to use the colours from a package
called ‘gglgbtq’. The package provides multiple colour palettes
based on pride flags with tailored themes. Let’s have a look at some of
the palettes
#install.packages("gglgbtq") # needs to be done only once.
library(gglgbtq)
library(tidyverse)
palette_lgbtq("rainbow")
palette_lgbtq("philadelphia")
palette_lgbtq("progress")
I am not listing all of them because it can be checked easily by vignette(“gallery”, package = “gglgbtq”). Now, let’s explore the dataset.
# tuesdata <- tidytuesdayR::tt_load('2024-06-25')
# #str(tuesdata) # -> It is a list
# lgbtq_movies <- tuesdata[[1]] # extract the dataset in the form of a tibble.
# saveRDS(lgbtq_movies, 'lgbtq_movies.rds') # I saved it to avoid repeated downloads
lgbtq_movies <- readRDS("~/Documents/R Scripts/tidyTuesdays/lgbtq_movies.rds")
dim(lgbtq_movies)
## [1] 7165 12
There are over 7K movies released over the years.
glimpse(lgbtq_movies)
## Rows: 7,165
## Columns: 12
## $ id <dbl> 860159, 719088, 632632, 929477, 197158, 398818, 5912…
## $ title <chr> "Crush", "Yes, No, or Maybe Half?", "Given", "Heart …
## $ original_title <chr> "Crush", "イエスかノーか半分か", "映画 ギヴン", "Hea…
## $ original_language <chr> "en", "ja", "ja", "en", "pt", "en", "en", "en", "en"…
## $ overview <chr> "When an aspiring young artist is forced to join her…
## $ release_date <date> 2022-04-29, 2020-12-11, 2020-08-22, 2022-02-17, 198…
## $ popularity <dbl> 321.755, 139.229, 110.140, 88.760, 76.302, 85.425, 8…
## $ vote_average <dbl> 7.5, 7.1, 8.4, 5.4, 4.3, 8.2, 6.8, 8.0, 5.7, 8.0, 6.…
## $ vote_count <dbl> 120, 26, 318, 37, 46, 10242, 1542, 14668, 814, 14654…
## $ adult <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ video <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ genre_ids <chr> "[35, 10749]", "[16, 18, 10749]", "[16, 18, 10402, 1…
There is quite some information in the data set.
Let’s see how
many languages are there in it.
unique(lgbtq_movies$original_language) |> length()
## [1] 65
There are 65 languages in which movies were released. Let’s see which are the front runner languages.
The Plot
lgbtq_movies |>
group_by(original_language) |>
summarise(n = n()) |>
mutate(
Perc = n/sum(n) * 100,
original_language = case_when(
original_language == "en" ~ "English",
original_language == "es" ~ "Spanish",
original_language == "fr" ~ "French",
original_language == "de" ~ "German",
original_language == "ja" ~ "Japanese",
original_language == "pt" ~ "Portuguese",
original_language == "it" ~ "Italian",
original_language == "ko" ~ "Korean",
.default = original_language
)
) |>
arrange(desc(Perc)) |>
slice_head(n = 8) |>
mutate(original_language = reorder(original_language, Perc),
Perc1 = case_when(
original_language == 'English' ~ Perc,
.default = NA
),
Perc2 = case_when(
original_language != 'English' ~ Perc,
.default = NA
)) |>
ggplot() +
geom_col(aes(x = original_language,
y = Perc,
fill = original_language)) +
geom_text(aes(x = original_language,
y = Perc,
label = scales::percent(round(Perc1/100, 3))),# label it with percent sign
colour = "white",
hjust = 1.2) +
geom_text(aes(x = original_language,
y = Perc,
label = scales::percent(round(Perc2/100, 3))),# label it with percent sign
colour = "black",
hjust = -1)+
scale_fill_manual(values = palette_lgbtq("philadelphia")) +
theme_bw() +
theme(
panel.border = element_blank(),
axis.ticks = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major = element_blank(),
panel.grid.major.y = element_blank(),
axis.text.x = element_blank()
) +
coord_flip() +
geom_vline(xintercept = seq(0.5, 8.5, 1),
colour = 'grey50',
linewidth = 0.2)+
# to give it a clean look put a vline at zero.
geom_hline(yintercept = 0,
colour = "black",
linewidth = 0.5) +
guides(fill = FALSE) +
labs(
y = "Percentage",
x = "Language in which Movie originally released",
title = "Top Eight languages",
caption = 'There are 65 different languages listed in the dataset'
)
The English language, by far, outruns all others, the nearest,
Spanish is 64.3 percentage points behind.
Now, I really would
like to see the same thing but for years. Is there any trend? ok let’s
use line graph to check for any trends.
# The dataset has date on which movie was released
lgbtq_movies |>
mutate(Year = lubridate::year(release_date)) |>
group_by(Year) |>
count() |>
ggplot() +
geom_line(aes(x = Year,
y = n),
colour = palette_lgbtq("progress")[8],
size = 1.2) +
geom_point(aes(
x = 2019,
y = 526),
colour = palette_lgbtq("philadelphia")[8],
alpha = 1/100,
size = 5) +
geom_curve(data = data.frame(x = 2017, y = 526, xend = 2000, yend = 480),
aes(x = x, y = y, xend = xend, yend = yend),
curvature = 0.20,
arrow = arrow(30L, unit(0.1, "inches"),
"first", "closed")) +
geom_label(aes(x = 1995,
y = 450,
label = "The year '2019' \nsaw most movies")
) +
theme_bw() +
labs(y = 'Counts',
x = 'Years') +
theme(
panel.grid = element_line(
colour = palette_lgbtq("philadelphia")[6],
linetype = 'dotdash'
),
panel.border = element_blank()
)
The year 2019 has seen highest movies so far i.e. 526!
The data
set is still quite big and a lot more EDA can still be done but I will
take that up in a new tidy tuesday data set.