Tuesday, June 25, 2024

Movies data set Tidy tuesday

Tidy tuesday 25-06-2024

The Data set

This blog post is in response to very nice initiative under tidytuesday. Each week a dataset is provided to work upon. You can learn more about it by visiting: https://github.com/rfordatascience/tidytuesday.
This week’s dataset is on the movies released addressing LGBTQ topic. Before diving in the dataset I want to use the colours from a package called ‘gglgbtq’. The package provides multiple colour palettes based on pride flags with tailored themes. Let’s have a look at some of the palettes

#install.packages("gglgbtq") # needs to be done only once.
library(gglgbtq)
library(tidyverse)
palette_lgbtq("rainbow")

palette_lgbtq("philadelphia")

palette_lgbtq("progress")

Thing to note is not all palettes have same length and can be easily checked with command show_pride(). It will show you the following:

I am not listing all of them because it can be checked easily by vignette(“gallery”, package = “gglgbtq”). Now, let’s explore the dataset.

# tuesdata <- tidytuesdayR::tt_load('2024-06-25')
# #str(tuesdata) # -> It is a list
# lgbtq_movies <- tuesdata[[1]] # extract the dataset in the form of a tibble.
# saveRDS(lgbtq_movies, 'lgbtq_movies.rds') # I saved it to avoid repeated downloads
lgbtq_movies <- readRDS("~/Documents/R Scripts/tidyTuesdays/lgbtq_movies.rds")
dim(lgbtq_movies)
## [1] 7165   12

There are over 7K movies released over the years.

glimpse(lgbtq_movies)
## Rows: 7,165
## Columns: 12
## $ id                <dbl> 860159, 719088, 632632, 929477, 197158, 398818, 5912…
## $ title             <chr> "Crush", "Yes, No, or Maybe Half?", "Given", "Heart …
## $ original_title    <chr> "Crush", "イエスかノーか半分か", "映画 ギヴン", "Hea…
## $ original_language <chr> "en", "ja", "ja", "en", "pt", "en", "en", "en", "en"…
## $ overview          <chr> "When an aspiring young artist is forced to join her…
## $ release_date      <date> 2022-04-29, 2020-12-11, 2020-08-22, 2022-02-17, 198…
## $ popularity        <dbl> 321.755, 139.229, 110.140, 88.760, 76.302, 85.425, 8…
## $ vote_average      <dbl> 7.5, 7.1, 8.4, 5.4, 4.3, 8.2, 6.8, 8.0, 5.7, 8.0, 6.…
## $ vote_count        <dbl> 120, 26, 318, 37, 46, 10242, 1542, 14668, 814, 14654…
## $ adult             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ video             <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ genre_ids         <chr> "[35, 10749]", "[16, 18, 10749]", "[16, 18, 10402, 1…

There is quite some information in the data set.
Let’s see how many languages are there in it.

unique(lgbtq_movies$original_language) |> length()
## [1] 65

There are 65 languages in which movies were released. Let’s see which are the front runner languages.

The Plot

lgbtq_movies |>
  group_by(original_language) |> 
  summarise(n = n()) |> 
  mutate(
    Perc = n/sum(n) * 100,
    original_language = case_when(
      original_language == "en" ~ "English",
      original_language == "es" ~ "Spanish",
      original_language == "fr" ~ "French",
      original_language == "de" ~ "German",
      original_language == "ja" ~ "Japanese",
      original_language == "pt" ~ "Portuguese",
      original_language == "it" ~ "Italian",
      original_language == "ko" ~ "Korean",
      .default = original_language
    )
  ) |> 
  arrange(desc(Perc)) |> 
  slice_head(n = 8) |> 
  mutate(original_language = reorder(original_language, Perc),
         Perc1 = case_when(
           original_language == 'English' ~ Perc,
           .default = NA
         ),
         Perc2 = case_when(
           original_language != 'English' ~ Perc,
           .default = NA
         )) |> 
  ggplot() +
  geom_col(aes(x = original_language,
               y = Perc,
               fill = original_language)) +
  geom_text(aes(x = original_language,
               y = Perc,
               label = scales::percent(round(Perc1/100, 3))),# label it with percent sign
            colour = "white",
            hjust = 1.2) +
  geom_text(aes(x = original_language,
               y = Perc,
               label = scales::percent(round(Perc2/100, 3))),# label it with percent sign
            colour = "black",
            hjust = -1)+
  scale_fill_manual(values = palette_lgbtq("philadelphia")) +
  theme_bw() +
  theme(
    panel.border = element_blank(),
    axis.ticks = element_blank(), 
    panel.grid.minor.x = element_blank(),
    panel.grid.major = element_blank(),
    panel.grid.major.y = element_blank(),
    axis.text.x = element_blank()
  ) +
  coord_flip() +
  geom_vline(xintercept = seq(0.5, 8.5, 1),
             colour = 'grey50',
             linewidth = 0.2)+
  # to give it a clean look put a vline at zero.
  geom_hline(yintercept = 0,
             colour = "black",
             linewidth = 0.5) +
  guides(fill = FALSE) +
  labs(
    y = "Percentage",
    x = "Language in which Movie originally released",
    title = "Top Eight languages",
    caption = 'There are 65 different languages listed in the dataset'
  )

The English language, by far, outruns all others, the nearest, Spanish is 64.3 percentage points behind.
Now, I really would like to see the same thing but for years. Is there any trend? ok let’s use line graph to check for any trends.

# The dataset has date on which movie was released
lgbtq_movies |> 
  mutate(Year = lubridate::year(release_date)) |> 
  group_by(Year) |> 
  count() |> 
  ggplot() +
  geom_line(aes(x = Year,
                y = n),
            colour = palette_lgbtq("progress")[8],
            size = 1.2) +
  geom_point(aes(
    x = 2019,
    y = 526),
    colour = palette_lgbtq("philadelphia")[8],
    alpha = 1/100,
    size = 5) +
  geom_curve(data = data.frame(x = 2017, y = 526, xend = 2000, yend = 480),
             aes(x = x, y = y, xend = xend, yend = yend),
             curvature = 0.20,
             arrow = arrow(30L, unit(0.1, "inches"),
"first", "closed")) +
  geom_label(aes(x = 1995,
                y = 450, 
                label = "The year '2019' \nsaw most  movies")
  ) +
  theme_bw() +
  labs(y = 'Counts',
       x = 'Years') +
  theme(
    panel.grid = element_line(
      colour = palette_lgbtq("philadelphia")[6],
      linetype = 'dotdash'
    ),
    panel.border = element_blank()
  )

The year 2019 has seen highest movies so far i.e. 526!
The data set is still quite big and a lot more EDA can still be done but I will take that up in a new tidy tuesday data set.