index – MDR

Helly vs. Helena, a voice analysis

To my ear, Helena’s voice has always sounded deeper to me compared to Helly’s. I first noticied this when I suspected that it was Helena on the severed floor at the beginning of season 2, and have continued to hear it. Now after episode 9, I am again wondering who is actually on the severed floor, so I decided to see if we could collect some data. I started by taking a random sample of vocals we know are Helly’s. I took the transcript data from the mdr R package and filtered to include just rows where Helly was the speaker and the dialogue was at least 100 characters. I then randomly sampled 10 (although we ended up with 9, one was accidentally a line from season 1 episode 4 where Helena was speaking to Helly via the TV). Here is my Helly sample:

Code

library(mdr)
library(glue)
library(voice)
library(tidyverse)
library(tidymodels)

set.seed(2938)

helly_sample <- transcripts |>
  filter(grepl("Helly", speaker),
         nchar(dialogue) > 100,
         (season == 1 & episode != 9 | season == 2 & episode == 5),
         !(season == 1 & episode == 1 & minute(timestamp) == 30)) |>
  slice_sample(n = 10) |>
  arrange(season, episode)

helena_sample <- transcripts |>
  filter(grepl("Helly", speaker),
         nchar(dialogue) > 100,
         season == 2, episode %in% c(1, 4)) |>
  arrange(season, episode) |>
  slice(1:5)

helly_sample

# A tibble: 10 × 5
   season episode timestamp speaker dialogue                                    
    <int>   <int> <time>    <chr>   <chr>                                       
 1      1       2 14'50"    Helly   I really don’t. I guess I went home last ni…
 2      1       2 48'10"    Helly   Okay, but how good are the scanners? Like, …
 3      1       3 46'56"    Helly   “Forgive me for the harm I have caused this…
 4      1       3 22'14"    Helly   Mark, I don’t wanna work here with you. So …
 5      1       4 21'16"    Helly   Well, boss. I guess this is the part where …
 6      1       4 21'56"    Helly   Helly, I watched your video asking that I r…
 7      1       5 39'37"    Helly   I mean, what if the goats are the numbers? …
 8      1       7 35'18"    Helly   Yeah, see. There are two lever switches tha…
 9      1       8 28'07"    Helly   And we don’t know how long Dylan will be ab…
10      2       5 25'10"    Helly   You don't. You just have to trust me. This …

I then sampled the same from the episodes we know are Helena. After filtering to only lines of dialogue that were at least 100 characters, excluding one where she and Mark are laughing (so the audio is messy), we ended up with 5 such samples.

# A tibble: 5 × 5
  season episode timestamp speaker dialogue                                     
   <int>   <int> <time>    <chr>   <chr>                                        
1      2       1 33'06"    Helly R "Then I went outside and found a guy. He loo…
2      2       1 36'25"    Helly R "Right. Of course. I mean, assuming she's st…
3      2       1 36'54"    Helly R "We're not the same, actually. Us and the ou…
4      2       4 10'36"    Helly   "\"I was not born into this world alone. The…
5      2       4 10'55"    Helly   "\"In infancy, he was my bosom friend. But a…

I likewise took a sample from episode 9, when Helly/Helena was talking to Dylan (labeled ‘unknown’ below). I then recorded the audio and estimated the fundamental frequency through a short-term cepstral transform. The figure below shows the distribution of average fundamental frequency by who was speaking. Notably, the ‘unknown’ sample aligns more closely with Helena than with Helly. Specifically, the mean pitch of ‘unknown’ falls within Helena’s interquartile range (the ‘box’), whereas it aligns with Helly’s whiskers (the less common values).

Code

get_path <- function(n, who) {
  glue("data/o{n}_{who}.wav")
}
get_pitch <- function(path, i) {
  pitch_values <- extract_features(path, stereo2mono = TRUE) |>
    as_tibble()
  max_seconds <- max(pitch_values$section_seq_file)
  pitch_values$norm_section <- pitch_values$section_seq_file / max_seconds
  pitch_values |>
    mutate(i = i) |>
    select(norm_section, f0:f8, i)
}

get_avg_pitch <- function(pitch_values, what = "f0") {
  pitch_values <- pitch_values |> filter(f0 < 300)
  tibble(mean = mean(pitch_values[[what]], na.rm = TRUE),
         sd = sd(pitch_values[[what]], na.rm = TRUE),
         median = median(pitch_values[[what]], na.rm = TRUE),
         skewness = moments::skewness(pitch_values[[what]], na.rm = TRUE),
         kurtosis = moments::kurtosis(pitch_values[[what]], na.rm = TRUE))
}

helly_path <- purrr::map(1:9, get_path, who = "helly")
helly_pitch <- purrr::imap(helly_path, get_pitch) 
helly_pitch_df <- helly_pitch |> bind_rows()

helena_path <- purrr::map(1:5, get_path, who = "helena")
helena_pitch <- purrr::imap(helena_path, get_pitch) 
helena_pitch_df <- helena_pitch |> bind_rows()

unknown_pitch <- get_pitch("data/q1.wav", 1) 


helly_avg <- purrr::map_df(helly_pitch, get_avg_pitch)
helly_avg$class <- "helly"
helena_avg <- purrr::map_df(helena_pitch, get_avg_pitch)
helena_avg$class <- "helena"
unknown_avg <- get_avg_pitch(unknown_pitch)
unknown_avg$class <- "unknown"
data <- bind_rows(helly_avg, helena_avg, unknown_avg)

Code

ggplot(data, aes(x = mean, y = class, fill = class, color = class)) +
  geom_vline(xintercept = data$mean[data$class == "unknown"], linetype = "dashed", color = "white") +
  geom_boxplot(alpha = 0.75, linewidth = 1.5) +
  labs(x = "Average pitch (fundamental frequency in Hz)", y = "") + 
  annotate("text", x = 60, y = 3.2, 
           label = "\"unknown\" matches Helena's\ncentral tendency, not Helly's", 
           color = "yellow", size = 4.5,
           fontface = "bold", hjust = 0) +
  annotate("curve", x = 80, xend = 130, 
           y = 3, yend = 2.8, 
           curvature = 0.3,
           arrow = arrow(length = unit(0.2, "inches"), type = "closed"), 
           color = "yellow", linewidth = 1.2) +
  annotate("curve", x = 80, xend = 130, 
           y = 3, yend = 1.1, 
           curvature = 0.3,
           arrow = arrow(length = unit(0.2, "inches"), type = "closed"), 
           color = "yellow", linewidth = 1.2) +
  scale_fill_manual(values = c("#C15C58", "#5BA9D0", "yellow")) + 
  scale_color_manual(values = c("#C15C58", "#5BA9D0", "yellow")) +
  theme_mdr() + 
  theme(legend.position = "none",
        axis.text = element_text(size = 20),
        axis.title = element_text(size = 20),
        panel.grid = element_blank())

I extracted pitch features including the mean, standard deviation, median, skewness, and kurtosis of the fundamental frequencies. I fit a logistic regression model using these features. The model was trained on the known audio samples from Helly and Helena. I then classified the unknown sample based on the model. And look at that, it predicted that the unknown audio was from Helena not Helly.

library(tidymodels)

model <- fit(logistic_reg(), 
             factor(class) ~ ., 
             data = data |> filter(class != "unknown"))

predict(model, new_data = data |> filter(class == "unknown"), type = "prob")

# A tibble: 1 × 2
  .pred_helena  .pred_helly
         <dbl>        <dbl>
1         1.00 0.0000000364

The output suggests that based on the model I fit the probability that the unknown source was actually Helena is essentially 100% (Note: as a statistician I feel the need to make some clarifications, this model is silly, this application is silly, we have a very small number of observations, etc.).