Project 1: Happy Moments EDA

HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. You can read more about it on https://arxiv.org/abs/1801.07746.

Here, we explore this data set and try to answer the question, “What makes people happy?” ### Step 0 - Load all the required libraries

library(tidyverse)
library(tidytext)
library(DT)
library(scales)
library(wordcloud2)
library(gridExtra)
library(ngram)
library(shiny) 
library(wordcloud)
library(wordcloud2)
library(ggplot2)
library(tm)
library(SentimentAnalysis)

Step 1 - Load the processed text data along with demographic information on contributors

We use the processed data for our analysis and combine it with the demographic information available.

hm_data <- read_csv("/Users/wpj/Downloads/processed_moments.csv")

urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'
demo_data <- read_csv(urlfile)

# hm_data

# demo_data

Combine both the data sets and keep the required columns for analysis

We select a subset of the data that satisfies specific row conditions.

hm_data <- hm_data %>%
  inner_join(demo_data, by = "wid") %>%
  select(wid,
         original_hm,
         gender, 
         marital, 
         parenthood,
         reflection_period,
         age, 
         country, 
         ground_truth_category, 
         text) %>%
  mutate(count = sapply(hm_data$text, wordcount)) %>%
  filter(gender %in% c("m", "f")) %>%
  filter(marital %in% c("single", "married")) %>%
  filter(parenthood %in% c("n", "y")) %>%
  filter(reflection_period %in% c("24h", "3m")) %>%
  mutate(reflection_period = fct_recode(reflection_period, 
                                        months_3 = "3m", hours_24 = "24h"))

# head(hm_data)

Topic: Does marital status affect the way women get happiness?

Points to Explore:

How do the word clouds and word2vec vectors of unmarried and married women differ?
How do unmarried and married women differ in terms of ground truth category and text length?
How does marital status affect men’s happiness, and is it similar to the effect it has on women’s happiness?
What other factors, such as geographic location and culture, influence the relationship between marital status and happiness in women?

1. Word Analysis: Unmarried vs. Married Women

1.1 Word Clouds for Unmarried vs. Married Females

We begin our analysis by comparing unmarried and married females in terms of the words they use to describe their happy moments. This analysis involves the creation of word clouds to identify the most frequent words and Word2Vec to find the most related words.

# Splitting data into single and married females
single_females <- hm_data[hm_data$gender == 'f' & hm_data$marital == 'single', ]
married_females <- hm_data[hm_data$gender == 'f' & hm_data$marital == 'married', ]

# Break text into bag of words for single females
bag_of_words_single <-  single_females %>%
  unnest_tokens(word, text)

word_count_single <- bag_of_words_single %>%
  count(word, sort = TRUE)

# word_count_single
# is a dataframe with two columns: "word" containing a word, "n" containing the word's frequency in desc order

# Break text into bag of words for married females
bag_of_words_married <-  married_females %>%
  unnest_tokens(word, text)

word_count_married <- bag_of_words_married %>%
  count(word, sort = TRUE)

Wordcloud generated for single and married bag of words.

word_count_single_head <- head(word_count_single, 200)
wordcloud(words = word_count_single_head$word, freq = word_count_single_head$n, scale=c(2, 0.5), min.freq = 1, colors=brewer.pal(8, "Dark2"))

word_count_married_head <- head(word_count_married, 200)
wordcloud(words = word_count_married_head$word, freq = word_count_married_head$n, scale=c(2, 0.5), min.freq = 1, colors=brewer.pal(8, "Dark2"))

We can get a broad sense of what frequent for single or married females. But since there are a lot of overlapping words, it’s hard to tell if there are significant differences.

Therefore, for a closer look, we filter out top words that are unique to single women and create two wordclouds: one is unique to single females, and the other unique to married women.

# anti_join() returns a set difference for single females
unique_to_single <- anti_join(word_count_single_head, word_count_married_head, by = "word")
# wordcloud2(unique_to_single, size = 1, color = "random-dark")

# worldcloud2() isn't knitting for my computer, so I inserted the image directly.
knitr::include_graphics("/Users/wpj/Desktop/boyfriend.png")

The most prominent words include “boyfriend,” “uncle,” “coworker,” and “partner,” suggesting that close relationships play a significant role in the happiness of unmarried women. Additionally, words like “interview” and “ready” hint at active and challenging experiences contributing to their happiness.

unique_to_married <- anti_join(word_count_married_head, word_count_single_head, by = "word")
# wordcloud2(unique_to_married, size = 1, color = "random-dark")
knitr::include_graphics("/Users/wpj/Desktop/husband.png")

Similarly, we create a word cloud for married women to identify words that are distinctive to them. The most common words include “husband,” “children,” and “child,” indicating that for married women, happiness is often derived from their immediate family. Words like “garden,” “flowers,” and “planted” suggest a sense of home and domestic happiness, while “sitting” and “stay” imply a more relaxed and settled lifestyle.

Analysis: It is interesting to note that the top words for single women suggest that they are more active and challenged than married women. This could be due to the fact that single women may have more time and energy to pursue their own interests and goals. They may also be more likely to be in new and exciting situations, which can lead to happiness.

On the other hand, the top words for married women suggest that they are more focused on their families and their home life. This could be due to the fact that married women often have more responsibilities, such as caring for children and managing a household. However, it is also possible that married women simply find more happiness in the simple things in life, such as spending time with loved ones and creating a home.

1.2 Word2vec for Unmarried vs. Married Females

We use word2vec to see what are words, to single and married women, are similar to “happiness”.

library(word2vec)

model_single <- word2vec(x = single_females$text, type = "cbow", dim = 15, iter = 20)
embedding_single <- as.matrix(model_single)
embedding_single <- predict(model_single, c("happy"), type = "embedding")
lookslike_single <- predict(model_single, c("happiness"), type = "nearest", top_n = 50)

We use word2vec to take a look at what are words that sound similar to happiness for both single and married women.

library(word2vec)

model_married <- word2vec(x = married_females$text, type = "cbow", dim = 15, iter = 20)
embedding_married <- as.matrix(model_married)
embedding_married <- predict(model_married, c("happy"), type = "embedding")
lookslike_married <- predict(model_married, c("happiness"), type = "nearest", top_n = 50)

word_count_single_vec_head <- head(lookslike_single, 100)
word_count_married_vec_head <- head(lookslike_married, 100)

Then we perform similar set difference operations, to see which words are unique to single females, and which are unique to married females.

unique_to_single_vec <- anti_join(word_count_single_vec_head$happiness, word_count_married_vec_head$happiness, by = "term2")

wordcloud(words = unique_to_single_vec$term2, freq = unique_to_single_vec$rank, scale=c(2, 0.5), min.freq = 1, colors=brewer.pal(8, "Dark2"))

unique_to_married_vec <- anti_join(word_count_married_vec_head$happiness, word_count_single_vec_head$happiness, by = "term2")
wordcloud(words = unique_to_married_vec$term2, freq = unique_to_married_vec$rank, scale=c(2, 0.5), min.freq = 1, colors=brewer.pal(8, "Dark2"))

2. Ground Truth Categories: Unmarried vs. Married Women

To gain further insights, we compare unmarried and married women based on their classified ground truth categories. These categories classify happy moments into specific happiness types, such as “achievement,” “enjoying the moment,” and “bonding.” We filter out rows corresponding to single females with valid values in the ground truth category and present our findings through a box plot. This identifies the predominant sources of happiness among single women.

filtered_data_single <- single_females %>%
  filter(!is.na(ground_truth_category))

word_freq <- filtered_data_single %>%
  count(ground_truth_category, sort = TRUE)

word_freq <- word_freq %>%
  arrange(desc(n))

ggplot(word_freq, aes(x = reorder(ground_truth_category, -n), y = n)) +
  geom_bar(stat = "identity", fill = "grey") +
  labs(x = "Word", y = "Frequency") +
  coord_flip() +  # Horizontal bars
  theme_minimal()

We do the same for married women.

filtered_data_married <- married_females %>%
  filter(!is.na(ground_truth_category))

word_freq <- filtered_data_married %>%
  count(ground_truth_category, sort = TRUE)

word_freq <- word_freq %>%
  arrange(desc(n))

ggplot(word_freq, aes(x = reorder(ground_truth_category, -n), y = n)) +
  geom_bar(stat = "identity", fill = "grey") +
  labs(x = "Word", y = "Frequency") +
  coord_flip() +  # Horizontal bars
  theme_minimal()

Analysis: our visualizations indicate that both single and married women derive happiness from “affection” and “achievement.” However, single women obtain happiness from achievement almost as much as affection, while married women obtain happiness as half as affection. Notably, single women rank “bonding” as their third most common source of happiness, highlighting the importance of social connections for this demographic. In contrast, married women appear to prioritize “enjoying the moment,” emphasizing the value of being fully engaged in the present.

3. Text Length Analysis: Single vs. Married Women

Next, we examine whether there is a significant difference in the text length of happy moments between single and married women.

# count characters for every happy moment original text entry
text_lengths_single <- nchar(single_females$original_hm)
mean_text_length_single <- mean(text_lengths_single)

text_lengths_married <- nchar(married_females$original_hm)
mean_text_length_married <- mean(text_lengths_married)

# draw a box plot to compare
mean_text_lengths <- data.frame(
  Category = c("Single Females", "Married Females"),
  Mean_Text_Length = c(mean_text_length_single, mean_text_length_married)
)
ggplot(mean_text_lengths, aes(x = Category, y = Mean_Text_Length, fill = Category)) +
  geom_bar(stat = "identity") +
  labs(x = NULL, y = "Mean Text Length") +
  theme_minimal()

Analysis: this reveals that married females tend to write slightly longer descriptions of their happy moments compared to single women. However, this difference is relatively small. This could be because married women have more people to share their happiness with, or it could be that they simply have more time to reflect on their happy moments and write them down.

4. Does marriage affect the way men get happiness, in a similar way?

We use a similar strategy: filter out top words that are unique to single and married men, and create wordclouds.

# Filter out subsets of unmarried females and married males
single_males <- hm_data[hm_data$gender == 'm' & hm_data$marital == 'single', ]
married_males <- hm_data[hm_data$gender == 'm' & hm_data$marital == 'married', ]

bag_of_words_single_male <-  single_males %>%
  unnest_tokens(word, text)

word_count_single_male <- bag_of_words_single_male %>%
  count(word, sort = TRUE)

bag_of_words_married_male <-  married_males %>%
  unnest_tokens(word, text)

word_count_married_male <- bag_of_words_married_male %>%
  count(word, sort = TRUE)

word_count_single_head_male <- head(word_count_single_male, 200)
word_count_married_head_male <- head(word_count_married_male, 200)

unique_to_single_male <- anti_join(word_count_single_head_male, word_count_married_head_male, by = "word")
# unique_to_single_male
# wordcloud2(unique_to_single_male, size = 0.3, color = "random-dark")
knitr::include_graphics("/Users/wpj/Desktop/pizza.png")

Single men seem to be more materialistic and easy to satisfy, as we see “pizza”, “drink”, “dollars”, and “computer” dominating the top unique words. Words such as “gym”, “graduation”, and “goal” give a sense of liveliness and strive. In the top words, there’s no presence of “girlfriend”, which is a stark contrast to the huge “boyfriend” in single women wordcloud. Words are also more equally valued, as we can see words in the wordcloud are similar in size.

Notably, the prominent words in this cloud suggest that single men may exhibit a more materialistic and easy-to-satisfy outlook. Words such as “pizza,” “drink,” “dollars,” and “computer” dominate the vocabulary. Additionally, words like “gym,” “graduation,” and “goal” convey a sense of vitality and ambition. Interestingly, unlike single women, there is no significant mention of “girlfriend” in the word cloud, indicating a difference in relationship dynamics.

unique_to_married_male <- anti_join(word_count_married_head_male, word_count_single_head_male, by = "word")
# unique_to_married_male
# wordcloud2(unique_to_married_male, size = 1, color = "random-dark")
knitr::include_graphics("/Users/wpj/Desktop/wife.png")

In the word cloud for married men, we observe the prominent presence of words like “wife,” “daughter,” and “kids,” underscoring the significance of family in a married man’s life. Words such as “baby,” “child,” “marriage,” “share,” and “bed” further reinforce the importance of a stable family environment. Additionally, words like “temple” and “smile” suggest a sense of contentment associated with married life.

Analysis: It is interesting to note that the top words for single men suggest that they are more materialistic and easy to satisfy than single women. It is possible that single men are more likely to enjoy simple pleasures, or obtain happiness from simple pleasures, such as eating pizza and drinking beer.

Overall, our findings suggest that marriage does affect the way men get happiness in a similar way to women. Both single and married women and men find happiness in their close relationships, but married women and men tend to place a greater emphasis on their families.

5. Does geography and culture affect how women obtain happiness, in general?

We split data into 4 groups: single & USA, single & IND, married & USA, married & IND, and conduct a ground_truth_category analysis on the 4 subsets. USA and IND were chosen because both countries have a significant amount of data in the data set, and are representative of western and eastern countries.

# Splitting data into USA and India females
usa_single_females <- hm_data[hm_data$country == 'USA' & hm_data$gender == 'f' & hm_data$marital == 'single', ]
ind_single_females <- hm_data[hm_data$country == 'IND' & hm_data$gender == 'f' & hm_data$marital == 'single', ]
usa_married_females <- hm_data[hm_data$country == 'USA' & hm_data$gender == 'f' & hm_data$marital == 'married', ]
ind_married_females <- hm_data[hm_data$gender == 'IND' & hm_data$gender == 'f' & hm_data$marital == 'married', ]

data_subsets <- list(
  usa_single_females,
  ind_single_females,
  usa_married_females,
  ind_married_females
)
subset_labels <- c("USA Single Females", "IND Single Females", "USA Married Females", "IND Married Females")
plots <- list()

# Create plots for each subset, omitting NA values
for (i in 1:length(data_subsets)) {
  filtered_data <- na.omit(data_subsets[[i]])
  
  p <- ggplot(data = filtered_data, aes(x = ground_truth_category)) +
    geom_bar(aes(fill = ground_truth_category)) +
    labs(title = subset_labels[i], x = "Ground Truth Category", y = "Count") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  plots[[i]] <- p
}
library(gridExtra)
grid.arrange(grobs = plots, ncol = 2)

Since there is no data on married Indian females, it would be hard for us to examine whether both USA and IND women, in general, change their way of happiness pre or post marriage. However, the graph suggests that single Indian women obtain happiness in a pattern similar to married USA women: both groups prioritize affection over achievement and seem to enjoy_the_moment more than bonding.

Summary: Our exploratory data analysis suggests that 1. Women change their way of obtaining happiness after marriage. 2. Men exhibit similar changes in their sources of happiness after marriage, similar to women. 3. Indian single women show comparable ways of obtaining happiness to married U.S. women. The analysis is very general and every individual in every demographic acquire happiness in their own unique ways.