HappyDB is a corpus of 100,000 crowd-sourced happy moments via Amazon’s Mechanical Turk. You can read more about it on https://arxiv.org/abs/1801.07746.
Here, we explore this data set and try to answer the question, “What makes people happy?” ### Step 0 - Load all the required libraries
library(tidyverse)
library(tidytext)
library(DT)
library(scales)
library(wordcloud2)
library(gridExtra)
library(ngram)
library(shiny)
library(wordcloud)
library(wordcloud2)
library(ggplot2)
library(tm)
library(SentimentAnalysis)
We use the processed data for our analysis and combine it with the demographic information available.
hm_data <- read_csv("/Users/wpj/Downloads/processed_moments.csv")
urlfile<-'https://raw.githubusercontent.com/rit-public/HappyDB/master/happydb/data/demographic.csv'
demo_data <- read_csv(urlfile)
# hm_data
# demo_data
We select a subset of the data that satisfies specific row conditions.
hm_data <- hm_data %>%
inner_join(demo_data, by = "wid") %>%
select(wid,
original_hm,
gender,
marital,
parenthood,
reflection_period,
age,
country,
ground_truth_category,
text) %>%
mutate(count = sapply(hm_data$text, wordcount)) %>%
filter(gender %in% c("m", "f")) %>%
filter(marital %in% c("single", "married")) %>%
filter(parenthood %in% c("n", "y")) %>%
filter(reflection_period %in% c("24h", "3m")) %>%
mutate(reflection_period = fct_recode(reflection_period,
months_3 = "3m", hours_24 = "24h"))
# head(hm_data)
We begin our analysis by comparing unmarried and married females in terms of the words they use to describe their happy moments. This analysis involves the creation of word clouds to identify the most frequent words and Word2Vec to find the most related words.
# Splitting data into single and married females
single_females <- hm_data[hm_data$gender == 'f' & hm_data$marital == 'single', ]
married_females <- hm_data[hm_data$gender == 'f' & hm_data$marital == 'married', ]
# Break text into bag of words for single females
bag_of_words_single <- single_females %>%
unnest_tokens(word, text)
word_count_single <- bag_of_words_single %>%
count(word, sort = TRUE)
# word_count_single
# is a dataframe with two columns: "word" containing a word, "n" containing the word's frequency in desc order
# Break text into bag of words for married females
bag_of_words_married <- married_females %>%
unnest_tokens(word, text)
word_count_married <- bag_of_words_married %>%
count(word, sort = TRUE)
Wordcloud generated for single and married bag of words.
word_count_single_head <- head(word_count_single, 200)
wordcloud(words = word_count_single_head$word, freq = word_count_single_head$n, scale=c(2, 0.5), min.freq = 1, colors=brewer.pal(8, "Dark2"))
word_count_married_head <- head(word_count_married, 200)
wordcloud(words = word_count_married_head$word, freq = word_count_married_head$n, scale=c(2, 0.5), min.freq = 1, colors=brewer.pal(8, "Dark2"))
We can get a broad sense of what frequent for single or married females.
But since there are a lot of overlapping words, it’s hard to tell if
there are significant differences.
Therefore, for a closer look, we filter out top words that are unique to single women and create two wordclouds: one is unique to single females, and the other unique to married women.
# anti_join() returns a set difference for single females
unique_to_single <- anti_join(word_count_single_head, word_count_married_head, by = "word")
# wordcloud2(unique_to_single, size = 1, color = "random-dark")
# worldcloud2() isn't knitting for my computer, so I inserted the image directly.
knitr::include_graphics("/Users/wpj/Desktop/boyfriend.png")
The most prominent words include “boyfriend,” “uncle,” “coworker,” and “partner,” suggesting that close relationships play a significant role in the happiness of unmarried women. Additionally, words like “interview” and “ready” hint at active and challenging experiences contributing to their happiness.
unique_to_married <- anti_join(word_count_married_head, word_count_single_head, by = "word")
# wordcloud2(unique_to_married, size = 1, color = "random-dark")
knitr::include_graphics("/Users/wpj/Desktop/husband.png")
Similarly,
we create a word cloud for married women to identify words that are
distinctive to them. The most common words include “husband,”
“children,” and “child,” indicating that for married women, happiness is
often derived from their immediate family. Words like “garden,”
“flowers,” and “planted” suggest a sense of home and domestic happiness,
while “sitting” and “stay” imply a more relaxed and settled
lifestyle.
Analysis: It is interesting to note that the top words for single women suggest that they are more active and challenged than married women. This could be due to the fact that single women may have more time and energy to pursue their own interests and goals. They may also be more likely to be in new and exciting situations, which can lead to happiness.
On the other hand, the top words for married women suggest that they are more focused on their families and their home life. This could be due to the fact that married women often have more responsibilities, such as caring for children and managing a household. However, it is also possible that married women simply find more happiness in the simple things in life, such as spending time with loved ones and creating a home.
We use word2vec to see what are words, to single and married women, are similar to “happiness”.
library(word2vec)
model_single <- word2vec(x = single_females$text, type = "cbow", dim = 15, iter = 20)
embedding_single <- as.matrix(model_single)
embedding_single <- predict(model_single, c("happy"), type = "embedding")
lookslike_single <- predict(model_single, c("happiness"), type = "nearest", top_n = 50)
We use word2vec to take a look at what are words that sound similar to happiness for both single and married women.
library(word2vec)
model_married <- word2vec(x = married_females$text, type = "cbow", dim = 15, iter = 20)
embedding_married <- as.matrix(model_married)
embedding_married <- predict(model_married, c("happy"), type = "embedding")
lookslike_married <- predict(model_married, c("happiness"), type = "nearest", top_n = 50)
word_count_single_vec_head <- head(lookslike_single, 100)
word_count_married_vec_head <- head(lookslike_married, 100)
Then we perform similar set difference operations, to see which words are unique to single females, and which are unique to married females.
unique_to_single_vec <- anti_join(word_count_single_vec_head$happiness, word_count_married_vec_head$happiness, by = "term2")
wordcloud(words = unique_to_single_vec$term2, freq = unique_to_single_vec$rank, scale=c(2, 0.5), min.freq = 1, colors=brewer.pal(8, "Dark2"))
unique_to_married_vec <- anti_join(word_count_married_vec_head$happiness, word_count_single_vec_head$happiness, by = "term2")
wordcloud(words = unique_to_married_vec$term2, freq = unique_to_married_vec$rank, scale=c(2, 0.5), min.freq = 1, colors=brewer.pal(8, "Dark2"))
To gain further insights, we compare unmarried and married women based on their classified ground truth categories. These categories classify happy moments into specific happiness types, such as “achievement,” “enjoying the moment,” and “bonding.” We filter out rows corresponding to single females with valid values in the ground truth category and present our findings through a box plot. This identifies the predominant sources of happiness among single women.
filtered_data_single <- single_females %>%
filter(!is.na(ground_truth_category))
word_freq <- filtered_data_single %>%
count(ground_truth_category, sort = TRUE)
word_freq <- word_freq %>%
arrange(desc(n))
ggplot(word_freq, aes(x = reorder(ground_truth_category, -n), y = n)) +
geom_bar(stat = "identity", fill = "grey") +
labs(x = "Word", y = "Frequency") +
coord_flip() + # Horizontal bars
theme_minimal()
We do the same for married women.
filtered_data_married <- married_females %>%
filter(!is.na(ground_truth_category))
word_freq <- filtered_data_married %>%
count(ground_truth_category, sort = TRUE)
word_freq <- word_freq %>%
arrange(desc(n))
ggplot(word_freq, aes(x = reorder(ground_truth_category, -n), y = n)) +
geom_bar(stat = "identity", fill = "grey") +
labs(x = "Word", y = "Frequency") +
coord_flip() + # Horizontal bars
theme_minimal()
Analysis: our visualizations indicate that both single and
married women derive happiness from “affection” and “achievement.”
However, single women obtain happiness from achievement almost as much
as affection, while married women obtain happiness as half as
affection. Notably, single women rank “bonding” as
their third most common source of happiness, highlighting the importance
of social connections for this demographic. In contrast, married women
appear to prioritize “enjoying the moment,” emphasizing the value of
being fully engaged in the present.
Next, we examine whether there is a significant difference in the text length of happy moments between single and married women.
# count characters for every happy moment original text entry
text_lengths_single <- nchar(single_females$original_hm)
mean_text_length_single <- mean(text_lengths_single)
text_lengths_married <- nchar(married_females$original_hm)
mean_text_length_married <- mean(text_lengths_married)
# draw a box plot to compare
mean_text_lengths <- data.frame(
Category = c("Single Females", "Married Females"),
Mean_Text_Length = c(mean_text_length_single, mean_text_length_married)
)
ggplot(mean_text_lengths, aes(x = Category, y = Mean_Text_Length, fill = Category)) +
geom_bar(stat = "identity") +
labs(x = NULL, y = "Mean Text Length") +
theme_minimal()
Analysis: this reveals that married females tend to write
slightly longer descriptions of their happy moments compared to single
women. However, this difference is relatively small.
This could be because married women have more people to share
their happiness with, or it could be that they simply have more time to
reflect on their happy moments and write them down.
We use a similar strategy: filter out top words that are unique to single and married men, and create wordclouds.
# Filter out subsets of unmarried females and married males
single_males <- hm_data[hm_data$gender == 'm' & hm_data$marital == 'single', ]
married_males <- hm_data[hm_data$gender == 'm' & hm_data$marital == 'married', ]
bag_of_words_single_male <- single_males %>%
unnest_tokens(word, text)
word_count_single_male <- bag_of_words_single_male %>%
count(word, sort = TRUE)
bag_of_words_married_male <- married_males %>%
unnest_tokens(word, text)
word_count_married_male <- bag_of_words_married_male %>%
count(word, sort = TRUE)
word_count_single_head_male <- head(word_count_single_male, 200)
word_count_married_head_male <- head(word_count_married_male, 200)
unique_to_single_male <- anti_join(word_count_single_head_male, word_count_married_head_male, by = "word")
# unique_to_single_male
# wordcloud2(unique_to_single_male, size = 0.3, color = "random-dark")
knitr::include_graphics("/Users/wpj/Desktop/pizza.png")
Single men
seem to be more materialistic and easy to satisfy, as we see “pizza”,
“drink”, “dollars”, and “computer” dominating the top unique words.
Words such as “gym”, “graduation”, and “goal” give a sense of liveliness
and strive. In the top words, there’s no presence of “girlfriend”, which
is a stark contrast to the huge “boyfriend” in single women wordcloud.
Words are also more equally valued, as we can see words in the wordcloud
are similar in size.
Notably, the prominent words in this cloud suggest that single men may exhibit a more materialistic and easy-to-satisfy outlook. Words such as “pizza,” “drink,” “dollars,” and “computer” dominate the vocabulary. Additionally, words like “gym,” “graduation,” and “goal” convey a sense of vitality and ambition. Interestingly, unlike single women, there is no significant mention of “girlfriend” in the word cloud, indicating a difference in relationship dynamics.
unique_to_married_male <- anti_join(word_count_married_head_male, word_count_single_head_male, by = "word")
# unique_to_married_male
# wordcloud2(unique_to_married_male, size = 1, color = "random-dark")
knitr::include_graphics("/Users/wpj/Desktop/wife.png")
In the word
cloud for married men, we observe the prominent presence of words like
“wife,” “daughter,” and “kids,” underscoring the significance of family
in a married man’s life. Words such as “baby,” “child,” “marriage,”
“share,” and “bed” further reinforce the importance of a stable family
environment. Additionally, words like “temple” and “smile” suggest a
sense of contentment associated with married life.
Analysis: It is interesting to note that the top words for single men suggest that they are more materialistic and easy to satisfy than single women. It is possible that single men are more likely to enjoy simple pleasures, or obtain happiness from simple pleasures, such as eating pizza and drinking beer.
Overall, our findings suggest that marriage does affect the way men get happiness in a similar way to women. Both single and married women and men find happiness in their close relationships, but married women and men tend to place a greater emphasis on their families.
We split data into 4 groups: single & USA, single & IND, married & USA, married & IND, and conduct a ground_truth_category analysis on the 4 subsets. USA and IND were chosen because both countries have a significant amount of data in the data set, and are representative of western and eastern countries.
# Splitting data into USA and India females
usa_single_females <- hm_data[hm_data$country == 'USA' & hm_data$gender == 'f' & hm_data$marital == 'single', ]
ind_single_females <- hm_data[hm_data$country == 'IND' & hm_data$gender == 'f' & hm_data$marital == 'single', ]
usa_married_females <- hm_data[hm_data$country == 'USA' & hm_data$gender == 'f' & hm_data$marital == 'married', ]
ind_married_females <- hm_data[hm_data$gender == 'IND' & hm_data$gender == 'f' & hm_data$marital == 'married', ]
data_subsets <- list(
usa_single_females,
ind_single_females,
usa_married_females,
ind_married_females
)
subset_labels <- c("USA Single Females", "IND Single Females", "USA Married Females", "IND Married Females")
plots <- list()
# Create plots for each subset, omitting NA values
for (i in 1:length(data_subsets)) {
filtered_data <- na.omit(data_subsets[[i]])
p <- ggplot(data = filtered_data, aes(x = ground_truth_category)) +
geom_bar(aes(fill = ground_truth_category)) +
labs(title = subset_labels[i], x = "Ground Truth Category", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plots[[i]] <- p
}
library(gridExtra)
grid.arrange(grobs = plots, ncol = 2)
Since there is no data on married Indian females, it would be hard for
us to examine whether both USA and IND women, in general, change their
way of happiness pre or post marriage. However, the graph suggests that
single Indian women obtain happiness in a pattern similar to married USA
women: both groups prioritize affection over achievement and seem to
enjoy_the_moment more than bonding.
Summary: Our exploratory data analysis suggests that 1. Women change their way of obtaining happiness after marriage. 2. Men exhibit similar changes in their sources of happiness after marriage, similar to women. 3. Indian single women show comparable ways of obtaining happiness to married U.S. women. The analysis is very general and every individual in every demographic acquire happiness in their own unique ways.