Introduction

Note: This project contains spoilers for DDLC

Hello, I'm nopynospy and I love Doki Doki Literature Club. I stumbled across Monika After Story Mod for the game and the fans are adding additional dialogues for the character, Monika. Guidelines are provided for dialogue contributors, but I feel that maybe I can make it more specifics by obtaining insights from the original scripts using some basic natural language processing. Luckily, someone has extracted all the game files in this repo. As for my contribution to the repo, this is the fork.

Well, someone else may want to use it for the mods that focuses on other characters, so I may as well run the analysis on all the characters.

Data Preprocessing and Engineering

In this section, I will extract the scripts from the game files and do some data cleaning. I will also generate some new data.

Data Exploration and Visualization

In this section, I use plots to explore the data for more insights.

Monika speaks the most, due to her monologues. The MC ranks second, as he did not die. Sayori is the earliest to die, so her lines are the least.

Sayori and Monika have the highest polarity. Despite Sayori's depression, she got the highest. Yuri and Natsuki are the lowest, probably because they argue a lot. The MC is well, dense, so it makes sense that he is in the middle.

Monika has the highest subjectivity, probably because of her constant monologues in the climax of the plot.

Monika has the highest number of words, unsurprisingly. Sayori has lowest, is it related to her depression though? Yuri is very quite, but even she has more words per line than the both Sayori and the dense MC.

All od the characters have high lexical diversity, but Monika actually has lowest.

Sayori's most frequent words

Sayori is the MC's childhood friend after all and has depression, so the MC (player) naturally appears a lot in her lines. Interestingly, Yuri and Natsuki both appeared exactly 12 times. Sayori also has more exclamations than questions.

MC's most frequent words

The MC and Sayori are childhood friends, despite her shortest screen time, Sayori appears the most in the MC's lines. The frequency is then followed by Yuri, Natsuki and Monika. The MC asks a lot questions very frequently too.

Yuri's most frequent words

Followed by MC, Natsuki is mentioned the most by Yuri, probably because they argue a lot. Sayori is not even in her most frequent 100 words, perhaps they are not close. Yuri asks than she proclaims. Yuri uses exclamation marks, almost five times more than she asks, likely due to her shy nature.

Natsuki's most frequent words

Natsuki is very similar to Yuri in some ways, despite being polar opposites. Followed by MC, Yuri is mentioned the most by Natsuki, probably because they argue a lot. Sayori is not even in her most frequent 100 words, perhaps they are not close. Natsuki proclaims than she asks. Unlike Yuri, Natsuki is quite blunt and she does not use that many ellipsis compared to others.

Monika's most frequent words

Surprisingly, Monika is the only girl that does not mention the MC the most. On the other hand, she actually mentions Sayori the most. She asks almost two times the number she exclaims.

Data Modelling

In this section, I will use clustering on each character to get the number of categories in their lines. I will also compare the clusters.

Sayori's clusters

Probably 5, 8 is too high

5, because the increment from 4 to 5 is a lot

The number of lines in each cluster is not a lot, but she has the shortest screen time. And she was later revealed to be depressed, despite her usual cheerful manners. Perhaps its why she has quite high number of clusters due to such a contrast.

Cluster 2 has the lowest word numbers, polarity and subjectivity and the words that are used are sorry etc. Cluster 3 has the highest polarity and the words are like happy, like etc.

MC's clusters

Probably 4, as 7 and 9 are too many

Yes, 4 it is, since the increase is a lot from 3 to 4

Cluster 0 and 1 have the lowest polarity and lowest word numbers and the words are mainly fillers such as, maybe, sure and mean. Cluster 2 and 3 consists of mentions of other characters, mostly in a positive manner.

Yuri's clusters

3? 6 is probably too many.

Yup, 3, otherwise the increase is diminishing

Cluster 1 has lowest polarity, subjectivity and number of words, it has mostly fillers such as mean, thanks, yeah. Cluster 0 has higher number of words, polarity and subjectivity than cluster 2. Interestingly, cluster 2 has more mentions of other characters, perhaps because she is the most introverted?

Natsuki's clusters

4? This is quite vague

4 looks ok

Cluster 0 has high number of words, but lowest polarity. It has a lot of third person plural verbs, such as know, like and think. These lines probably are when she is talking about herself. This is the opposite that what is found in Yuri.

Monika's clusters

This is very vague, maybe 4?

Probably 3, since the increase is a lot

Cluster 1 has the lowest number of words, polarity and subjectivity. It consists of fillers, such as well, really and also apology, like sorry. Cluster 0 has the second highest number of words and the most lines. As the club president, she mentions other members more often in this cluster and this cluster. As for cluster 2, it is mostly her opinions, likely about poems from members, as seen in the high number of words, subjectivity and polarity.

Saving the dataframes as CSVs

{% endblock %}