The One with all the Data

Friends Lover meets Data Science

I love Data Science and the TV show Friends! This is my data project focused on learning more about the TV show through the scripts - I hope that you enjoy reading this post!!

The date: September 22nd, 1994. The scene: A coffee house in New York named “Central Perk”. The people: 6 people in their mid-twenties, all at different points of their lives, facing the roller coaster that is adulthood. Welcome to Friends - a show that many would come to love over its crazy ten seasons.

Friends has an ensemble cast of six main characters: Rachel Green, Monica Geller, Lisa Kudrow, Joey Tribbiani, Chandler Bing, and Ross Geller. Although it’s riddled with issues, at its core, Friends dealt with the relationships between its six main characters as they dealt with crazy romantic adventures and career issues, and all of them were both rewarded and burned by these crazy paths. These relationships define friendship, as their loyalty and affection for one another always pull them closer together.

I have watched the show many times, but recently, I thought of putting the show under a microscope and questioning what the breakdown of the show looks like. Who spoke the most? What are the general attitudes of the friends? These were some of the questions that crossed my mind in pursuit of learning more about my favorite TV show, but instead of coming at it as a teenager lazed across my couch and relaxing, a data analyst, digging deeper into the scripts of the show and seeing what takeaways I could get from it.

Step 1: Finding The Data

The first step required was to find scripts of each episode online to collect the data from. I eventually came across this website that had every episode script. I looked through a few of the scripts, and took note of the common tags used, and the pattern of every link, so that looping through them when parsing would be easier.

Step 2: Parsing the Data

Parsing the data was defintely the hardest part of this project, due to the scripts being transcribed by different people, and subsequently, not having a standard for parsing. All of the code for this project was written in Python, as I tried to use Java initially, but Python was simply better for the requirements presented. To parse the files, I used BeatifulSoup, a library in Python designed for the purpose of parsing HTML files directly from the website. I looped through all the files, and looked for html files with the 'p' tag, since that was (generally) how each line was marked. I then looped through each line with a 'p' tag, and looked only for lines with the names of each friend (Joey, Monica, etc.), or nicknames (Pheebs for Phoebe, Rach for Rachel, etc.). Next, given a line in the form of "name:line", I split it by the colon, and removed all brackets, parentheses, new lines, tabs, and carriage returns from the line of the script, to avoid extra words that may have screwed up the script. Finally, I took the length of the script, and wrote four things to a new file, all divided by pipes - the season and episode (written as 4 numbers), the name of the character who spoke, the line that they said, and the number of words in that line. There were some files that sadly couldn't be parsed, which I will explain later, but a majority of the files successfully parsed, and data was collected from them.

Step 3: Reading the Data and Getting Information

To get the data, I first loaded the pandas library, in order to use the DataFrame object to store all of my data. When looking at the data, I had previously parsed the data into 4 columns - season/episode, character name, line spoken, and number of words in that line. I then looped through all of the lines of my data file, and splitting by the pipe characters again, I then added the values for number of words spoken to a list, which was previously calculated, and a value for the number of lines spoken, where I added one every time a character's name appeared. In addition, I added checks to account for all of the nicknames that were present (this could have been done at the parsing stage, but it also works here). At the end, I created a DataFrame, added the word and sentence counts to a dictionary, and converted that data to the DataFrame, eventually ending up with a clean, usable DataFrame.

This was the initial process of gathering data - for other script related stuff, I would use a similar process, but would change specific variables in order to gather new data.

Step 4: Making Data Visualizations

Finally, it was time to visualize. I copied the DataFrame, and did some research on different ways to create charts, such as Tableau, and finally opted to use Google Sheets, due to my familiarity with it and its simplicity. I plugged in the data, and finally created charts with all of the information, which can be seen below.

Conclusion and Final Thoughts

On the whole, this project was extremely demanding, but rewarding, as I was able to successfully create these visualizations. This project mainly required a lot of parsing, as I needed to directly get the words, look for weird syntax appearing, such as new line characters and carriage returns randomly, and split those words, to create a dataset at the very end. Although it was hard, it was definitely worth it.

Some trivia from the data:

Due to the fact that a proper dataset was not used, there are some discrepencies in the data, due to some bugs that couldn't be easily solved within the context of the problem. The two main ones were:

As a result, the numbers are slightly under the true values of how many lines and words were spoken, but in theory, the percentages should stay the same, as that is reflective of every episode.

Overall, this was an extremely fun project, and my parsing skills were greatly improved through this project - I hope that you enjoy the project!