Structured data data from a semi structured PDF

This article sprung out of a tutorial about getting data into a scatterplot.

Movies with the highest number of Academy awards is a good example of data with a time variable that does not have continuous datapoints. For example, Ben-Hur only won 11 Oscars in 1959 but obviously not before or after, so the Movie only ever has one value for time, as opposed to polls about political parties where parties have several values spread across a time range. That means a scatterplot with a time axis is a suitable visualization for this data, if we can find it..

Googling resulted in a few results:

  • a structured dataset from Statista which cost money, or you could punch the numbers from the chart they provide which only has the highest range of the data, the movies with 8 or more Oscars.
  • The official Oscar awards database, where a semi structured PDF is available for free. Choose ‘Film facts- 5 or more competetive awards’ and download the PDF is you want to follow every step of this.

The data source is in PDF format and semi structured and is a good example of a data source that can be turned in to structured data with some simple steps. This article will guide you through a process to convert the data into something the scatterplot component can display.

Open the PDF in a PDF reader, highlight all the text and copy paste it into a blank document in your code editor. You should see exactly the same list as in the PDF, but in plain text. This text is well suited for a quick manual conversion to CSV baecause each line is structured in the same way with commas separating datafields on each line:

Ben-Hur, Metro-Goldwyn-Mayer, 1959 (12 nominations)
Titanic, 20th Century Fox and Paramount, 1997 (14 nominations)
The Lord of the Rings: The Return of the King, New Line, 2003 (11 nominations)

so a header for this data could look like this:

title, productionCompany, yearAndNominations

I’m not really interested in nominations, but we might as well keep it.

First we need to remove all lines that do not follow the structure above. The list in the PDF document is sorted after number of wins so even though we have to punch this data it will be quick and easy. Here are the lines I removed from the plain text document in my code editor:

© Academy of Motion Picture Arts and Sciences. Document may not be republished without permission. 1.2 - 1
FILM FACTS
5 OR MORE COMPETITIVE AWARDS
[Updated thru 93rd Awards (4/21)]
11 AWARDS
10 AWARDS
9 AWARDS
8 AWARDS
7 AWARDS
6 AWARDS
1.2 - 2
Film Facts - Five or More Awards
© Academy of Motion Picture Arts and Sciences. Document may not be republished without permission.
5 AWARDS

Now we have a file with only lines containing the datafields we want so a few rows of the data looks like this:

A Man for All Seasons, Columbia, 1966 (8 nominations)
The Godfather Part II, Paramount, 1974 (11 nominations)
Star Wars, 20th Century-Fox, 1977 (10 nominations) (plus 1 Special Achievement Oscar)

Because of the mentioned comma separating the datafields we can easily make a CSV file. First we have to define the separator used in the file so our spreadsheet can separate the data in columns. First make sure to remove every blank line in the document. Then place the text cursor in the first position on the first line and press Enter to make a new line. On this new line write sep=, like in the example below.

sep=,
Ben-Hur, Metro-Goldwyn-Mayer, 1959 (12 nominations)
Titanic, 20th Century Fox and Paramount, 1997 (14 nominations)

Go to File > Save as and save it as a .CSV file. Now you can open the .CSV file in as a spreadsheet. We will use Google sheets for the rest of this article. Below is a link to a dataset you can use, but you will need ot make your own copy so the dataset in the following link remains as it is:

Get your own dataset

https://docs.google.com/spreadsheets/d/1wpWb_-VJMG_cfBVAWLveJnTaiX0klznsGhcUILj8quA/edit?usp=sharing

Go File > Make a copy

You now have your own dataset to mess around with. The top rows of data look like this in the spreadsheet:

variableName
Ben-HurMetro-Goldwyn-Mayer1959 (12 nominations)
Titanic20th Century Fox and Paramount1997 (14 nominations)
The Lord of the Rings: The Return of the KingNew Line2003 (11 nominations)
West Side StoryUnited Artists1961 (11 nominations)

Splitting datafields

In this dataset the title (mapped to variableName) is of course important and so is the year. The production company might be interesting but I am not interested in nominations right now, just wins. From the third column we want a clean year number we can convert to a date object in the scatterplot component. That means the third column needs a little work so it only has the year number and not nominations in brackets. In order to achieve install an add-on to Google sheets called ‘Power tools’. The Power tools add on lets us slice up text strings in cells. Go to:

Add ons > Get add-ons

Search for ‘power tools’ and install.

Power tools should be visible on the right side of google docs. Highlight column C (click on the C) which has year and nominations, then click on the split function with the scissor icon in Power tools. You will then get some more options, choose ‘Split text’.

Use ‘Split values by characters’ and then choose custom and insert a normal bracket as a separator since we want to split the cell right after the year number, and in every one of these cells the year number is followed by a (. You can also leave Space as a separator resulting in a few more columns, one of which is nominations as a number. This might still be of the string data type though, so that is something to be aware of later. The operations we just did should result in a table looking like this, plus som extra mess in the far right columns that I will leave unused. The column with ‘nominations)’ will be deleted but is shown here just to show what the dataset looks like after the split.

variableName
Ben-HurMetro-Goldwyn-Mayer195912nominations)
Titanic20th Century Fox and Paramount199714nominations)
The Lord of the Rings: The Return of the KingNew Line200311nominations)
West Side StoryUnited Artists196111nominations)

Again, this only shows the top rows for the purpose of showing the structure. What we need now is a complete header row, and we can also add a descriptive header above the one we need in the scatterplot. Remember that the scatterplot is made to be independent from the data and therefore it doesn not have any variable names that describeto the data. Instead the variable names describe the axis the data will be drawn along in the scatterplot component such as Yval, xVal, areaVal and colorVal. Therefore it is practical to have a descriptive header row in our spreadsheets so we can remap columns of data to different axis in the scatterplot and then the descriptive header lets us make sense of the data while working with it. Later we will have to make sense of it for a general audience inside the scatterplot, but that’s a different matter.

First let’s add the most important metric, the number of acamdemy awards per movie. This is simple, we just need to compare our spreadsheet with the PDF it is based on and fill in a column with a lot of repeating numbers. Both should have the same order and they are sorted so the movies with the most wins are on top and it is a simple task of punching in numbers in a column. I disovered two lines where the data was a bit messed up after copy pasting the PDF text into plain text, Mad Max fury road and the Aviator. I cleaned this up manually.

The data now looks like this (only top rows):

titleproduction companyyearnominationsawards
variableNameprodCompanyxValnominationsyVal
Ben-HurMetro-Goldwyn-Mayer19591211
Titanic20th Century Fox and Paramount19971411
The Lord of the Rings: The Return of the KingNew Line20031111

The top row describes the data and should not be included when copy pasting the table into Mr. Data Converter. That is also why it is marked red in the google sheets. You can see the cleaned dataset with the added awards column here. You can also try copy pasting straight from this dataset. Do not include the red top row.

https://shancarter.github.io/mr-data-converter/

Choose JSON properties as output in the converter, copy all the converted json data into a new document in your code editor. It should look like this (top rows only as usual).

[{"variableName":"Ben-Hur","prodCompany":"Metro-Goldwyn-Mayer","xVal":1959,"nominations":12,"yVal":11},
{"variableName":"Titanic","prodCompany":"20th Century Fox and Paramount","xVal":1997,"nominations":14,"yVal":11},
{"variableName":"The Lord of the Rings: The Return of the King","prodCompany":"New Line","xVal":2003,"nominations":11,"yVal":11},
{"variableName":"West Side Story","prodCompany":"United Artists","xVal":1961,"nominations":11,"yVal":10},

This data should work in the scatterplot if a timeScale is activated on the scatterplots x-axis and the year number is interpreted as a Date object. Let’s try this data in the scatterplot.