Making a graph based on an html table can be useful in many situations where you need to give your reader an easy way to perceive more complex data.
This is a step by step guide on how to make a graph based on an html table. We will use google sheets to scrape a table from atpworldtour.com. The table contains an professional tennisplayers ranking history. The google sheets spreadsheet will be used a as a datasource for the online tool datawrapper which will be used to make the graph.
It is not terribly difficult, but there are a number of things that can go wrong along the way. This article is based on trial and error with former students and aims to guide you through thte graph creation.
First step: Scrape a html table with Google sheets
Before we start we need to set the language for your google account to english. Previous workshops with norwegian language students showed that the scraping function does not work with norwegian languar, and possibly other languages. I reccomend setting the account language to english for this workshop. In order to change this you have to go to the settings of your google account, seach for language and set it to english. You need to to this before you create a new spreadsheet.
It is important to set the language to english before you even creat the spreadsheet, or the scraping might not work.
Once your language is set to english you can start google sheets and create a new blank spreadsheet.
https://docs.google.com/spreadsheets/u/0/
You don’t have to scrape a table to get data into a graph. Smaller tables with only 10-30 rows you can highlight and then copy paste in to a spreadsheet. If you do this remember to use paste special (Shift + ctrl + V on Windows. Cmd + shift + V on Mac?) to avoid getting formatting into your data table.
Before we evens start we need to find a table to import. Here is a link Roger Federer’s ATP world tour page which has his ranking history and lots of other nformation. Federer’s table is very long though and since we finally have a good norwegian tennis player (this is very rare!) I recommen Casper Ruud this time. You can use any player you like, or even experiment with other tables from other sports but that might not work with the rest of this tutorial.
Once you have found an ATP player whos ranking history you want to visualize click ‘ranking history’ to navigate to the table we will scrape.
Scraping the data
Copy the url to this page from your browsers adress field. The url to Casper Ruud’s ranking history looks like this:
https://www.atptour.com/en/players/casper-ruud/rh16/rankings-history
We will use this URL in google sheets to scrape the table.
Click on cell A1 in the google spreadsheet. We are about to write the function that will scarpe the data. The function is called importHTML and it can import two types of html elements, lists and tables. We will be importing a table. Here is a function that will scrape the contents of the first html table on the page about Casper Ruud. You can copy paste the function below straight into cell A1 in your spreadsheet.
=importHTML("https://www.atptour.com/en/players/casper-ruud/rh16/rankings-history", "table", 1)
In the spreadsheet you can now see the contents of the table we scraped. But is this the data we want?
Age 22 (1998.12.22) | Turned Pro 2015 | Weight 170lbs(77kg) | Height 6’0″(183cm) |
Birthplace Oslo, Norway | Residence Oslo, Norway | Plays Right-Handed, Two-Handed Backhand | Coach Christian Ruud |
Before we tried the function I wrote that this function will import the first html table on the page. The rankings history is the 2nd html table on the page sp we need to alter our function a bit. Take a goog look at the importHTML function above. You can see that the function has its own name, importHTML, and that there is some more stuff happening between the brackets. Between these brackets are three parameters separated by commas. The three parameters are the url to the page we will scrape the html element from, what type of html element we will scrape (table), and finally a number that represents what html table on that page we want. We want table number 2 on the atp page so change the 1 to a 2.
This should cause the spreadsheet to scrape the table containing the rankings history.
Refining the data
This table now shows the entire rankings history but we still need to work with the data to get a nice graph. The first problem with our spreadsheet is that we can not edit the contents. That is because the spreadsheet is loading the data from an external source and all the cells showing data can not be edited. We can fix this by creating a new blank google spreadsheet. We will now copy paste the contents of the cells in the first spreadsheet in a way that converts it to data we can edit.
Go to your spreadsheet that has the importHTML function. Click anywhere in the spreadsheet and use the shortkey Win: Ctrl + A or Mac: Cmd + A. This should highlight all cells containing data. Now select copy (Win: Ctrl + C or Mac: Cmd + C) to copy the data to your clipboard.
Go to the new blank spreadsheet and click in cell A1. Now we ant to paste the data in without formatting that can be included in the table. In order to do this we use a command called Paste special > Paste values only (Win: Ctrl + Shift + V Mac: Cmd + Shift + V?)
This should paste the data without formatting such as fonts and colours.
We can now edit the contents of this spreadseet. In order to make my graph I don’t want all the data in the spreadsheet and there are also some numerical values that have a ‘T’ next to the whichmay cause the graph to fail. There is also a ranking column for ‘Doubles’ which I am not interested in. Right click the top of a column and choose ‘Delete column’ to remove an entire column.
The next problem is some ‘T’ characters that are next to some of the ranking numbers. I want to get rid of all these T’s. We will use the search and replace command to do this. Search replace is a comon command in code editors and spread sheets and in most applications it has the shortkey Ctrl + H (Win) or probably Cmd + H (Mac).
Click on the column label above the column to highlight all data in that column, then select Edit > search and replace (Win: Ctrl + H or Mac: Cmd + H?). This should open a dialog box that says ‘Find and replace’. Type a uppercase ‘T’ in the ‘find’ input box and leave the ‘replace with’ input box blank. This should remove all ‘T’ characters from this column and replace it with nothing. Before doing this step always ensure that you won’t remove characters that are necessary.
The last step we will do in refining the data is to simply remove some of it. When we made this graph in class I chose start the graph at ranking position 200 about. You can choose whatever you want to highlight a period in the players history, or you can choose to show it all.
The rest of the work is done in datawrapper but that is not yet covered in this article. Datawrapper is fairly easy to use so I am sure you can figure it out. If you are using your refined google sheet as a direct data source for datawrapper you first need to enable sharing for the google sheets document.
This is the iframe for a graph made with datawrapper: