Scatterplot XYSC

XYSC is for X-position, Y-position, Scale and Colour

This scatterplot can show 2-4 numerical variables via x-position, y-position, bubble-scale (area) and colour as well as text strings. It must have x-position and y-position and variableName (descriptive text string) while data columns for areaVal and colorVal are optional. The input has been simplified and works well for small to medium sized datasets up to maybe a couple of hundred lines (?). Experiment and have fun!

This example shows data on 3 of the 4 possible scales. This has x and y positions and scaled bubbles. The colour scale is not used here. This scatterplot also highlights one of the datapoints on start.

Scatterplot with time scale on X axis

This shows movies with the most Academy awards wins. This draws year numbers on the x-axis. It is also possible to use more precise dates. Contact me for a custom solution.

Both components are available from the link below.

Download Scatterplot V1

The first thing youshould check in this file is the function loading the data at the very end of the script. In the first file provided (which you may have downloaded) it was set to load the datafile from the media library on my blog, so the load function looked like this:

loadJsonData('https://gauteheggen.com/wp-content/uploads/2021/10/employeesIncomeIndustry.txt');

This won’t work unless the component showing the data is on gauteheggen.com. Also notice that the file extension is .txt and not .json. This is done because wordpress rejects .json files. The .txt file is still formatted as a json file and will work as a json file inside the component. This is just a simple fix to allow the file in the media library.

So when you work with the file locally the loadJsonData should have a parameter pointing at a local file. You can simply reactivate (remove //) the line above so it says:

 loadJsonData("json/employeesIncomeIndustry.txt");

WHne working with a local file it can be either .txt or .json. The davatage with haveing a .json file extension is that Visual Studio will put in colors to make it more readable (property names, text strings and numbers will have different colors).

The data format

Working with the input data you must follow these steps for the data to function in the component. The structure is a table format with a minimum of 3 columns with variable names as headings. The following table has one more column than the minimum amount. In this dataset you can delete the areaVal column and get a scatterplot with same sized bubbles. The amount of lines is not important. The following example data is the source of the scatterplot above except only a few rows are included.

industry (SIC2007)Number of employees 2021K2Number of jobs (employments) 2021K2Average monthly earnings (NOK) 2021K2
variableNameareaValxValyVal
01-03 Agriculture, forestry and fishing312173635843480
05-09 Mining and quarrying610446142378210
10-33 Manufacture21032121500851590

For datasets with up to a few hundred (maybe more?) lines I reccommend a spreadsheet such as Google sheets or Microsoft Excel. We will use a Google sheet in this example.

In the table above the top row has the original headers from SSB. You have to add the row below in your own dataset. While exploring the dataset it is a good idea to keep the original headers in the spreadsheet, but they should not be part of the data that is converted to a json file later on. So when you copy cells out of your spreadsheet make sure to NOT highlight the top row. Start highlighting the variableName cell and include everything in the rows below. The reason I reccommend this is because you can then easily re-map a row to a different visual property in the scatterplot while remebering the meaning of the numbers in that row. You can swap any columns by changing the variable names, so if you want Number of employees to go on the x-axis you change the variable name from areaVal to xVal. You can play around with this as much as you want. You can for example have datasets with more rows than we can display in the scatterplot, and then by simply renaming the header name you can quickly look at the data visually.

Update 2023:

Here is another dataset from kaggle.com that has Data for some Spotify tracks that were popular in 2017. This is a more fun dataset than the previous one to practice. The methods are the same:

https://www.kaggle.com/code/alankarmahajan/exploring-spotify-dataset

This is the previous dataset and it is still fine:

The dataset is here.

This dataset is not editable by the way. That is to ensure it is in ok form when the next student comes along. You should make a copy of it so you can mess around and change properties. On google sheets go to:

File > Make a copy

Mess around with that one. Change xVal and yVal to get started. Once you want to try it in the component, highlight everything except the top descriptive row. Choose Copy (Ctrl + C) and then in Mr. Data Converter simply paste into the top text box where it says input CSV or tab delimited. Table data also works so just paste it in.

Here is the link to Mr.Data Converter

Choose Json properties as your output format. The converted data should look like this unless you have altered it :

[{"variableName":"01-03 Agriculture, forestry and fishing","areaVal":31217,"xVal":36358,"yVal":43480},
{"variableName":"05-09 Mining and quarrying","areaVal":61044,"xVal":61423,"yVal":78210},
{"variableName":"10-33 Manufacture","areaVal":210321,"xVal":215008,"yVal":51590},
{"variableName":"35-39 Electricity, water supply, sewerage, waste management","areaVal":33418,"xVal":34575,"yVal":57510},
{"variableName":"41-43 Construction","areaVal":230754,"xVal":238849,"yVal":47960},
{"variableName":"45-47 Wholesale and retail trade: repair of motor vehicles and motorcycles","areaVal":341649,"xVal":368960,"yVal":47080},
{"variableName":"49-53 Transportation and storage","areaVal":121188,"xVal":129920,"yVal":49310},
{"variableName":"55-56 Accommodation and food service activities","areaVal":65928,"xVal":74409,"yVal":34490},
{"variableName":"58-63 Information and communication","areaVal":101097,"xVal":104456,"yVal":66320},
{"variableName":"64-66 Financial and insurance activities","areaVal":47952,"xVal":48534,"yVal":79430},
{"variableName":"68-75 Real estate, professional, scientific and technical activities","areaVal":162695,"xVal":174635,"yVal":63320},
{"variableName":"77-82 Administrative and support service activities","areaVal":135735,"xVal":148670,"yVal":42800},
{"variableName":"84 Public adm., defence, soc. security","areaVal":174298,"xVal":188924,"yVal":52580},
{"variableName":"85 Education","areaVal":224967,"xVal":253500,"yVal":46860},
{"variableName":"86-88 Human health and social work activities","areaVal":558936,"xVal":651089,"yVal":45620},
{"variableName":"90-99 Other service activities","areaVal":90038,"xVal":111104,"yVal":44890}]

Copy alle the output json data from the converter, or just grab the data in the window above and then paste it into new file in visual studio code. Save the data as a .json file in the json folder that came along with the project. This data should now work with the component but you first have to map it up to the loadJson function on line 440-450 about. You will see several commented out loadData and LoadJsonData function calls here. Comment out the ones you don’t use and make sure you use the loadJsonData function and not the loadData at this stage. The function calls look like this:

 //loadData("csv/municipalititesEnglish.csv");
  loadJsonData("json/employeesIncomeIndustry.json");
  //loadJsonData("json/municipalitiesArea.json");

The active one (without //) is the one that loads the data from this article.

Here are some properties in the code that control certain apsects you might want to change:

Bubble colour

on line 198 or thereabouts you will find the colorScale. This is a cool thing that can output data values as color values in a range. The colours in the range are determined by the three colour names you see on line 201. It looks like this:

.range(['darkblue', 'darkmagenta', 'fuchsia']);

If your scatterplot does not use the range, the first color name (darkblue in this example) will be the color of all the bubbles.

If you do use the range the first color name will represent the low values in your data range while ‘fuchsia’ in this example will represent the high values in your data range.

Around line 218 you can change the hover colour, stroke (outline) and stroke width. Here is what it looks like for the example on the top of the page:

            d3.select(this).attr("stroke",  "black");
            d3.select(this).attr("fill",  "orange");
            d3.select(this).attr("stroke-width",  "4");

Change it to whatever you like.

Alter the tooltip

The tooltip is important to help your reader understand the data. You need to know your own dataset so you can either label the data or construct sentences containing bit of data. Here is what the tooltip code looks like for the component above. This has been formatted to be more intuitive to adjust:

NOTE! Depending on what version of the component you are using the code below might look different but the principle is the same; We should construct sentences around our data that improves our readers understanding.

.html("<b><u>"+(d.variableName) + "</u></b><br>"
                    +"Average monthly salary in NOK : <b>"+(d.yVal) + "</b><br>"
                    +"employees : <b>"+(d.areaVal)+ "</b><br>"
                    +"Number of Jobs : <b>"+((d.numJobs)) + "</b><br>"
                    //The line below multiplies the variable by 100 and then rounds it to make a more relatable number, for example 88 per 100 rather than 0.88 per 1
                    //This is an interpretation that might fit this dataset, but maybe not another one. Try to make sense of the numbers for the readers. This could be one way to do that. Use if it fits.
                    +"Employees per 100 jobs: <b>"+(Math.round(100 * (d.xVal))) + "</b><br>"
                    );

In order to adjust it you need to write words about your data that make sense to your readers, and you need to map the right variables from your dataset. This is where you should look at the headers in the spreadsheet again. They can be used as labels for the data. You can see that Average monthly salary describes the yVal data. By maintaining the top row in the google spreadsheet you can make sense of what data you mapped to which variable name (xVal, yVal, areaVal or colorVal)

One line in the code above represents one line in the tooltip so if you want to remove one label and one piece of data simply remove the entire line including everythin from the first ‘+’ to the last quotation mark.

Another issue is that depending on the number of lines you have in your tooltip, you might want to adjust the poistion of the tooltip relative to the data marker. A lrage tooltip might need a bigger offset from the marker origin to stay inside the borders. This can be achieved by adjusting the ySwitch value in two places. The first sets the y-position of the tooltip to apear above the data marker

var ySwitch = -110;

This works fine, except if the marker is in the top of the frame, then we need the tooltip to appear below the marker so we need an if statement.

if (d3.event.pageY < 150){
    ySwitch = 20;
}

The code above checks the y position of the element this event originated from, and if that y-position is less than 150 it sets the value of ySwitch to a positive number, inturn placing the tooltip below the data marker. Try altering the values above until the tooltip appears where you want it. If the tooltip appears to close or overlaps the data marker it needs a larger negative number or a larger positiv number inside the if check. If the tooltip is too far away you have to do the opposite.

Time scale

If we want to use time on a scale we need to set up the scale and axis for this, but even more important we have to make Date objects of whatever variable is representing time. Time formats in datasets can vary greatly and it is normal to have to parse (interpret) this data for the app so it can display the time variables as time. The good thing about the Date object is that it allows us to get any date related information from it such as year, month, date, hours, seconds, time (in milliseconds since jan 1st 1970), week day and a lot more, and you can intpret the date object into sentences in your own language as described here. You can read more about the Date object on MDN.

As a first example we can try converting data that only has a year number into date objects. I googled movies with most oscar wins to use in this scatterplot. That led med to this dataset on Statista.

https://www.statista.com/statistics/1002740/movies-with-the-most-oscar-wins/

After looking at this I realized I can’t get much from statista without paying. I checked on the official Oscar page’s statistics page and found a statistic called ‘Film Facts- 5 or More Competetive awards’. The statistic is available as a PDF and has now resulted in a separate article on how to extract semi structured data from that PDF.

The result of that artilce on extracting structured lines, tunring it into a csv and then converting it to a josn that our scatterplot can read resulted in this json datafile which has a .txt extension due to wordpress limitations.

In order to dsiplay time data in the scatterplot we have to activate a timescale on the x-axis, rather than a linear scale. Before doing that we have to make sure our data for the scaleTime is in the right format, a Date object. In order to do this we need a function to parse a number or a string ot a Date Object. Here it is:

function parseYear(myYear){
  var newDate = new Date(0)//0 as argument to set the time to jan 1st 1970.
   newDate.setFullYear(myYear);//Sets the year to the value of myYear, date is still jan 1st.
   return newDate;
}

Then we need to set out xVal to equal the result of the function above, like this:

 dt[i].xVal = parseYear(dt[i].xVal);

This sets the xVal to a Date object based on the year number. This value can now be placed on a time scale.

Since we have the data in the right format we can activate the time scale. You need to activate and deactivate some code with comments for this to work. First on line 106 the xScale is declared as a scaleTime.

var xScale = d3.scaleTime();
//var xScale = d3.scaleLinear();

Then the axis is instantiated (created) further down. Again the scaleLinear is commented out.

/*
xScale = d3.scaleTime()
.domain(extentX)
.range([0, width])
.nice();
*/
    
xScale = d3.scaleLinear()
.domain(extentX)
//.range([60, (width - 80)])
.range([0, width])
.nice();
    

As you can see above the timescale is commented out with /* before and */ after that chunk of code making it appear green in Visual Studio Code, meaning it is a comment. Activate the time scale and deactivate the linear scale. You can also try to run the component with both scales (not at the same time) just to see the difference.

Adjusting the domain to get data markers within the axis

This question was asked in class. How can we make all the data markers appear well within the axis and not on them?

The short answer is to use a d3 function called nice() on our domain. This extends the domain to round units so we don’t get weird extreme values. This is already in use in this component. However this was not sufficient with some time based data with full year values. Maybe because a Date object with only a year value has a date of january 1st, making the data marker appear on the axis since the domain has had the nice() function applied. So how do we then extend the domain further?

Maths with variables and small multipliers is one way. We want the domain to go a bit lower so our first year number appears inside the scatterplot. I want to change lowest value of the domain to a value that is lower with about 5% of the domain range. This example is with a time value, but any numercial number can be treated the same way. Just drop the Date object when dealing with plain numbers. When doing calculations with time it is best to use milliseconds as it is a nice clean (although long) number.

First we add a custom extent for the x axis on line 90 around the other extents:

var customExtentX = d3.extent([0,1]);//values 0 and 1 are just placeholder values

Here is some code for setting a lower value (Lower with 5% of the range) in a new extent, effectively expanding the x-axis:

var scatterXMargin = ((extentX[1].getTime() - extentX[0].getTime()));//set scatterMargin to highest timevalue minus the lowest gives the amount of milliseconds within the range
    scatterXMargin *= 0.05; //Multiply it with 0.05 to get a nice number to subtract from the lowest x value
    var customXLow = new Date((extentX[0].getTime() - scatterXMargin));//Subtract scatterMargin from the lowestvalue and use it as an argument for a new Date object
    customExtentX = d3.extent([customXLow, extentX[1]]);
    //customExtentX
    xScale = d3.scaleTime()
    .domain(customExtentX)
    //.domain(extentX)
    .range([0,width])
    .nice();

You can copy paste the above code into your project and try to make it work. If you want to quickly swap between teh different extents (determines the range the x axis should cover) on the x-axis just swap the extent used by changing these comments from above:

//.domain(customExtentX)
.domain(extentX)

Labels and source attribution

To give the reader more info we need labels on the axis and a place to put links to sources and creators. The labels are added to the svg element (the vector graphic element) while the attribution is added as a standard html element containing some text and some links. This element is placed below the svg element.

The axis labels are added around line 239 – 253 as text elements that are attached (appended) to the svg. These labels have som attributes and some of them you need to modify to make them look their best in your project. First of all you need to change the .text properties to whatever makes sense for your data. Once you have done this you will also need to alter the x-values on both labels to a number that places your axis label in a way that looks good. Please note that for the y label the coordinate system is fliped because the text element is flipped. That means the x-value moves the text up and down on the screen along the y axis since the the texts x-axis is flipped 90 degrees. You can adjust how close the label is to the graphic with the y property. The code for the x-labek looks like this:

svg.append("text")
    .attr("class", "x label")
    .attr("text-anchor", "end")
    .attr("x", 280)
    .attr("y", height - 4)
    .text("income per capita, inflation-adjusted (dollars)");

The source attribution is different, it’s a html div element containing a couple of links. Try modifying them to suit your project. You find them around line 75 in the html files linked at the top of the article.

  <div id="credits">Source 1:<a href="https://gauteheggen.com">gauteheggen.com</a> Source 2:<a href="https://nrk.no">nrk.no</a></div>

More to come

This should hopefully be enough get you going exploring some data. There will be updates to this article. Here is a list of what will be covered soon:

  • sort the bubbles z-indeex by their size so big bubbles go to the back.
  • Labeled axis
  • html element to add links to sources etc.
  • Implementing d3 statistical methods and visualizing them (This might be a lot of work, or not, lets find out!)