Using AI, Python and Pandas for larger datasets (it’s easier than it sounds!)

In this exercise we will use world population data to practice data extraction with Python & Pandas. Here is the file:

https://gauteheggen.com/wp-content/uploads/2025/10/population.zip

https://gauteheggen.com/wp-content/uploads/2025/10/GPD_countries.zip

Here is an AI summary on how to get anaconda and VSCode set up:

Using Anaconda Python in VS Code

This is an AI generated guide to make Python work with VS Code. It seemed ot work in class so we will leave it as is for now:

Prerequisites: Anaconda and VS Code already installed

Quick Setup (All Platforms)

1. Install Python Extension in VS Code

  1. Open VS Code
  2. Click Extensions (Ctrl+Shift+X or Cmd+Shift+X)
  3. Search “Python”
  4. Install the official Microsoft Python extension

2. Select Anaconda Interpreter

  1. Open VS Code
  2. Press Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (Mac)
  3. Type: Python: Select Interpreter
  4. Choose the Anaconda interpreter (looks like):
    • Windows: C:\Users\YourName\anaconda3\python.exe
    • macOS: /Users/YourName/anaconda3/bin/python
    • Linux: /home/YourName/anaconda3/bin/python

That’s it! Python should now work in VS Code and with Copilot.

Ok, with that working and VS Code should open a folder containing population.csv and then we can make prompts from there.

AI prompting for this dataset

A few things to be aware of before AI prompting:

  • Check if copilot is set to ‘agent’ or ‘ask’. If it is set to ask it can only reply, not actually create or modify files. Set this to ‘Agent’ for now.
  • Also check what model is being used. I have had most success using Calude Sonnet, especially when projects grow complex, but chatGPT might also worki just fine.

Here is a prompt I made to get only the last ten year for Norway. This prompt also had typos (I mispelled Norway as ‘Nowray’ for the filename but the Ai corrected it.) I left typos in this case:

I need a python script to extract the last ten years of data for norway from the file population.csv. The script must also reorganize the data so the first column has 'year' as a header and then the years as values in that column. The 2nd column should have country name as header (so norway in this case) and then the population value as content in that column. Save the data to a file called nowrayLastTenYears.csv. The pythin script can have the same filename with a .py filetype.

This resulted in the followin little dataset:

year,Norway
2014,5137232.0
2015,5188607.0
2016,5234519.0
2017,5276968.0
2018,5311916.0
2019,5347896.0
2020,5379475.0
2021,5408320.0
2022,5457127.0
2023,5519594.0

Which could be used for this graph:

Extract the ten largest countries

Here is the prompt I used:

I want to make a python script for extracting data from the larger dataset population.csv. From this dataset I want only data for the last year and only the ten countries with the largest populations in the last year. Make a csv file called tenLargestCountriesPopulationLastYear.csv. Name the python script the same but with .py filetype.

Copilot then makes the script, checks a few thing (you might need to click ‘allow’) and in the end runs the python script that generates the data file. Here is the resulting data:

Country Name,Country Code,Year,Population
India,IND,2023,1438069596
China,CHN,2023,1410710000
United States,USA,2023,334914895
Indonesia,IDN,2023,281190067
Pakistan,PAK,2023,247504495
Nigeria,NGA,2023,227882945
Brazil,BRA,2023,211140729
Bangladesh,BGD,2023,171466990
Russian Federation,RUS,2023,143826130
Mexico,MEX,2023,129739759

That can be turned into this datawrapper bar chart:

In the dataset there are some entities that are not countries. Depending on the AI model in use these might filtered out, or not. In class I used Claude Sonnet and on its own initiative it filtered out the entities that are no countries, while a student using chatGPT experienced that the non country entitites were included in the dataset.

The smallest countries

Here is the prompt. Note that this is simpler as the AI is instructed to do the same as last time, bu topposite. This works weel when the AI maintains its context window.

Make another similar python script, but with the ten smallest populations from last year.

The resulted in this dataset:

Country Name,Country Code,Year,Population
Tuvalu,TUV,2023,9816
Nauru,NRU,2023,11875
Palau,PLW,2023,17727
St. Martin (French part),MAF,2023,27515
San Marino,SMR,2023,33860
Gibraltar,GIB,2023,38471
Marshall Islands,MHL,2023,38827
Monaco,MCO,2023,38956
British Virgin Islands,VGB,2023,38985
Liechtenstein,LIE,2023,39850

Which gave us this graph:

Population changes over time

The datasets above only had data for the last year in the series. The complete dataset goes back to 1960 so that should give us a total of 64 years of data if my head math was correct. We can assume most countries show a steady growth but looking at all countries it might be hard to see who stand out. Let’s look at what countries populations have had the largest growth, and lets not look at total growth but relative growth.

Make another script. This script should detect the 5 countries with the largest relative population growth from the beginning of the dataset until the end of the dataset. For these 5 countries extract values fom every year and export that to a new csv file. 

The problem with the output from this is that it will still be one column for country and one for population so the structure will look something like this:

We need one column per country for this to make sense. This additional prompt makes the python script reorganize the data:

Change the python script to reorganize the data. I want one column per country with the country name as the column header. Also, drop the country code as I dont need it. The first column should have year as header and the year value sin the column.

Now we get a dataset like this (only showing first fivr years):

Year,Kuwait,Qatar,Saudi Arabia,Sint Maarten (Dutch part),United Arab Emirates
1960,311060,36010,2435403,2715,131334
1961,345945,39680,2526347,2970,137989
1962,384995,44630,2620910,3264,144946
1963,427694,50380,2719341,3584,152211
1964,473115,56867,2821968,3922,159692

That can give us a line graph like this:

This works, but it is impossible to see the growth of the smallest country since the largest country in this dataset has about 1000x larger population. We can use chart type ‘Multiple lines’:

This is still using the same y axis for all lines so the smallest country is still just a flat line. We can choose to ‘use independent scales for each panel’ under visualize > refine.

You can also play around with the panel layout. As mentioned in class choosing 5 countries might have been an odd choice as 5 is not a very dividable number, so maybe 6 countries would ahve been better. As a result of having 5 countries I decided in this case to position them side by side resulting in a wide but not very tall chart with 5 panels per row.

That then gives us this graph:

This layout might be sub optimal for mobile views. Take that into cosideration even when choosing the number of datapoints to ecstract.

Then the opposite:

Make another extraction with the 5 countries with the least population growth or even decline. Otherwiser make it as the previous dataset.

Again, this prompt is simpler because it leans on the previous prompt. Note that this only works as long as the AI maintains its context window. If you try this prompt from scratch it wont’s work well. This prompt and the resulting python script resulted in a dataset like this (5 first lines only):

Year,Bulgaria,Croatia,Latvia,St. Kitts and Nevis,Ukraine
1960,7867374,4140181,2120979,55810,42824093
1961,7943118,4167292,2152681,55334,43399766
1962,8012946,4196712,2181586,54461,43943674
1963,8078145,4225675,2210919,53441,44462372
1964,8144340,4252876,2240623,52326,44951156

This result in this slightly sad graph:

Since I had a gripe with the layout of 5 countries I made an additional prompt to add a 6th country to the dataset. This is simply because 6 is a more dividable number, resulting in more layout flexibility. Here is the follow up prompt:

Note that this has the 5 charts stacked on top of each other. NOTE! This is in no way a layout recommendation Neither the side by side layout above or this stacked one has been tested well for mobile use. This is simply a result of choosing 5 countries and placing them in this kind of view.

Largest total growth

This will show the countries that have had the largest growth in population over the period in the dataset. This will most likely show the largest countries. This is another short prompt that relies on earlier prompts and that the AI has thos in its context window:

Make another script getting the 6 countries with the largest total population growth from start to end of the dataset.

Note that I chose 6 countries this time, for the sake of the layout if we still choose the multiple line chart type. Here are the first five lines of data:

Year,China,India,Indonesia,Nigeria,Pakistan,United States
1960,667070000,435990338,88296070,45053782,45709310,180671000
1961,660330000,446564729,90791249,45989310,46921277,183691000
1962,665770000,457283090,93375850,46965292,48156128,186538000
1963,682335000,468138575,96051424,47973575,49447776,189242000
1964,698355000,479229598,98833749,49012016,50799999,191889000

So, this dataset might not require the mulitple line type since the six largest countries can be shown in the same line chart. I am still going to use the multiple line in the end to test the 6 dataset layout, but here is how it looks in on single line chart (screendump, not interactive):

This is intereseting because India overtake China. I think I would go with this view if I was serious about this, but I am not. I want to test the six layout with the other view. So here it is:

This is another screendump because I saw one thing that bothered me. Every line except the United States on has a baseline of zero. With a line chart we dont have to use 0 as a baseline, but when 5 out of 6 charts has it I would like it to be consistent. This is with the ‘Use independent scales for each panel’ selection on which removes some options for a custom scale. There is however one option that seems to force a zero baseline with line charts and that is the option use Area fill.

Because the area then becomes a visual factor Datawrapper seems to force a zero baseline. Here is the final graph with interactivity:

Merging datasets to use with scatterplot

The last thing we try with this dataset is to merg it with data for GDP. This will use both datasets linked in the top of the article. Make sure they exist in the same folder and that this folder has been opened by VS code.

These two datasets have different formatting. If I was more serious about this (this was a quick improvised stunt in class) I probably would have tested the intergrity of both datasets by testing what country names and what country names only exist in one of the datasets. This would give valuable information about each dataset and could have highlighted shortcomings or other problems.

Another issue is that one of the datasets should probably have been transposed (rotated) to have a more similar structure. We did not do that wither but just went with what we had and asked the AI (Claude Sonnet in my case) to make a script merging the two datasets anyway. I believe it went ok but the process was abit on the cowboys side with this last script.

Here is the initial prompt I tried:

I want to merge two datasets. Make a python script for this. The datasets are GDP_countries.csv and population.csv. Use the country name as the key variable. The datasets might have different orientation. Write to a csv file called population_gpd.csv

By orientation I meant that one dataset has years progressing left to right the other from top to bottom. The AI noticed that the the datasets were formatted differently and din’t seem to catch my ‘orientation’ hint in the prompt, so that was probably a bit too vague.

Ok, so it handled that ok, but here is another pieve of feedback from the AI that should trigger your data intergrity awareness:

This is a great example of what can happen when going a bit fast and giving the AI a flexible prompt. I guessed right here and assumed that the AI tried to generate GDP per capita (dividing total GPD by population) and that maybe the GRP data was per capita to begin with? We went really fast at this stage so still not 100% sure about the original GDP data, what is interesting though is that the AI did this on its own initiative. This is a good reminder to give the AI strict instructions with a narrow scope and to test the result.

So you can see the AI did a few more tests and also checked a third party source for reference:

Ok, so can we trust the AI? NO! Can we query the AI again to check its own mistakes and assumptions? Yes, and we really have to.

So, that gave us a new merged dataset that looks like this (filename polpulation_GDP.csv)

This is another big datafile. We wanted only data from 2023 and in this case I did specify 2023 because the GDP data included 2024 while the population data did not. To avoid any AI confusion I instructed it to pick the last year where I knew there was data in both original datasets. This means the python script might need modification if it is to be used for 2024 data. Anyway, here is the prompt to get 2023 data:

I want to make anoterh pythoch script to extract data from population_gdp.csv. I want only data from the year 2023. I walso want on more data column that should have total GDP. For simplicity we can calculate total GDP by multiplying population and gpd per capita. Name the csv file gpd_population_2023.csv

This also does another aggregation that might be shaky, but for the sake of speed we calculated the total GDP by multiplying GDP per capita with population. This led to an interger overflow issue where the numbers were to big to be represented by a 32 bit integer (a data type). The AI fixed it by using a different data type, probably a 64 bit integer but it did not say.

The resulting file worked in Datawrapper as we wanted to use a scatterplot.

NOTE! In making the merged dataset we did make several shortcuts and if this was a ‘real’ project we would have to do more tests to ensure the data integrity of that file. It is likely correct but we did not test that to a large extent.

Anyway, here is a scatterplot of the last data:

NOTE! This scatterplot is not very complete. It has unfortunate units in the tooltip such as too many digits on some numbers and also with populations of less than one million oyu get a silly decimal number in millions with too many decmial spaces. This should of course be fixed but we were speedrunning in class so that is why.