Due to changes in the EU Open Data Portal - API, the API access demonstrated in this tutorial is currently not working. We will try to fix it asap. If you have any questions, please contact us: info@datacareer.de.
The EU Open Data Portal gives access to open data published by EU institutions, agencies and other bodies. Around 70 EU institutions, bodies or departments use the platform to make over 12,500 datasets available.
In this Jupyter Notebook we will retrieve data from open data portal "http://data.europa.eu/euodp/en/home". The portal is based on the open source project CKAN. CKAN stands for Comprehensive Knowledge Archive Network. It provides an extensive API for the metadata of the open data catalog. This means that the information about the datasets can be retrieved from CKAN, but the data itself will have to be downloaded from the servers of the contributors.
In this tutorial we will take a look at the unemployment numbers for Europe, using Python 3. Let's start with importing some packages we will use for this exercise.
import pprint
import requests # 2.18.4
import json # 2.0.9
import pandas as pd # 0.23.0
The EU Open Data Portal's website is: http://data.europa.eu/euodp/en/home. Please take a look to get a feel of what this portal has to offer.
You can actually take a look at all the datasets listed in the EU Open Data Portal: https://data.europa.eu/euodp/data/api/3/action/package_list.
As you can see, the majority of the datasets have names which consist of a random combination of letters and numbers. It does not really give us any useful insight into what's available. The CKAN API also offers another option, the tag list: https://data.europa.eu/euodp/data/api/3/action/tag_list.
The tag list gives us an idea of what kind of datasets are available. For this tutorial, we will check out what 'unemployment' numbers are available. We will use the CKAN's 'package_search' functionality, to get the related datasets.
# Topic of interest
query = 'unemployement'
# The base url for search queries
base_url = 'https://data.europa.eu/euodp/data/api/3/action/package_search?q='
# Limit the number of results (max 1,000)
limit = 20
# Build the HTTP request, with the 'base url', 'query' & 'limit'
response = requests.get(base_url + query + '&rows=' + str(limit))
# Use the json module to load CKAN's response into a dictionary
response_dict = json.loads(response.content)
# Check the contents of the response
assert response_dict['success'] is True # make sure if response is OK
search_result = response_dict['result'] # extract all the packages from the response
pprint.pprint(search_result) # pretty print the list to the screen
{'count': 164, 'facets': {}, 'results': [{'author': None, 'author_email': None, 'capacity': None, 'concepts_eurovoc': [{'display_name': 'http://eurovoc.europa.eu/7932', 'id': '8fe979a0-7834-4061-be7b-286fed5529cf', 'name': 'http://eurovoc.europa.eu/7932', 'revision_timestamp': '2015-10-16T11:27:19.505880', 'state': 'active', 'vocabulary_id': 'dbfd15ac-a514-4fa0-b3ef-1167c493b831'}], 'contact_address': 'Joseph Bech building, 5 Rue Alphonse ' 'Weicker, L-2721 Luxembourg', 'contact_name': 'Eurostat, the statistical office of the ' 'European Union', 'contact_telephone': '+352430136789', 'contact_webpage': 'http://ec.europa.eu/eurostat/help/support', 'creator_user_id': '75d0190a-2f12-40ed-a863-b5bf30990d14', 'description': 'Unemployment rate', 'geographical_coverage': [], 'groups': [{'description': '', 'display_name': 'http://eurovoc.europa.eu/100144', 'id': '9ea8d460-827f-4f46-b5d0-3d16605878d7', 'image_display_url': '', 'name': 'eurovoc_domain_100144', 'title': 'http://eurovoc.europa.eu/100144', 'type': 'eurovoc_domain'}], 'id': 'ad9d4738-b1aa-44f1-b53e-beb65865ec3a', 'identifier': 'med_ps421', 'interoperability_level': [], 'isopen': True, 'language': [], 'license_id': 'http://data.europa.eu/euodp/kos/licence/EuropeanCommission', 'license_title': 'Legal Notice', 'license_url': 'http://ec.europa.eu/geninfo/legal_notices_en.htm', 'maintainer': None, 'maintainer_email': None, 'metadata_created': '2015-10-16T11:27:19.505880', 'metadata_modified': '2018-10-05T06:42:47.247130', 'modified_date': '2018-06-28', 'name': 'GnZ8n38zuEu8oJzmmII3WQ', 'num_resources': 6, 'num_tags': 3, 'organization': {'approval_status': 'approved', 'created': '2012-12-12T13:50:52.62
There is a total of 164 results ('count': 164) for the search term 'unemployment'. This is the same as when you search on the website of the data portal: http://data.europa.eu/euodp/en/data/dataset?q=unemployment&ext_boolean=all
Even though we "pretty print" our search results in this Jupyter Notebook, it is still not very readable. Let's build a small loop that prints the 'title' of the datasets. Please note we have limited our search query to 20 results, so with index 0 to 19.
for i, x in enumerate(search_result['results']):
print(i, x['title'])
0 Unemployment rate 1 Unemployment by sex, age and duration of unemployment (1 000) 2 Unemployment by sex, age and duration of unemployment (1 000) 3 [DISCONTINUED] Total unemployment rate 4 Change in Unemployment 2008-2014 5 Total unemployment rate 6 Unemployment Rate 2014 7 Unemployment rate by age 8 Harmonised unemployment rate by sex 9 Unemployment rate by sex 10 Unemployment rate - annual data 11 Youth unemployment rate by sex 12 Tables by benefits - unemployment function 13 Unemployment by sex, age and other typologies 14 Unemployment rate by education level 15 Harmonised unemployment by sex 16 Unemployment by sex, age, duration of unemployment and distinction registration/benefits (%) 17 Long-term unemployment (12 months or more) as a percentage of the total unemployment, by sex and age (%) 18 Youth unemployment ratio (15-24) 19 Unemployment rate by NUTS 2 regions
You can find some more details about the datasets in the JSON response we printed earlier. We will continue with the 'Total unemployment rate' dataset, listed at index 5. We will pretty print the details about this particular dataset again.
pprint.pprint(search_result['results'][5])
{'author': None, 'author_email': None, 'capacity': None, 'concepts_eurovoc': [], 'contact_address': 'Joseph Bech building, 5 Rue Alphonse Weicker, L-2721 ' 'Luxembourg', 'contact_name': 'Eurostat, the statistical office of the European Union', 'contact_telephone': '+352430136789', 'contact_webpage': 'http://ec.europa.eu/eurostat/help/support', 'creator_user_id': '75d0190a-2f12-40ed-a863-b5bf30990d14', 'description': 'Unemployment rates represent unemployed persons as a ' 'percentage of the labour force. The labour force is the total ' 'number of people employed and unemployed. Unemployed persons ' 'comprise persons aged 15 to 74 who were: a. without work ' 'during the reference week, b. currently available for work, ' 'i.e. were available for paid employment or self-employment ' 'before the end of the two weeks following the reference week, ' 'c. actively seeking work, i.e. had taken specific steps in ' 'the four weeks period ending with the reference week to seek ' 'paid employment or self-employment or who found a job to ' 'start later, i.e. within a period of, at most, three months. ' 'This table does not only show unemployment rates but also ' 'unemployed in 1000 and as % of the total population.', 'geographical_coverage': [], 'groups': [], 'id': '01d07089-626d-4216-b947-906bf94c9f84', 'identifier': 'tps00203', 'interoperability_level': [], 'isopen': True, 'language': [], 'license_id': 'http://data.europa.eu/euodp/kos/licence/EuropeanCommission', 'license_title': 'Legal Notice', 'license_url': 'http://ec.europa.eu/geninfo/legal_notices_en.htm', 'maintainer': None, 'maintainer_email': None, 'metadata_created': '2017-12-06T06:32:40.667781', 'metadata_modified': '2018-10-05T05:57:13.592821', 'modified_date': '2018-10-03', 'name': '5wwzZYnxIK3aSTAZdXaGg', 'num_resources': 7, 'num_tags': 2, 'organization': {'approval_status': 'approved', 'created': '2012-12-12T13:50:52.627310', 'description': '', 'id': '9c0d04b1-a494-4f70-80b2-54024b6f8b0c', 'image_url': '', 'is_organization': True, 'name': 'estat', 'revision_id': 'c9dbca7b-695f-4247-a5b5-0ae1818f9d53', 'revision_timestamp': '2012
Let's take a look first at the web page of the dataset at the Open Data Portal. The URL is:
search_result['results'][5]['url']
'http://ec.europa.eu/eurostat/web/products-datasets/-/tps00203'
There are various ways to look at the data on the portal. You can see it in table format, in a graph or on a map. We are more interested in the raw data itself. The number of resources (data formats) available is:
len(search_result['results'][5]['resources'])
7
Browsing through the dataset details above, it seems there is a "tap separated (TSV)" / "unzipped" version of the data amongst the 7 resources. Its index location is 1.
search_result['results'][5]['resources'][1]
{'mimetype': None, 'cache_url': None, 'hash': '', 'description': 'Download dataset in TSV format (unzipped)', 'name': None, 'format': 'text/tab-separated-values', 'url': 'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tps00203.tsv.gz&unzip=true', 'created': '2018-10-05T07:57:13.648521', 'state': 'active', 'webstore_url': None, 'revision_timestamp': '2018-10-05T05:57:13.587294', 'tracking_summary': {'total': 0, 'recent': 0}, 'mimetype_inner': None, 'download_total_resource': 0, 'url_type': None, 'position': 5, 'resource_group_id': '860bce01-077e-47d4-9851-837127343dea', 'revision_id': '34e4492b-0134-4417-93ee-28c83c190942', 'id': '0bebdf99-68a6-4d2c-8090-ced6bcb24b56', 'resource_type': 'http://www.w3.org/TR/vocab-dcat#Download', 'size': None}
The URL to the raw data is:
search_result['results'][5]['resources'][1]['url']
'http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tps00203.tsv.gz&unzip=true'
For readability lateron, we can put the URL in a variable called 'unzipped'
unzipped = search_result['results'][5]['resources'][1]['url']
# http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tps00203.tsv.gz&unzip=true
And we can load the raw data into a Pandas DataFrame:
df = pd.read_csv(unzipped, sep='\t') # use separator '\t' for tab-separated values
# Print the first 5 rows of the DataFrame for inspection
df.head()
age,unit,sex,geo\time | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TOTAL,PC_ACT,T,AT | 5.3 | 4.9 | 4.1 | 5.3 | 4.8 | 4.6 | 4.9 | 5.4 | 5.6 | 5.7 | 6.0 | 5.5 |
1 | TOTAL,PC_ACT,T,BE | 8.3 | 7.5 | 7.0 | 7.9 | 8.3 | 7.2 | 7.6 | 8.4 | 8.5 | 8.5 | 7.8 | 7.1 b |
2 | TOTAL,PC_ACT,T,BG | 9.0 | 6.9 | 5.6 | 6.8 | 10.3 d | 11.3 | 12.3 | 13.0 | 11.4 | 9.2 | 7.6 | 6.2 |
3 | TOTAL,PC_ACT,T,CY | 4.6 | 3.9 | 3.7 | 5.4 | 6.3 | 7.9 | 11.9 | 15.9 | 16.1 | 15.0 | 13.0 | 11.1 |
4 | TOTAL,PC_ACT,T,CZ | 7.1 | 5.3 | 4.4 | 6.7 | 7.3 | 6.7 | 7.0 | 7.0 | 6.1 | 5.1 | 4.0 | 2.9 |
The column names are numbers and the first column has a weird header. It is best to inspect the column headers a bit more before we continue.
print(df.columns)
Index(['age,unit,sex,geo\time', '2006 ', '2007 ', '2008 ', '2009 ', '2010 ', '2011 ', '2012 ', '2013 ', '2014 ', '2015 ', '2016 ', '2017 '], dtype='object')
If you look closely, all the column headers with years in them contain spaces at the end. To remove them:
df.columns = df.columns.str.strip()
Also, the data is not entirely ready for use yet and needs some cleaing first. The first column is called 'age,unit,sex,geo\time' and has the country codes as two letters at the end of each value.
# Create a new column at the end of the DataFrame called 'country', with the values of the first column
df['country'] = df[df.columns[0]]
# Use only the last two letters of the value, to filter out the country codes
df['country'] = df['country'].str[-2:]
# Inspect the DataFrame. What are the types of the columns? Float or integers we can use directly? Probably not...
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 35 entries, 0 to 34 Data columns (total 14 columns): age,unit,sex,geo\time 35 non-null object 2006 35 non-null object 2007 34 non-null float64 2008 34 non-null object 2009 34 non-null object 2010 34 non-null object 2011 34 non-null object 2012 34 non-null float64 2013 34 non-null float64 2014 34 non-null float64 2015 34 non-null float64 2016 34 non-null float64 2017 34 non-null object country 35 non-null object dtypes: float64(6), object(8) memory usage: 3.9+ KB
As we will look at the year '2017' only, we would need the values of this column to be numeric. We can transform them while erasing the errors from the DataFrame (NaN's or non-numeric).
df = df[pd.to_numeric(df['2017'], errors='coerce').notnull()]
df['2017'] = df['2017'].astype(float)
# Check the first rows of the DataFrame again
df.head()
age,unit,sex,geo\time | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | TOTAL,PC_ACT,T,AT | 5.3 | 4.9 | 4.1 | 5.3 | 4.8 | 4.6 | 4.9 | 5.4 | 5.6 | 5.7 | 6.0 | 5.5 | AT |
2 | TOTAL,PC_ACT,T,BG | 9.0 | 6.9 | 5.6 | 6.8 | 10.3 d | 11.3 | 12.3 | 13.0 | 11.4 | 9.2 | 7.6 | 6.2 | BG |
3 | TOTAL,PC_ACT,T,CY | 4.6 | 3.9 | 3.7 | 5.4 | 6.3 | 7.9 | 11.9 | 15.9 | 16.1 | 15.0 | 13.0 | 11.1 | CY |
4 | TOTAL,PC_ACT,T,CZ | 7.1 | 5.3 | 4.4 | 6.7 | 7.3 | 6.7 | 7.0 | 7.0 | 6.1 | 5.1 | 4.0 | 2.9 | CZ |
5 | TOTAL,PC_ACT,T,DE | 10.1 | 8.5 | 7.4 | 7.6 | 7.0 | 5.8 | 5.4 | 5.2 | 5.0 | 4.6 | 4.1 | 3.8 | DE |
Finally, we will plot the values for the year '2017' for the various countries. For readability, let's create a new DataFrame called 'last_year' with only the columns '2017' and 'country'.
# Use only the last two columns of the DataFrame
last_year = df[df.columns[-2:]]
# Check again the first 5 rows
last_year.head()
2017 | country | |
---|---|---|
0 | 5.5 | AT |
2 | 6.2 | BG |
3 | 11.1 | CY |
4 | 2.9 | CZ |
5 | 3.8 | DE |
The data is ready to be visualised now.
# For charting, use IPython's "magic" in Jupyter Notebook to directly show the plot on the screen.
%matplotlib inline
# Plot the values in a chart, with ascending values
last_year.sort_values('2017', ascending=True).plot( x='country', y='2017', kind='bar')
<matplotlib.axes._subplots.AxesSubplot at 0x19ffefce2e8>
It is quite a hassle to search through all the datasets to find raw data. Below is a small loop which only prints the datasets which have a link to the raw data available.
for x in search_result['results']: # Loop through every dataset in the search results
for i in x['resources']: # Check for every data set if the word 'unzip' is in the URL
if 'unzip' in i['url']: # If so, print the 'title' and the URL to the screen
print(x['title'])
print(i['url'])
Unemployment rate http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/med_ps421.tsv.gz&unzip=true Unemployment by sex, age and duration of unemployment (1 000) http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/lfsa_ugad.tsv.gz&unzip=true Unemployment by sex, age and duration of unemployment (1 000) http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/lfsq_ugad.tsv.gz&unzip=true Total unemployment rate http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tps00203.tsv.gz&unzip=true Unemployment rate by age http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tepsr_wc170.tsv.gz&unzip=true Harmonised unemployment rate by sex http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/teilm020.tsv.gz&unzip=true Unemployment rate by sex http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tesem120.tsv.gz&unzip=true Unemployment rate - annual data http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tipsun20.tsv.gz&unzip=true Youth unemployment rate by sex http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/tesem140.tsv.gz&unzip=true Tables by benefits - unemployment function http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/spr_exp_fun.tsv.gz&unzip=true Unemployment by sex, age and other typologies http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/urt_lfu3pers.tsv.gz&unzip=true Unemployment rate by education level http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/med_ps422.tsv.gz&unzip=true Harmonised unemployment by sex http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/teilm010.tsv.gz&unzip=true Unemployment by sex, age, duration of unemployment and distinction registration/benefits (%) http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/lfsa_ugadra.tsv.gz&unzip=true Long-term unemployment (12 months or more) as a percentage of the total unemployment, by sex and age (%) http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/lfsq_upgal.tsv.gz&unzip=true Youth unemployment ratio (15-24) http://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=da
About the author:
Joris H., Python & open source enthusiast. Entrepreneur @ PythonSherpa - https://www.pythonsherpa.com