University Library, photographer Wouter van der Wolk
The tutorial was created by staff of the Library of the University of Amsterdam and the Amsterdam University of Applied Sciences. It is intended for lecturers working there, but can also be used elsewhere.
Usage in class
As a teacher, why would you want to find and use open data in education?
Data skills are important in more and more courses and for more and more professions.
Of course, you can use self-created example datasets, but for students it is valuable to get to know datasets from real life. By using real life data, you can make your education more attuned to reality.
Higher education prepares students to do research independently, among other things. By confronting datasets in education, students become more aware of the importance of data for their own research.
And finally, the use of open data in education promotes the principle of open science: the principle that the results of science should be transparent and accessible to society as a whole.
"Open Science" by NWO Wetenschap, 2020
Data
What is data?
"What is data?" by University of Guelph McLaughlin Library, 2019, CC BY-NC-SA
Data can be described as the raw material for information. By itself, data has/has1 no meaning; context and interpretation are needed to answer the questions who, what, where and when, in other words: to transform the data into information. That information can then be used to support an argument and thus serve science, public administration or business.
Data collected by the researcher himself is called primary data. Data from other sources are called secondary: these are already existing data, for instance found in a government database or a scientific publication.
Primary data can be created in many ways:
observation
measurement
interviews
case studies
surveys
crowdsourcing (contributions from interested lay people to research).
Another classification for data is qualitative and quantitative: qualitative data are not numerical; quantitative data are.
A dataset is a collection of matching data. Once data are published openly, it is usually in the form of a dataset.
A data paper describes, according to custom within a scientific discipline, how a particular dataset available online should be interpreted.
Metadata are data about data.
__________________
Note
1.
In the traditional sense, data is the plural of date. A date is 'something that is given' and can be counted (1 date, 2 dates, 3 dates etc). Example: "are these dates suitable for everyone?"
When talking about dates in science, it is not common to count dates. Any quantity of it can be referred to as dates, whether in the singular or plural form. For example, the New York Times uses the singular and plural forms side by side:
"the survey data are still being analysed"
and
"the first year for which data is available".
This course also uses singular and plural side by side for data.
Open
Open science and FAIR
Openness of research data fits into the global Open Science movement, in which more and more scientists realise that the results of their publicly funded research, including the underlying data, should be made available to a wide audience to promote transparency.
In 2014, scientists internationally agreed to describe, store and publish scientific data according to FAIR principles from now on. FAIR is an acronym for findable, accessible, interoperable and reusable.
The Netherlands Organisation for Scientific Research (NWO) has also embraced Open Science, making similar conditions mandatory for scientists funded by NWO.
"Open Data - explained in a nutshell" by Simpleshow Foundation
Open data outside academia
Open Government
Many government organisations are publishing some of the data they acquire openly for the sake of transparency and accountability, the ideals of the Open Government movement.
Open GLAM
Also in the heritage sector (GLAM stands for galleries, libraries, archives and museums), there have been initiatives since 2010 to make data on collections openly available to the public.
Data is open when the following conditions are met:
comprehensibility for machines
presence of metadata
presence of an open licence
Machine-readable
It is important that data is machine-readable. This does not mean the same as "digital".
Many document formats we create every day using our laptops and computers are not machine-readable. They are unreadable unless you have the right software package.
Examples:
A PDF file can be "read" and displayed by the Adobe Reader programme, but not by other software.
An Excel file can be "read in" and displayed by the programme MS Excel, but not by other programmes or by programming environments.
"But surely you can easily save an xlsx file in the csv format?" Yes, but then information is lost, such as formulas and meaningful use of colours.
Open data should be extractable and processable by the computer. The guarantee for this is the file format.
Machine-readable formats include.
.tsv or tab separated values: a table in which every row starts on a new line and there is a horizontal tab (white space) between every 2 columns
.csv or comma separated values: a table in which each row starts on a new line and a comma or semicolon appears between each 2 columns
.txt or plain text: this is a text stripped of all formatting, font and images
.json: in this format, relationships between objects can be described.
Files in these formats can always be opened and read by any computer, regardless of the software installed.
A table filled with numbers and/or text needs basic explanations:
what are these data about?
how were they obtained?
where and when were they obtained?
what unit of measurement was used?
what do the used abbreviations mean?
etc.
The answer to those questions should become clear using metadata ("data about data"). It is common practice to add these to the dataset in a separate text file.
Open licence
Publishing a dataset on the internet does not mean that this makes it "open". This is because its creator has copyright! If he/she has not explicitly stated that the dataset is open for re-use, then the data is not open.
As with other "works of art, science or literature" (that's how it says in the Copyright Act), the creator of a dataset also has copyright on it: only the creator has the right to reproduce or distribute the dataset.
This means that, in principle, a user may only view and download someone else's dataset for their own use. Making copies for a group of students, combining the data with others, and then republishing or distributing them are infringements of the creator's copyright. Even in education!
Unless.... the creator has given prior conditional permission for use and re-use, i.e. a licence or license attached to the work. This preserves the creator's copyright, but creates opportunities for others to distribute the dataset.
Without a licence, the dataset cannot be open!
How do you find out if a dataset has a licence attached to it?
The creator could write out the licence terms themselves. However, this rarely, if ever, happens.
Creators almost always use an existing licensing system. Not having to invent and write out terms saves them time, and not having to read them saves you time.
Creative Commons (CC) is the most widely used licensing system worldwide. In this system, logos and abbreviations are used for terms and conditions. The creator selects one or more logos and/or abbreviations.
CC-BY
Re-use is allowed on condition that a correct source citation is added.
CC-ND
Re-use is permitted provided no derivative works are published.
CC-SA
Re-use is permitted, provided derivative works are published under the same licence (share alike).
CC-NC
Re-use permitted, but only for non-commercial purposes.
"What is Creative Commons? Creative Commons License Types Basics Explained" by Creative Common Studio, 2020
Assignments
Examples
Websites
Many open datasets can be found on the internet. Below are some examples of websites offering them.
Science
Governments
Such as the city of Amsterdam:
The cultural sector
Financial markets
.
Collector of statistical data
Such as the CBS.
Weather and climate
Datasets
Below are some examples of datasets.
In the left-hand column, you see the web page introducing the dataset. Each website obviously has its own way of presenting it, but constants are:
title
creator
(link to) description (metadata)
size
date/year of publication
a button to download the dataset to your computer.
Usually, you will also find a licence indication, which tells you whether and how you may reuse this dataset. If such an indication is not there, check the general information of the data collector, usually in the About section of the site: the same licence may apply to all sets. Again: without a licence, the data is not open.
In the right-hand column is an example of the data, as you can download it to your own computer, viewed with a "plain" text editor.
web page
view of the data after downloading
Quality and usability
To determine which open datasets (i.e. licensed) are suitable for use in education, you can use various assessment criteria.
Look especially for downloadable datasets and not for real-time data that is continuously updated via an API ("application programming interface"). The latter applies to stock market data, for example. It is difficult for a group of students to have access to the same data.
Metadata
Is metadata present, so you can see how this data has been/is being collected?
Source
Who (person or body) is the creator of the dataset and to what extent does it inspire trust?
Can you be confident that it will be stable for the duration of the teaching block?
Size
Is the dataset not too large?
Keep in mind that students do not all have very modern computers. If a computer's working memory (RAM) is 4GB, it can handle a dataset of up to 4GB, but then other programmes cannot be used at the same time.
File format
Is the file format suitable for processing by students?
The formats .csv, .tsv and .txt can be read by any computer without any problems.
The formats .zip, and .gz mean that these are folders of "packed" files; what the actual format is becomes clear only after unpacking.
Finding
Starting points
Datasets are collected and offered on all kinds of websites. We provide some key starting points here.
Science
Researchers from many scientific institutes store their data in 1 of these databases:
National repositories: in these, research results including datasets from several universities in a country are made accessible, often by "harvesting" ( = retrieving information) from university repositories. In the Netherlands, this is
There are also all kinds of subject-specific data search engines. On the websites of many university libraries, information specialists offer an anthology of these for their specific field.
See for example the data management pages per discipline of the
There are also the metacatalogues, or "repositories of repositories". These inventory not the datasets themselves, but the collecting repositories. To be successful with this, it is wise to use large subject categories.
Example: if you are looking for datasets on precipitation in a particular year in Europe, search first on the larger topic 'weather'. The metacatalogue links to various repositories. Once there, only use the more specific search terms 'rainfall' etc.
You can also use the general search engine Google.com to search for datasets. To avoid drowning in the number of irrelevant results, we offer the following tips:
- in addition to the subject, type
data OR dataset OR "data set"
in the search query.
- You can search specifically for a certain file format with, for example
filetype:csv
and for data from a particular site or internet domain with, for example
site:.gov
- Before words that should NOT appear in the search result, place a - (minus sign).
Google also offers a dataset search engine, launched in 2020:
The online encyclopaedia Wikipedia and other Wikimedia projects such as Wiktionary (dictionary) and Wikivoyage (travel guide), have an underlying database cum classification system: Wikidata.
Like the content of these reference works, Wikidata is also a product of crowdsourcing.
Wikidata has an open licence (CC0) and is special because it is not merely about searching for existing datasets. It allows you to generate your own datasets based on your own search. These can be downloaded in csv, tsv, and json format and used for any purpose.
Older versions are also available for download.
Please note that Wikidata is constantly changing!
Searches in Wikidata require knowledge of the structure of Wikipedia and of the SPARQL search language, but all kinds of help is available, including the Wikidata Query Builder and the Query Helper.
Like many knowledge databases, Wikidata is composed of so-called triples. A triple is a one set of subject, predicate and object. The predicate establishes the relationship between subject and object.
Example
A triple can be formed by:
Subject: "Cristiano Ronaldo"
Predicate: "has been awarded with"
Object: the "Bravo Award 2004"
Suppose you want a dataset containing all Cristiano Ronaldo's awards and their corresponding years.
The set may be downloaded in tsv-, csv- and json-format.
Video
In the following video, the whole process is explained, now with reference to the residences of all women who studied at a particular university.
Only the first 10 minutes are relevant to our topic.
"Wikidata SPARQL Query Tutorial", by Wikimedian in Residence - University of Edinburgh
Assignments
How to proceed
Coming to the end of this tutorial, we give some suggestions on how to view, analyse and process the datasets found. The suggestions are very general, as ultimately your choices will largely be determined by your specific field and educational goal.
View
Txt files contain "plain text". They can be viewed in any "plain" text editor, such as NotePad and Notepad, among others. If you want to compare several text files, e.g. on style features, the programme AntConc is suitable.
Csv and tsv files contain tabular data, which can be viewed in Excel with an intermediate step, demonstrated in the next video:
"How to convert txt file to csv or excel file" by Krishna Ojha, 2020
Analyse
Files in both csv and tsv format can be read in OpenRefine (free). That programme is useful for those without programming knowledge and is suitable for analysis tasks, such as displaying frequency of unique values. OpenRefine can also be used to enrich the data with data from other sources.
The next video demonstrates the use of OpenRefine.
"OpenRefine demo" by Henaramay, 2020
Processing
More advanced analysis, processing, manipulation and visualisation of the data requires programming knowledge and a programming environment, e.g. Python Pandas. This is beyond the scope of what we cover here.
______________
Thanks
You are at the end of the tutorial.
Thanks for your participation and best of luck in finding and using open data in your teaching.
Any comments, suggestions or questions? You can contact Alice Doek.
Het arrangement Open data for education is gemaakt met
Wikiwijs van
Kennisnet. Wikiwijs is hét onderwijsplatform waar je leermiddelen zoekt,
maakt en deelt.
Dit lesmateriaal is gepubliceerd onder de Creative Commons Naamsvermelding-GelijkDelen 4.0 Internationale licentie. Dit houdt in dat je onder de voorwaarde van naamsvermelding en publicatie onder dezelfde licentie vrij bent om:
het werk te delen - te kopiëren, te verspreiden en door te geven via elk medium of bestandsformaat
het werk te bewerken - te remixen, te veranderen en afgeleide werken te maken
voor alle doeleinden, inclusief commerciële doeleinden.
Leeromgevingen die gebruik maken van LTI kunnen Wikiwijs arrangementen en toetsen afspelen en resultaten
terugkoppelen. Hiervoor moet de leeromgeving wel bij Wikiwijs aangemeld zijn. Wil je gebruik maken van de LTI
koppeling? Meld je aan via info@wikiwijs.nl met het verzoek om een LTI
koppeling aan te gaan.
Maak je al gebruik van LTI? Gebruik dan de onderstaande Launch URL’s.
Arrangement
IMSCC package
Wil je de Launch URL’s niet los kopiëren, maar in één keer downloaden? Download dan de IMSCC package.
Wikiwijs lesmateriaal kan worden gebruikt in een externe leeromgeving. Er kunnen koppelingen worden gemaakt en
het lesmateriaal kan op verschillende manieren worden geëxporteerd. Meer informatie hierover kun je vinden op
onze Developers Wiki.