Introduction

This tutorial is intended for teachers who (want to) use open data in their teaching.

Upon completion:

you will be able to determine which data are open and which are not
you will know what to look out for when using datasets in education or when you want to use them
you will know about various websites where you can search for open data

Reading through the reading text and videos and doing the assignments will take about an hour and a half.

University Library, photographer Wouter van der Wolk

The tutorial was created by staff of the Library of the University of Amsterdam and the Amsterdam University of Applied Sciences. It is intended for lecturers working there, but can also be used elsewhere.

Usage in class

As a teacher, why would you want to find and use open data in education?

Data skills are important in more and more courses and for more and more professions.

Of course, you can use self-created example datasets, but for students it is valuable to get to know datasets from real life. By using real life data, you can make your education more attuned to reality.

Higher education prepares students to do research independently, among other things. By confronting datasets in education, students become more aware of the importance of data for their own research.

And finally, the use of open data in education promotes the principle of open science: the principle that the results of science should be transparent and accessible to society as a whole.

"Open Science" by NWO Wetenschap, 2020

Data

What is data?

"What is data?" by University of Guelph McLaughlin Library, 2019, CC BY-NC-SA

Data can be described as the raw material for information. By itself, data has/has¹ no meaning; context and interpretation are needed to answer the questions who, what, where and when, in other words: to transform the data into information. That information can then be used to support an argument and thus serve science, public administration or business.

Data collected by the researcher himself is called primary data. Data from other sources are called secondary: these are already existing data, for instance found in a government database or a scientific publication.

Primary data can be created in many ways:

observation
measurement
interviews
case studies
surveys
crowdsourcing (contributions from interested lay people to research).

Another classification for data is qualitative and quantitative: qualitative data are not numerical; quantitative data are.

A dataset is a collection of matching data. Once data are published openly, it is usually in the form of a dataset.

A data paper describes, according to custom within a scientific discipline, how a particular dataset available online should be interpreted.

Metadata are data about data.

__________________

Note

In the traditional sense, data is the plural of date. A date is 'something that is given' and can be counted (1 date, 2 dates, 3 dates etc). Example:
"are these dates suitable for everyone?"
When talking about dates in science, it is not common to count dates. Any quantity of it can be referred to as dates, whether in the singular or plural form. For example, the New York Times uses the singular and plural forms side by side:

"the survey data are still being analysed"

and

"the first year for which data is available".

This course also uses singular and plural side by side for data.

Open

Open science and FAIR

Openness of research data fits into the global Open Science movement, in which more and more scientists realise that the results of their publicly funded research, including the underlying data, should be made available to a wide audience to promote transparency.

In 2014, scientists internationally agreed to describe, store and publish scientific data according to FAIR principles from now on. FAIR is an acronym for findable, accessible, interoperable and reusable.

The Netherlands Organisation for Scientific Research (NWO) has also embraced Open Science, making similar conditions mandatory for scientists funded by NWO.

"Open Data - explained in a nutshell" by Simpleshow Foundation

Open data outside academia

Open Government

Many government organisations are publishing some of the data they acquire openly for the sake of transparency and accountability, the ideals of the Open Government movement.

Open GLAM

Also in the heritage sector (GLAM stands for galleries, libraries, archives and museums), there have been initiatives since 2010 to make data on collections openly available to the public.

The Library UvA/HvA is doing just that, for example.

Conditions for "open"

Data is open when the following conditions are met:

comprehensibility for machines
presence of metadata
presence of an open licence

Machine-readable

It is important that data is machine-readable. This does not mean the same as "digital".

Many document formats we create every day using our laptops and computers are not machine-readable. They are unreadable unless you have the right software package.

Examples:

A PDF file can be "read" and displayed by the Adobe Reader programme, but not by other software.
An Excel file can be "read in" and displayed by the programme MS Excel, but not by other programmes or by programming environments.

"But surely you can easily save an xlsx file in the csv format?" Yes, but then information is lost, such as formulas and meaningful use of colours.

Open data should be extractable and processable by the computer. The guarantee for this is the file format.

Machine-readable formats include.

.tsv or tab separated values: a table in which every row starts on a new line and there is a horizontal tab (white space) between every 2 columns
.csv or comma separated values: a table in which each row starts on a new line and a comma or semicolon appears between each 2 columns
.txt or plain text: this is a text stripped of all formatting, font and images
.json: in this format, relationships between objects can be described.

Files in these formats can always be opened and read by any computer, regardless of the software installed.

More information on file formats

Presence of metadata

A table filled with numbers and/or text needs basic explanations:

what are these data about?
how were they obtained?
where and when were they obtained?
what unit of measurement was used?
what do the used abbreviations mean?
etc.

The answer to those questions should become clear using metadata ("data about data"). It is common practice to add these to the dataset in a separate text file.

Open licence

Publishing a dataset on the internet does not mean that this makes it "open". This is because its creator has copyright! If he/she has not explicitly stated that the dataset is open for re-use, then the data is not open.

As with other "works of art, science or literature" (that's how it says in the Copyright Act), the creator of a dataset also has copyright on it: only the creator has the right to reproduce or distribute the dataset.

Copyright arises automatically, i.e. not only because the creator has placed a © sign.
And it persists even after the creator has published or allowed the dataset to be published on a website.

This means that, in principle, a user may only view and download someone else's dataset for their own use. Making copies for a group of students, combining the data with others, and then republishing or distributing them are infringements of the creator's copyright. Even in education!

Unless.... the creator has given prior conditional permission for use and re-use, i.e. a licence or license attached to the work. This preserves the creator's copyright, but creates opportunities for others to distribute the dataset.

Without a licence, the dataset cannot be open!

How do you find out if a dataset has a licence attached to it?

The creator could write out the licence terms themselves. However, this rarely, if ever, happens.

Creators almost always use an existing licensing system. Not having to invent and write out terms saves them time, and not having to read them saves you time.

Creative Commons (CC) is the most widely used licensing system worldwide. In this system, logos and abbreviations are used for terms and conditions. The creator selects one or more logos and/or abbreviations.

	CC-BY	Re-use is allowed on condition that a correct source citation is added.
	CC-ND	Re-use is permitted provided no derivative works are published.
	CC-SA	Re-use is permitted, provided derivative works are published under the same licence (share alike).
	CC-NC	Re-use permitted, but only for non-commercial purposes.
	CC-0	No conditions; public domain

Other licence forms

Some governments and international organisations do not use Creative Commons but have created their own licence forms, such as the UK Open Government License, The World Bank Terms of Use and the French Government License Ouverte.

"What is Creative Commons? Creative Commons License Types Basics Explained" by Creative Common Studio, 2020

Assignments

Examples

Websites

Many open datasets can be found on the internet. Below are some examples of websites offering them.

Science

Governments

Such as the city of Amsterdam:

https://data .amsterdam.nl/data sets/zoek/

The cultural sector

https://www.opencultuurdata.nl/data sets/

Financial markets

https://www.sec.gov/dera/data /financial-statement-data -sets.html

Collector of statistical data

Such as the CBS.

https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS

Weather and climate

https://data .world/fivethirtyeight/us-weather-history

Datasets

Below are some examples of datasets.

In the left-hand column, you see the web page introducing the dataset. Each website obviously has its own way of presenting it, but constants are:

title
creator
(link to) description (metadata)
size
date/year of publication
a button to download the dataset to your computer.

Usually, you will also find a licence indication, which tells you whether and how you may reuse this dataset. If such an indication is not there, check the general information of the data collector, usually in the About section of the site: the same licence may apply to all sets. Again: without a licence, the data is not open.

In the right-hand column is an example of the data, as you can download it to your own computer, viewed with a "plain" text editor.

web page	view of the data after downloading
https://data.4tu.nl/articles/dataset/Participatory_Value_Evaluation_for_relaxation_of_COVID-19_measures/14413958	https://data.4tu.nl/articles/dataset/Participatory_Value_Evaluation_for_relaxation_of_COVID-19_measures/14413958?file=27556598
https://www.kaggle.com/datasets/elgunisgandarli/active-and-awarded-grants-usa	https://www.kaggle.com/datasets/elgunisgandarli/active-and-awarded-grants-usa?resource=download
https://ec.europa.eu/eurostat/databrowser/view/tag00081/default/table?lang=en	https://ec.europa.eu/eurostat/databrowser/product/view/FISH_CA_ATL37

Quality and usability

To determine which open datasets (i.e. licensed) are suitable for use in education, you can use various assessment criteria.

Look especially for downloadable datasets and not for real-time data that is continuously updated via an API ("application programming interface"). The latter applies to stock market data, for example. It is difficult for a group of students to have access to the same data.

Metadata

Is metadata present, so you can see how this data has been/is being collected?

Source

Who (person or body) is the creator of the dataset and to what extent does it inspire trust?
Can you be confident that it will be stable for the duration of the teaching block?

Size

Is the dataset not too large?
Keep in mind that students do not all have very modern computers. If a computer's working memory (RAM) is 4GB, it can handle a dataset of up to 4GB, but then other programmes cannot be used at the same time.

File format

Is the file format suitable for processing by students?
The formats .csv, .tsv and .txt can be read by any computer without any problems.
The formats .zip, and .gz mean that these are folders of "packed" files; what the actual format is becomes clear only after unpacking.

Finding

Starting points

Datasets are collected and offered on all kinds of websites. We provide some key starting points here.

Science

Researchers from many scientific institutes store their data in 1 of these databases:

Figshare: http://figshare.com
Dataverse: http://dataverse.org
OSF: https://osf.io/

Of university repositories, the content is limited to the 'production' of 1 or a few institutions. At the UvA and HvA this is

UvA/HvA-Figshare: https://uvaauas.figshare.com/

National repositories: in these, research results including datasets from several universities in a country are made accessible, often by "harvesting" ( = retrieving information) from university repositories. In the Netherlands, this is

Netherlands Research Portal https://netherlands.openaire.eu/

in which mainly output from humanities and social sciences can be found.

For datasets related to exact science, engineering and medicine, the best places to look are

DANS Easy: https://easy.dans.knaw.nl/ and
4TU data centre: https://data.4tu.nl/portal

Public sector

The City of Amsterdam: https://data.amsterdam.nl/
The European Union: https://data.europa.eu/euodp/nl/home
Dutch government: https://data.overheid.nl/

Subject areas

There are also all kinds of subject-specific data search engines. On the websites of many university libraries, information specialists offer an anthology of these for their specific field.
See for example the data management pages per discipline of the

UvA Library: https://uba.uva.nl/en/search-the-collection/search-by-discipline
(choose a discipline and then click on Data Management; this is not yet available for all disciplines)

Overarching

There are also the metacatalogues, or "repositories of repositories". These inventory not the datasets themselves, but the collecting repositories. To be successful with this, it is wise to use large subject categories.

Example: if you are looking for datasets on precipitation in a particular year in Europe, search first on the larger topic 'weather'. The metacatalogue links to various repositories. Once there, only use the more specific search terms 'rainfall' etc.

Registry of Research Data Repositories: https://www.re3data.org/
OpenDOAR: https://v2.sherpa.ac.uk/opendoar
Dataportals: https://dataportals.org (geographically ordered)
Datahub: https://datahub.io

Google

You can also use the general search engine Google.com to search for datasets. To avoid drowning in the number of irrelevant results, we offer the following tips:

- in addition to the subject, type

data OR dataset OR "data set"

in the search query.

- You can search specifically for a certain file format with, for example

filetype:csv

and for data from a particular site or internet domain with, for example

site:.gov

- Before words that should NOT appear in the search result, place a - (minus sign).

Google also offers a dataset search engine, launched in 2020:

Google Dataset Search: https://datasetsearch.research.google.com/

Wikidata

The online encyclopaedia Wikipedia and other Wikimedia projects such as Wiktionary (dictionary) and Wikivoyage (travel guide), have an underlying database cum classification system: Wikidata.

Like the content of these reference works, Wikidata is also a product of crowdsourcing.

Wikidata has an open licence (CC0) and is special because it is not merely about searching for existing datasets. It allows you to generate your own datasets based on your own search. These can be downloaded in csv, tsv, and json format and used for any purpose.
Older versions are also available for download.

Please note that Wikidata is constantly changing!

Searches in Wikidata require knowledge of the structure of Wikipedia and of the SPARQL search language, but all kinds of help is available, including the Wikidata Query Builder and the Query Helper.

Wikidata: http://wikidata.org
Wikidata Query Builder: http://query.wikidata.org/querybuilder/?uselang=en

Triples

Like many knowledge databases, Wikidata is composed of so-called triples. A triple is a one set of subject, predicate and object. The predicate establishes the relationship between subject and object.

By CmplstofB - Own work, WTFPL, https://commons.wikimedia.org/w/index.php?curid=82141957

Example

A triple can be formed by:

Subject: "Cristiano Ronaldo"
Predicate: "has been awarded with"
Object: the "Bravo Award 2004"

Suppose you want a dataset containing all Cristiano Ronaldo's awards and their corresponding years.

Wikidata: query for awards for Cristiano Ronaldo, ordered by year Url

Explanation:

wd:Q11571 means the subject / item: 'Cristiano Ronaldo', sie https://www.wikidata.org/wiki/Q11571
p:P166 means the predicate / thee property: 'awarded with', sie https://www.wikidata.org/wiki/Property:P166
pq:P585 is a qualifier, to be precise: 'point in time', see https://www.wikidata.org/wiki/Property:P585

After clicking the blue arrow, the search is started and the dataset is created.
The dataset shows, among other things:

wd:Q554495, an object / value that belongs to this subject-predicate relation, namely thee Bravo Award, see https://www.wikidata.org/wiki/Q554495

The set may be downloaded in tsv-, csv- and json-format.

Video

In the following video, the whole process is explained, now with reference to the residences of all women who studied at a particular university.
Only the first 10 minutes are relevant to our topic.

"Wikidata SPARQL Query Tutorial", by Wikimedian in Residence - University of Edinburgh

Assignments

How to proceed

Coming to the end of this tutorial, we give some suggestions on how to view, analyse and process the datasets found. The suggestions are very general, as ultimately your choices will largely be determined by your specific field and educational goal.

View

Txt files contain "plain text". They can be viewed in any "plain" text editor, such as NotePad and Notepad, among others. If you want to compare several text files, e.g. on style features, the programme AntConc is suitable.

Csv and tsv files contain tabular data, which can be viewed in Excel with an intermediate step, demonstrated in the next video:

"How to convert txt file to csv or excel file" by Krishna Ojha, 2020

Analyse

Files in both csv and tsv format can be read in OpenRefine (free). That programme is useful for those without programming knowledge and is suitable for analysis tasks, such as displaying frequency of unique values. OpenRefine can also be used to enrich the data with data from other sources.

The next video demonstrates the use of OpenRefine.

"OpenRefine demo" by Henaramay, 2020

Processing

More advanced analysis, processing, manipulation and visualisation of the data requires programming knowledge and a programming environment, e.g. Python Pandas. This is beyond the scope of what we cover here.
______________

Thanks

You are at the end of the tutorial.
Thanks for your participation and best of luck in finding and using open data in your teaching.

Any comments, suggestions or questions? You can contact Alice Doek.

Het arrangement Open data for education is gemaakt met Wikiwijs van Kennisnet. Wikiwijs is hét onderwijsplatform waar je leermiddelen zoekt, maakt en deelt.

Auteur: Team Informatievaardigheid, Bibliotheek UvA
Laatst gewijzigd: 2023-08-07 11:41:16
Licentie: Dit lesmateriaal is gepubliceerd onder de Creative Commons Naamsvermelding-GelijkDelen 4.0 Internationale licentie. Dit houdt in dat je onder de voorwaarde van naamsvermelding en publicatie onder dezelfde licentie vrij bent om:

het werk te delen - te kopiëren, te verspreiden en door te geven via elk medium of bestandsformaat

het werk te bewerken - te remixen, te veranderen en afgeleide werken te maken

voor alle doeleinden, inclusief commerciële doeleinden.

Meer informatie over de CC Naamsvermelding-GelijkDelen 4.0 Internationale licentie.

Aanvullende informatie over dit lesmateriaal

Van dit lesmateriaal is de volgende aanvullende informatie beschikbaar:

Toelichting: On finding and using open data in higher education
Eindgebruiker: leraar
Moeilijkheidsgraad: gemiddeld
Studiebelasting: 1 uur 30 minuten
Trefwoorden: gebruik, onderwijs, open data, zoeken

Bronnen

Bron	Type
"Open Science" by NWO Wetenschap, 2020 https://youtu.be/BIHuPGg0YT0	Video
"What is data?" by University of Guelph McLaughlin Library, 2019, CC BY-NC-SA https://youtu.be/pg12U1BAnoA	Video
"Open Data - explained in a nutshell" by Simpleshow Foundation https://youtu.be/c42QNa-rccw	Video
"What is Creative Commons? Creative Commons License Types Basics Explained" by Creative Common Studio, 2020 https://youtu.be/4MYSVhKcnaA	Video
"Wikidata SPARQL Query Tutorial", by Wikimedian in Residence - University of Edinburgh https://youtu.be/1jHoUkj_mKw	Video
"How to convert txt file to csv or excel file" by Krishna Ojha, 2020 https://youtu.be/d9i2nBhg3aM	Video
"OpenRefine demo" by Henaramay, 2020 https://youtu.be/yjLIRNpc2RQ	Video

Gebruikte Wikiwijs Arrangementen

Team Informatievaardigheid, Bibliotheek UvA. (2022).

Open data voor onderwijs

https://maken.wikiwijs.nl/186751/Open_data_voor_onderwijs

Download
Downloaden

Het volledige arrangement is in de onderstaande formaten te downloaden.

pdf

json

IMSCP package

Metadata

Metadata overzicht (Excel)

LTI

Leeromgevingen die gebruik maken van LTI kunnen Wikiwijs arrangementen en toetsen afspelen en resultaten terugkoppelen. Hiervoor moet de leeromgeving wel bij Wikiwijs aangemeld zijn. Wil je gebruik maken van de LTI koppeling? Meld je aan via info@wikiwijs.nl met het verzoek om een LTI koppeling aan te gaan.

Maak je al gebruik van LTI? Gebruik dan de onderstaande Launch URL’s.

Arrangement

IMSCC package

Wil je de Launch URL’s niet los kopiëren, maar in één keer downloaden? Download dan de IMSCC package.

IMSCC package

Voor developers

Wikiwijs lesmateriaal kan worden gebruikt in een externe leeromgeving. Er kunnen koppelingen worden gemaakt en het lesmateriaal kan op verschillende manieren worden geëxporteerd. Meer informatie hierover kun je vinden op onze Developers Wiki.
Sluiten
Opties
Gebruik
Weergave
Wikiwijs is een dienst van

Introduction

Usage in class

Data

What is data?

Open

Open science and FAIR

Open data outside academia

Open Government

Open GLAM

Conditions for "open"

Machine-readable

Presence of metadata

Open licence

How do you find out if a dataset has a licence attached to it?

Other licence forms

Assignments

Examples

Websites

Science

Governments

The cultural sector

Financial markets

Collector of statistical data

Weather and climate

Datasets

Quality and usability

Metadata

Source

Size

File format

Finding

Starting points

Science

Public sector

Subject areas

Overarching

Google

Wikidata

Triples

Example

Video

Assignments

How to proceed

View

Analyse

Processing

Thanks

Downloaden

Metadata

LTI

Arrangement

IMSCC package

Voor developers

Gebruik

Weergave