Open data for education

Open data for education

Introduction

This tutorial is intended for teachers who (want to) use open data in their teaching.

Upon completion:

  •   you will be able to determine which data are open and which are not
  •   you will know what to look out for when using datasets in education or when you want to use them
  •   you will know about various websites where you can search for open data

Reading through the reading text and videos and doing the assignments will take about an hour and a half.

 

 

 

Dutch version / Nederlandstalige versie

 

 

University Library, photographer Wouter van der Wolk
University Library, photographer Wouter van der Wolk

The tutorial was created by staff of the Library of the University of Amsterdam and the Amsterdam University of Applied Sciences. It is intended for lecturers working there, but can also be used elsewhere.

Usage in class

As a teacher, why would you want to find and use open data in education?

Data skills are important in more and more courses and for more and more professions.

Of course, you can use self-created example datasets, but for students it is valuable to get to know datasets from real life. By using real life data, you can make your education more attuned to reality.

Higher education prepares students to do research independently, among other things. By confronting datasets in education, students become more aware of the importance of data for their own research.

And finally, the use of open data in education promotes the principle of open science: the principle that the results of science should be transparent and accessible to society as a whole.

"Open Science" by NWO Wetenschap, 2020

Data

What is data?

"What is data?" by University of Guelph McLaughlin Library, 2019, CC BY-NC-SA

Data can be described as the raw material for information. By itself, data has/has1 no meaning; context and interpretation are needed to answer the questions who, what, where and when, in other words: to transform the data into information. That information can then be used to support an argument and thus serve science, public administration or business.

Data collected by the researcher himself is called primary data. Data from other sources are called secondary: these are already existing data, for instance found in a government database or a scientific publication.

Primary data can be created in many ways:

  •    observation
  •    measurement
  •    interviews
  •    case studies
  •    surveys
  •    crowdsourcing (contributions from interested lay people to research).

Another classification for data is qualitative and quantitative: qualitative data are not numerical; quantitative data are.

A dataset is a collection of matching data. Once data are published openly, it is usually in the form of a dataset.

A data paper describes, according to custom within a scientific discipline, how a particular dataset available online should be interpreted.

Metadata are data about data.


__________________


Note

1.

In the traditional sense, data is the plural of date. A date is 'something that is given' and can be counted (1 date, 2 dates, 3 dates etc). Example:
"are these dates suitable for everyone?"
When talking about dates in science, it is not common to count dates. Any quantity of it can be referred to as dates, whether in the singular or plural form. For example, the New York Times uses the singular and plural forms side by side:

"the survey data are still being analysed"

and

"the first year for which data is available".

This course also uses singular and plural side by side for data.

 

 

Open

Open science and FAIR

Openness of research data fits into the global Open Science movement, in which more and more scientists realise that the results of their publicly funded research, including the underlying data, should be made available to a wide audience to promote transparency.

In 2014, scientists internationally agreed to describe, store and publish scientific data according to FAIR principles from now on. FAIR is an acronym for findable, accessible, interoperable and reusable.

The Netherlands Organisation for Scientific Research (NWO) has also embraced Open Science, making similar conditions mandatory for scientists funded by NWO.

 

 

 

 

 

"Open Data - explained in a nutshell" by Simpleshow Foundation

Open data outside academia

Open Government

Many government organisations are publishing some of the data they acquire openly for the sake of transparency and accountability, the ideals of the Open Government movement.

Open GLAM

Also in the heritage sector (GLAM stands for galleries, libraries, archives and museums), there have been initiatives since 2010 to make data on collections openly available to the public.

The Library UvA/HvA is doing just that, for example.

 

Conditions for "open"


Data is open when the following conditions are met:

  •    comprehensibility for machines
  •    presence of metadata
  •    presence of an open licence

Machine-readable

It is important that data is machine-readable. This does not mean the same as "digital".

Many document formats we create every day using our laptops and computers are not machine-readable. They are unreadable unless you have the right software package.

Examples:

  •    A PDF file can be "read" and displayed by the Adobe Reader programme, but not by other software.
  •    An Excel file can be "read in" and displayed by the programme MS Excel, but not by other programmes or by programming environments.

"But surely you can easily save an xlsx file in the csv format?" Yes, but then information is lost, such as formulas and meaningful use of colours.

Open data should be extractable and processable by the computer. The guarantee for this is the file format.

Machine-readable formats include.

  •    .tsv or tab separated values: a table in which every row starts on a new line and there is a horizontal tab (white space) between every 2 columns
  •    .csv or comma separated values: a table in which each row starts on a new line and a comma or semicolon appears between each 2 columns
  •    .txt or plain text: this is a text stripped of all formatting, font and images
  •    .json: in this format, relationships between objects can be described.

Files in these formats can always be opened and read by any computer, regardless of the software installed.

More information on file formats


Presence of metadata

A table filled with numbers and/or text needs basic explanations:

  •    what are these data about?
  •    how were they obtained?
  •    where and when were they obtained?
  •    what unit of measurement was used?
  •    what do the used abbreviations mean?
  •    etc.

The answer to those questions should become clear using metadata ("data about data"). It is common practice to add these to the dataset in a separate text file.

 

Open licence

Publishing a dataset on the internet does not mean that this makes it "open". This is because its creator has copyright! If he/she has not explicitly stated that the dataset is open for re-use, then the data is not open.

As with other "works of art, science or literature" (that's how it says in the Copyright Act), the creator of a dataset also has copyright on it: only the creator has the right to reproduce or distribute the dataset.

Copyright arises automatically, i.e. not only because the creator has placed a © sign.
And it persists even after the creator has published or allowed the dataset to be published on a website.

This means that, in principle, a user may only view and download someone else's dataset for their own use. Making copies for a group of students, combining the data with others, and then republishing or distributing them are infringements of the creator's copyright. Even in education!

Unless.... the creator has given prior conditional permission for use and re-use, i.e. a licence or license attached to the work. This preserves the creator's copyright, but creates opportunities for others to distribute the dataset.

Without a licence, the dataset cannot be open!


How do you find out if a dataset has a licence attached to it?

The creator could write out the licence terms themselves. However, this rarely, if ever, happens.

Creators almost always use an existing licensing system. Not having to invent and write out terms saves them time, and not having to read them saves you time.

Creative Commons (CC) is the most widely used licensing system worldwide. In this system, logos and abbreviations are used for terms and conditions. The creator selects one or more logos and/or abbreviations.

    CC-BY Re-use is allowed on condition that a correct source citation is added.
CC-ND Re-use is permitted provided no derivative works are published.
CC-SA Re-use is permitted, provided derivative works are published under the same licence (share alike).
CC-NC Re-use permitted, but only for non-commercial purposes.
CC-0 No conditions; public domain


 

Other licence forms

Some governments and international organisations do not use Creative Commons but have created their own licence forms, such as the UK Open Government License, The World Bank Terms of Use and the French Government License Ouverte.

 

"What is Creative Commons? Creative Commons License Types Basics Explained" by Creative Common Studio, 2020

Assignments

Examples

Websites

Many open datasets can be found on the internet. Below are some examples of websites offering them.

 

Science

 
https://data.4tu.nl/portal

 

 

Governments

Such as the city of Amsterdam:

 

Datasets

Below are some examples of datasets.

In the left-hand column, you see the web page introducing the dataset. Each website obviously has its own way of presenting it, but constants are:

  •    title
  •    creator
  •    (link to) description (metadata)
  •    size
  •    date/year of publication
  •    a button to download the dataset to your computer.

Usually, you will also find a licence indication, which tells you whether and how you may reuse this dataset. If such an indication is not there, check the general information of the data collector, usually in the About section of the site: the same licence may apply to all sets. Again: without a licence, the data is not open.

In the right-hand column is an example of the data, as you can download it to your own computer, viewed with a "plain" text editor.

 

 

web page view of the data after downloading
https://data.4tu.nl/articles/dataset/Participatory_Value_Evaluation_for_relaxation_of_COVID-19_measures/14413958
https://data.4tu.nl/articles/dataset/Participatory_Value_Evaluation_for_relaxation_of_COVID-19_measures/14413958?file=27556598

 

https://www.kaggle.com/datasets/elgunisgandarli/active-and-awarded-grants-usa
https://www.kaggle.com/datasets/elgunisgandarli/active-and-awarded-grants-usa?resource=download

 

https://ec.europa.eu/eurostat/databrowser/view/tag00081/default/table?lang=en
https://ec.europa.eu/eurostat/databrowser/product/view/FISH_CA_ATL37

 

 

 

Quality and usability

To determine which open datasets (i.e. licensed) are suitable for use in education, you can use various assessment criteria.

Look especially for downloadable datasets and not for real-time data that is continuously updated via an API ("application programming interface"). The latter applies to stock market data, for example. It is difficult for a group of students to have access to the same data.

Metadata

Is metadata present, so you can see how this data has been/is being collected?

Source

Who (person or body) is the creator of the dataset and to what extent does it inspire trust?
Can you be confident that it will be stable for the duration of the teaching block?

Size

Is the dataset not too large?
Keep in mind that students do not all have very modern computers. If a computer's working memory (RAM) is 4GB, it can handle a dataset of up to 4GB, but then other programmes cannot be used at the same time.

File format

Is the file format suitable for processing by students?
The formats .csv, .tsv and .txt can be read by any computer without any problems.
The formats .zip, and .gz mean that these are folders of "packed" files; what the actual format is becomes clear only after unpacking.

Finding

Starting points

Datasets are collected and offered on all kinds of websites. We provide some key starting points here.


Science

Researchers from many scientific institutes store their data in 1 of these databases:

  Figshare: http://figshare.com
  Dataverse: http://dataverse.org
  OSF: https://osf.io/

Of university repositories, the content is limited to the 'production' of 1 or a few institutions. At the UvA and HvA this is

  UvA/HvA-Figshare: https://uvaauas.figshare.com/

National repositories: in these, research results including datasets from several universities in a country are made accessible, often by "harvesting" ( = retrieving information) from university repositories. In the Netherlands, this is 

in which mainly output from humanities and social sciences can be found.

For datasets related to exact science, engineering and medicine, the best places to look are

  DANS Easy: https://easy.dans.knaw.nl/ and
  4TU data centre: https://data.4tu.nl/portal

Public sector

  The City of Amsterdam: https://data.amsterdam.nl/
  The European Union: https://data.europa.eu/euodp/nl/home
  Dutch government: https://data.overheid.nl/

Subject areas

There are also all kinds of subject-specific data search engines. On the websites of many university libraries, information specialists offer an anthology of these for their specific field.
See for example the data management pages per discipline of the

  UvA Library: https://uba.uva.nl/en/search-the-collection/search-by-discipline
  (choose a discipline and then click on Data Management; this is not yet available for all disciplines)

Overarching

There are also the metacatalogues, or "repositories of repositories". These inventory not the datasets themselves, but the collecting repositories. To be successful with this, it is wise to use large subject categories.

Example: if you are looking for datasets on precipitation in a particular year in Europe, search first on the larger topic 'weather'. The metacatalogue links to various repositories. Once there, only use the more specific search terms 'rainfall' etc.

  Registry of Research Data Repositories: https://www.re3data.org/
  OpenDOAR: https://v2.sherpa.ac.uk/opendoar
  Dataportals: https://dataportals.org (geographically ordered)
  Datahub: https://datahub.io

Google

You can also use the general search engine Google.com to search for datasets. To avoid drowning in the number of irrelevant results, we offer the following tips:

- in addition to the subject, type

data OR dataset OR "data set"

in the search query.

- You can search specifically for a certain file format with, for example

filetype:csv

and for data from a particular site or internet domain with, for example

site:.gov

- Before words that should NOT appear in the search result, place a - (minus sign).


Google also offers a dataset search engine, launched in 2020:

  Google Dataset Search: https://datasetsearch.research.google.com/

Wikidata

The online encyclopaedia Wikipedia and other Wikimedia projects such as Wiktionary (dictionary) and Wikivoyage (travel guide), have an underlying database cum classification system: Wikidata.

Like the content of these reference works, Wikidata is also a product of crowdsourcing.

Wikidata has an open licence (CC0) and is special because it is not merely about searching for existing datasets. It allows you to generate your own datasets based on your own search. These can be downloaded in csv, tsv, and json format and used for any purpose.
Older versions are also available for download.

Please note that Wikidata is constantly changing!

Searches in Wikidata require knowledge of the structure of Wikipedia and of the SPARQL search language, but all kinds of help is available, including the Wikidata Query Builder and the Query Helper.

 

Triples

Like many knowledge databases, Wikidata is composed of so-called triples. A triple is a one set of subject, predicate and object. The predicate establishes the relationship between subject and object.

By CmplstofB - Own work, WTFPL, https://commons.wikimedia.org/w/index.php?curid=82141957

 

Example

A triple can be formed by:

Subject: "Cristiano Ronaldo"
Predicate: "has been awarded with"
Object: the "Bravo Award 2004"

Suppose you want a dataset containing all Cristiano Ronaldo's awards and their corresponding years.

Wikidata: query for awards for Cristiano Ronaldo, ordered by year Url

 

Explanation:

After clicking the blue arrow, the search is started and the dataset is created.
The dataset shows, among other things:

The set may be downloaded in tsv-, csv- and json-format.

 

Video

In the following video, the whole process is explained, now with reference to the residences of all women who studied at a particular university.
Only the first 10 minutes are relevant to our topic.

"Wikidata SPARQL Query Tutorial", by Wikimedian in Residence - University of Edinburgh

Assignments

How to proceed

Coming to the end of this tutorial, we give some suggestions on how to view, analyse and process the datasets found. The suggestions are very general, as ultimately your choices will largely be determined by your specific field and educational goal.


View

Txt files contain "plain text". They can be viewed in any "plain" text editor, such as NotePad and Notepad, among others. If you want to compare several text files, e.g. on style features, the programme AntConc is suitable.

Csv and tsv files contain tabular data, which can be viewed in Excel with an intermediate step, demonstrated in the next video:

 

"How to convert txt file to csv or excel file" by Krishna Ojha, 2020

Analyse

Files in both csv and tsv format can be read in OpenRefine (free). That programme is useful for those without programming knowledge and is suitable for analysis tasks, such as displaying frequency of unique values. OpenRefine can also be used to enrich the data with data from other sources.

The next video demonstrates the use of OpenRefine.

"OpenRefine demo" by Henaramay, 2020

Processing

More advanced analysis, processing, manipulation and visualisation of the data requires programming knowledge and a programming environment, e.g. Python Pandas. This is beyond the scope of what we cover here.
______________


Thanks

You are at the end of the tutorial.
Thanks for your participation and best of luck in finding and using open data in your teaching.

Any comments, suggestions or questions? You can contact Alice Doek.

 

 

 

  • Het arrangement Open data for education is gemaakt met Wikiwijs van Kennisnet. Wikiwijs is hét onderwijsplatform waar je leermiddelen zoekt, maakt en deelt.

    Laatst gewijzigd
    2023-08-07 11:41:16
    Licentie
    CC Naamsvermelding-GelijkDelen 4.0 Internationale licentie

    Dit lesmateriaal is gepubliceerd onder de Creative Commons Naamsvermelding-GelijkDelen 4.0 Internationale licentie. Dit houdt in dat je onder de voorwaarde van naamsvermelding en publicatie onder dezelfde licentie vrij bent om:

    • het werk te delen - te kopiëren, te verspreiden en door te geven via elk medium of bestandsformaat
    • het werk te bewerken - te remixen, te veranderen en afgeleide werken te maken
    • voor alle doeleinden, inclusief commerciële doeleinden.

    Meer informatie over de CC Naamsvermelding-GelijkDelen 4.0 Internationale licentie.

    Aanvullende informatie over dit lesmateriaal

    Van dit lesmateriaal is de volgende aanvullende informatie beschikbaar:

    Toelichting
    On finding and using open data in higher education
    Eindgebruiker
    leraar
    Moeilijkheidsgraad
    gemiddeld
    Studiebelasting
    1 uur en 30 minuten
    Trefwoorden
    gebruik, onderwijs, open data, zoeken

    Bronnen

    Bron Type
    "Open Science" by NWO Wetenschap, 2020
    https://youtu.be/BIHuPGg0YT0
    Video
    "What is data?" by University of Guelph McLaughlin Library, 2019, CC BY-NC-SA
    https://youtu.be/pg12U1BAnoA
    Video
    "Open Data - explained in a nutshell" by Simpleshow Foundation
    https://youtu.be/c42QNa-rccw
    Video
    "What is Creative Commons? Creative Commons License Types Basics Explained" by Creative Common Studio, 2020
    https://youtu.be/4MYSVhKcnaA
    Video
    "Wikidata SPARQL Query Tutorial", by Wikimedian in Residence - University of Edinburgh
    https://youtu.be/1jHoUkj_mKw
    Video
    "How to convert txt file to csv or excel file" by Krishna Ojha, 2020
    https://youtu.be/d9i2nBhg3aM
    Video
    "OpenRefine demo" by Henaramay, 2020
    https://youtu.be/yjLIRNpc2RQ
    Video

    Gebruikte Wikiwijs Arrangementen

    Team Informatievaardigheid, Bibliotheek UvA. (2022).

    Open data voor onderwijs

    https://maken.wikiwijs.nl/186751/Open_data_voor_onderwijs