Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
D
documentation
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • dbnomics-fetchers
  • documentation
  • Wiki
  • write a new fetcher

Last edited by Christophe Benz May 14, 2020
Page history

write a new fetcher

=> content being moved to https://dbnomics-fetcher-toolbox.readthedocs.io/en/latest/writing_a_fetcher.html

About

Step by step, this page explains how to:

  • set-up your working environnement
  • write a new fetcher for DBnomics platform, or contribute to an existing one
  • test this fetcher against the "DBnomics validation script"

What is a fetcher ?

A fetcher is a software package dedicated to fetch (download) data from a given provider and convert them to DBnomics format (cf. dbnomics-data-model). Once developed and tested in a dev environment, the fetcher is run by GitlabCI on a regular basis (often daily) to collect fresh data from providers.

A fetcher is composed of 2 scripts, which can be written in any programming language (In fact, Python is the language promoted by DBnomics to write its fetchers).

The two scripts are:

  • the dowloader: download.py is responsible of downloading data from the provider (the place where data is available), it takes the download directory as parameter
    • in DBnomics Gitlab CI context, downloaded data is commited to corresponding source-data repository
  • the converter: convert.py is responsible of converting downloaded data from provider's format to DBnomics's format, it takes source directory (where the downloaded files reside) and target directory (where the converted data has been written) as parameters
    • in the context of DBnomics Gitlab CI, converted data is commited to corresponding json-data repository

Git structure

For a given provider (e.g. FED), 3 git repositories are defined:

  • {provider}-fetcher (e.g. fed-fetcher) contains download and convert script
  • {provider}-source-data (e.g. fed-source-data) where the download script writes downloaded data
  • {provider}-json-data (e.g. fed-json-data) where the convert script writes converted data ready to be ingested by DBnomics platform

Download process

  • download.py downloads the data from the provider and put this data without changing the format in a directory somewhere, given as script parameter
  • if the source data is a zip file, the downloader unzip the files but keep the original files format
  • when the script is executed in the context of a Gitlab CI job, ie download.py is executed by the bash script in .gitlab-ci.yml file of the fetcher, the downloaded data is commited to the corresponding source-data git repository. Example: Worldbank fetcher put Worldbank source data in wb-source-data repository.

Conversion process

  • convert.py converts the data found in a directory onto DBnomics format and put the resulting data in a directory somewhere, both directories are given as script parameter
  • the format of converted data is specified in dbnomics-data-model (this will be described later)
  • when the script is executed in the context of a Gitlab CI job, ie convert.py is executed by the bash script in .gitlab-ci.yml file of the fetcher, the converted data is commited to the corresponding json-data git repository. Example: Worldbank fetcher put Worldbank json data in wb-json-data repository.

Steps to write/contribute to a fetcher

  • clone the fetcher, it contains the download and the convert scripts
  • or
    • Write or clone the download script
    • Write or clone the convert script (that use downloaded data from download script)
  • Make some changes
  • Validate converted data (using a script available in dbnomics-data-model project)

All those steps will be descripbed below

Prepare your environnement

[Optional] - Use a Python virtual env

Inside your working directory

  • Create a virtualenv for DBNOMICS with python3:

    virtualenv --python=python3 dbnomics_env
  • activate the virtualenv

    source dbnomics_env/bin/activate

Prepare destination folders

Download script needs an existing folder to put source data in, and the converter script needs an existing folder to put json data in. We could name this two directories freely; but later we will use the validation script to test if our json-data is correct regarding to DBnomics data model, and this script wants the json-data to be named like: [provider_slug]-json-data.

So we're used to name those folders [provider_slug]-source-data and [provider_slug]-json-data. In our example, the slug used for Worldbank is wb so:

Inside your working directory

mkdir wb-source-data
mkdir wb-json-data

(if you're creating a fetcher from scratch, you can decide the provider slug but have a look in the official fetchers list before to check that this fetcher slug is available)

Clone or create a fetcher

Clone an existing fetcher

In this example we'll clone Worlbank fetcher

Inside your working directory

git clone https://git.nomics.world/dbnomics-fetchers/wb-fetcher.git
Install fetcher dependencies

Yes, because fetchers often depends on some third-party libraries.

pip install -r requirements.txt
Create an existing fetcher

A small part of fetcher code is common to every fetcher, so to avoid starting from scratch we created a cookiecutter (ie a template).

Follow README.md of cookiecutter repo to get started.

Now you should get ready

Your working directory may look like:

.
├── wb-fetcher
├── wb-source-data
└── wb-json-data

You're ready to start modifying cloned fetcher / edit cookiecutter to start a new fetcher.

Generate json-data

Here are the general look of files and directories that constitute a json-data directory (what will be created by the convert script):

[my_provider]-json-data
/
|- category_tree.json     <-- [not required] metadata about datasets categorization (in a tree)
|- provider.json            <-- metadata about this provider
|- dataset1                 <-- a dataset folder
|  |- dataset.json          <-- the file containing this dataset metadata
|  |- A1.B1.C1.tsv          <-- a dataset's series
|  |- A1.B1.C2.tsv
|  |- A1.B2.C1.tsv
|  |- A1.B2.C2.tsv
|  |- etc.
|
|- dataset2
|  |- dataset.json
|  |- I1.J1.tsv
|  |- I1.J2.tsv
|  |- etc.

Notes:

  • you can have a look to existing json-data repos for real world examples
  • you can also have a look to dbnomics-data-model fixtures folder for fake examples (used to test the data model)

Using jsonl files

When a dataset contains a huge number of time series (around 1000), the dataset.json file grows drastically. In this case, the use of series.jsonl files (JSON-lines format) is recommended because parsing a JSON-lines file line-by-line consumes less memory than opening a whole JSON file.

Going further onto details

For complete documentation about the structure of those files, please refer to the Storing time series section of the README of the data model project.

Validate your json-data

Generating valid data is essential for those data to be understood by DBnomics API, and so displayed on DBnomics website.

Some general rules (expressed in data model) define a set of constraints:

  • The repository directory name MUST be equal to the provider code + "-json-data"
  • Each dataset directory name MUST be equal to the corresponding dataset code
  • Conversions MUST be stable: 2 executions of the conversion script MUST be equivalent to one
  • (and many other !)

Hopefully, a validation script exists to help you validate all of those constraints later.

In the next section, we'll explain how to install DBnomics data model and use this script in details.

Install data-model and use validation script

Data-model defines the json data model of DBnomics. Each new json-data produced by a fetcher must be compliant with this json-data model.

First, let's install it inside the dbnomics virtual env

  • clone the data_model repo

    (dbnomics_env) git clone https://git.nomics.world/dbnomics/dbnomics-data-model.git
  • Install the package

    (dbnomics_env) pip install -e dbnomics-data_model/

-> this will install the dbnomics-data-model lib, and especially the validation script available in the virtual env as dbnomics-validate command (this magic trick is done in setup.py file of data-model package)

Use validation script to validate your data

So this is the big moment ! You wrote a brand new fetcher, or you fixed a bug in one of existing fetchers (what ?! no way !), you run the convert.py script on previously downloaded source-data and you want to know wether the generated data are valid.

As we saw in prvious section, the script is available in the virtual env as a shell command: dbnomics-validate.

So, to run the validation script on your json-data:

dbnomics-validate [my_fetcher]-json-data

💡 replace [my_provider] by the slug of the provider you're working on. See previous "Now you should get ready" section for details.

Often there's a bunch of errors coming. Don't panic ! The same little fix in converter's code often fix a bunch of errors.

Here's the algorythm of a validation process:

while True:
    fix your code
    validation_ok = run the validator
    if validation_ok:
        # you're done !
        break
    else:
        don't panic
        take a breath

When the validation script passes on your json-data, you're good to go for a pull request with DBnomics team :)

Clone repository
  • Code style
  • Git and Gitlab workflow
  • acceptance criteria
    • fetchers
  • ci jobs and runners
  • code optimization
  • dev tools
  • e mails
  • failure handling procedures
  • Home
  • librairies
  • maintaining fetchers
  • monitoring
  • presentation
  • production configuration
  • publishing python packages
View All Pages