|
|
## About
|
|
|
Step by step, this page explains how to:
|
|
|
* set-up your working environnement
|
|
|
* write a new fetcher for **dbnomics plateform**, or contribute to an existing one
|
|
|
* write a new fetcher for **DBnomics platform**, or contribute to an existing one
|
|
|
* test this fetcher against the "DBnomics validation script"
|
|
|
|
|
|
## Changelog
|
... | ... | @@ -13,22 +13,32 @@ Step by step, this page explains how to: |
|
|
|
|
|
## What is a fetcher ?
|
|
|
|
|
|
A fetcher is a set of 2 scripts, written in any language (but Python is used in this doc, and used for [all fetchers developped by DBnomics team](https://git.nomics.world/dbnomics-fetchers)).
|
|
|
A fetcher is a software package dedicated to fetch (download) data from a given provider and convert them to DBnomics format (cf. [dbnomics-data-model](https://git.nomics.world/dbnomics/dbnomics-data-model)). Once developed and tested in a dev environment, the fetcher is run by GitlabCI on a regular basis (often daily) to collect fresh data from providers.
|
|
|
|
|
|
A fetcher is composed of 2 scripts, which can be written in any programming language (In fact, Python is the language promoted by DBnomics to write its [fetchers](https://git.nomics.world/dbnomics-fetchers)).
|
|
|
|
|
|
The two scripts are:
|
|
|
- the dowloader: `download.py` is responsible of downloading data from the provider (the place where data is available)
|
|
|
- in the context of DBnomics Gitlab CI, this data is commited to corresponding `source-data` repository
|
|
|
- the converter: `convert.py` is responsible of converting those data from provider's format to DBnomics's format
|
|
|
- in the context of DBnomics Gitlab CI, this data is commited to corresponding `json-data` repository
|
|
|
- the dowloader: `download.py` is responsible of downloading data from the provider (the place where data is available), it takes the download directory as parameter
|
|
|
- in DBnomics Gitlab CI context, downloaded data is commited to corresponding `source-data` repository
|
|
|
- the converter: `convert.py` is responsible of converting downloaded data from provider's format to DBnomics's format, it takes source directory (where the downloaded files reside) and target directory (where the converted data has been written) as parameters
|
|
|
- in the context of DBnomics Gitlab CI, converted data is commited to corresponding `json-data` repository
|
|
|
|
|
|
### Git structure
|
|
|
|
|
|
For a given provider (e.g. FED), 3 git repositories are defined:
|
|
|
- {provider}-fetcher (e.g. fed-fetcher) contains download and convert script
|
|
|
- {provider}-source-data (e.g. fed-source-data) where the download script writes downloaded data
|
|
|
- {provider}-json-data (e.g. fed-json-data) where the convert script writes converted data ready to be ingested by DBnomics platform
|
|
|
|
|
|
|
|
|
### Download process
|
|
|
- `download.py` downloads the data from the provider and put this data **without changing the format** in a directory somewhere, given in scripts arguments
|
|
|
- `download.py` downloads the data from the provider and put this data **without changing the format** in a directory somewhere, given as script parameter
|
|
|
- if the source data is a zip file, the downloader unzip the files but keep the original files format
|
|
|
- when the script is executed in the context of a [Gitlab CI job](https://docs.gitlab.com/ee/ci/introduction/), ie `download.py` is executed by the bash script in [`.gitlab-ci.yml` file of the fetcher](https://git.nomics.world/dbnomics-fetchers/wb-fetcher/blob/master/.gitlab-ci.yml), the downloaded data is **commited** to the corresponding *source-data* git repository. Example: [Worldbank fetcher](https://git.nomics.world/dbnomics-fetchers/wb-fetcher) put **Worldbank source data** in [wb-source-data](https://git.nomics.world/dbnomics-source-data/wb-source-data) repository.
|
|
|
|
|
|
### Convertion process
|
|
|
- `convert.py` converts the data downloaded by `download.py` onto DBnomics format and put the resulting data in a directory somewhere, given in scripts arguments
|
|
|
- the format of this data is described in [dbnomics-data-model](https://git.nomics.world/dbnomics/dbnomics-data-model) (this will be described later)
|
|
|
### Conversion process
|
|
|
- `convert.py` converts the data found in a directory onto DBnomics format and put the resulting data in a directory somewhere, both directories are given as script parameter
|
|
|
- the format of converted data is specified in [dbnomics-data-model](https://git.nomics.world/dbnomics/dbnomics-data-model) (this will be described later)
|
|
|
- when the script is executed in the context of a [Gitlab CI job](https://docs.gitlab.com/ee/ci/introduction/), ie `convert.py` is executed by the bash script in [`.gitlab-ci.yml` file of the fetcher](https://git.nomics.world/dbnomics-fetchers/wb-fetcher/blob/master/.gitlab-ci.yml), the converted data is **commited** to the corresponding *json-data* git repository. Example: [Worldbank fetcher](https://git.nomics.world/dbnomics-fetchers/wb-fetcher) put **Worldbank json data** in [wb-json-data](https://git.nomics.world/dbnomics-json-data/wb-json-data) repository.
|
|
|
|
|
|
## Steps to write/contribute to a fetcher
|
... | ... | @@ -62,7 +72,7 @@ Inside your working directory |
|
|
|
|
|
#### Prepare destination folders
|
|
|
|
|
|
Download script need an existing folder to put *source data* in, and the converter script needs an existing folder to put *json data* in. We could name this two directories freely; but later we will use the *validation script* to test if our json-data is correct regarding to DBnomics data model, and this script wants the json-data to be named like: `[provider_slug]-json-data`.
|
|
|
Download script needs an existing folder to put *source data* in, and the converter script needs an existing folder to put *json data* in. We could name this two directories freely; but later we will use the *validation script* to test if our json-data is correct regarding to DBnomics data model, and this script wants the json-data to be named like: `[provider_slug]-json-data`.
|
|
|
|
|
|
So we're used to name those folders `[provider_slug]-source-data` and `[provider_slug]-json-data`. In our example, the slug used for Worldbank is `wb` so:
|
|
|
|
... | ... | |