Harmonization of CI jobs between fetchers

Epic: #508

Description

.gitlab-ci.yml files contents vary from one fetcher to another. In some cases, this can lead to deleted data that should not have been deleted. Example: #503 (closed)

We've to discuss and decide how to handle cases like:

what to do when a download script doesn't download some datasets due to a temporary error ?
- if the .gitlab-ci.yml script do a git add -A, the errors datasets will be deleted from source-data repo
- if we want to delete those datasets from source-data, a way to do it have been suggested here
- if we want to keep those datasets (this is probably the way to go as it could be usefull to propose then to end user), we already had the idea of adding a deleted attribute in dataset.json files in json-data

Notes:

kill ssh-agent at the end of the script, even on failure, because the processes accumulate and we have to do killall ssh-agent manually

.gitlab-ci.yml examples

Some fetchers have "advanced" .gitlab-ci.yml files. Here is a list of some:

ECB, that allows to test on branches of source-data and json-data
BOE, that add in json-data commits mesages the ID of corresponding source-data revision

config

(to be designed...)

CI jobs would use those env variables:

DATASETS (csv): the datasets to process
FULL: for incremental mode...

Then map those env vars to --xxx script options.

Proposals

Using the new Gitlab's yaml include functionality, we could isolate all the [call [convert/download].py, commit to [source/json] data] process to one uniq script isolated in a git repo

Edited Oct 31, 2019 by Christophe Benz