Harmonization of CI jobs between fetchers
Epic: #508
Description
.gitlab-ci.yml
files contents vary from one fetcher to another. In some cases, this can lead to deleted data that should not have been deleted. Example: #503 (closed)
We've to discuss and decide how to handle cases like:
- what to do when a download script doesn't download some datasets due to a temporary error ?
- if the
.gitlab-ci.yml
script do agit add -A
, the errors datasets will be deleted from source-data repo - if we want to delete those datasets from source-data, a way to do it have been suggested here
- if we want to keep those datasets (this is probably the way to go as it could be usefull to propose then to end user), we already had the idea of adding a
deleted
attribute indataset.json
files in json-data
- if the
Notes:
- kill
ssh-agent
at the end of the script, even on failure, because the processes accumulate and we have to dokillall ssh-agent
manually
.gitlab-ci.yml examples
Some fetchers have "advanced" .gitlab-ci.yml
files. Here is a list of some:
- ECB, that allows to test on branches of source-data and json-data
- BOE, that add in json-data commits mesages the ID of corresponding source-data revision
config
(to be designed...)
CI jobs would use those env variables:
-
DATASETS
(csv): the datasets to process -
FULL
: for incremental mode...
Then map those env vars to --xxx
script options.
Proposals
Using the new Gitlab's yaml include functionality, we could isolate all the [call [convert/download].py, commit to [source/json] data] process to one uniq script isolated in a git repo
Edited by Christophe Benz