Some fetchers generate too many commits
EPIC: #508
See also:
Description
Some fetchers generate too many commits of xxx-json-data
by simply inverting the order of the lines in the json. This may probably happen in the *.jsonl
and *.tsv
files, but it seems more frequent in dataset.json
.
This will make impossible to identify a commit with a revision of the data. In addition, it increases needlessly the dataflow for an institution who would mirror DBnomics.
I suggest the following procedure to identify the extent of the problem:
For each dataset in dbnomics-json-data
:
- get the frequencies used in the dataset. If daily frequency is present, there is nothing to test, go to next dataset
- get the dates of the last 3 commits in the corresponding directory
- if all 3 commits took place in the last 7 days, report an error with provider, dataset code, frequencies, dates of last 3 commits
Tasks
-
@cbenz review https://git.nomics.world/MichelJuillard/check_commits -
ensure the pre-production environment has json-data of all fetchers available somewhere -
port the script to dbnomics-data-model storage layer instead of opening files directly, to support json-lines and bare repositories -
start a meta-issue of fetchers having problems, and indicate those with false commits -
fix fetchers having the problem
Edited by Christophe Benz