Some fetchers generate too many commits

EPIC: #508

See also:

  • https://git.nomics.world/MichelJuillard/check_commits

Description

Some fetchers generate too many commits of xxx-json-data by simply inverting the order of the lines in the json. This may probably happen in the *.jsonl and *.tsv files, but it seems more frequent in dataset.json.

This will make impossible to identify a commit with a revision of the data. In addition, it increases needlessly the dataflow for an institution who would mirror DBnomics.

I suggest the following procedure to identify the extent of the problem: For each dataset in dbnomics-json-data:

  1. get the frequencies used in the dataset. If daily frequency is present, there is nothing to test, go to next dataset
  2. get the dates of the last 3 commits in the corresponding directory
  3. if all 3 commits took place in the last 7 days, report an error with provider, dataset code, frequencies, dates of last 3 commits

Tasks

  • @cbenz review https://git.nomics.world/MichelJuillard/check_commits
  • ensure the pre-production environment has json-data of all fetchers available somewhere
  • port the script to dbnomics-data-model storage layer instead of opening files directly, to support json-lines and bare repositories
  • start a meta-issue of fetchers having problems, and indicate those with false commits
  • fix fetchers having the problem
Edited Oct 28, 2019 by Christophe Benz
Assignee Loading
Time tracking Loading