Simplify how to write fetchers
Description
When writing a new fetcher, the developer should focus on the specificities of the provider – data or infrastructure. Common tools should be provided to avoid solving the same problems again and again.
In particular she should not bother with:
- data serialization and storage organization (sort JSONL files...)
- implementing the iteration loop of all datasets
- handling common script arguments
- producing metrics (error status and tracing)
But she should rely on:
- a shared toolbox of functions: #622
- a data model of domain-level entities: #818
- a data storage: #819 (closed)
Areas of work
- storage module now must handle writes
- read: api, indexer, clients (from Git, cf #546 )
- write: fetchers (convert)
- port the ~60 existing fetchers
- factor download and convert arguments like
--datasets
,--full
(for incremental mode) between fetchers- factor arguments parser, while being extensible
- in order to allow running a CI pipeline with env variables mapped to these options (cf #507 (closed) for defining those env vars)
- generalize the incremental VS full mode of fetchers (currently each fetcher has to deal with this)
- leading to MR like eurostat-fetcher!2 (merged)
- validate data before writing (and produce error report as artifact?)
- declare download frequency by dataset to download some important datasets more often than others in providers which don't announce their update dates (like OECD) so that we can't do incremental download. Moreover, download "daily datasets" (i.e. datasets with at least one daily series?) every day. Cf #567 (closed)
Fetcher common tasks
Common and repetitive tasks that fetchers do should be factorized in dbnomics-data-model helpers. The following list helps building those helpers:
- convert periods from source format to DBnomics
- declare a dataset skipped because of an error in a series (cf #520)
Edited by Christophe Benz