Merge strategies for data rolling windows
Replaces #542 (closed)
Designed with @MichelJuillard
Description
As a general design goal, we want DBnomics to accumulate data instead of strictly reflecting the current state of the provider database.
Sometimes datasets stop being distributed: we should keep them.
For each dataset, providers distribute either:
- the full historical data (i.e. from first known period to latest known period)
- a rolling window of some most rescent known periods (i.e. the 6 last months)
Post-convert strategies
Convert scripts should produce data starting with an empty directory. They should not know about DBnomics database state.
A post-convert step external to the fetcher should merge data produced by convert with DBnomics database.
There are many possible strategies. Here is the description of each of them.
Legacy
Currently we use git add --ignore-removal
to ensure no file is deleted.
However this works only with files and directories, and this does not work well with series.jsonl
containing many series.
This strategy should not remain anymore.
Merge provider datasets
Works at the provider level.
To be used when the provider distributes a subset of its datasets.
This strategy should always be used.
Merge dataset series
Works at the dataset level.
To be used when a dataset is distributed with a subset of its series.
This strategy should be used when the fetcher developer knows that a dataset is distributed with a rolling window for its series.
Merge series observations
Works at the series level.
To be used when a series is distributed with a subset of its observations.
This strategy should be used when the fetcher developer knows that a series is distributed with a rolling window for its observations.
How fetchers declare strategies
- in a
fetcher.yml
in the Git repository of each fetcher - TODO
Questions
- Should there be one or many strategies per provider? per dataset?