Merge strategies for data rolling windows

Description

As a general design goal, we want DBnomics to accumulate data instead of strictly reflecting the current state of the provider database.

Sometimes datasets stop being distributed: we should keep them.

For each dataset, providers distribute either:

Convert scripts should produce data starting with an empty directory. They should not know about DBnomics database state.

A post-convert step external to the fetcher should merge data produced by convert with DBnomics database.

There are many possible strategies. Here is the description of each of them.

Currently we use git add --ignore-removal to ensure no file is deleted.

However this works only with files and directories, and this does not work well with series.jsonl containing many series.

This strategy should not remain anymore.

Works at the provider level.

To be used when the provider distributes a subset of its datasets.

This strategy should always be used.

Works at the dataset level.

To be used when a dataset is distributed with a subset of its series.

This strategy should be used when the fetcher developer knows that a dataset is distributed with a rolling window for its series.

Works at the series level.

To be used when a series is distributed with a subset of its observations.

This strategy should be used when the fetcher developer knows that a series is distributed with a rolling window for its observations.

Edited Dec 02, 2020 by Christophe Benz