Accumulating data across revisions

Note: this issue is a wip thought process

New version of NBS fetcher is based on the scrapping of a single HTML page that displays current period value of time series. Time series have a name but no id.

Current download script downloads HTML page from NBS website, overwriting the previous downloaded page. Current convert script parses HTML page and produce a list of datasets and series. Each series contains only one observation value matching current period value. See on preprod

If we stand with this way of download and convert, times series will always contain only one observation values for the last known period. Each iteration overwrites the previous one.

@MichelJuillard suggests to accumulate data across revision (cf. NBS issue comment)

What if we would like to accumulate observation data along time and produce consolidated time series?

Caveats

First questions that stand on our way (mostly specific to NBS), potentially preventing us to connect observation values into stable time series:

  • Will the updated HTML keep the same name over time?
  • Will the design of the page stay stable?
  • Will the code and label of the dataset remain the same?
  • Will the code and label of the series remain the same?
  • Will the frequency of the time series stay stable?
  • Will the dimensions of the time series stay stable?
  • Will the dataset - time series relation stay the same?
  • What about the category tree?

Approaches

  • accumulate downloaded files instead of overwriting them
    • keep original format (HTML)
    • option: add a intermediary format like CSV (or serialized DataFrame...)
      • in order to prevent the fetcher to remain compatible with all the versions of the source format
    • good: does not depend on current database state
  • post-process JSON-data: keep download/convert scripts as-is and add an additional step to the CI (for ex) which would merge each new partial time series with the existing one, fetched either from DBnomics database (via API) or from the previous commit in JSON-data repo (to be determined).
    • good: simplify fetcher authoring

Accumulate download file

We should find a way to automate the accumulation of source files in source-data.

problem:

graph LR
  html[partial series HTML]
  tsv[partial series TSV]
  provider -- download --> html
  html -- convert --> tsv

accumulate downloaded files:

graph LR
  html1[partial series HTML 1]
  html2[partial series HTML 2]
  tsv[full series TSV]
  provider --> download
  download --> html1
  download --> html2
  html1 --> convert
  html2 --> convert
  convert --> tsv

Post-process JSON-data:

graph LR
  html1[partial series HTML 1]
  html2[partial series HTML 2]
  tsv[full series TSV]
  provider --> download
  download --> html1
  html1 -- previous commit --> html2
  html1 --> convert
  convert --> tsv
  json-data -- read --> convert
Edited Dec 02, 2020 by Christophe Benz
Assignee Loading
Time tracking Loading