Accumulating data across revisions
Note: this issue is a wip thought process
New version of NBS fetcher is based on the scrapping of a single HTML page that displays current period value of time series. Time series have a name but no id.
Current download script downloads HTML page from NBS website, overwriting the previous downloaded page. Current convert script parses HTML page and produce a list of datasets and series. Each series contains only one observation value matching current period value. See on preprod
If we stand with this way of download and convert, times series will always contain only one observation values for the last known period. Each iteration overwrites the previous one.
@MichelJuillard suggests to accumulate data across revision (cf. NBS issue comment)
What if we would like to accumulate observation data along time and produce consolidated time series?
Caveats
First questions that stand on our way (mostly specific to NBS), potentially preventing us to connect observation values into stable time series:
- Will the updated HTML keep the same name over time?
- Will the design of the page stay stable?
- Will the code and label of the dataset remain the same?
- Will the code and label of the series remain the same?
- Will the frequency of the time series stay stable?
- Will the dimensions of the time series stay stable?
- Will the dataset - time series relation stay the same?
- What about the category tree?
Approaches
- accumulate downloaded files instead of overwriting them
- keep original format (HTML)
- option: add a intermediary format like CSV (or serialized DataFrame...)
- in order to prevent the fetcher to remain compatible with all the versions of the source format
- good: does not depend on current database state
- post-process JSON-data: keep download/convert scripts as-is and add an additional step to the CI (for ex) which would merge each new partial time series with the existing one, fetched either from DBnomics database (via API) or from the previous commit in JSON-data repo (to be determined).
- good: simplify fetcher authoring
Accumulate download file
We should find a way to automate the accumulation of source files in source-data.
problem:
graph LR
html[partial series HTML]
tsv[partial series TSV]
provider -- download --> html
html -- convert --> tsv
accumulate downloaded files:
graph LR
html1[partial series HTML 1]
html2[partial series HTML 2]
tsv[full series TSV]
provider --> download
download --> html1
download --> html2
html1 --> convert
html2 --> convert
convert --> tsv
Post-process JSON-data:
graph LR
html1[partial series HTML 1]
html2[partial series HTML 2]
tsv[full series TSV]
provider --> download
download --> html1
html1 -- previous commit --> html2
html1 --> convert
convert --> tsv
json-data -- read --> convert