Handle dataset releases
Related to #718 (closed)
- As an economist
- I want to access each release (past and latest) of a particular dataset, when it is published with releases
- in order to do reproducible data processing.
Acceptance criteria
-
the fetcher authors MUST be able to declare the releases of a dataset in a meta-data file ( releases.json
) -
the web API MUST accept :latest
suffix for the "dataset" endpoint, and redirect to the latest release
Description
This issue describes an evolution of DBnomics conceptual data model, introducing a new concept of dataset release.
When the provider distributes datasets with releases, allowing the user to download many previous releases, each having its own name, DBnomics can integrate and propose them to its users.
Each dataset can have many releases. Each release is just a normal dataset named after the pattern {dataset_code_prefix}:{release_name}
(e.g. WEO:2020-04
). As a consequence, all the existing components of DBnomics continue to work without needing evolutions.
There are actual release names that can be any string (2020-01
, 1.1
, before_trump
), and a special release name latest
referencing the latest known release.
To encode the relationship between a dataset and its releases, a new releases.json
file is introduced:
// releases.json
{
"dataset_releases": [
{
"dataset_code_prefix": "WEO",
"name": "WEO by countries", // optional
"releases": [
{"code": "2020-04"},
{"code": "2020-10"} // latest release at this time
]
}
]
}
The latest release corresponds to the last item of the release array. If that array evolves, the latest
release will always correspond to the last item.
When the user asks for {dataset_code_prefix}:latest
:
- in the API: the HTTP request is redirected (HTTP 302) to the actual latest release name
- in the UI: the HTTP request is redirected (HTTP 302) to the actual latest release name
- from language modules (Python, R, etc.): because the
fetch
function calls the API, it just has to follow the redirection.
An HTTP redirection is a good way to let the user understand that the latest release depends on the current time, and encourage him to use an actual release name. However, in language modules, the user will not see the redirection, and he will have to assume the risk to use the latest
release. He will have to choose between having the latest data, and potentially breaking the source code, in particular if it is executed automatically every day for example.
To simplify fetcher development, when the category tree is just a flat list of datasets, it's conceptually possible to generate it, taking that releases meta-data into account. The dataset_code_prefix
would be a category having one node per release. This will be possible to do with dbnomics-fetcher-toolbox (cf #622). Meanwhile, the fetcher authors can write the category_tree.json
manually.
Details
- We preferred introducing a new file named
releases.json
, instead of adding a new property toprovider.json
, to avoid data model changes. - Data validation: each release declared in
releases.json
MUST correpond to a dataset code such as{dataset_code_prefix}:{release_name}
Tasks
-
dbnomics-data-model: add a schemamodel forreleases.json
(cf dbnomics/dbnomics-data-model!44 (merged)) -
dbnomics-data-model: validate releases.json
if it exists in json-data directory -
dbnomics-api: detect when asking {dataset_code_prefix}:latest
in each endpoint accepting a dataset code, and redirect to the latest release (HTTP 302 temporary, very important, do not use a permanent redirect) (cf dbnomics/dbnomics-api!9 (merged)) -
dbnomics-website: detect when asking {dataset_code_prefix}:latest
in each route accepting a dataset code, and redirect to the latest release (HTTP 302 temporary, very important, do not use a permanent redirect) (cf dbnomics/dbnomics-website!7 (merged)) -
dbnomics-docs: document feature (cf dbnomics/dbnomics-docs!1 (merged)) -
declare WEO
andWEOAGG
releases in IMF fetcher (cf #718 (closed) and imf-fetcher!1 (merged))
Questions
- What happens if a dataset code has a
:
as published by the provider? - Is Solr impacted by this issue? I don't think so for now...