Handle dataset releases

Related to #718 (closed)

  • As an economist
  • I want to access each release (past and latest) of a particular dataset, when it is published with releases
  • in order to do reproducible data processing.

Acceptance criteria

  • the fetcher authors MUST be able to declare the releases of a dataset in a meta-data file (releases.json)
  • the web API MUST accept :latest suffix for the "dataset" endpoint, and redirect to the latest release

Description

This issue describes an evolution of DBnomics conceptual data model, introducing a new concept of dataset release.

When the provider distributes datasets with releases, allowing the user to download many previous releases, each having its own name, DBnomics can integrate and propose them to its users.

Each dataset can have many releases. Each release is just a normal dataset named after the pattern {dataset_code_prefix}:{release_name} (e.g. WEO:2020-04). As a consequence, all the existing components of DBnomics continue to work without needing evolutions.

There are actual release names that can be any string (2020-01, 1.1, before_trump), and a special release name latest referencing the latest known release.

To encode the relationship between a dataset and its releases, a new releases.json file is introduced:

// releases.json
{
  "dataset_releases": [
    {
      "dataset_code_prefix": "WEO",
      "name": "WEO by countries", // optional
      "releases": [
        {"code": "2020-04"},
        {"code": "2020-10"} // latest release at this time
      ]
    }
  ]
}

The latest release corresponds to the last item of the release array. If that array evolves, the latest release will always correspond to the last item.

When the user asks for {dataset_code_prefix}:latest:

  • in the API: the HTTP request is redirected (HTTP 302) to the actual latest release name
  • in the UI: the HTTP request is redirected (HTTP 302) to the actual latest release name
  • from language modules (Python, R, etc.): because the fetch function calls the API, it just has to follow the redirection.

An HTTP redirection is a good way to let the user understand that the latest release depends on the current time, and encourage him to use an actual release name. However, in language modules, the user will not see the redirection, and he will have to assume the risk to use the latest release. He will have to choose between having the latest data, and potentially breaking the source code, in particular if it is executed automatically every day for example.

To simplify fetcher development, when the category tree is just a flat list of datasets, it's conceptually possible to generate it, taking that releases meta-data into account. The dataset_code_prefix would be a category having one node per release. This will be possible to do with dbnomics-fetcher-toolbox (cf #622). Meanwhile, the fetcher authors can write the category_tree.json manually.

Details

  • We preferred introducing a new file named releases.json, instead of adding a new property to provider.json, to avoid data model changes.
  • Data validation: each release declared in releases.json MUST correpond to a dataset code such as {dataset_code_prefix}:{release_name}

Tasks

  • dbnomics-data-model: add a schema model for releases.json (cf dbnomics/dbnomics-data-model!44 (merged))
  • dbnomics-data-model: validate releases.json if it exists in json-data directory
  • dbnomics-api: detect when asking {dataset_code_prefix}:latest in each endpoint accepting a dataset code, and redirect to the latest release (HTTP 302 temporary, very important, do not use a permanent redirect) (cf dbnomics/dbnomics-api!9 (merged))
  • dbnomics-website: detect when asking {dataset_code_prefix}:latest in each route accepting a dataset code, and redirect to the latest release (HTTP 302 temporary, very important, do not use a permanent redirect) (cf dbnomics/dbnomics-website!7 (merged))
  • dbnomics-docs: document feature (cf dbnomics/dbnomics-docs!1 (merged))
  • declare WEO and WEOAGG releases in IMF fetcher (cf #718 (closed) and imf-fetcher!1 (merged))

Questions

  • What happens if a dataset code has a : as published by the provider?
  • Is Solr impacted by this issue? I don't think so for now...
Edited Oct 14, 2020 by Christophe Benz
Assignee Loading
Time tracking Loading