Skip to content
Snippets Groups Projects
README.md 8.93 KiB
Newer Older
# DBnomics data model
Christophe Benz's avatar
Christophe Benz committed

This repository defines the data model of DBnomics.
Christophe Benz's avatar
Christophe Benz committed

Christophe Benz's avatar
Christophe Benz committed
For a quick schematic look at the data model, please read the [cheat_sheet.md](./cheat_sheet.md) file.
If you are a developer working on fetchers, you can print it!

Christophe Benz's avatar
Christophe Benz committed
See also [these sample directories](./tests/fixtures).

Note: The `✓` symbol means that a constraint is validated by the [validation script](./scripts/validate_storage_dir.py).
Christophe Benz's avatar
Christophe Benz committed

## Entities and relationships
provider -> dataset -> time series -> observations
Christophe Benz's avatar
Christophe Benz committed
- Each provider contains datasets
- Each dataset contains time series
- Each time series contains observations
- Each observation is a tuple like `(period, value, attribute1, attribute2, ..., attributeN)`, where attributes are optional

Note: the singluar and plural forms of "time series" are identical (cf [Wiktionary](https://en.wiktionary.org/wiki/time_series)).
Christophe Benz's avatar
Christophe Benz committed
## Storage

DBnomics data is stored in regular directories of the file-system.

A directory containing data from a provider converted by a fetcher.

- ✓ The directory name MUST be `{provider_code}-json-data`.
Christophe Benz's avatar
Christophe Benz committed

## Revisions
Christophe Benz's avatar
Christophe Benz committed

Each storage directory is versioned using Git in order to track revisions.
## General constraints
Christophe Benz's avatar
Christophe Benz committed

### Minimal data
Christophe Benz's avatar
Christophe Benz committed

Data MUST NOT be stored if it adds no value or if it can be computed from any other data.
Christophe Benz's avatar
Christophe Benz committed

As a consequence:
- series names MUST NOT be generated when not provided by source data;
DBnomics can generate a name from the dimensions values codes
Christophe Benz's avatar
Christophe Benz committed

### Data stability
Christophe Benz's avatar
Christophe Benz committed

Any commit in the storage directory of a provider MUST reflect a change from the side of the provider.
Christophe Benz's avatar
Christophe Benz committed

Data conversions MUST be stable: running a conversion script on the same source-data MUST NOT change converted data.
Christophe Benz's avatar
Christophe Benz committed

As a consequence:
- when series codes are generated from a dimensions `dict`, always use the same order;
- properties of JSON objects MUST be sorted alphabetically;
Christophe Benz's avatar
Christophe Benz committed

## `/provider.json`
Christophe Benz's avatar
Christophe Benz committed

This JSON file contains meta-data about the provider.
Christophe Benz's avatar
Christophe Benz committed

See [its JSON schema](./dbnomics_data_model/schemas/v0.8/provider.json).
Christophe Benz's avatar
Christophe Benz committed

## `/category_tree.json`

This JSON file contains a tree of categories which leaves are datasets and nodes are categories.
Christophe Benz's avatar
Christophe Benz committed

This file is optional:
- if categories are provided by source data, it SHOULD exist;
- if it's missing, DBnomics will generate the tree as a list of datasets ordered lexicographically;
- it MUST NOT be written if it is identical to the generated list mentioned above (due to the general constraint about minimal data)
Christophe Benz's avatar
Christophe Benz committed

See [its JSON schema](./dbnomics_data_model/schemas/v0.8/category_tree.json).
Christophe Benz's avatar
Christophe Benz committed

## `/{dataset_code}/`

This directory contains data about a dataset of the provider.

- The directory name MUST be equal to the dataset code.
Christophe Benz's avatar
Christophe Benz committed

## `/{dataset_code}/dataset.json`
Christophe Benz's avatar
Christophe Benz committed

This JSON file contains meta-data about a dataset of the provider.
Christophe Benz's avatar
Christophe Benz committed

See [its JSON schema](./dbnomics_data_model/schemas/v0.8/category_tree.json).
Christophe Benz's avatar
Christophe Benz committed

The `series` property if optional: see [storing series](#storing-series) section.
Christophe Benz's avatar
Christophe Benz committed

## `/{dataset_code}/series.jsonl`
Christophe Benz's avatar
Christophe Benz committed

This [JSON-lines](http://jsonlines.org/) file contains meta-data about time series of a dataset of a provider.
Christophe Benz's avatar
Christophe Benz committed

Each line is a JSON object validated against [this JSON schema](./dbnomics_data_model/schemas/v0.8/series.json).
Christophe Benz's avatar
Christophe Benz committed

This file is optional: see [storing series](#storing-series) section.
Christophe Benz's avatar
Christophe Benz committed

## `/{dataset_code}/{series_code}.tsv`
Christophe Benz's avatar
Christophe Benz committed

This [TSV](https://en.wikipedia.org/wiki/Tab-separated_values) file contains observations of a time series of a dataset of a provider.
Christophe Benz's avatar
Christophe Benz committed

These files are optional: see [storing series](#storing-series) section.
## Constraints on time series
Christophe Benz's avatar
Christophe Benz committed
- With providers using series codes composed of dimensions values codes:
  - The separator MUST be '.' to be compatible with series codes masks. It is allowed to change the separator used originally by the provider. Example: [this commit on BIS](https://git.nomics.world/dbnomics-fetchers/bis-fetcher/commit/dce6f0caf32762aa859f657467161a397a9b60f6).
  - The parts of the series code MUST follow the order defined by `dimensions_codes_order`. Example: if `dimensions_codes_order = ["FREQ", "COUNTRY"]`, the series code MUST be `A.FR` and not `FR.A`.
  - When dimensions codes order is not defined by the provider, the lexicographic order of the dimensions codes SHOULD be used, and the `dimensions_codes_order` key MUST NOT be written. Example: if dimensions are `FREQ` and `COUNTRY`, the series code is `FR.A` because dimensions codes are sorted alphabetically: `["COUNTRY", "FREQ"]`.
## Constraints on TSV files
Note: The `✓` symbol means that a constraint is validated by the [validation script](./scripts/validate_storage_dir.py).
Christophe Benz's avatar
Christophe Benz committed
- TSV files MUST be encoded in UTF-8.
- ✓ The two first columns of the header MUST be named `PERIOD` and `VALUE`.
- ✓ Each row MUST have the same number of columns than the header.
- The values of the `PERIOD` column:
  - ✓ MUST respect a specific format:
Christophe Benz's avatar
Christophe Benz committed
    - `YYYY` for years
Bruno Duyé's avatar
Bruno Duyé committed
    - `YYYY-MM` for months (MUST be padded for `MM`)
    - `YYYY-MM-DD` for days (MUST be padded for `MM` and `DD`)
Christophe Benz's avatar
Christophe Benz committed
    - `YYYY-Q[1-4]` for year quarters
Bruno Duyé's avatar
Bruno Duyé committed
    - `YYYY-S[1-2]` for year semesters
Bruno Duyé's avatar
Bruno Duyé committed
    - `YYYY-W[01-53]` for year weeks (MUST be padded)
  - ✓ MUST NOT include average values using `M13` or `Q5` periods
  - MUST be consistent with the frequency (ie use `YYYY-Q[1-4]` for quarterly observations, not `YYYY-MM-DD`, even if those daily periods have 3 months between them)
- ✓ The `PERIOD` column MUST be sorted in an ascending order.
- ✓ The values of the `VALUE` column MUST either:
  - follow that of decimal in [XMLSchema](https://www.w3.org/TR/xmlschema-2/#decimal): a non-empty finite-length sequence of decimal digits separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, "+" is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(es) can be omitted. For example: '-1.23', '12678967.543233', '+100000.00', '210'.
Christophe Benz's avatar
Christophe Benz committed
  - OR be `NA` meaning "not available".
Pierre Dittgen's avatar
Pierre Dittgen committed
- TSV files CAN have supplementary columns in order to tag some observation values.
  - The values of these columns are free, empty string `""` means no tag
  - Reuse values defined by the provider if possible; otherwise define values with DBnomics team
## Storing time series

### Meta-data

Time series meta-data can be stored either:
- in `{dataset_code}/dataset.json` under the `series` property as a JSON array of objects
- in `{dataset_code}/series.jsonl`, a [JSON-lines](http://jsonlines.org/) file, each line being a (non-indented) JSON object

When a dataset contains a huge number of time series, the `dataset.json` file grows drastically. In this case, the `series.jsonl` format is recommended because parsing a JSON-lines file line-by-line consumes less memory than opening a whole JSON file. A maximum limit of 1000 time series in `dataset.json` is recommended.

Whatever format you choose, the JSON objects are validated against [this JSON schema](./dbnomics_data_model/schemas/v0.8/series.json).

Examples:
- [this dataset](./tests/fixtures/provider1-json-data/dataset1) stores time series meta-data in `dataset.json` under the `series` property
- [this dataset](./tests/fixtures/provider2-json-data/dataset1) stores time series meta-data in `series.jsonl`

### Observations

Time-series observations can be stored either:
- in `{dataset_code}/{series_code}.tsv` [TSV](https://en.wikipedia.org/wiki/Tab-separated_values) files
- in `{dataset_code}/series.jsonl`, a [JSON-lines](http://jsonlines.org/) file, each line being a (non-indented) JSON object, under the `observations` property of each object.

When a dataset contains a huge number of time series, the number of TSV files file grows drastically. In this case, the `series.jsonl` format is recommended because a single file consumes less disk space than thousands of files (each file taking some kilo-bytes in the file-system table of contents), and because Git is slower when the number of committed files increases. A maximum limit of 1000 TSV files is recommended.

Whatever format you choose, the JSON objects are validated against [this JSON schema](./dbnomics_data_model/schemas/v0.8/series.json).

Examples:
- [this dataset](./tests/fixtures/provider2-json-data/dataset1) stores observations in TSV files
- [this dataset](./tests/fixtures/provider2-json-data/dataset2) stores observations in `series.jsonl`

## Data validation

DBnomics-data-model comes with a validation script.
Christophe Benz's avatar
Christophe Benz committed
Validate a JSON data Git repository:

```sh
Christophe Benz's avatar
Christophe Benz committed
./scripts/validate_storage_dir.py <storage_dir>
Christophe Benz's avatar
Christophe Benz committed

# for example:
Christophe Benz's avatar
Christophe Benz committed
./scripts/validate_storage_dir.py wto-json-data
Christophe Benz's avatar
Christophe Benz committed
Note that some of the constraints expressed above are not yet checked by the validation script.

## Testing

Run unit tests:

```sh
python setup.py test
```

Run validation script against dummy providers:
Christophe Benz's avatar
Christophe Benz committed
./scripts/validate_storage_dir.py tests/fixtures/provider1-json-data
./scripts/validate_storage_dir.py tests/fixtures/provider2-json-data
See [CHANGELOG.md](./CHANGELOG.md). It contains an upgrade guide explaining how to modify the source code of your fetcher, if the data model changes in unexpected ways.