...
 
Commits (19)
stages:
- test
- deploy
- validate
Test:
stage: test
image: python:3.7
only:
- pushes
before_script:
- pip install --editable .
- pip install pytest
script:
- pytest
Publish on PyPI:
stage: deploy
image: python:3.7
......@@ -18,7 +34,7 @@ Publish on PyPI:
url: https://pypi.org/project/dbnomics-data-model/$CI_COMMIT_TAG
Validate:
stage: test
stage: validate
except:
- pushes
tags:
......
# Changelog
### 0.13.9
Non-breaking changes:
- Fix weeks handling (cf https://git.nomics.world/dbnomics-fetchers/management/issues/635)
### 0.13.8
Non-breaking changes in validation script:
- Make `dataset-not-found-in-category-tree` error a warning
### 0.13.7
Non-breaking changes in validation script:
- Make `no-observations` error a warning
### 0.13.6
Non-breaking changes in validation script:
- Add `--developer-mode` option which (false by default), if not used, ignores some errors like `duplicated-series-name`
- Fix a crash when dataset code is not defined in data
- Add new `duplicated-observations-period` check
### 0.13.5
Non-breaking changes in validation script:
......
......@@ -149,7 +149,7 @@ Time series meta-data can be stored either:
- in `{dataset_code}/dataset.json` under the `series` property as a JSON array of objects
- in `{dataset_code}/series.jsonl`, a [JSON-lines](http://jsonlines.org/) file, each line being a (non-indented) JSON object
When a dataset contains a huge number of time series, the `dataset.json` file grows drastically. In this case, the `series.jsonl` format is recommended because parsing a JSON-lines file line-by-line consumes less memory than opening a whole JSON file. A maximum limit of 1000 time series in `dataset.json` is recommended.
When a dataset contains a huge number of time series, the `dataset.json` file grows drastically. In this case, the `series.jsonl` format is recommended because parsing a JSON-lines file line-by-line consumes less memory than opening a whole JSON file. A maximum limit of 1000 time series in `dataset.json` is recommended. In this case, the `series` key of `dataset.json` file should be: `{'path': 'series.jsonl'}`.
Whatever format you choose, the JSON objects are validated against [this JSON schema](./dbnomics_data_model/schemas/v0.8/series.json).
......@@ -186,6 +186,31 @@ It is possible to encode this order in `dataset.json` like this:
Another case is when the dimensions values talk about units, and we want to order units from the smallest to the largest. For example, "millimeter", "centimeter", "meter", "kilometer".
### Series attributes
In conjunction with dimensions, series can have `attributes`. They behave like dimensions: labels and codes.
Example: (from provider1-json-data/dataset2/dataset.json)
- in `dataset.json`:
```json
"attributes_labels": {
"UNIT_MULT": "Unit of multiplier"
},
"attributes_values_labels": {
"UNIT_MULT": {
"9": "× 10^9"
}
},
```
- then, for each series (in dataset.json or series.jonl files)
```json
"attributes": {
"UNIT_MULT": "9"
},
```
### Observations
Time-series observations can be stored either:
......@@ -224,6 +249,8 @@ dbnomics-validate wto-json-data
Note that some of the constraints expressed above are not yet checked by the validation script.
Some errors are warnings and are not displayed by default. Use the `--developer-mode` option to display all errors.
## Testing
Run unit tests:
......
This diff is collapsed.
......@@ -39,6 +39,8 @@ from dbnomics_data_model.observations import NOT_AVAILABLE, detect_period_format
log = logging.getLogger(__name__)
# List of codes to ignore when not in "developer mode" (--developer-mode) (#646)
WARNING_CODES = {"dataset-not-found-in-category-tree", "duplicated-series-name", "no-observations"}
def main():
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
......@@ -56,6 +58,7 @@ def main():
parser.add_argument('--all-observations', action='store_true', help='validate all observations')
parser.add_argument('--max-observations', type=int, default=100,
help='Max number of observations to validate per series')
parser.add_argument('--developer-mode', action='store_true', help='check all possible errors')
args = parser.parse_args()
numeric_level = getattr(logging, args.log.upper(), None)
......@@ -100,17 +103,20 @@ def main():
return -1
errors_codes = defaultdict(int)
ignore_errors = args.ignore_errors
if not args.developer_mode:
ignore_errors += WARNING_CODES
try:
log.debug("Validating provider...")
_, provider_errors = validate_provider(storage, ignore_errors=args.ignore_errors,
_, provider_errors = validate_provider(storage, ignore_errors=ignore_errors,
storage_dir_name=args.storage_dir.name)
for error in provider_errors:
errors_codes[error['error_code']] += 1
print(format_error(error, output_format=args.format))
log.debug("Validating category tree...")
category_tree_errors = validate_category_tree(storage, ignore_errors=args.ignore_errors)
category_tree_errors = validate_category_tree(storage, ignore_errors=ignore_errors)
for error in category_tree_errors:
errors_codes[error['error_code']] += 1
print(format_error(error, output_format=args.format))
......@@ -123,13 +129,13 @@ def main():
continue
log.debug("Validating dataset %s (%d/%d) (except its series)...", dataset_code, dataset_index, nb_datasets)
_, dataset_series, dataset_errors = validate_dataset(dataset_dir, ignore_errors=args.ignore_errors)
_, dataset_series, dataset_errors = validate_dataset(dataset_dir, ignore_errors=ignore_errors)
for error in dataset_errors:
errors_codes[error['error_code']] += 1
print(format_error(error, output_format=args.format))
log.debug("Validating series of dataset %r...", dataset_code)
series_errors = validate_series(dataset_dir, dataset_series, ignore_errors=args.ignore_errors,
series_errors = validate_series(dataset_dir, dataset_series, ignore_errors=ignore_errors,
max_series=args.max_series, max_observations=args.max_observations)
for error in series_errors:
errors_codes[error['error_code']] += 1
......@@ -316,7 +322,7 @@ def validate_dataset(dataset_dir, ignore_errors=[]):
# Dataset directory name MUST be the dataset code.
error_code = "invalid-dataset-directory-name"
if error_code not in ignore_errors and dataset_json["code"] != dataset_code:
if error_code not in ignore_errors and "code" in dataset_json and dataset_json["code"] != dataset_code:
errors.append({
"error_code": error_code,
"message": "Dataset code from dataset.json is different than the directory name",
......
......@@ -45,7 +45,7 @@ with readme_filepath.open('rt', encoding='utf-8') as fd:
setup(
name='dbnomics-data-model',
version='0.13.5',
version='0.13.9',
author='DBnomics Team',
author_email='contact@nomics.world',
......