Outdated !
** Warning: this is an old documentation. Some parts may still be valid, some not.
The only reference for data format is the sample-json-data-tree
directory in dbnomics-data-model
repo
**
This means that the format of files to be written by a converter have changed since the writing of this document. But globally the information inside files are the same, but organized differently upon files and dirs. This being said, this is still a good introduction to understand the basics of DBnomics vocabulary.
Write a new converter
The aim of this page is to describe a conversion process from source_data
to json_data
free, starting from a dummy dataset
TSV file.
Categories won't be covered here.
Source data
Let's consider the following data, in a TSV file:
Country ccode Flow fcode year total
France FR Import I 2010 83791
France FR Import I 2011 83332
France FR Import I 2012 82001
Belguim BE Import I 2010 33290
Belguim BE Import I 2011 36002
Belguim BE Import I 2012 39332
Italy IT Import I 2009 ...
Italy IT Import I 2010 ...
Italy IT Import I 2011 77266
Italy IT Import I 2012 89022
France FR Export E 2010 23982
France FR Export E 2011 23777
France FR Export E 2012 24000
Belguim BE Export E 2010 ...
Belguim BE Export E 2011 13277
Belguim BE Export E 2012 14002
Italy IT Export E 2009 ...
Italy IT Export E 2010 ...
Italy IT Export E 2011 59288
Italy IT Export E 2012 61300
Note: some values are unknown for Italy; ie the provider do not know the values. In this dataset, those unknown values are represented by this string: "..."
This dataset contains only 2 dimensions:
- Country
- Flow
Fixing values for those dimensions make possible to extract a series
. For example, the series corresponding to Country = 'Belgium' and Flow = 'Export' is:
2010 ...
2011 13277
2012 14002
Files tree
The output tree for this dataset should be:
category_name
├── dataset.json
├── Export-Belguim
│ ├── observations.tsv
│ └── series.json
├── Export-France
│ ├── observations.tsv
│ └── series.json
├── Export-Italy
│ ├── observations.tsv
│ └── series.json
├── Import-Belguim
│ ├── observations.tsv
│ └── series.json
├── Import-France
│ ├── observations.tsv
│ └── series.json
└── Import-Italy
├── observations.tsv
└── series.json
=> Remember that this is a part of the total tree produced by a parser; we do not talk about categories here
DBnomics vocabulary
For dimensions and values of dimensions ("France" is a value of dimension "Country"), we use label
and code
terms.
-
label
: human readable version -
code
: used for indexation (slugified label if no code given by provider)
Note: in DBnomics, we use geo
as code for "Country" dimension. So "Country" dimension has:
-
dimension_label
: "Country" -
dimension_code
: "geo"
series.json
The series.json
file for Country = 'Belgium' and Flow = 'Export' should be:
{
"dimensions": {
"geo": "bel",
"flow": "E"
},
"frequency": "A",
"code": "ITA.1.0.0.0.ZNAWRU",
"name": "Belgium exports",
"unknown_value": "..."
}
Notes:
- "dimensions" is a dict of
dimension_code
:dimension_value_code
, its aim is to give the list of values of dimensions for this series - the series directory name havn't to be equal to the series key
- the
unknown_value
key gives the representation of unknown values in this series
observations.tsv
The observations.tsv
file for this series should be:
YEAR EUR
2010 ...
2011 13277
2012 14002
The file must have a header (its first line). The exact values are up to the developer of the fetcher as decided in this technical committee.
See a real-world example here.
dataset.json
The dataset's dataset.json
should be:
Comments have been added for understanding, despite being invalid JSON.
{
"dimensions_values_labels": {
// dimensions_codes: {dimension_value_code, dimension_value_label}
"flow": {
// dimension_value_code: dimension_value_label
"I": "Import",
"E": "Export"
},
"geo": {
// dimension_value_code: dimension_value_label
"fra": "France",
"ita": "Italy",
"bel": "Belguim",
},
},
"dimensions_labels": {
// dimension_code: dimension_label
"freq": "Frequency",
"geo": "Country",
"unit": "Unit"
},
"dimensions_codes_order": [
// dimensions_codes
"freq",
"geo",
"unit"
],
// Human-readeable name dataset name
"code": "FWTD",
"name": "Employees, full-time equivalents: total economy (National accounts)",
}
Notes:
- codelists and concepts terms comes from the SDMX standard
- We didn't use the
dimensions_codes
given by provider for dimension "Country" (aka dimension withdimension_label
="Country"): we used "geo" in DBnomics - We didn't use the
dimensions_values_codes
given by provider fordimension_values_labels
"France", "Italy" and "Belgium": we used "fra", "ita" and "bel" (not "FR", "IT" and "BE" as given in source file) - We used the
dimensions_values_codes
given by provider for "flow" dimension ("I" and "E"). We could have choosen something else.
In a nutshell
To summarize terms introduced here:
- a
dimension
(example: "Country") has:- a
dimension_code
: "geo" - a
dimension_label
: "Country"
- a
- a
dimension_value
(example: "France") has:- a
dimension_value_code
: "FR" - a
dimension_value_label
: "France"
- a