Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
D
documentation
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 0
    • Issues 0
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Operations
    • Operations
    • Incidents
    • Environments
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • dbnomics-fetchers
  • documentation
  • Wiki
  • write a new converter

Last edited by Christophe Benz Feb 12, 2018
Page history

write a new converter

Outdated !

** Warning: this is an old documentation. Some parts may still be valid, some not. The only reference for data format is the sample-json-data-tree directory in dbnomics-data-model repo **

This means that the format of files to be written by a converter have changed since the writing of this document. But globally the information inside files are the same, but organized differently upon files and dirs. This being said, this is still a good introduction to understand the basics of DBnomics vocabulary.

Write a new converter

The aim of this page is to describe a conversion process from source_data to json_data free, starting from a dummy dataset TSV file.

Categories won't be covered here.

Source data

Let's consider the following data, in a TSV file:

Country	ccode	Flow	fcode	year	total
France	FR	Import	I	2010	83791
France	FR	Import	I	2011	83332
France	FR	Import	I	2012	82001
Belguim	BE	Import	I	2010	33290
Belguim	BE	Import	I	2011	36002
Belguim	BE	Import	I	2012	39332
Italy	IT	Import	I	2009	...
Italy	IT	Import	I	2010	...
Italy	IT	Import	I	2011	77266
Italy	IT	Import	I	2012	89022
France	FR	Export	E	2010	23982
France	FR	Export	E	2011	23777
France	FR	Export	E	2012	24000
Belguim	BE	Export	E	2010	...
Belguim	BE	Export	E	2011	13277
Belguim	BE	Export	E	2012	14002
Italy	IT	Export	E	2009	...
Italy	IT	Export	E	2010	...
Italy	IT	Export	E	2011	59288
Italy	IT	Export	E	2012	61300

Note: some values are unknown for Italy; ie the provider do not know the values. In this dataset, those unknown values are represented by this string: "..."

This dataset contains only 2 dimensions:

  • Country
  • Flow

Fixing values for those dimensions make possible to extract a series. For example, the series corresponding to Country = 'Belgium' and Flow = 'Export' is:

2010	...
2011	13277
2012	14002

Files tree

The output tree for this dataset should be:

category_name
├── dataset.json
├── Export-Belguim
│   ├── observations.tsv
│   └── series.json
├── Export-France
│   ├── observations.tsv
│   └── series.json
├── Export-Italy
│   ├── observations.tsv
│   └── series.json
├── Import-Belguim
│   ├── observations.tsv
│   └── series.json
├── Import-France
│   ├── observations.tsv
│   └── series.json
└── Import-Italy
    ├── observations.tsv
    └── series.json

=> Remember that this is a part of the total tree produced by a parser; we do not talk about categories here

DBnomics vocabulary

For dimensions and values of dimensions ("France" is a value of dimension "Country"), we use label and code terms.

  • label: human readable version
  • code: used for indexation (slugified label if no code given by provider)

Note: in DBnomics, we use geo as code for "Country" dimension. So "Country" dimension has:

  • dimension_label: "Country"
  • dimension_code: "geo"

series.json

The series.json file for Country = 'Belgium' and Flow = 'Export' should be:

{
  "dimensions": {
    "geo": "bel",
    "flow": "E"
  },
  "frequency": "A",
  "code": "ITA.1.0.0.0.ZNAWRU",
  "name": "Belgium exports",
  "unknown_value": "..."
}

Notes:

  • "dimensions" is a dict of dimension_code: dimension_value_code, its aim is to give the list of values of dimensions for this series
  • the series directory name havn't to be equal to the series key
  • the unknown_value key gives the representation of unknown values in this series

observations.tsv

The observations.tsv file for this series should be:

YEAR 	EUR
2010	...
2011	13277
2012	14002

The file must have a header (its first line). The exact values are up to the developer of the fetcher as decided in this technical committee.

See a real-world example here.

dataset.json

The dataset's dataset.json should be:

Comments have been added for understanding, despite being invalid JSON.

{
  "dimensions_values_labels": {
    // dimensions_codes: {dimension_value_code, dimension_value_label}
    "flow": {
      // dimension_value_code: dimension_value_label
      "I": "Import",
      "E": "Export"
    },
    "geo": {
      // dimension_value_code: dimension_value_label
      "fra": "France",
      "ita": "Italy",
      "bel": "Belguim",
    },
  },
  "dimensions_labels": {
    // dimension_code: dimension_label
    "freq": "Frequency",
    "geo": "Country",
    "unit": "Unit"
  },
  "dimensions_codes_order": [
    // dimensions_codes
    "freq",
    "geo",
    "unit"
  ],
  // Human-readeable name dataset name
  "code": "FWTD",
  "name": "Employees, full-time equivalents: total economy (National accounts)",
}

Notes:

  • codelists and concepts terms comes from the SDMX standard
  • We didn't use the dimensions_codes given by provider for dimension "Country" (aka dimension with dimension_label="Country"): we used "geo" in DBnomics
  • We didn't use the dimensions_values_codes given by provider for dimension_values_labels "France", "Italy" and "Belgium": we used "fra", "ita" and "bel" (not "FR", "IT" and "BE" as given in source file)
  • We used the dimensions_values_codes given by provider for "flow" dimension ("I" and "E"). We could have choosen something else.

In a nutshell

To summarize terms introduced here:

  • a dimension (example: "Country") has:
    • a dimension_code: "geo"
    • a dimension_label: "Country"
  • a dimension_value (example: "France") has:
    • a dimension_value_code: "FR"
    • a dimension_value_label: "France"
Clone repository
  • Code style
  • Git and Gitlab workflow
  • acceptance criteria
    • fetchers
  • ci jobs and runners
  • code optimization
  • dev tools
  • e mails
  • failure handling procedures
  • Home
  • librairies
  • maintaining fetchers
  • monitoring
  • presentation
  • production configuration
  • publishing python packages
View All Pages