DBnomics Technical Committee meeting
September 29, 2017 16:00-17:00
Christophe Benz, DBnomics Thomas Brand, Cepremap Michel Juillard, Banque de France Julien Lasselot, Banque de France Constance de Quatrebarbes, DBnomics Johan Richer, DBnomics
Decisions or propositions of solutions
Can we factorize code for Excel file parsing?
The developer has the last word on this issue. Not a matter treated during Analysis.
To be discussed during the next Technical Committee
API : 50€ per year to get access to tables ; 500€ to get access to linear files. Metadata not guaranteed in English. Questions: do we want to spend this kind of money and in the end have a segment of the database in German? Decisions:
- Look into the feasability of using just the website (scraping)
- Contact Destatis to know exactly what we get access to by paying 500€ per year.
What to do with missing and unknown values?
Problem: How do the API know which value should be interpreted as missing or unknown? Propositions:
- Store in the metadata of a series the values that should be interpreted as 'missing' or 'unknown (e.g. NaN, N/A, Null, -1, 9999, etc.)
- Keep as is (period and symbol of value)
- Convert to a standard value for unknown (e.g. NaN)
- Give value of your choice
- Remove the value missing
Should we store web pages for categories?
To be decided fetcher by fetcher. The developer has the last word on this issue. Mettre en dur les informations ou les extraire du source (HTML ou fichier).
Should we use a numbering for
categories_code like AMECO or let each fetcher choose?
Take number given by provider if existing, or make up one or use label slug.
Do users read the JSON repositories ?
The answer is mainly no, and we shouldn't take care of the presentation for now.
README.mdin the files tree
- when directory names (category, dataset and series) are too long, go to the simpler solution: if shortening them is harder than using codes, use codes
How to standardize
observations.tsv file header (
The header up to the developer for now.
Is it relevant to store an order for dimensions (ie dataset.json property dimension_keys)?Or should we forget about it and display dimensions by lexicographic order in the UI?
Keep the order of dimensions. When a key exist is in series, use the same order in the key. If not, add most significant dimensions first. The order of dimensions should always be the order of the key. A last possibility is to have no specific dimension order.