DBnomics Technical Committee meeting

September 29, 2017 16:00-17:00

Attendees

Christophe Benz, DBnomics Thomas Brand, Cepremap Michel Juillard, Banque de France Julien Lasselot, Banque de France Constance de Quatrebarbes, DBnomics Johan Richer, DBnomics

Meeting preparations

Outstanding issues

Decisions or propositions of solutions

Can we factorize code for Excel file parsing?

The developer has the last word on this issue. Not a matter treated during Analysis.

ONS

To be discussed during the next Technical Committee

Destatis

API : 50€ per year to get access to tables ; 500€ to get access to linear files. Metadata not guaranteed in English. Questions: do we want to spend this kind of money and in the end have a segment of the database in German? Decisions:

Look into the feasability of using just the website (scraping)
Contact Destatis to know exactly what we get access to by paying 500€ per year.

What to do with missing and unknown values?

Problem: How do the API know which value should be interpreted as missing or unknown? Propositions:

Store in the metadata of a series the values that should be interpreted as 'missing' or 'unknown (e.g. NaN, N/A, Null, -1, 9999, etc.)
Keep as is (period and symbol of value)
Convert to a standard value for unknown (e.g. NaN)
Give value of your choice
Remove the value missing

Should we store web pages for categories?

To be decided fetcher by fetcher. The developer has the last word on this issue. Mettre en dur les informations ou les extraire du source (HTML ou fichier).

Should we use a numbering for `categories_code` like AMECO or let each fetcher choose?

Take number given by provider if existing, or make up one or use label slug.

Do users read the JSON repositories ?

The answer is mainly no, and we shouldn't take care of the presentation for now.

Decisions:

abandon README.md in the files tree
when directory names (category, dataset and series) are too long, go to the simpler solution: if shortening them is harder than using codes, use codes

How to standardize `observations.tsv` file header (`YEAR\t???`)?

The header up to the developer for now.

Dimensions order

Is it relevant to store an order for dimensions (ie dataset.json property dimension_keys)?Or should we forget about it and display dimensions by lexicographic order in the UI?

Keep the order of dimensions. When a key exist is in series, use the same order in the key. If not, add most significant dimensions first. The order of dimensions should always be the order of the key. A last possibility is to have no specific dimension order.

17 09 29