Commits (3)
# DESTATIS fetcher
Federal Statistical Office Germany
## Source data
DESTATIS provides SDMX data via an [API](https://www-genesis.destatis.de/genesisWS/web) with login/password dedicated to DBnomics.
Data is grouped by hierarchical themes (see `Themes` link on [home page](https://www-genesis.destatis.de/genesis/online)). Only data belonging to some chosen themes (can be found in
destatis_util.py) are downloaded.
- Source-data repo contains theme code subdirs (`42`, `45`, ... `81`)
- Each subdir contains:
- one datacubes file (named `{themeid}_datacubes.xml`): list datacube (code, name) information
- pairs of
- datacube file (named `{dataset_id}.xml`) containing time series data
- datacube structure file (named `{dataset_id}.structure.xml`) containing dimension information
### Source format oddities
- datacube SDMX files embed CSV content (';' delimiter) in `<quaderDaten>` tags. Column number is variable because several CSV content co-exist with their own header lines.
- time series data itself are preceded by metadata (also encoded as CSV)
- each row of time series data contains several observation values. Each observation value is relative to an indicator (stored in datacube structure file).
- each observation value consists in 4 columns: the 1st one contains observation value unless the 2nd one is a dash ('-'), in this case the observation value is considered as N/A.
## Download
Download script checks [new data RSS feed](https://www-genesis.destatis.de/genesis/online/news?language=en) to know which datacubes have been updated.
- If they're some, download script fetches not only the modified datacubes but all the datacubes belonging to the theme.
- download script can ignore RSS feed and download all datacubes using `--all-datasets` parameter
- datacube files are post-processed after download: download timestamp in content is erased (to avoid git false commit)
- TODO: datacubes files contain download urls (including login and password), juste replace password value by "XXXX"
## Convert
Convert process scans source-data repo to generates json-data (1 datacube == 1 dataset). Category tree is based on theme hierarchy (see skeleton in destatis_util.py) and filled with generated datasets.
......@@ -199,13 +199,32 @@ def untimestamp_content(content):
if line.lstrip().startswith('<quaderDaten>* ') and line.rstrip().endswith('angestossen.'):
line = re.sub(r' am \d{2}\.\d{2}\.\d{4} um \d{2}:\d{2}:\d{2} ',
' am dd.mm.yyyy um hh:MM:ss ', line)
' am dd.mm.yyyy um hh:MM:ss ', line)
fixed = True
return '\n'.join(output)
def hide_password(content):
"""Change password displayed as URL parameter PASSWORT in <href> tag into XXXX"""
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse
from lxml import etree
root = etree.fromstring(content)
for href_elt in root.iter('href'):
url = href_elt.text
scheme, netloc, path, params, query, fragment = urlparse(url)
params_dict = parse_qs(query)
if 'PASSWORT' in params_dict:
params_dict['PASSWORT'] = 'XXXX'
query = urlencode(params_dict)
href_elt.text = urlunparse((scheme, netloc, path, params, query, fragment))
return etree.tostring(root, pretty_print=True, encoding='unicode')
def main():
""" Downloads data from destatis website """
parser = argparse.ArgumentParser(description=__doc__,
......@@ -276,7 +295,8 @@ def main():
category_url = CATEGORY_URL_TPL.format(**tpl_dict)
category_filepath = cat_dir / '{}.datacubes.xml'.format(cat_id)
log.info('Downloading category %s info...', cat_id)
download_url_if_needed(category_url, category_filepath, xml_content=True)
download_url_if_needed(category_url, category_filepath, xml_content=True,
tpl_dict['format'] = 'csv'
for datacube in du.extract_datacubes_info(str(category_filepath.resolve())):
This source diff could not be displayed because it is too large. You can view the blob instead.
......@@ -19,7 +19,7 @@ def get_structure_filepath(dataset_code):
def test_build_dimension_info_1():
dim_info = conv.extract_dimensions_info(get_csv_data('62111BJ001'))
dim_info = conv.extract_dimensions_info(get_csv_data('62111BJ001'), {})
assert 'AUSB7' in dim_info['dimensions_labels']
assert 'GES' in dim_info['dimensions_labels']
......@@ -30,7 +30,7 @@ def test_build_dimension_info_1():
def test_build_dimension_info_2():
dim_info = conv.extract_dimensions_info(get_csv_data('81000BJ004'))
dim_info = conv.extract_dimensions_info(get_csv_data('81000BJ004'), {})
assert len(dim_info['dimensions_labels']) == 1
assert 'DINSG' in dim_info['dimensions_labels']
#!/usr/bin/env python3
import re
from pathlib import Path
import download as dwnd
FIXTURE_DIR = Path(__file__).parent / 'fixtures'
PASSWD_URL_PARAM_RE = re.compile(r'&amp;PASSWORT=([^&]+)&amp;')
def get_datacubes_file_content(theme_id):
datacubes_filepath = FIXTURE_DIR / '{}.datacubes.xml'.format(theme_id)
with datacubes_filepath.open('rt', encoding='utf-8') as fd:
return fd.read()
def test_remove_password():
content = get_datacubes_file_content('51')
for m in PASSWD_URL_PARAM_RE.finditer(content):
assert m.group(1) != 'XXXX'
anonymous_content = dwnd.hide_password(content)
for m in PASSWD_URL_PARAM_RE.finditer(anonymous_content):
assert m.group(1) == 'XXXX'