fix #448 - Download and parse tables in TOC

mentioned in issue management#448 (closed)

added 5 commits

0e920a02...10e4ce36 - 4 commits from branch master
70292bcc - Fix #448: missing datasets - download and parse TOC's leaf elements of type...

The code seems good to merge for me. But are tables supported by the conversion code?

This MR adds many error messages to the script output:

INFO:Mode: full                                                                                                                               
ERROR:Skipping dataset DS-008573 because source directory ../eurostat-source-data/data/DS-008573 is missing                                   
ERROR:Skipping dataset DS-043408 because source directory ../eurostat-source-data/data/DS-043408 is missing                                   
ERROR:Skipping dataset DS-043409 because source directory ../eurostat-source-data/data/DS-043409 is missing                                   
ERROR:Skipping dataset DS-066341 because source directory ../eurostat-source-data/data/DS-066341 is missing                                   
ERROR:Skipping dataset DS-066342 because source directory ../eurostat-source-data/data/DS-066342 is missing                                   
ERROR:Skipping dataset cei_cie010 because source directory ../eurostat-source-data/data/cei_cie010 is missing                                 
ERROR:Skipping dataset cei_cie020 because source directory ../eurostat-source-data/data/cei_cie020 is missing                                 
ERROR:Skipping dataset cei_pc010 because source directory ../eurostat-source-data/data/cei_pc010 is missing                                   
ERROR:Skipping dataset cei_pc031 because source directory ../eurostat-source-data/data/cei_pc031 is missing                                   
ERROR:Skipping dataset cei_pc032 because source directory ../eurostat-source-data/data/cei_pc032 is missing                                   
ERROR:Skipping dataset cei_pc033 because source directory ../eurostat-source-data/data/cei_pc033 is missing                                   
ERROR:Skipping dataset cei_srm010 because source directory ../eurostat-source-data/data/cei_srm010 is missing                                 
ERROR:Skipping dataset cei_srm020 because source directory ../eurostat-source-data/data/cei_srm020 is missing                                 
ERROR:Skipping dataset cei_srm030 because source directory ../eurostat-source-data/data/cei_srm030 is missing                                 
ERROR:Skipping dataset cei_wm010 because source directory ../eurostat-source-data/data/cei_wm010 is missing                                   
[...]

I think that's because these entries do not have any downloadLink, even if their type is dataset.

Example:

                  <nt:leaf type="dataset">
                    <nt:title language="en">Sold production, exports and imports for steel by PRODCOM list (NACE Rev. 1.1) - monthly data</nt:
title>
                    <nt:title language="fr">Production vendue, exportations et importations d'acier par liste PRODCOM (NACE Rév. 1.1) - donnée
s mensuelles</nt:title>
                    <nt:title language="de">Verkaufte Produktion, Exporte und Importe für Stahl je PRODCOM Liste (NACE Rev. 1.1) - Monatliche
Daten</nt:title>
                    <nt:code>DS-008573</nt:code>
                    <nt:lastUpdate />
                    <nt:lastModified />
                    <nt:dataStart />
                    <nt:dataEnd />
                    <nt:values />
                    <nt:unit language="en" />
                    <nt:unit language="fr" />
                    <nt:unit language="de" />
                    <nt:shortDescription language="en" />
                    <nt:shortDescription language="fr" />
                    <nt:shortDescription language="de" />
                    <nt:metadata format="html">https://ec.europa.eu/eurostat/cache/metadata/en/prom_esms.htm</nt:metadata>
                    <nt:metadata format="sdmx">https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=metadata/prom_esms.sdmx.zip</nt:metadata>
                  </nt:leaf>

The "skip" part of the script should take that into account.

changed title from Fix #448: missing datasets - download and parse TOC's leaf elements of type... to Download and parse tables in TOC

changed the description

resolved all discussions

@bduye this MR seems good work

I'd like to test that MR in pre-production. Could you do that and post the results?

I think about that plan:

connect to pre-production server
fetch eurostat-source-data (using "git clone" or rsync)
extract the list of the datasets with type=table in the table_of_contents.xml
run the download script in with --datasets option to pass only the table datasets
run the convert script in with --datasets option to pass only the table datasets

If you need to modifiy that plan go ahead then ping me.

Note: it would be very long to run download and convert in full mode, so it thought about passing a --datasets option.

Depending on the results of pre-prod execution, I will be able to merge this MR.

What I did:

extract the list of the datasets with type=table in the table_of_contents.xml

from pathlib import Path
import lxml
from lxml import etree

xml_tree = lxml.etree.parse("/home/bruno/dev/jailbreak/dbnomics/fetchers/eurostat/eurostat-source-data-local/table_of_contents.xml")
nodes = xml_tree.findall(".//{*}leaf")
tables = [node.find("./{*}code").text for node in nodes if node.attrib['type'] == 'table']
Path('/home/bruno/dev/jailbreak/dbnomics/fetchers/eurostat/eurostat-source-data-local/tmp-datasets').write_text(' '.join(tables))

this gave 1503 datasets
put it in a file (because it's too loarge for copy/paste on command line)

try to run the converter passing the content of this file to --datasets options:

(eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm  -rf ../${PROVIDER_SLUG}-json-data/*; ./convert.py ../${PROVIDER_SLUG}-source-data/ ../${PROVIDER_SLUG}-json-data/ --datasets $( cat tmp-datasets ) --full

Current questions:

why the script says that 0 datasets are converted ? INFO:Mode: full
INFO:Converting 0 datasets...

Next step: investigate to see if this is a script bug or command line problem.

I found why the script didn't understood the arguments: argparse only accepts space separated arguments (not comma separated as usually seen):

$ ./convert.py ../eurostat-source-data ../eurostat-json-data --datasets teicp290,demo_nmsta2
args.datasets ['teicp290,demo_nmsta2']
$ ./convert.py ../eurostat-source-data ../eurostat-json-data --datasets teicp290 demo_nmsta2
args.datasets ['teicp290', 'demo_nmsta2']

changed title from Download and parse tables in TOC to fix #448 - Download and parse tables in TOC

assigned to @bduye and unassigned @cbenz

added 2h of time spent at 2019-08-29

I regenerated datasets list, started download script but ...

(eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm -rf ../${PROVIDER_SLUG}-source-data/*; ./download.py ../${PROVIDER_SLUG}-source-data/ --datasets $( cat tmp-datasets )
INFO:Mode: incremental
ERROR:../eurostat-source-data is not a Git repository. Use --full option to download all the files of the provider.

@cbenz I'm confused Should I download all datasets ? (I suppose it's super long) or should I copy the source-data bare repo and test with it ? Or ... I don't know

@bduye oh yes, the incremental mode works only when source-data uses Git, just to read the previous version of the table_of_contents.xml, I think.

There are 2 approches: either git init artificially your eurostat-source-data dir, or rsync a dir which already uses Git.

`git init` dir

Just git init the eurostat-source-data dir, create a commit, then run the download script. It may just work but I'm not sure.

rsync from prod

Cloning eurostat-source-data is very long, so I would recommend to rsync it exceptionally from the production fetcher stateful dir (so gitlab-runner@dolos:fetcher-envs/eurostat/eurostat-source-data).

I would do rsync -avz gitlab-runner@dolos:fetcher-envs/eurostat/eurostat-source-data eurostat-source-data

If you have any doubt, let me review the command you are about to type :)

This would bring you the eurostat-source-data with .git inside it.

@bduye oh yes, the incremental mode works only when source-data uses Git, just to read the previous version of the table_of_contents.xml, I think.

I miss something is this logic to look to table_of_contents.xml history (to get the list of datasets that changed since the last script run) when --datasets option is passed to the script (so the user fixes himself the list of datasets to treat) ? I think that in the case when user passes --datasets option, the script shouldn't look to table_of_contents.xml git history right ?

mentioned in merge request !2 (merged)

So as I'm a rebel I changed the way download script works to bypass the need to have a Git context when --datasets option is used.

I temporary manually edited download script to be able to launch it without git context
- I created PR #2 to integrate this change in the long term

I managed to download of wanted datasets:

(eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm -rf ../${PROVIDER_SLUG}-source-data/*; ./download.py ../${PROVIDER_SLUG}-source-data/ --datasets $( cat tmp-datasets )

Then tried to convert data

(eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm -rf ../${PROVIDER_SLUG}-json-data/*; ./convert.py  ../${PROVIDER_SLUG}-source-data/  ../${PROVIDER_SLUG}-json-data/ --datasets $( cat tmp-datasets )

But still 0 datasets converted

Conclusion: The download was okay, but I'm still not able to close this PR, I've to investigate to find why convert doesn't run.

added 2h 42m of time spent at 2019-08-30

@bduye thanks for the very clear explanations, I'm looking at your MR !2 (merged)

By the way I remember now that in order to solve this issue, I just pass the --full option, so this would do: ./download.py ../${PROVIDER_SLUG}-source-data/ --datasets $( cat tmp-datasets ) --full

But I think that this is counter-intuitive and that --datasets option should imply --full as you proposed.

added 4 commits

70292bcc...4fef5732 - 3 commits from branch master
dbc8dbaa - Fix #448: missing datasets - download and parse TOC's leaf elements of type...

Compare with previous version

Remark:

cepremap@eros:~/fetchers-envs/eurostat/eurostat-source-data/data$ /bin/ls -1 | wc -l
1064

But the size of previously build list of datasets corresponding to <nt:leaf type="table"> in table_of_contents.xml is 1503.

After some tests I realized that's because there are duplicates in previously built list:

len(tables)  # => 1503
len(set(tables))  # => 1064

So all wanted datasets for test are downloaded and available on eros's fetchers-envs directory.

fix #448 - Download and parse tables in TOC

Merge request reports

Activity

git init dir

rsync from prod

`git init` dir