fix #448 - Download and parse tables in TOC
Solves management#448 (closed)
This merge request makes the download script to download and parse type="table"
leaf elements of table_of_contents.xml
.
Merge request reports
Activity
mentioned in issue management#448 (closed)
added 5 commits
-
0e920a02...10e4ce36 - 4 commits from branch
master
- 70292bcc - Fix #448: missing datasets - download and parse TOC's leaf elements of type...
-
0e920a02...10e4ce36 - 4 commits from branch
- Resolved by Christophe Benz
The code seems good to merge for me. But are tables supported by the conversion code?
- Resolved by Christophe Benz
This MR adds many error messages to the script output:
INFO:Mode: full ERROR:Skipping dataset DS-008573 because source directory ../eurostat-source-data/data/DS-008573 is missing ERROR:Skipping dataset DS-043408 because source directory ../eurostat-source-data/data/DS-043408 is missing ERROR:Skipping dataset DS-043409 because source directory ../eurostat-source-data/data/DS-043409 is missing ERROR:Skipping dataset DS-066341 because source directory ../eurostat-source-data/data/DS-066341 is missing ERROR:Skipping dataset DS-066342 because source directory ../eurostat-source-data/data/DS-066342 is missing ERROR:Skipping dataset cei_cie010 because source directory ../eurostat-source-data/data/cei_cie010 is missing ERROR:Skipping dataset cei_cie020 because source directory ../eurostat-source-data/data/cei_cie020 is missing ERROR:Skipping dataset cei_pc010 because source directory ../eurostat-source-data/data/cei_pc010 is missing ERROR:Skipping dataset cei_pc031 because source directory ../eurostat-source-data/data/cei_pc031 is missing ERROR:Skipping dataset cei_pc032 because source directory ../eurostat-source-data/data/cei_pc032 is missing ERROR:Skipping dataset cei_pc033 because source directory ../eurostat-source-data/data/cei_pc033 is missing ERROR:Skipping dataset cei_srm010 because source directory ../eurostat-source-data/data/cei_srm010 is missing ERROR:Skipping dataset cei_srm020 because source directory ../eurostat-source-data/data/cei_srm020 is missing ERROR:Skipping dataset cei_srm030 because source directory ../eurostat-source-data/data/cei_srm030 is missing ERROR:Skipping dataset cei_wm010 because source directory ../eurostat-source-data/data/cei_wm010 is missing [...]
I think that's because these entries do not have any
downloadLink
, even if their type isdataset
.Example:
<nt:leaf type="dataset"> <nt:title language="en">Sold production, exports and imports for steel by PRODCOM list (NACE Rev. 1.1) - monthly data</nt: title> <nt:title language="fr">Production vendue, exportations et importations d'acier par liste PRODCOM (NACE Rév. 1.1) - donnée s mensuelles</nt:title> <nt:title language="de">Verkaufte Produktion, Exporte und Importe für Stahl je PRODCOM Liste (NACE Rev. 1.1) - Monatliche Daten</nt:title> <nt:code>DS-008573</nt:code> <nt:lastUpdate /> <nt:lastModified /> <nt:dataStart /> <nt:dataEnd /> <nt:values /> <nt:unit language="en" /> <nt:unit language="fr" /> <nt:unit language="de" /> <nt:shortDescription language="en" /> <nt:shortDescription language="fr" /> <nt:shortDescription language="de" /> <nt:metadata format="html">https://ec.europa.eu/eurostat/cache/metadata/en/prom_esms.htm</nt:metadata> <nt:metadata format="sdmx">https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=metadata/prom_esms.sdmx.zip</nt:metadata> </nt:leaf>
The "skip" part of the script should take that into account.
@bduye this MR seems good work
I'd like to test that MR in pre-production. Could you do that and post the results?
I think about that plan:
- connect to pre-production server
- fetch eurostat-source-data (using "git clone" or rsync)
-
extract the list of the datasets with
type=table
in thetable_of_contents.xml
-
run the download script in with
--datasets
option to pass only the table datasets -
run the convert script in with
--datasets
option to pass only the table datasets
If you need to modifiy that plan go ahead then ping me.
Note: it would be very long to run download and convert in full mode, so it thought about passing a
--datasets
option.Depending on the results of pre-prod execution, I will be able to merge this MR.
Edited by Bruno DuyéWhat I did:
- extract the list of the datasets with
type=table
in thetable_of_contents.xml
from pathlib import Path import lxml from lxml import etree xml_tree = lxml.etree.parse("/home/bruno/dev/jailbreak/dbnomics/fetchers/eurostat/eurostat-source-data-local/table_of_contents.xml") nodes = xml_tree.findall(".//{*}leaf") tables = [node.find("./{*}code").text for node in nodes if node.attrib['type'] == 'table'] Path('/home/bruno/dev/jailbreak/dbnomics/fetchers/eurostat/eurostat-source-data-local/tmp-datasets').write_text(' '.join(tables))
- this gave 1503 datasets
- put it in a file (because it's too loarge for copy/paste on command line)
- try to run the converter passing the content of this file to
--datasets
options:(eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm -rf ../${PROVIDER_SLUG}-json-data/*; ./convert.py ../${PROVIDER_SLUG}-source-data/ ../${PROVIDER_SLUG}-json-data/ --datasets $( cat tmp-datasets ) --full
Current questions:
- why the script says that 0 datasets are converted ?
INFO:Mode: full
INFO:Converting 0 datasets...
Next step: investigate to see if this is a script bug or command line problem.
Edited by Bruno Duyé- extract the list of the datasets with
I found why the script didn't understood the arguments: argparse only accepts space separated arguments (not comma separated as usually seen):
$ ./convert.py ../eurostat-source-data ../eurostat-json-data --datasets teicp290,demo_nmsta2 args.datasets ['teicp290,demo_nmsta2'] $ ./convert.py ../eurostat-source-data ../eurostat-json-data --datasets teicp290 demo_nmsta2 args.datasets ['teicp290', 'demo_nmsta2']
I regenerated datasets list, started download script but ...
(eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm -rf ../${PROVIDER_SLUG}-source-data/*; ./download.py ../${PROVIDER_SLUG}-source-data/ --datasets $( cat tmp-datasets ) INFO:Mode: incremental ERROR:../eurostat-source-data is not a Git repository. Use --full option to download all the files of the provider.
@cbenz I'm confused
Should I download all datasets ? (I suppose it's super long) or should I copy the source-data bare repo and test with it ? Or ... I don't know@bduye oh yes, the incremental mode works only when source-data uses Git, just to read the previous version of the
table_of_contents.xml
, I think.There are 2 approches: either
git init
artificially youreurostat-source-data
dir, orrsync
a dir which already uses Git.git init
dirJust
git init
theeurostat-source-data
dir, create a commit, then run the download script. It may just work but I'm not sure.rsync from prod
Cloning
eurostat-source-data
is very long, so I would recommend torsync
it exceptionally from the production fetcher stateful dir (sogitlab-runner@dolos:fetcher-envs/eurostat/eurostat-source-data
).I would do
rsync -avz gitlab-runner@dolos:fetcher-envs/eurostat/eurostat-source-data eurostat-source-data
If you have any doubt, let me review the command you are about to type :)
This would bring you the
eurostat-source-data
with.git
inside it.@bduye oh yes, the incremental mode works only when source-data uses Git, just to read the previous version of the
table_of_contents.xml
, I think.I miss something
is this logic to look totable_of_contents.xml
history (to get the list of datasets that changed since the last script run) when--datasets
option is passed to the script (so the user fixes himself the list of datasets to treat) ? I think that in the case when user passes--datasets
option, the script shouldn't look totable_of_contents.xml
git history right ?mentioned in merge request !2 (merged)
So as I'm a rebel
I changed the way download script works to bypass the need to have a Git context when--datasets
option is used.- I temporary manually edited download script to be able to launch it without git context
- I created PR #2 to integrate this change in the long term
- I managed to download of wanted datasets:
(eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm -rf ../${PROVIDER_SLUG}-source-data/*; ./download.py ../${PROVIDER_SLUG}-source-data/ --datasets $( cat tmp-datasets )
- Then tried to convert data
(eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm -rf ../${PROVIDER_SLUG}-json-data/*; ./convert.py ../${PROVIDER_SLUG}-source-data/ ../${PROVIDER_SLUG}-json-data/ --datasets $( cat tmp-datasets )
But still 0 datasets converted
Conclusion: The download was okay, but I'm still not able to close this PR, I've to investigate to find why convert doesn't run.
Edited by Bruno Duyé- I temporary manually edited download script to be able to launch it without git context
@bduye thanks for the very clear explanations, I'm looking at your MR !2 (merged)
By the way I remember now that in order to solve this issue, I just pass the
--full
option, so this would do:./download.py ../${PROVIDER_SLUG}-source-data/ --datasets $( cat tmp-datasets ) --full
But I think that this is counter-intuitive and that
--datasets
option should imply--full
as you proposed.added 4 commits
-
70292bcc...4fef5732 - 3 commits from branch
master
- dbc8dbaa - Fix #448: missing datasets - download and parse TOC's leaf elements of type...
-
70292bcc...4fef5732 - 3 commits from branch
Remark:
cepremap@eros:~/fetchers-envs/eurostat/eurostat-source-data/data$ /bin/ls -1 | wc -l 1064
But the size of previously build list of datasets corresponding to
<nt:leaf type="table">
intable_of_contents.xml
is 1503.After some tests I realized that's because there are duplicates in previously built list:
len(tables) # => 1503 len(set(tables)) # => 1064
So all wanted datasets for test are downloaded and available on eros's
fetchers-envs
directory.