Skip to content
Snippets Groups Projects

fix #448 - Download and parse tables in TOC

Merged Bruno Duyé requested to merge bduye into master

Solves management#448 (closed)

This merge request makes the download script to download and parse type="table" leaf elements of table_of_contents.xml.

Edited by Bruno Duyé

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Bruno Duyé added 5 commits

    added 5 commits

    • 0e920a02...10e4ce36 - 4 commits from branch master
    • 70292bcc - Fix #448: missing datasets - download and parse TOC's leaf elements of type...

    Compare with previous version

    • Resolved by Christophe Benz

      This MR adds many error messages to the script output:

      INFO:Mode: full                                                                                                                               
      ERROR:Skipping dataset DS-008573 because source directory ../eurostat-source-data/data/DS-008573 is missing                                   
      ERROR:Skipping dataset DS-043408 because source directory ../eurostat-source-data/data/DS-043408 is missing                                   
      ERROR:Skipping dataset DS-043409 because source directory ../eurostat-source-data/data/DS-043409 is missing                                   
      ERROR:Skipping dataset DS-066341 because source directory ../eurostat-source-data/data/DS-066341 is missing                                   
      ERROR:Skipping dataset DS-066342 because source directory ../eurostat-source-data/data/DS-066342 is missing                                   
      ERROR:Skipping dataset cei_cie010 because source directory ../eurostat-source-data/data/cei_cie010 is missing                                 
      ERROR:Skipping dataset cei_cie020 because source directory ../eurostat-source-data/data/cei_cie020 is missing                                 
      ERROR:Skipping dataset cei_pc010 because source directory ../eurostat-source-data/data/cei_pc010 is missing                                   
      ERROR:Skipping dataset cei_pc031 because source directory ../eurostat-source-data/data/cei_pc031 is missing                                   
      ERROR:Skipping dataset cei_pc032 because source directory ../eurostat-source-data/data/cei_pc032 is missing                                   
      ERROR:Skipping dataset cei_pc033 because source directory ../eurostat-source-data/data/cei_pc033 is missing                                   
      ERROR:Skipping dataset cei_srm010 because source directory ../eurostat-source-data/data/cei_srm010 is missing                                 
      ERROR:Skipping dataset cei_srm020 because source directory ../eurostat-source-data/data/cei_srm020 is missing                                 
      ERROR:Skipping dataset cei_srm030 because source directory ../eurostat-source-data/data/cei_srm030 is missing                                 
      ERROR:Skipping dataset cei_wm010 because source directory ../eurostat-source-data/data/cei_wm010 is missing                                   
      [...]

      I think that's because these entries do not have any downloadLink, even if their type is dataset.

      Example:

                        <nt:leaf type="dataset">
                          <nt:title language="en">Sold production, exports and imports for steel by PRODCOM list (NACE Rev. 1.1) - monthly data</nt:
      title>
                          <nt:title language="fr">Production vendue, exportations et importations d'acier par liste PRODCOM (NACE Rév. 1.1) - donnée
      s mensuelles</nt:title>
                          <nt:title language="de">Verkaufte Produktion, Exporte und Importe für Stahl je PRODCOM Liste (NACE Rev. 1.1) - Monatliche
      Daten</nt:title>
                          <nt:code>DS-008573</nt:code>
                          <nt:lastUpdate />
                          <nt:lastModified />
                          <nt:dataStart />
                          <nt:dataEnd />
                          <nt:values />
                          <nt:unit language="en" />
                          <nt:unit language="fr" />
                          <nt:unit language="de" />
                          <nt:shortDescription language="en" />
                          <nt:shortDescription language="fr" />
                          <nt:shortDescription language="de" />
                          <nt:metadata format="html">https://ec.europa.eu/eurostat/cache/metadata/en/prom_esms.htm</nt:metadata>
                          <nt:metadata format="sdmx">https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=metadata/prom_esms.sdmx.zip</nt:metadata>
                        </nt:leaf>

      The "skip" part of the script should take that into account.

  • Christophe Benz changed title from Fix #448: missing datasets - download and parse TOC's leaf elements of type... to Download and parse tables in TOC

    changed title from Fix #448: missing datasets - download and parse TOC's leaf elements of type... to Download and parse tables in TOC

  • Christophe Benz changed the description

    changed the description

  • Christophe Benz resolved all discussions

    resolved all discussions

    • @bduye this MR seems good work :clap:

      I'd like to test that MR in pre-production. Could you do that and post the results?

      I think about that plan:

      • connect to pre-production server
      • fetch eurostat-source-data (using "git clone" or rsync)
      • extract the list of the datasets with type=table in the table_of_contents.xml
      • run the download script in with --datasets option to pass only the table datasets
      • run the convert script in with --datasets option to pass only the table datasets

      If you need to modifiy that plan go ahead then ping me.

      Note: it would be very long to run download and convert in full mode, so it thought about passing a --datasets option.

      Depending on the results of pre-prod execution, I will be able to merge this MR.

      Edited by Bruno Duyé
    • What I did:

      • extract the list of the datasets with type=table in the table_of_contents.xml
      from pathlib import Path
      import lxml
      from lxml import etree
      
      xml_tree = lxml.etree.parse("/home/bruno/dev/jailbreak/dbnomics/fetchers/eurostat/eurostat-source-data-local/table_of_contents.xml")
      nodes = xml_tree.findall(".//{*}leaf")
      tables = [node.find("./{*}code").text for node in nodes if node.attrib['type'] == 'table']
      Path('/home/bruno/dev/jailbreak/dbnomics/fetchers/eurostat/eurostat-source-data-local/tmp-datasets').write_text(' '.join(tables))
      • this gave 1503 datasets
      • put it in a file (because it's too loarge for copy/paste on command line)
      • try to run the converter passing the content of this file to --datasets options:
        (eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm  -rf ../${PROVIDER_SLUG}-json-data/*; ./convert.py ../${PROVIDER_SLUG}-source-data/ ../${PROVIDER_SLUG}-json-data/ --datasets $( cat tmp-datasets ) --full

      Current questions:

      • why the script says that 0 datasets are converted ? INFO:Mode: full
        INFO:Converting 0 datasets...

      Next step: investigate to see if this is a script bug or command line problem.

      Edited by Bruno Duyé
    • I found why the script didn't understood the arguments: argparse only accepts space separated arguments (not comma separated as usually seen):

      $ ./convert.py ../eurostat-source-data ../eurostat-json-data --datasets teicp290,demo_nmsta2
      args.datasets ['teicp290,demo_nmsta2']
      $ ./convert.py ../eurostat-source-data ../eurostat-json-data --datasets teicp290 demo_nmsta2
      args.datasets ['teicp290', 'demo_nmsta2']

      :unamused:

    • Please register or sign in to reply
  • Bruno Duyé changed title from Download and parse tables in TOC to fix #448 - Download and parse tables in TOC

    changed title from Download and parse tables in TOC to fix #448 - Download and parse tables in TOC

  • Bruno Duyé assigned to @bduye and unassigned @cbenz

    assigned to @bduye and unassigned @cbenz

  • Bruno Duyé added 2h of time spent at 2019-08-29

    added 2h of time spent at 2019-08-29

  • I regenerated datasets list, started download script but ...

    (eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm -rf ../${PROVIDER_SLUG}-source-data/*; ./download.py ../${PROVIDER_SLUG}-source-data/ --datasets $( cat tmp-datasets )
    INFO:Mode: incremental
    ERROR:../eurostat-source-data is not a Git repository. Use --full option to download all the files of the provider.

    @cbenz I'm confused :thinking: Should I download all datasets ? (I suppose it's super long) or should I copy the source-data bare repo and test with it ? Or ... I don't know

  • @bduye oh yes, the incremental mode works only when source-data uses Git, just to read the previous version of the table_of_contents.xml, I think.

    There are 2 approches: either git init artificially your eurostat-source-data dir, or rsync a dir which already uses Git.

    git init dir

    Just git init the eurostat-source-data dir, create a commit, then run the download script. It may just work but I'm not sure.

    rsync from prod

    Cloning eurostat-source-data is very long, so I would recommend to rsync it exceptionally from the production fetcher stateful dir (so gitlab-runner@dolos:fetcher-envs/eurostat/eurostat-source-data).

    I would do rsync -avz gitlab-runner@dolos:fetcher-envs/eurostat/eurostat-source-data eurostat-source-data

    If you have any doubt, let me review the command you are about to type :)

    This would bring you the eurostat-source-data with .git inside it.

  • @bduye oh yes, the incremental mode works only when source-data uses Git, just to read the previous version of the table_of_contents.xml, I think.

    I miss something :thinking: is this logic to look to table_of_contents.xml history (to get the list of datasets that changed since the last script run) when --datasets option is passed to the script (so the user fixes himself the list of datasets to treat) ? I think that in the case when user passes --datasets option, the script shouldn't look to table_of_contents.xml git history right ?

  • Bruno Duyé mentioned in merge request !2 (merged)

    mentioned in merge request !2 (merged)

  • So as I'm a rebel :gun: I changed the way download script works to bypass the need to have a Git context when --datasets option is used.

    • I temporary manually edited download script to be able to launch it without git context
      • I created PR #2 to integrate this change in the long term
    • I managed to download of wanted datasets:
      (eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm -rf ../${PROVIDER_SLUG}-source-data/*; ./download.py ../${PROVIDER_SLUG}-source-data/ --datasets $( cat tmp-datasets )
    • Then tried to convert data
      (eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ rm -rf ../${PROVIDER_SLUG}-json-data/*; ./convert.py  ../${PROVIDER_SLUG}-source-data/  ../${PROVIDER_SLUG}-json-data/ --datasets $( cat tmp-datasets )

    But still 0 datasets converted :thinking:

    Conclusion: The download was okay, but I'm still not able to close this PR, I've to investigate to find why convert doesn't run.

    Edited by Bruno Duyé
  • Bruno Duyé added 2h 42m of time spent at 2019-08-30

    added 2h 42m of time spent at 2019-08-30

  • @bduye thanks for the very clear explanations, I'm looking at your MR !2 (merged)

    By the way I remember now that in order to solve this issue, I just pass the --full option, so this would do: ./download.py ../${PROVIDER_SLUG}-source-data/ --datasets $( cat tmp-datasets ) --full

    But I think that this is counter-intuitive and that --datasets option should imply --full as you proposed.

  • Bruno Duyé added 4 commits

    added 4 commits

    • 70292bcc...4fef5732 - 3 commits from branch master
    • dbc8dbaa - Fix #448: missing datasets - download and parse TOC's leaf elements of type...

    Compare with previous version

  • Remark:

    cepremap@eros:~/fetchers-envs/eurostat/eurostat-source-data/data$ /bin/ls -1 | wc -l
    1064

    But the size of previously build list of datasets corresponding to <nt:leaf type="table"> in table_of_contents.xml is 1503.

    After some tests I realized that's because there are duplicates in previously built list:

    len(tables)  # => 1503
    len(set(tables))  # => 1064

    So all wanted datasets for test are downloaded and available on eros's fetchers-envs directory.

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
Please register or sign in to reply