Eurostat: category tree is different than provider's
As reported by user
On DBnomics
https://db.nomics.world/Eurostat
On Eurostat
Designs
Activity
-
Newest first Oldest first
-
Show all activity Show comments only Show history only
- Christophe Benz added 1 deleted label
added 1 deleted label
- Christophe Benz added Fetcher label
added Fetcher label
- Christophe Benz changed the description
changed the description
- Christophe Benz added Doing label
added Doing label
- Johan Richer removed Doing label
removed Doing label
- Bruno Duyé added Doing label
added Doing label
- Bruno Duyé assigned to @bduye
assigned to @bduye
- Bruno Duyé mentioned in merge request eurostat-fetcher!1 (merged)
mentioned in merge request eurostat-fetcher!1 (merged)
This is exact: some categories and datasets are currently missing in Eurostat.
Technical explanation:
For now downloader and convert scripts only takes in account XML leafs of type
dataset
intable_of_contents.xml
file. But there are some datasets that have typetable
(while they do not contain only one table, as name would suggest).Proposed solution
I adapted downloder and converter code to take those datasets onto account too. The merge request is here: eurostat-fetcher!1 (merged)
- Bruno Duyé added 6h of time spent at 2019-08-02
added 6h of time spent at 2019-08-02
- Bruno Duyé added 1h of time spent at 2019-08-26
added 1h of time spent at 2019-08-26
For people who follow this issue, until eurostat-fetcher!1 (merged) is merged, all updates are in this merge request
Collapse replies This is fixed now, preprod is accessible
- Please register or sign in to reply
- Bruno Duyé added 1 deleted label and removed Doing label
added 1 deleted label and removed Doing label
- Bruno Duyé added Doing label and removed 1 deleted label
added Doing label and removed 1 deleted label
- Bruno Duyé removed Doing label
removed Doing label
- Bruno Duyé added 3h of time spent at 2019-09-17
added 3h of time spent at 2019-09-17
- Johan Richer added 1 deleted label
added 1 deleted label
- Christophe Benz added Doing label and removed 1 deleted label
added Doing label and removed 1 deleted label
In preprod we can check that the missing parts of category tree are now available:
Also, 1064 new datasets have been added. Only those new datasets are available in preprod (for disk space and computing time issues).
Some random "new" datasets:
2 1
- Bruno Duyé added 1 deleted label and removed Doing label
added 1 deleted label and removed Doing label
- Bruno Duyé changed the description
changed the description
@MichelJuillard here's the flat list of all new datasets that have been added: new_datasets_in_Eurostat.txt
- Bruno Duyé added Doing label and removed 1 deleted label
added Doing label and removed 1 deleted label
- Author Owner
I merged eurostat-fetcher!1 (merged)
Before closing @bduye can you verify tomorrow that datasets appear in production?
Collapse replies Hopefully you didn't closed this issue by yourself because there's a "Things todo before closing" section in the issue description.
- Bruno Duyé added 1 deleted label and removed Doing label
added 1 deleted label and removed Doing label
The last download job after merging didn't downloaded all "new" datasets, so not all "new" datasets are available in production.
After investigation, it appears that this is because the download script download in incremental mode by default, and to decide the datasets to download it compares the
lastUpdate
attributes intable_of_contents.xml
with previously downloadedtable_of_contents.xml
file.In our case, only a part (162) of "new" datasets have changed, so they have been downloaded.
What I have done:
- generated list of "new" datasets to download:
source_data = Path("/home/bruno/dev/jailbreak/dbnomics/fetchers/eurostat/eurostat-source-data-local/") all = (source_data / "list_all_downloaded").read_text().split('\n') new = (source_data / "new_datasets").read_text().split('\n') to_download_datasets = set(new) - set(all) (source_data / 'new_datasets_to_download').write_text(' '.join(to_download_datasets))
- copied the generated file to
eros
- manually downloaded those datasets:
(eurostat-venv) cepremap@eros:~/fetchers-envs/eurostat/eurostat-fetcher$ python download.py ../eurostat-source-data/ --datasets $( cat ~/new_datasets_to_download )
- added this downloaded data to
/home/gitlab-runner/fetchers-envs/eurostat/eurostat-source-data
- commited it and pushed it:
git add . --ignore-removal
git commit -m '#448 - manual add of missing "new" datasets'
- it triggered a convert and index
- now all "new" datasets are in production
Edited by Christophe Benz- generated list of "new" datasets to download:
- Bruno Duyé added 1 deleted label and removed 1 deleted label
added 1 deleted label and removed 1 deleted label
@MichelJuillard I was on the point of answering to the user that reported the issue, but I don't find the thread in forum. Maybe it have been reported by mail ? Do you still have the contact of the reporter ? Thanks
Collapse replies - Reporter
I can't find the email. Maybe it was me who reported this.
- Maintainer
@bduye I didn't report the issue
Indeed @MichelJuillard, sorry ! So this question is to @cbenz
- Author Owner
I don't find the info. I'm surprised I did not put it in the issue...
Maybe it's @thomasbrand who was working with Eurostat who reported this to me.
Let's close the issue anyway, thanks @bduye !
- Author Owner
Fixed: I can see this dataset in production: https://db.nomics.world/Eurostat/tqoe1c2
- Christophe Benz closed
closed
- Bruno Duyé changed the description
changed the description
- Bruno Duyé added 7h of time spent at 2019-10-02
added 7h of time spent at 2019-10-02