Simplify JSON data model

changed title from dataset json data model to Brainstorm about dataset JSON data model

changed the description

add number of series

I prefer not to store data which can be derived from other data (ie keep an orthogonal modelisation).

The web API is there to reconstitute the information, either by adding back the series property as a list of strings, either by adding a series_count property as an integer.

add doc_href for information about the dataset on the provider web site

This is already supported (see this line) but not a required property. I don't think we should enforce it, except if we're sure there's always an URL per dataset.

add notes for unstructured information about the dataset to display under Infos in the UI

This is already expressed as issues: #71 (closed) and #35 (closed)

great minds...

add timestamp for first download add timestamp for last update by provider add timestamp for last update by DB.nomics add timestamp for last visit by DB.nomics

OK to add timestamps for a maximum of events.

do we need list of series? Some datasets have tens of thousand of series. Is a query to indexer not better?

I think we should drop this series property. We'll reintroduce it if needed. The web API can infer this information.

No, it's not related to the indexer.

do we need dimensions_codes and attributes codes? There are just keys of dimensions_labels and attributes_labels

They define the order of the dimensions or attributes, as we talked about in this technical committee.

Related to #93 (closed) especially this question

@cbenz OK. Concerning dimensions_codes and attributes_codes, can we add _order to the name in order to better indicate the use of this field?

@MichelJuillard :

OK. Concerning dimensions_codes and attributes_codes, can we add _order to the name in order to better indicate the use of this field?

That's the case actually, in this question, which is an update of this previous question

This is quite tricky to have many questions in the same issue, sorry.

@cbenz Yes, somebody should summarize the discussions and the pending choices/decision in a single place

@MichelJuillard Bruno is doing it indeed. But I discovered 10 minutes ago the "Threaded discussions" feature of GitLab, which we should always use!

https://docs.gitlab.com/ce/user/discussions/#threaded-discussions

My comment here is started as a discussion to test.

For surveys, the process would be:

create a discussion per question, with a vote
discuss below each question, thus separating the threads
summarize the proposal in the description of the issue, at the top

mentioned in issue #93 (closed)

added Priority 2 Should label

changed the description

Do we need to store all texts found on XLS files or html pages ?

In XLS files, like this one, a lot of text comes with data. It's certainly the case for other datasets, and probably in static html files.

How could we make user find this text?

Proposal A: store it in `dataset.json` and display it in UI

If we decide to store this (quite long) text in dataset.json files, it involves to:

add a long_description key to store this long text that won't be displayed anywhere else than in dataset page
add fetchers development time to parse, transform and store this text

But we can then integrate this text on the UI; and we could also imagine fuzzy-search inside.

Proposal B: do not store but give a link to original file

As this text is not intended to be read by a program; but an human, I think this would be more straightforward to add a more_description_link key that contains a link (ore many links) to:

an html page talking about dataset
an XLS file in ###-source-data git repo
something else ?

This won't replace the description key, that is (for me, at least) intended to be "quite short" (not too short to permit indexing it in Solr, not too long permit to be displayed in UI pages that display other informations).

I agree, and I prefer proposal B.

If the description is short (1 or 2 paragraphs) it can fit the description property; if it is longer, we can keep the first paragraph.

Beside the description property, we can add a description_href linking to something containing the long description, and we should prefer targeting the source data repository to keep history.

If we decide to store long descriptions, I would prefer storing it in a separate file like description.md, description.html or description.txt depending on the format. And the indexer could index it.

changed title from Brainstorm about dataset JSON data model to Simplify JSON data model

changed the description

mentioned in merge request wto-fetcher!7 (merged)

add timestamp for last visit by DB.nomics

I'm afraid this would trigger a change in the JSON repository, and then a re-indexation in Solr, for nothing.

We should commit in the JSON repository only when source data or the conversion script change.

Simplify JSON data model

Acceptance criteria

Resources

Technical tasks

Designs

Child items ...

Activity

Do we need to store all texts found on XLS files or html pages ?

Proposal A: store it in `dataset.json` and display it in UI

Proposal B: do not store but give a link to original file

Simplify JSON data model

Acceptance criteria

Resources

Technical tasks

Activity

Do we need to store all texts found on XLS files or html pages ?

Proposal A: store it in dataset.json and display it in UI

Proposal B: do not store but give a link to original file

Proposal A: store it in `dataset.json` and display it in UI