Commit 6c43e871 authored by Christophe Benz's avatar Christophe Benz

Update docs

parent 9c20ed96
Pipeline #119262 passed with stages
in 4 minutes and 42 seconds
......@@ -10,15 +10,17 @@ In order to achieve this, fetchers just write data to the file-system.
This allows anyone to run it without having to run the complete DBnomics infrastructure.
## Keep provider data
## Store provider data as-is
Fetchers download data from the provider infrastructure and write it to the file-system as-is.
Providers usually distribute data as:
* static files (sometimes called bulk download): XML, JSON, CSV, XLSX, sometime archived in ZIP files
* web API, with responses being XML, JSON, etc.
File formats can be:
* machine-readable: XML, JSON, CSV
* human-readable: XLSX files using formatting, colors, etc.
......@@ -56,6 +58,7 @@ In that case, fetchers have to cancel writing the dataset, otherwise this would
Providers distribute data in various ways.
For example, here are many possible cases:
* a CSV file defining a whole dataset (1 to 1 relationship)
* an XLSX file defining many datasets (1 to many relationship)
* many XML files defining a dataset (many to 1 relationship)
......
......@@ -21,6 +21,7 @@ reduce the burden in solving bugs.
:maxdepth: 2
:caption: Contents:
installation
design_goals
writing_a_fetcher
dbnomics_fetcher_toolbox
......
# Installation
See https://git.nomics.world/dbnomics/dbnomics-fetcher-toolbox#installation
# Writing a fetcher
## Install and configure environment
* initialize a new project from dbnomics-fetcher-cookiecutter
* create a virtualenv
* install dependencies
* create `source-data` and `json-data` directories
## Good practices
* follow directives of robots.txt
* write in a dynamic manner to ensure resilience
## Main steps of a script
* start from the skeleton of `download.py` or `convert.py`
* define what is a resource
* implement the function prepare_resources
* implement the function process_resource
A resource must have a unique `id`. Other attributes (e.g. `file`, `url`, etc.) can be defined by inheriting the base class `dbnomics_fetcher_toolbox.resources.Resource`.
Using a sub-directory per resource:
* implement `create_context` method which creates the directory
* implement `delete` method which deletes the directory
```python
class DaresResource(Resource):
dir: Path
def create_context(self):
self.dir.mkdir(exist_ok=True)
def delete(self):
"""Delete HTML file and all Excel files."""
shutil.rmtree(self.dir)
```
## Run the fetcher
```bash
python download.py source-data
```
You should find `status.jsonl` in `source-data` and several files corresponding to the downloaded resources.
To display the available script options:
```bash
python download.py source-data --help
```
## Technical details
* why use asyncio?
Writing a fetcher
=================
bla
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment