Reduce data availability delay

Redaction of this issue is in progress.

As a user
I want to have data on DBnomics as soon as it is published by the provider
in order to build my workflows with DBnomics without having to wait too long.

Description

The goal of this issue is to describe the tasks helping to reduce the time between the publication of a time series by the provider and its availability in DBnomics.

It is a strategic problematic DBnomics has to solve because data freshness is one of the main reasons to choose to use DBnomics or not from the new user point of view.

current scheduler is static (it runs based on cron expressions) but we need a more dynamic scheduler, based on strategies
GitLab does not supply such a scheduler, so we may need to introduce an external one (e.g. Apache Airflow) that would either trigger GitLab CI jobs, or its own jobs
in the case data providers do not give access to a log of data updates, and data fetching is costly:
- we may need to trigger jobs manually from the website or the dashboard, with authentication, allowing users to do so (consumers or even "insiders" from provider side wanting to help DBnomics being up-to-date)
- we may need to trigger a restricted scope, like specific datasets or series

Example of declarative schedule expressions:

# fetchers.yml
fetchers:
  - provider_code: BDF
    maintainer: bduye
    star: true
    schedules:
      # Fetch all data once a day, but do not tell when.
      - scope: all
        strategy:
          every: 24h

      # Fetch 2 specific datasets every hour, but do not tell when.
      - scope:
          - dataset_code: DATASET_ONE
          - dataset_code: DATASET_TWO
        strategy:
          every: 1h

      # Fetch 3 specific series as often as possible.
      - scope:
          - dataset_code: DATASET_ONE
            series_code: SERIES_ONE
          - dataset_code: DATASET_ONE
            series_code: SERIES_TWO
          - dataset_code: DATASET_TWO
            series_code: SERIES_THREE
        strategy: as_often_as_possible # TODO choose better name

      # Fetch a specific series each first day of the month at 1am French time.
      - scope:
          - dataset_code: DATASET_ONE
            series_code: SERIES_ONE
        strategy:
          cron: 0 1 1 * *
          # TODO allow expressing it in ISO 8601

Source: https://www.draw.io/#G1qz5oLw5Hu21vSqD7FqAZtYoZuI8WGL_Y

TODO:

search in existing schedulers the common used names for those strategies

Tasks

reduce jobs waiting time: cf #694 (closed)
declare schedules in fetchers.yml
display schedule information on website (extension of #664 with new schedule strategies)
document why some providers have fresh data on DBnomics, and some others don't

Technical analysis

Technical response

current situation :

we use a GitLab CI to run the download, convert and index jobs
jobs are executed by runners
runners are hosted by servers
we can configure a limit of N jobs per runner

This creates a situation where jobs can wait to be processed a certain time in the job queue, waiting to be taken by any matching runner, and this is difficult to predict how much time they wait.

It is easier to predict the execution time of the job.

There are many tracks to be explored:

auto-provision servers and runners in a cloud architecture (called "autoscale", cf GitLab docs)
find a way to express custom scheduling rules to express priorities, distinguish bandwith-bound jobs from CPU-bound jobs, and take time of publication into account (on GitLab CI or any other job orchestrator e.g. Apache Airflow)
maybe more ideas to come...

We already tried to tackle runners autoscale, this has been documented in #496 (closed), but we switched back to a VM without autoscale. 2 main reasons are that we tried it on Scaleway infrastructure (I'm pretty sure that it would work better with Google Cloud or AWS), and we did not find a way to provision small VM for light jobs, and powerful VM for CPU-intensive jobs.

Edited May 20, 2020 by Christophe Benz