Reduce data availability delay
Redaction of this issue is in progress.
- As a user
- I want to have data on DBnomics as soon as it is published by the provider
- in order to build my workflows with DBnomics without having to wait too long.
Description
The goal of this issue is to describe the tasks helping to reduce the time between the publication of a time series by the provider and its availability in DBnomics.
It is a strategic problematic DBnomics has to solve because data freshness is one of the main reasons to choose to use DBnomics or not from the new user point of view.
- current scheduler is static (it runs based on cron expressions) but we need a more dynamic scheduler, based on strategies
- GitLab does not supply such a scheduler, so we may need to introduce an external one (e.g. Apache Airflow) that would either trigger GitLab CI jobs, or its own jobs
- in the case data providers do not give access to a log of data updates, and data fetching is costly:
- we may need to trigger jobs manually from the website or the dashboard, with authentication, allowing users to do so (consumers or even "insiders" from provider side wanting to help DBnomics being up-to-date)
- we may need to trigger a restricted scope, like specific datasets or series
Example of declarative schedule expressions:
# fetchers.yml
fetchers:
- provider_code: BDF
maintainer: bduye
star: true
schedules:
# Fetch all data once a day, but do not tell when.
- scope: all
strategy:
every: 24h
# Fetch 2 specific datasets every hour, but do not tell when.
- scope:
- dataset_code: DATASET_ONE
- dataset_code: DATASET_TWO
strategy:
every: 1h
# Fetch 3 specific series as often as possible.
- scope:
- dataset_code: DATASET_ONE
series_code: SERIES_ONE
- dataset_code: DATASET_ONE
series_code: SERIES_TWO
- dataset_code: DATASET_TWO
series_code: SERIES_THREE
strategy: as_often_as_possible # TODO choose better name
# Fetch a specific series each first day of the month at 1am French time.
- scope:
- dataset_code: DATASET_ONE
series_code: SERIES_ONE
strategy:
cron: 0 1 1 * *
# TODO allow expressing it in ISO 8601
Source: https://www.draw.io/#G1qz5oLw5Hu21vSqD7FqAZtYoZuI8WGL_Y
TODO:
- search in existing schedulers the common used names for those strategies
Tasks
-
reduce jobs waiting time: cf #694 (closed) -
declare schedules in fetchers.yml -
display schedule information on website (extension of #664 with new schedule strategies) -
document why some providers have fresh data on DBnomics, and some others don't
Technical analysis
Technical response
current situation :
- we use a GitLab CI to run the download, convert and index jobs
- jobs are executed by runners
- runners are hosted by servers
- we can configure a limit of N jobs per runner
This creates a situation where jobs can wait to be processed a certain time in the job queue, waiting to be taken by any matching runner, and this is difficult to predict how much time they wait.
It is easier to predict the execution time of the job.
There are many tracks to be explored:
- auto-provision servers and runners in a cloud architecture (called "autoscale", cf GitLab docs)
- find a way to express custom scheduling rules to express priorities, distinguish bandwith-bound jobs from CPU-bound jobs, and take time of publication into account (on GitLab CI or any other job orchestrator e.g. Apache Airflow)
- maybe more ideas to come...
We already tried to tackle runners autoscale, this has been documented in #496 (closed), but we switched back to a VM without autoscale. 2 main reasons are that we tried it on Scaleway infrastructure (I'm pretty sure that it would work better with Google Cloud or AWS), and we did not find a way to provision small VM for light jobs, and powerful VM for CPU-intensive jobs.