Backport pipeline from OECD instance
Description
DBnomics fetcher pipeline has evolved in many directions.
Pipeline Versions
version 1: Gitlab Runner
- used in prod for some fetchers
- uses GitLab Runner with Docker or Shell executors, depending on the fetcher
- runs on statically provisioned servers
- pipeline definition is copy-pasted in each fetcher repo (
.gitlab-ci.yml
) - data is stored either in containers for Docker executors (so it's lost between each job and pipeline run), or in a persisted directory for Shell executors (Eurostat)
- starts by cloning Git repos of source-data and json-data, deleting files (not for all fetchers, cf Eurostat which is an exception), and executes the Python scripts
- fetchers are run from a python:3.x container image, dependencies are installed at each pipeline run
- incremental mode is explicit: last pipeline execution date is read from Git history
- data is deployed to production by the index job which is triggered by pushing to the json-data repo; it indexes data to Solr, but also does a "git pull" from the directory served by the API; for that it is bound to a Shell executor on the server hosting the API and Solr, which have to be served on the same server
Notes:
- this pipeline is attatched to Git
- scripts are attached to Git
- jobs taken by Shell executors are bound to the Python version available on the server
version 2: Gitlab Runner + Kubernetes
- used in prod for some fetchers
- uses GitLab Runner with Kubernetes executor
- runs in the k8s cluster named "condescending borg" on Scaleway
- common pipeline definition is stored in https://git.nomics.world/dbnomics/dbnomics-fetcher-pipeline/ and is included by each fetcher repo (
.gitlab-ci.yml
usinginclude
directive) - CI scripts are downloaded from https://git.nomics.world/dbnomics/dbnomics-fetcher-pipeline/ also with
wget
at each pipeline execution - data is stored in a PVC backed by a NFS storage class, shared by all jobs which cooperate to use their own sub-directory based on the provider slug
- starts by cloning Git repos of source-data and json-data, deleting files, and executes the Python scripts
- fetchers are run from a python:3.x container image, dependencies are installed at each pipeline run
- incremental mode is explicit: last pipeline execution date is read from Git history
- data is deployed to production by the same index job as for version 1
Notes:
- this pipeline is attatched to Git
- scripts are attached to Git
- using NFS is very slow
- cloning/updating Git repositories upfront then deleting files is sub-optimal
- having a block storage volume that is always up can be expensive
version 3: Tekton + Kubernetes
- used on next
- uses Tekton
- runs in the k8s cluster named "dbnomics next" on Scaleway
- common pipeline definition is stored in dbnomics-kubernetes repo
- data is stored in workspaces that are provisioned at each pipeline run and shared between jobs
- starts from an empty directory for source-data and json-data and executes the Python scripts
- fetcher source code is contained in a container image with its complete environment
- incremental mode for convert is implicit (i.e. only downloaded data is converted) but the scripts can't read the previous execution date from Git history anymore; it's required to pass that metadata from the pipeline
- committing to Git can be done at the end of the pipeline; but it has been replaced with S3
- data is deployed to production by both the index job that just indexes data to Solr, and the sync job that pushes to S3, and the API reads dynamically from S3 and not from the file-system
Notes:
- the pipeline is attached to Git only for the last job
- the scripts are not attached to Git
- it's possible to change the storage engine later by replacing the last job only
- Tekton was tried because of its ability to create "workspaces" shared between all jobs of a pipeline
- pipeline runs are not cleaned-up automatically, not freeing the PVC of the workspaces
- Tekton does not offer a database/API for pipeline metadata, making impossible to build a domain-level dashboard
- even if Tekton has a dashboard, job logs are not persisted by default, so they are lost after pods are deleted – it's possible to add a log aggregation to the cluster (e.g. Elastic Stack, Loki Stack...) but UX is not designed for domain-level interaction, but sysadmin/devops
- workspaces as PVC can be expensive because of the minimum size of a PVC as per the provider (i.e. Scaleway minimum value is 1GB)
- Tekton is not integrated with GitLab
- packaging fetchers in container images defines a clear boundary between the pipeline and the fetcher by defining its full environment (programming language, language dependencies, OS-level packages like chromium...)
version 4: OECD in-house
During the deployment of a private instance of DBnomics for OECD, the GitLab integration of fetchers and the dashboard were enough important to backport the Tekton pipeline to Gitlab Runner again, keeping the additions of the version 3 like running from container images, and starting from an empty directory.
The dashboard is used for consulting previous fetcher pipelines, and start new ones.
UX of GitLab CI is quite more advanced that Tekton in that help messages and default values can be added to the "new pipeline" form.
- used on OECD private instance
- uses GitLab Runner with Kubernetes executor
- runs in a k8s cluster
- common pipeline definition is stored in a private repo
- data is stored in a block storage PVC that is attached to the job pods by using gitlab-runner config, used cooperatively by all the fetchers
- starts from an empty directory for source-data and json-data and executes the Python scripts
- fetcher source code is contained in a container image with its complete environment
- incremental mode for convert is implicit (i.e. only downloaded data is converted) but the scripts can't read the previous execution date from Git history anymore; it's required to pass that metadata from the pipeline
- committing to Git is done at the end of the pipeline
- data is deployed to production by both the index job that just indexes data to Solr, the sync job that pushed to Git repos, and a sidecar container of the API that periodically pulls every known json-data repo
Goals
- keep the best of each version (fetchers in containers, GitLab CI UX, have a domain-level dashboard)
- evaluate what's to be adapted from version 4 from OECD to be used for DBnomics (OECD POC has fetchers with small datasets)
- avoid having a small fetcher blocked by a big one that takes a long time
Acceptance Criteria
Being able to start porting all fetchers to the new pipeline on pipeline-ng instance:
- data produced by a fetcher on the new pipeline is available to end users on pipeline-ng.db.nomics.world and regularly updated
- pipeline jobs for each ported fetchers can be accessed from the DBnomics dashboard
Later, when all fetchers run on the pipeline-ng instance, the DNS will be switched for pipeline-ng to become the production instance.
Tasks
-
integrate Sentry to dbnomics-sync-git script used by API sidecar container -
fix Solr with Eurostat -
fix Git push with Eurostat -
fix dashboard formain
branch -
let dashboard support pipelines v1, v2, v5 -
Solr incremental indexation (without reading Git commit) -
git sync with merge strategy (cf #845 (closed) ) -
let the validation job fail when there are validation errors (e.g. this job, fixed in this job)