Backport pipeline from OECD instance

Description

DBnomics fetcher pipeline has evolved in many directions.

Pipeline Versions

version 1: Gitlab Runner

used in prod for some fetchers
uses GitLab Runner with Docker or Shell executors, depending on the fetcher
runs on statically provisioned servers
pipeline definition is copy-pasted in each fetcher repo (.gitlab-ci.yml)
data is stored either in containers for Docker executors (so it's lost between each job and pipeline run), or in a persisted directory for Shell executors (Eurostat)
starts by cloning Git repos of source-data and json-data, deleting files (not for all fetchers, cf Eurostat which is an exception), and executes the Python scripts
fetchers are run from a python:3.x container image, dependencies are installed at each pipeline run
incremental mode is explicit: last pipeline execution date is read from Git history
data is deployed to production by the index job which is triggered by pushing to the json-data repo; it indexes data to Solr, but also does a "git pull" from the directory served by the API; for that it is bound to a Shell executor on the server hosting the API and Solr, which have to be served on the same server

Notes:

this pipeline is attatched to Git
scripts are attached to Git
jobs taken by Shell executors are bound to the Python version available on the server

version 2: Gitlab Runner + Kubernetes

used in prod for some fetchers
uses GitLab Runner with Kubernetes executor
runs in the k8s cluster named "condescending borg" on Scaleway
common pipeline definition is stored in https://git.nomics.world/dbnomics/dbnomics-fetcher-pipeline/ and is included by each fetcher repo (.gitlab-ci.yml using include directive)
CI scripts are downloaded from https://git.nomics.world/dbnomics/dbnomics-fetcher-pipeline/ also with wget at each pipeline execution
data is stored in a PVC backed by a NFS storage class, shared by all jobs which cooperate to use their own sub-directory based on the provider slug
starts by cloning Git repos of source-data and json-data, deleting files, and executes the Python scripts
fetchers are run from a python:3.x container image, dependencies are installed at each pipeline run
incremental mode is explicit: last pipeline execution date is read from Git history
data is deployed to production by the same index job as for version 1

Notes:

this pipeline is attatched to Git
scripts are attached to Git
using NFS is very slow
cloning/updating Git repositories upfront then deleting files is sub-optimal
having a block storage volume that is always up can be expensive

version 3: Tekton + Kubernetes

used on next
uses Tekton
runs in the k8s cluster named "dbnomics next" on Scaleway
common pipeline definition is stored in dbnomics-kubernetes repo
data is stored in workspaces that are provisioned at each pipeline run and shared between jobs
starts from an empty directory for source-data and json-data and executes the Python scripts
fetcher source code is contained in a container image with its complete environment
incremental mode for convert is implicit (i.e. only downloaded data is converted) but the scripts can't read the previous execution date from Git history anymore; it's required to pass that metadata from the pipeline
committing to Git can be done at the end of the pipeline; but it has been replaced with S3
data is deployed to production by both the index job that just indexes data to Solr, and the sync job that pushes to S3, and the API reads dynamically from S3 and not from the file-system

Notes:

the pipeline is attached to Git only for the last job
the scripts are not attached to Git
it's possible to change the storage engine later by replacing the last job only
Tekton was tried because of its ability to create "workspaces" shared between all jobs of a pipeline
pipeline runs are not cleaned-up automatically, not freeing the PVC of the workspaces
Tekton does not offer a database/API for pipeline metadata, making impossible to build a domain-level dashboard
even if Tekton has a dashboard, job logs are not persisted by default, so they are lost after pods are deleted – it's possible to add a log aggregation to the cluster (e.g. Elastic Stack, Loki Stack...) but UX is not designed for domain-level interaction, but sysadmin/devops
workspaces as PVC can be expensive because of the minimum size of a PVC as per the provider (i.e. Scaleway minimum value is 1GB)
Tekton is not integrated with GitLab
packaging fetchers in container images defines a clear boundary between the pipeline and the fetcher by defining its full environment (programming language, language dependencies, OS-level packages like chromium...)

version 4: OECD in-house

During the deployment of a private instance of DBnomics for OECD, the GitLab integration of fetchers and the dashboard were enough important to backport the Tekton pipeline to Gitlab Runner again, keeping the additions of the version 3 like running from container images, and starting from an empty directory.

The dashboard is used for consulting previous fetcher pipelines, and start new ones.

UX of GitLab CI is quite more advanced that Tekton in that help messages and default values can be added to the "new pipeline" form.

used on OECD private instance
uses GitLab Runner with Kubernetes executor
runs in a k8s cluster
common pipeline definition is stored in a private repo
data is stored in a block storage PVC that is attached to the job pods by using gitlab-runner config, used cooperatively by all the fetchers
starts from an empty directory for source-data and json-data and executes the Python scripts
fetcher source code is contained in a container image with its complete environment
incremental mode for convert is implicit (i.e. only downloaded data is converted) but the scripts can't read the previous execution date from Git history anymore; it's required to pass that metadata from the pipeline
committing to Git is done at the end of the pipeline
data is deployed to production by both the index job that just indexes data to Solr, the sync job that pushed to Git repos, and a sidecar container of the API that periodically pulls every known json-data repo

Goals

keep the best of each version (fetchers in containers, GitLab CI UX, have a domain-level dashboard)
evaluate what's to be adapted from version 4 from OECD to be used for DBnomics (OECD POC has fetchers with small datasets)
avoid having a small fetcher blocked by a big one that takes a long time

Acceptance Criteria

Being able to start porting all fetchers to the new pipeline on pipeline-ng instance:

data produced by a fetcher on the new pipeline is available to end users on pipeline-ng.db.nomics.world and regularly updated
pipeline jobs for each ported fetchers can be accessed from the DBnomics dashboard

Later, when all fetchers run on the pipeline-ng instance, the DNS will be switched for pipeline-ng to become the production instance.

Tasks

integrate Sentry to dbnomics-sync-git script used by API sidecar container
fix Solr with Eurostat
fix Git push with Eurostat
~~fix dashboard for main branch~~
let dashboard support pipelines v1, v2, v5
Solr incremental indexation (without reading Git commit)
git sync with merge strategy (cf #845 (closed) )
let the validation job fail when there are validation errors (e.g. this job, fixed in this job)

Edited Sep 06, 2021 by Christophe Benz