Try runners autoscale with Kubernetes

This issue is part of #666


Goal

  • General goal: reduce data availability delay.
  • Specific goal of this issue: reduce the waiting time of a job in the waiting queue by using GitLab runners autoscale feature via Kubernetes executor (cf docs)

Context

  • previously we tried autoscale with docker-machine executor with Scaleway driver
    • mostly because Docker driver had issues, and being not officially supported by Docker does not help
  • my intuition tells me to try Kubernetes first because it is more widely adopted than docker-machine
    • Scaleway has a Kubernetes commercial offer named Kapsule and we're going to start with it

Tasks

  • create a Kubernetes cluster on Scaleway
  • follow docs GitLab runner on Kubernetes
  • setup cluster scale up
  • setup cluster scale down
  • setup CI pipelines cache via S3 distributed cache
  • refactor GitLab CI pipeline to be a unique pipeline with download, convert, index jobs (related to #523 (closed) and #557)
  • consider adapting or removing the current dashboard
  • add Prometheus and collect GitLab runner metrics (enabled by default with GitLab Runner chart in values.yaml)
  • setup a dashboard presenting data like a timeline (Grafana?)

New CI pipeline

  • cf https://git.nomics.world/dbnomics/dbnomics-fetcher-pipeline
  • write pipeline with 3 stages: download, convert, index
  • add validate stage between convert and index, but keep in mind that warnings will need to be introduced
  • test with one big fetcher
  • test with many fetchers at the same time
  • remove test triggers (afdb, ecb)
  • do not use dev branch in wget ... git-pull-or-clone.py
  • remove new-pipeline branch name in Index job
  • add "git push" info to the dashboard (cf this JSON)
  • add support for errors.json artifact

Migrating fetchers to k8s

In management project:

  • update fetchers.yml to remove legacy_pipeline flag

Then use switch-provider-to-k8s-pipeline.py:

python switch-provider-to-k8s-pipeline.py -v --dry-run PROVIDER_SLUG
# if everything seems OK
python switch-provider-to-k8s-pipeline.py -v PROVIDER_SLUG

About k8s resources requests and limits:

  • start with default settings provided by fetcher-gitlab-ci.yml
  • look at the metrics in Grafana
  • adjust memory and CPU requests and limits based on what's really used

Rollback?

If needed, it's possible to come back to the old pipeline by reverting the following steps, and launching the script configure-ci-for-provider.py:

  • revert commit in fetcher source code about .gitlab-ci.yml
  • update fetchers.yml to add legacy_pipeline: true flag
Edited Nov 17, 2020 by Christophe Benz
Assignee Loading
Time tracking Loading