Try runners autoscale with Kubernetes
This issue is part of #666
Goal
- General goal: reduce data availability delay.
- Specific goal of this issue: reduce the waiting time of a job in the waiting queue by using GitLab runners autoscale feature via Kubernetes executor (cf docs)
Context
-
previously we tried autoscale with docker-machine executor with Scaleway driver
- mostly because Docker driver had issues, and being not officially supported by Docker does not help
- my intuition tells me to try Kubernetes first because it is more widely adopted than docker-machine
- Scaleway has a Kubernetes commercial offer named Kapsule and we're going to start with it
Tasks
-
create a Kubernetes cluster on Scaleway -
follow docs GitLab runner on Kubernetes -
setup cluster scale up -
setup cluster scale down -
setup CI pipelines cache via S3 distributed cache -
refactor GitLab CI pipeline to be a unique pipeline with download, convert, index jobs (related to #523 (closed) and #557) -
consider adapting or removing the current dashboard -
add Prometheus and collect GitLab runner metrics (enabled by default with GitLab Runner chart in values.yaml) -
setup a dashboard presenting data like a timeline (Grafana?)
New CI pipeline
- cf https://git.nomics.world/dbnomics/dbnomics-fetcher-pipeline
-
write pipeline with 3 stages: download, convert, index -
add validate stage between convert and index, but keep in mind that warnings will need to be introduced -
test with one big fetcher -
test with many fetchers at the same time -
remove test triggers (afdb, ecb) -
do not use dev
branch inwget ... git-pull-or-clone.py
-
remove new-pipeline
branch name in Index job -
add "git push" info to the dashboard (cf this JSON) -
add support for errors.json
artifact
Migrating fetchers to k8s
- update
fetchers.yml
to removelegacy_pipeline
flag
Then use switch-provider-to-k8s-pipeline.py:
python switch-provider-to-k8s-pipeline.py -v --dry-run PROVIDER_SLUG
# if everything seems OK
python switch-provider-to-k8s-pipeline.py -v PROVIDER_SLUG
About k8s resources requests and limits:
- start with default settings provided by
fetcher-gitlab-ci.yml
- look at the metrics in Grafana
- adjust memory and CPU requests and limits based on what's really used
Rollback?
If needed, it's possible to come back to the old pipeline by reverting the following steps, and launching the script configure-ci-for-provider.py
:
- revert commit in fetcher source code about
.gitlab-ci.yml
- update
fetchers.yml
to addlegacy_pipeline: true
flag
Edited by Christophe Benz