Add a server dedicated to runners

Related: #511 (closed)

Context

DBnomics' architecture currently uses 3 servers
dolos currently hosts DBnomics UI & API in production, as well as 1 runner
Disk space is a recurrent problem on dolos
Adding new servers has already been considered to reinforce DBnomics' architecture going forward
Compared to other infrastructure items on the roadmap, installing and configuring a new runner is a relatively easy and short term task

Links:

servers list on Scaleway: https://console.scaleway.com/instance/servers
Scaleway pricing page: https://www.scaleway.com/en/pricing/

Acceptance criteria

there are no more runners on dolos
runners for download and convert jobs are installed on a (new?) server
how dbnomics project use runners is documented in the technical wiki
- https://git.nomics.world/dbnomics-fetchers/documentation/wikis/ci-jobs-and-runners#usage-in-dbnomics-project

Analysis

runners on cloud instances

GitLab uses docker-machine to auto-scale runners
docker-machine providers: https://docs.docker.com/machine/drivers/
- scaleway non official...
~~kubernetes is considered first because~~
- it's the most widespread
- compatible with GitLab out of the box
- supported by most providers
- agnostic from providers infrastructure
- enables infrastructure as code
~~kubernetes providers to consider~~
- scaleway
- google cloud platform
- OVH
- azure
- amazon
  - Il faut plutôt comparer Scaleway à Amazon Lightsail pour l'offre et la tarification.
  - Pour AWS Gitlab recommande d'utiliser EC2 (VPS) + S3 (Object Storage), donc rien qui n'est pas déjà dispo chez Scaleway par exemple, et plus cher chez AWS.
what's the pricing?
- in particular what do we pay if a machine is archived (or stopped, what's the difference)
how to control the type of machine created in the provider platform?
- e.g. a small machine for AMECO, a middle one for INSEE...
how to monitor
- kubernetes VMs, system resources...
- runners and jobs on cloud instances
- use GitLab integrated Prometheus instance?
- use a SaaS monitoring solution (datadog?)
it's OK to pay for an over-sized server while the migration is ongoing
fetchers (download/convert) should not take too much RAM, only CPU, disk space, disk IO and network IO
buy 1 ... 3 middle-sized servers (2/4 core, 8Gb CPU, 500Gb disk) for download/convert jobs
keep dolos for API and Solr
- don't work on separation of Solr and API for now
- Solr needs RAM, CPU and disk IO, API needs disk IO
- dolos has 1Tb disk space and currently uses 118G for Solr, 537G for JSON-data
  - will be solved by moving API to another server

Plan

create a small server on scaleway (bastion)
document installation on technical wiki
very optional: automate server setup (ansible or terraform...)
choose Docker base image
common setup (there is an ansible playbook for that)
- securization:
  - disable SSH password
- packages: gitlab-runner
install https://github.com/scaleway/docker-machine-driver-scaleway
register a runner
- using docker+machine executor
- with tags download-convert, autoscale
configure the runner: IdleCount..., cache storage
test running a new job on a dummy repo with a single .gitlab-ci.yml doing an echo "hello", using the tag autoscale
- prefer using tags denoting the nature of the job (download, convert, index), or the execution environment (autoscale, stateful, pre-prod)
- avoid server names (dolos, eros) or technical terms (docker, docker-machine)
tasks for switchover (respect order and ensure everything is OK before doing a next step)
- change tag docker to download-convert in .gitlab-ci.yml of one trial fetcher
- if it's ok, generalize to all fetchers
- add download-convert tag to eros runner
  - so that all jobs are spread between eros and the new server
- delete docker tag from eros runner
- delete dolos runner in GitLab admin
  - don't uninstall gitlab-runner Debian package on dolos (because of Solr index job)

Pending questions

1 or more new server?
- 1 server on a first step & see if it's enough
- more servers can be simply added using ansible conf
on premises server(s) or cloud instances?
- on-premise would require to manually add a server to the pool, write automation scripts (ansible...) for its setup
- cloud would automate obtaining new server instance, but we should learn how to do
1 or more runner per server?
- if using cloud mode, this will be 1 runner per server
- otherwise we can configure concurrency (see below)
runners for little fetchers and other ones for strong fetchers?
should we move all download/convert jobs to the new server or just the ones currently handled by dolos runner?
- keep eros for a certain time
what about docker system prune --volumes on dynamically created machines?
what's the algorithm used by gitlab server to attribute a job to a runner?
- especially about tags: if a runner X has 2 tags a, b, another runner Y has 2 tags a, c, and the job has 2 tags a, b, will the job be always attributed to runner X? In other words, does the tag matching use a "any" or a "all"... We want an "all".
  - See https://docs.gitlab.com/ce/ci/yaml/README.html#tags
    
    [...] The specification above, will make sure that job is built by a Runner that has both ruby AND postgres tags defined.
be sure that docker-machine gitlab-runner driver won't delete any existing server on Scaleway (like email server, ...) thinking that it's an idle machine

concurrency config:

server => N runners => N jobs
at server level: concurrent (default 1): nb jobs globally
at runner level: limit (default 0): how many jobs can be handled concurrently

will all AMECO jobs be executed by the same machine?
- no, by design, the machine can be trashed at any moment...
how to ensure that jobs don't leave garbage data in Docker volumes (json-data, source-data)?
- this is solved currently on dolos and eros by doing docker system prune --volumes
- this problem would continue to occur with autoscale machines
  - because it's not possible to choose which machine runs a specific fetcher
    - even if it was possible, each job execution of the same fetcher would create a new Docker volume anyway
- solutions
  - handle garbage data collection in the job (rm -rf source-data...)
  - run docker system prune --volumes on the server
    - but this would require adding a "cron" job on each new machine, by building a custom server image
  - trash each machine after 1 job... non optimal
  - or learn how do GitLab runners users do in general... cf https://gitlab.com/gitlab-org/gitlab-runner-docker-cleanup
    - maybe docker+machine driver can install it automatically when creating a new machine??
how to configure gitlab-runner to ensure that the number of servers won't augment indefinitely (cf gitlab-runners autoscale doc page, IdleCount(?) option )
- i.e. we would like to have max 5 servers, shutdown then when unused, and power them up when needed (for now we have 7 autoscaled servers)
- why are servers archived (and not deleted, or kept running for idle ones) when jobs finish?
is it possible to delete the volume of a stopped server on Scaleway, to avoid paying from them (cf Scaleway pricing page)
- but idle servers of docker+machine pool should be kept running with their volumes

Tasks

Edited Nov 12, 2019 by Christophe Benz