Add a server dedicated to runners
Related: #511 (closed)
Context
- DBnomics' architecture currently uses 3 servers
 - 
doloscurrently hosts DBnomics UI & API in production, as well as 1 runner - Disk space is a recurrent problem on 
dolos - Adding new servers has already been considered to reinforce DBnomics' architecture going forward
 - Compared to other infrastructure items on the roadmap, installing and configuring a new runner is a relatively easy and short term task
 
Links:
- servers list on Scaleway: https://console.scaleway.com/instance/servers
 - Scaleway pricing page: https://www.scaleway.com/en/pricing/
 
Acceptance criteria
- 
there are no more runners on dolos - 
runners for download and convert jobs are installed on a (new?) server  - 
how dbnomics project use runners is documented in the technical wiki  
Analysis
runners on cloud instances
- GitLab uses docker-machine to auto-scale runners
 - docker-machine providers: https://docs.docker.com/machine/drivers/
- scaleway non official...
 
 - 
kubernetes is considered first because- it's the most widespread
 - compatible with GitLab out of the box
 - supported by most providers
 - agnostic from providers infrastructure
 - enables infrastructure as code
 
 - 
kubernetes providers to consider- scaleway
 - google cloud platform
 - OVH
 - azure
 - 
amazon
- Il faut plutôt comparer Scaleway à Amazon Lightsail pour l'offre et la tarification.
 - Pour AWS Gitlab recommande d'utiliser EC2 (VPS) + S3 (Object Storage), donc rien qui n'est pas déjà dispo chez Scaleway par exemple, et plus cher chez AWS.
 
 
 - what's the pricing?
- in particular what do we pay if a machine is archived (or stopped, what's the difference)
 
 - how to control the type of machine created in the provider platform?
- e.g. a small machine for AMECO, a middle one for INSEE...
 
 - how to monitor
- kubernetes VMs, system resources...
 - runners and jobs on cloud instances
 - use GitLab integrated Prometheus instance?
 - use a SaaS monitoring solution (datadog?)
 
 - it's OK to pay for an over-sized server while the migration is ongoing
 - fetchers (download/convert) should not take too much RAM, only CPU, disk space, disk IO and network IO
 - buy 1 ... 3 middle-sized servers (2/4 core, 8Gb CPU, 500Gb disk) for download/convert jobs
 - keep 
dolosfor API and Solr- don't work on separation of Solr and API for now
 - Solr needs RAM, CPU and disk IO, API needs disk IO
 - 
doloshas 1Tb disk space and currently uses 118G for Solr, 537G for JSON-data- will be solved by moving API to another server
 
 
 
Plan
- create a small server on scaleway (bastion)
 - document installation on technical wiki
 - very optional: automate server setup (ansible or terraform...)
 - choose Docker base image
 - common setup (there is an ansible playbook for that)
- securization:
- disable SSH password
 
 - packages: gitlab-runner
 
 - securization:
 - install https://github.com/scaleway/docker-machine-driver-scaleway
 - 
register a runner
- using 
docker+machineexecutor - with tags 
download-convert,autoscale 
 - using 
 - configure the runner: 
IdleCount..., cache storage - test running a new job on a dummy repo with a single 
.gitlab-ci.ymldoing anecho "hello", using the tagautoscale- prefer using tags denoting the nature of the job (
download,convert,index), or the execution environment (autoscale,stateful,pre-prod) - avoid server names (
dolos,eros) or technical terms (docker,docker-machine) 
 - prefer using tags denoting the nature of the job (
 - tasks for switchover (respect order and ensure everything is OK before doing a next step)
- change tag 
dockertodownload-convertin.gitlab-ci.ymlof one trial fetcher - if it's ok, generalize to all fetchers
 - add 
download-converttag toerosrunner- so that all jobs are spread between 
erosand the new server 
 - so that all jobs are spread between 
 - delete 
dockertag fromerosrunner - delete 
dolosrunner in GitLab admin- don't uninstall 
gitlab-runnerDebian package ondolos(because of Solr index job) 
 - don't uninstall 
 
 - change tag 
 
Pending questions
- 1 or more new server?
- 1 server on a first step & see if it's enough
 - more servers can be simply added using ansible conf
 
 - on premises server(s) or cloud instances?
- on-premise would require to manually add a server to the pool, write automation scripts (ansible...) for its setup
 - cloud would automate obtaining new server instance, but we should learn how to do
 
 - 1 or more runner per server?
- if using cloud mode, this will be 1 runner per server
 - otherwise we can configure concurrency (see below)
 
 - runners for little fetchers and other ones for strong fetchers?
 - should we move all download/convert jobs to the new server or just the ones currently handled by 
dolosrunner?- keep 
erosfor a certain time 
 - keep 
 - what about 
docker system prune --volumeson dynamically created machines? - what's the algorithm used by gitlab server to attribute a job to a runner?
- especially about tags: if a runner X has 2 tags 
a,b, another runner Y has 2 tagsa,c, and the job has 2 tagsa,b, will the job be always attributed to runner X? In other words, does the tag matching use a "any" or a "all"... We want an "all".- See https://docs.gitlab.com/ce/ci/yaml/README.html#tags
[...] The specification above, will make sure that job is built by a Runner that has both ruby AND postgres tags defined.
 
 - See https://docs.gitlab.com/ce/ci/yaml/README.html#tags
 
 - especially about tags: if a runner X has 2 tags 
 - be sure that docker-machine gitlab-runner driver won't delete any existing server on Scaleway (like email server, ...) thinking that it's an idle machine
 
concurrency config:
server => N runners => N jobs
at server level: concurrent (default 1): nb jobs globally
at runner level: limit (default 0): how many jobs can be handled concurrently
- will all AMECO jobs be executed by the same machine?
- no, by design, the machine can be trashed at any moment...
 
 - how to ensure that jobs don't leave garbage data in Docker volumes (json-data, source-data)?
- this is solved currently on 
dolosanderosby doingdocker system prune --volumes - this problem would continue to occur with autoscale machines
- because it's not possible to choose which machine runs a specific fetcher
- even if it was possible, each job execution of the same fetcher would create a new Docker volume anyway
 
 
 - because it's not possible to choose which machine runs a specific fetcher
 - solutions
- handle garbage data collection in the job (
rm -rf source-data...) - run 
docker system prune --volumeson the server- but this would require adding a "cron" job on each new machine, by building a custom server image
 
 - trash each machine after 1 job... non optimal
 - or learn how do GitLab runners users do in general... cf https://gitlab.com/gitlab-org/gitlab-runner-docker-cleanup
- maybe 
docker+machinedriver can install it automatically when creating a new machine?? 
 - maybe 
 
 - handle garbage data collection in the job (
 
 - this is solved currently on 
 - how to configure gitlab-runner to ensure that the number of servers won't augment indefinitely (cf gitlab-runners autoscale doc page, 
IdleCount(?) option )- i.e. we would like to have max 5 servers, shutdown then when unused, and power them up when needed (for now we have 7 autoscaled servers)
 - why are servers archived (and not deleted, or kept running for idle ones) when jobs finish?
 
 - is it possible to delete the volume of a stopped server on Scaleway, to avoid paying from them (cf Scaleway pricing page)
- but idle servers of docker+machine pool should be kept running with their volumes
 
 
Tasks
- 
@cbenz and @pdi pair to document the architecture, especially the runners, on a pad  - 
once this brainstorming has reached a good state, move the content to the technical wiki (in the https://git.nomics.world/dbnomics-fetchers/documentation/wikis/ci-jobs-and-runners page) (can be done continuously) - 
list services and servers - cf https://git.nomics.world/dbnomics-fetchers/documentation/wikis/servers#services
 - resources (directories on disk, config files, systemd services)
- show their size
 
 - dependencies to other services
 - indicate what service can be used publically (e.g. not Solr which runs on 
localhost:8983) 
 - 
show what service runs on what server  
 - 
 - 
find out which service to move from dolos to avoid disk full - => runners
 
 - 
continue documenting runners - 
graph runners usage (input/output)  
 - 
 - 
identify the consequences of moving download/convert runners to another server  - 
very optional: also try reducing time between json-data availability and solr index update  - 
find a server (or order a new one) for the moved runners=> autoscale- requirements:
- a lot of disk space
 - cpu from 4 to 16 cores
 - RAM: 16 to 64Gb
 
 - one big server or multiple little ones or cloud instances
 
 - requirements:
 - 
@pdi read runners doc about how to register a new runner - 
@pdi + @cbenz pair programming to register a new runner  - 
play with pipeline on new bastion - from https://git.nomics.world/cbenz/docker-machine-dummy-job
 - Adding 
download-converttag to .gitlab-ci.yml - => pipeline completed in 3m16s: not so fast
 
 - 
migrate AMECO, taking precise notes on migration klog: - 
update .gitlab-ci.yml to use 
autoscaleanddownload-convertto be picked up by the new autoscale runner - ask for download (
./trigger-job-for-provider.py download ameco) - it worked! but it took 15 minutes 18 seconds instead of ~45s with the previous system (https://git.nomics.world/dbnomics-fetchers/ameco-fetcher/-/jobs)
 - Is it setup time or server limitation?
 
 - 
update .gitlab-ci.yml to use 
 - 
create a DNS entry for maniaserver (IP: 163.172.146.196) on gandi (credentials in Jailbreak keepassX vault, use cbenz delegated account, not Sébastien or Stéphane from CEPREMAP) using aCNAMEentry, like any other server. - 
answer the different questions above based on what we learnt, or try to find the answers...  - 
start a migration log for the fetchers (technical wiki page) - starting with AMECO
 - indicating the server type used (
DEV1-S) and the average disk occupation 
 - 
add the DEV1-Stag to AMECO job - 
once it is working, have a review with team, and demo  - 
write an analysis about advantages/drawbacks of using autoscale runners for DBnomics - ongoing in this issue
 
 
Edited  by Christophe Benz