Add a server dedicated to runners
Related: #511 (closed)
Context
- DBnomics' architecture currently uses 3 servers
-
dolos
currently hosts DBnomics UI & API in production, as well as 1 runner - Disk space is a recurrent problem on
dolos
- Adding new servers has already been considered to reinforce DBnomics' architecture going forward
- Compared to other infrastructure items on the roadmap, installing and configuring a new runner is a relatively easy and short term task
Links:
- servers list on Scaleway: https://console.scaleway.com/instance/servers
- Scaleway pricing page: https://www.scaleway.com/en/pricing/
Acceptance criteria
-
there are no more runners on dolos
-
runners for download and convert jobs are installed on a (new?) server -
how dbnomics project use runners is documented in the technical wiki
Analysis
runners on cloud instances
- GitLab uses docker-machine to auto-scale runners
- docker-machine providers: https://docs.docker.com/machine/drivers/
- scaleway non official...
-
kubernetes is considered first because- it's the most widespread
- compatible with GitLab out of the box
- supported by most providers
- agnostic from providers infrastructure
- enables infrastructure as code
-
kubernetes providers to consider- scaleway
- google cloud platform
- OVH
- azure
-
amazon
- Il faut plutôt comparer Scaleway à Amazon Lightsail pour l'offre et la tarification.
- Pour AWS Gitlab recommande d'utiliser EC2 (VPS) + S3 (Object Storage), donc rien qui n'est pas déjà dispo chez Scaleway par exemple, et plus cher chez AWS.
- what's the pricing?
- in particular what do we pay if a machine is archived (or stopped, what's the difference)
- how to control the type of machine created in the provider platform?
- e.g. a small machine for AMECO, a middle one for INSEE...
- how to monitor
- kubernetes VMs, system resources...
- runners and jobs on cloud instances
- use GitLab integrated Prometheus instance?
- use a SaaS monitoring solution (datadog?)
- it's OK to pay for an over-sized server while the migration is ongoing
- fetchers (download/convert) should not take too much RAM, only CPU, disk space, disk IO and network IO
- buy 1 ... 3 middle-sized servers (2/4 core, 8Gb CPU, 500Gb disk) for download/convert jobs
- keep
dolos
for API and Solr- don't work on separation of Solr and API for now
- Solr needs RAM, CPU and disk IO, API needs disk IO
-
dolos
has 1Tb disk space and currently uses 118G for Solr, 537G for JSON-data- will be solved by moving API to another server
Plan
- create a small server on scaleway (bastion)
- document installation on technical wiki
- very optional: automate server setup (ansible or terraform...)
- choose Docker base image
- common setup (there is an ansible playbook for that)
- securization:
- disable SSH password
- packages: gitlab-runner
- securization:
- install https://github.com/scaleway/docker-machine-driver-scaleway
-
register a runner
- using
docker+machine
executor - with tags
download-convert
,autoscale
- using
- configure the runner:
IdleCount
..., cache storage - test running a new job on a dummy repo with a single
.gitlab-ci.yml
doing anecho "hello"
, using the tagautoscale
- prefer using tags denoting the nature of the job (
download
,convert
,index
), or the execution environment (autoscale
,stateful
,pre-prod
) - avoid server names (
dolos
,eros
) or technical terms (docker
,docker-machine
)
- prefer using tags denoting the nature of the job (
- tasks for switchover (respect order and ensure everything is OK before doing a next step)
- change tag
docker
todownload-convert
in.gitlab-ci.yml
of one trial fetcher - if it's ok, generalize to all fetchers
- add
download-convert
tag toeros
runner- so that all jobs are spread between
eros
and the new server
- so that all jobs are spread between
- delete
docker
tag fromeros
runner - delete
dolos
runner in GitLab admin- don't uninstall
gitlab-runner
Debian package ondolos
(because of Solr index job)
- don't uninstall
- change tag
Pending questions
- 1 or more new server?
- 1 server on a first step & see if it's enough
- more servers can be simply added using ansible conf
- on premises server(s) or cloud instances?
- on-premise would require to manually add a server to the pool, write automation scripts (ansible...) for its setup
- cloud would automate obtaining new server instance, but we should learn how to do
- 1 or more runner per server?
- if using cloud mode, this will be 1 runner per server
- otherwise we can configure concurrency (see below)
- runners for little fetchers and other ones for strong fetchers?
- should we move all download/convert jobs to the new server or just the ones currently handled by
dolos
runner?- keep
eros
for a certain time
- keep
- what about
docker system prune --volumes
on dynamically created machines? - what's the algorithm used by gitlab server to attribute a job to a runner?
- especially about tags: if a runner X has 2 tags
a
,b
, another runner Y has 2 tagsa
,c
, and the job has 2 tagsa
,b
, will the job be always attributed to runner X? In other words, does the tag matching use a "any" or a "all"... We want an "all".- See https://docs.gitlab.com/ce/ci/yaml/README.html#tags
[...] The specification above, will make sure that job is built by a Runner that has both ruby AND postgres tags defined.
- See https://docs.gitlab.com/ce/ci/yaml/README.html#tags
- especially about tags: if a runner X has 2 tags
- be sure that docker-machine gitlab-runner driver won't delete any existing server on Scaleway (like email server, ...) thinking that it's an idle machine
concurrency config:
server => N runners => N jobs
at server level: concurrent (default 1): nb jobs globally
at runner level: limit (default 0): how many jobs can be handled concurrently
- will all AMECO jobs be executed by the same machine?
- no, by design, the machine can be trashed at any moment...
- how to ensure that jobs don't leave garbage data in Docker volumes (json-data, source-data)?
- this is solved currently on
dolos
anderos
by doingdocker system prune --volumes
- this problem would continue to occur with autoscale machines
- because it's not possible to choose which machine runs a specific fetcher
- even if it was possible, each job execution of the same fetcher would create a new Docker volume anyway
- because it's not possible to choose which machine runs a specific fetcher
- solutions
- handle garbage data collection in the job (
rm -rf source-data
...) - run
docker system prune --volumes
on the server- but this would require adding a "cron" job on each new machine, by building a custom server image
- trash each machine after 1 job... non optimal
- or learn how do GitLab runners users do in general... cf https://gitlab.com/gitlab-org/gitlab-runner-docker-cleanup
- maybe
docker+machine
driver can install it automatically when creating a new machine??
- maybe
- handle garbage data collection in the job (
- this is solved currently on
- how to configure gitlab-runner to ensure that the number of servers won't augment indefinitely (cf gitlab-runners autoscale doc page,
IdleCount
(?) option )- i.e. we would like to have max 5 servers, shutdown then when unused, and power them up when needed (for now we have 7 autoscaled servers)
- why are servers archived (and not deleted, or kept running for idle ones) when jobs finish?
- is it possible to delete the volume of a stopped server on Scaleway, to avoid paying from them (cf Scaleway pricing page)
- but idle servers of docker+machine pool should be kept running with their volumes
Tasks
-
@cbenz and @pdi pair to document the architecture, especially the runners, on a pad -
once this brainstorming has reached a good state, move the content to the technical wiki (in the https://git.nomics.world/dbnomics-fetchers/documentation/wikis/ci-jobs-and-runners page) (can be done continuously) -
list services and servers - cf https://git.nomics.world/dbnomics-fetchers/documentation/wikis/servers#services
- resources (directories on disk, config files, systemd services)
- show their size
- dependencies to other services
- indicate what service can be used publically (e.g. not Solr which runs on
localhost:8983
)
-
show what service runs on what server
-
-
find out which service to move from dolos to avoid disk full - => runners
-
continue documenting runners -
graph runners usage (input/output)
-
-
identify the consequences of moving download/convert runners to another server -
very optional: also try reducing time between json-data availability and solr index update -
find a server (or order a new one) for the moved runners=> autoscale- requirements:
- a lot of disk space
- cpu from 4 to 16 cores
- RAM: 16 to 64Gb
- one big server or multiple little ones or cloud instances
- requirements:
-
@pdi read runners doc about how to register a new runner -
@pdi + @cbenz pair programming to register a new runner -
play with pipeline on new bastion - from https://git.nomics.world/cbenz/docker-machine-dummy-job
- Adding
download-convert
tag to .gitlab-ci.yml - => pipeline completed in 3m16s: not so fast
-
migrate AMECO, taking precise notes on migration klog: -
update .gitlab-ci.yml to use
autoscale
anddownload-convert
to be picked up by the new autoscale runner - ask for download (
./trigger-job-for-provider.py download ameco
) - it worked! but it took 15 minutes 18 seconds instead of ~45s with the previous system (https://git.nomics.world/dbnomics-fetchers/ameco-fetcher/-/jobs)
- Is it setup time or server limitation?
-
update .gitlab-ci.yml to use
-
create a DNS entry for mania
server (IP: 163.172.146.196) on gandi (credentials in Jailbreak keepassX vault, use cbenz delegated account, not Sébastien or Stéphane from CEPREMAP) using aCNAME
entry, like any other server. -
answer the different questions above based on what we learnt, or try to find the answers... -
start a migration log for the fetchers (technical wiki page) - starting with AMECO
- indicating the server type used (
DEV1-S
) and the average disk occupation
-
add the DEV1-S
tag to AMECO job -
once it is working, have a review with team, and demo -
write an analysis about advantages/drawbacks of using autoscale runners for DBnomics - ongoing in this issue
Edited by Christophe Benz