[EPIC] Platform reliability

This EPIC lists issues aiming to improve DBnomics reliability and resilience. The overall goal is to minimize failures and maximize service availability (SLA). Here we don't talk about data (quality).

Goals:

modularize architecture
- with redundancy for critical parts
backup whole system, or use distributed services
optimize CI pipeline: minimiser le temps de désynchronisation entre données disponibles (JSON-data) et mise à jour de l'index Solr
anticipate data volume augmentation
limit access to web API based on architecture possibilities, and announce it clearly to the user on the homepage, docs. Limit by IP address, or API token.
- ensure API requests can scale, and predict the limits

Questions:

should dbnomics use a managed instance of Solr? (IaaS, AWS, ...) what are the prices compared to dolos?
should dbnomics use a cloud storage service like S3/EFS to work around the disk size problem and SSD vs. mechanical...

Reliabilty Issues

#495 (closed) Monitor DBnomics
#496 (closed) Add a server dedicated to runners
#555 Move Solr to dedicated servers in cloud mode
#556 (closed) Backup GitLab instance via S3
#521 Allow restoring provider data at specific revision
#552 (closed) Configure Sentry for web API and web site
#666 Reduce data availability delay
#694 (closed) Try runners autoscale with Kubernetes
#747 (closed) Configure DBnomics services to run on Kubernetes
#821 (closed) Run fetchers with Tekton in Kubernetes
...

Incidents

#544 (closed) Incident on prod: no space left on device
#553 (closed) Incident on prod: too many open files

Edited Nov 26, 2020 by Johan Richer