Monitor DBnomics

EPIC: #512 – related to #511 (closed) #494 (closed)

The final result is a Grafana dashboard, accessible here: https://git.nomics.world/-/grafana/d/rRxXxKhZz/dbnomics?orgId=1

Description

DBnomics platform needs proper monitoring:

useful metrics (number of datasets, disk space usage, convert time...)
presented in a dashboard
have access to history to query in case of production failure

Areas:

monitor applications: Solr, API, UI
monitor fetcher jobs: download/convert, Solr indexation
- avoir une vue "pipeline" par provider
monitor data: source-data, json-data

Report average metrics for each provider to the docs, to see the big picture of providers:

https://git.nomics.world/dbnomics-fetchers/documentation/wikis/monitoring
but also details about the download process, how many time for category tree download, for datasets...

Monitoring primer

metrics collection and storage https://prometheus.io
alerts (with history DB) https://prometheus.io/docs/alerting/alertmanager/
dashboards https://grafana.com
GitLab exports prometheus metrics
Solr exports prometheus metrics: https://lucene.apache.org/solr/guide/8_1/monitoring-solr-with-prometheus-and-grafana.html
DBnomics can write an exporter of prometheus metrics
- number of provider, datasets...
- Cf https://prometheus.io/docs/instrumenting/exposition_formats/
we could match system resource metrics with services metrics, and application metrics
- for example to answer why a job failed, we could align the job time series with the RAM consumption of each servers, or the disk activity... all synchronised on the same time X axis
- we could annotate the time series in grafana and share the link as a comment on the issue...

Note: previous monitoring tool was https://www.netdata.cloud/ which collects metrics, stores them but only on a short time window (configurable, we used 6h), provides a simple dashboard, sends email alerts (but does not store history). It is an all-in-one solution very simple to install and get running, but it's more an "htop" on steroids than a proper monitoring solution allowing to dig into application failures. It's not designed to be integrated with grafana, alertmanager or other tools.

Acceptance criteria

Here are the metrics that we want:

source-data size on eros, per provider
json-data size on eros, per provider
json-data size on dolos (/home/gitlab-runner/json-data), per provider
Solr CPU/RAM
Solr disk usage
the number of requests to Solr, by response status code
the number of requests to API, by response status code
disk size of prometheus itself
Uwsgi CPU/RAM
...

Tasks

Enable built-in Prometheus in GitLab server
- read https://prometheus.io
- https://docs.gitlab.com/ce/administration/monitoring/
Grafana is installed by GitLab: enable it or make it accessible with a domain... read gitlab doc about grafana...
- it will help testing monitoring
Monitor eros (the machine hosting GitLab) with basic metrics
Monitor dolos (and ioke) with basic metrics
- Install node_exporter on dolos...
Monitor Solr on dolos
Monitor disk size of dbnomics-source-data and dbnomics-json-data GitLab groups
Monitor disk size of /home/gitlab-runner/json-data on dolos
...

Questions

What does GitLab built-in Prometheus tracks?
Is it necessary to install another Prometheus server?
- or can we just add metrics to GitLab Prometheus server?
Should we keep DBnomics dashboard or use Prometheus? What's the overlap?
should we monitor web API usage by user?
- by introducing an API key...

Edited Oct 17, 2019 by Bruno Duyé

Assignee

Time tracking