Monitor DBnomics
EPIC: #512 – related to #511 (closed) #494 (closed)
The final result is a Grafana dashboard, accessible here: https://git.nomics.world/-/grafana/d/rRxXxKhZz/dbnomics?orgId=1
Description
DBnomics platform needs proper monitoring:
- useful metrics (number of datasets, disk space usage, convert time...)
- presented in a dashboard
- have access to history to query in case of production failure
Areas:
- monitor applications: Solr, API, UI
- monitor fetcher jobs: download/convert, Solr indexation
- avoir une vue "pipeline" par provider
- monitor data: source-data, json-data
Report average metrics for each provider to the docs, to see the big picture of providers:
- https://git.nomics.world/dbnomics-fetchers/documentation/wikis/monitoring
- but also details about the download process, how many time for category tree download, for datasets...
Monitoring primer
- metrics collection and storage https://prometheus.io
- alerts (with history DB) https://prometheus.io/docs/alerting/alertmanager/
- dashboards https://grafana.com
- GitLab exports prometheus metrics
- Solr exports prometheus metrics: https://lucene.apache.org/solr/guide/8_1/monitoring-solr-with-prometheus-and-grafana.html
- DBnomics can write an exporter of prometheus metrics
- number of provider, datasets...
- Cf https://prometheus.io/docs/instrumenting/exposition_formats/
- we could match system resource metrics with services metrics, and application metrics
- for example to answer why a job failed, we could align the job time series with the RAM consumption of each servers, or the disk activity... all synchronised on the same time X axis
- we could annotate the time series in grafana and share the link as a comment on the issue...
Note: previous monitoring tool was https://www.netdata.cloud/ which collects metrics, stores them but only on a short time window (configurable, we used 6h), provides a simple dashboard, sends email alerts (but does not store history). It is an all-in-one solution very simple to install and get running, but it's more an "htop" on steroids than a proper monitoring solution allowing to dig into application failures. It's not designed to be integrated with grafana, alertmanager or other tools.
Acceptance criteria
Here are the metrics that we want:
-
source-data size on eros, per provider -
json-data size on eros, per provider -
json-data size on dolos ( /home/gitlab-runner/json-data
), per provider -
Solr CPU/RAM -
Solr disk usage -
the number of requests to Solr, by response status code -
the number of requests to API, by response status code -
disk size of prometheus itself -
Uwsgi CPU/RAM -
...
Tasks
-
Enable built-in Prometheus in GitLab server -
Grafana is installed by GitLab: enable it or make it accessible with a domain... read gitlab doc about grafana... - it will help testing monitoring
-
Monitor eros
(the machine hosting GitLab) with basic metrics -
Monitor dolos
(andioke
) with basic metrics-
Install node_exporter
ondolos
...
-
-
Monitor Solr on dolos
-
Monitor disk size of dbnomics-source-data
anddbnomics-json-data
GitLab groups -
Monitor disk size of /home/gitlab-runner/json-data
ondolos
-
...
Questions
- What does GitLab built-in Prometheus tracks?
- Is it necessary to install another Prometheus server?
- or can we just add metrics to GitLab Prometheus server?
- Should we keep DBnomics dashboard or use Prometheus? What's the overlap?
- should we monitor web API usage by user?
- by introducing an API key...