The final result is a Grafana dashboard, accessible here: https://git.nomics.world/-/grafana/d/rRxXxKhZz/dbnomics?orgId=1
DBnomics platform needs proper monitoring:
- useful metrics (number of datasets, disk space usage, convert time...)
- presented in a dashboard
- have access to history to query in case of production failure
- monitor applications: Solr, API, UI
- monitor fetcher jobs: download/convert, Solr indexation
- avoir une vue "pipeline" par provider
- monitor data: source-data, json-data
Report average metrics for each provider to the docs, to see the big picture of providers:
- but also details about the download process, how many time for category tree download, for datasets...
- metrics collection and storage https://prometheus.io
- alerts (with history DB) https://prometheus.io/docs/alerting/alertmanager/
- dashboards https://grafana.com
- GitLab exports prometheus metrics
- Solr exports prometheus metrics: https://lucene.apache.org/solr/guide/8_1/monitoring-solr-with-prometheus-and-grafana.html
- DBnomics can write an exporter of prometheus metrics
- number of provider, datasets...
- Cf https://prometheus.io/docs/instrumenting/exposition_formats/
- we could match system resource metrics with services metrics, and application metrics
- for example to answer why a job failed, we could align the job time series with the RAM consumption of each servers, or the disk activity... all synchronised on the same time X axis
- we could annotate the time series in grafana and share the link as a comment on the issue...
Note: previous monitoring tool was https://www.netdata.cloud/ which collects metrics, stores them but only on a short time window (configurable, we used 6h), provides a simple dashboard, sends email alerts (but does not store history). It is an all-in-one solution very simple to install and get running, but it's more an "htop" on steroids than a proper monitoring solution allowing to dig into application failures. It's not designed to be integrated with grafana, alertmanager or other tools.
Here are the metrics that we want:
- source-data size on eros, per provider
- json-data size on eros, per provider
json-data size on dolos (
/home/gitlab-runner/json-data), per provider
- Solr CPU/RAM
- Solr disk usage
- the number of requests to Solr, by response status code
- the number of requests to API, by response status code
- disk size of prometheus itself
- Uwsgi CPU/RAM
- Enable built-in Prometheus in GitLab server
Grafana is installed by GitLab: enable it or make it accessible with a domain... read gitlab doc about grafana...
- it will help testing monitoring
eros(the machine hosting GitLab) with basic metrics
ioke) with basic metrics
Monitor Solr on
Monitor disk size of
Monitor disk size of
- What does GitLab built-in Prometheus tracks?
- Is it necessary to install another Prometheus server?
- or can we just add metrics to GitLab Prometheus server?
- Should we keep DBnomics dashboard or use Prometheus? What's the overlap?
- should we monitor web API usage by user?
- by introducing an API key...