Monitor DBnomics

EPIC: #512 – related to #511 (closed) #494 (closed)

The final result is a Grafana dashboard, accessible here: https://git.nomics.world/-/grafana/d/rRxXxKhZz/dbnomics?orgId=1

Description

DBnomics platform needs proper monitoring:

  • useful metrics (number of datasets, disk space usage, convert time...)
  • presented in a dashboard
  • have access to history to query in case of production failure

Areas:

  • monitor applications: Solr, API, UI
  • monitor fetcher jobs: download/convert, Solr indexation
    • avoir une vue "pipeline" par provider
  • monitor data: source-data, json-data

Report average metrics for each provider to the docs, to see the big picture of providers:

  • https://git.nomics.world/dbnomics-fetchers/documentation/wikis/monitoring
  • but also details about the download process, how many time for category tree download, for datasets...

Monitoring primer

  • metrics collection and storage https://prometheus.io
  • alerts (with history DB) https://prometheus.io/docs/alerting/alertmanager/
  • dashboards https://grafana.com
  • GitLab exports prometheus metrics
  • Solr exports prometheus metrics: https://lucene.apache.org/solr/guide/8_1/monitoring-solr-with-prometheus-and-grafana.html
  • DBnomics can write an exporter of prometheus metrics
    • number of provider, datasets...
    • Cf https://prometheus.io/docs/instrumenting/exposition_formats/
  • we could match system resource metrics with services metrics, and application metrics
    • for example to answer why a job failed, we could align the job time series with the RAM consumption of each servers, or the disk activity... all synchronised on the same time X axis
    • we could annotate the time series in grafana and share the link as a comment on the issue...

Note: previous monitoring tool was https://www.netdata.cloud/ which collects metrics, stores them but only on a short time window (configurable, we used 6h), provides a simple dashboard, sends email alerts (but does not store history). It is an all-in-one solution very simple to install and get running, but it's more an "htop" on steroids than a proper monitoring solution allowing to dig into application failures. It's not designed to be integrated with grafana, alertmanager or other tools.

Acceptance criteria

Here are the metrics that we want:

  • source-data size on eros, per provider
  • json-data size on eros, per provider
  • json-data size on dolos (/home/gitlab-runner/json-data), per provider
  • Solr CPU/RAM
  • Solr disk usage
  • the number of requests to Solr, by response status code
  • the number of requests to API, by response status code
  • disk size of prometheus itself
  • Uwsgi CPU/RAM
  • ...

Tasks

  • Enable built-in Prometheus in GitLab server
    • read https://prometheus.io
    • https://docs.gitlab.com/ce/administration/monitoring/
  • Grafana is installed by GitLab: enable it or make it accessible with a domain... read gitlab doc about grafana...
    • it will help testing monitoring
  • Monitor eros (the machine hosting GitLab) with basic metrics
  • Monitor dolos (and ioke) with basic metrics
    • Install node_exporter on dolos...
  • Monitor Solr on dolos
  • Monitor disk size of dbnomics-source-data and dbnomics-json-data GitLab groups
  • Monitor disk size of /home/gitlab-runner/json-data on dolos
  • ...

Questions

  • What does GitLab built-in Prometheus tracks?
  • Is it necessary to install another Prometheus server?
    • or can we just add metrics to GitLab Prometheus server?
  • Should we keep DBnomics dashboard or use Prometheus? What's the overlap?
  • should we monitor web API usage by user?
    • by introducing an API key...
Edited Oct 17, 2019 by Bruno Duyé
Assignee Loading
Time tracking Loading