How to deal with failures in production? What to do if something happens?
Criticity level
Depending on what is affected, the criticity level of the failure is different:
- UI or API: critical
- data or jobs:
- star-provider (see the dashboard): urgent
- other providers: normal
Critical failures MUST be reported via a SMS, e-mail, chat message or direct call to the person in charge of the maintenance, besides the normal procedure (see sections below).
Monitoring
TODO
Failure with UI
When something goes wrong with production UI. This does not includes problems with data (see other section).
Typical symptoms:
- "Error loading page" message displayed on the web page. This means that the data needed to initialize the page could not be fetched by the UI.
How to report bug: create an issue on the board in the Maintenance column:
- labels: "UI"
- mentioning the URL showing the problem in the description
- a description of what is wrong, and what was expected
- optional: a screenshot or copy/paste of the error
How to investigate: see troubleshooting
Failure with data
When something goes wrong with data, as seen in the production UI or any other programming language package (like DBnomics-Python).
Typical symptoms:
- data seems incomplete or wrong
How to report bug: create an issue on the board:
- title starting by the name of the provider
- labels: "Maintenance" and "Fetcher"
- mentioning the URL showing the problem in the description
- a description of what is wrong, and what was expected
Failure with jobs
When something goes wrong with download, conversion or indexation jobs, visible from the dashboard.
Typical symptoms:
- job status is "failed"
How to report bug: create an issue on the board:
- title starting by the name of the provider
- labels: "Maintenance" and "Fetcher"
- mentioning the URL of the failed job in the description
How to investigate:
- Open the job URL; the following depends on what you'll see.
- Maybe use a pre-production fetcher environment