Try S3 object storage for converted data

Following #821 (comment 24382)

Description

As a consequence we quickly explore using object storage:

  • use object storage to store provider data, both source-data and json-data
  • use the TSV representation for the series (we don't have to use JSON-Lines thanks to S3 that tolerates many small files better than file-systems)
  • store a provider per bucket or a dataset per bucket, in function of Scaleway limits
  • use S3 object versions
  • update dbnomics-data-model and dbnomics-api to read from object storage instead of Git repositories

This would be a nice solution with many advantages over Git repositories:

  • object storage is cheaper than block storage
  • object storage is more secured against data loss
  • revisions are implementable in a more efficient way than reading Git history
  • more efficient that file-systems: less problems with many small files, or huge files
  • no more problems with GitLab server (slow git clones, pushes...)

Tasks

  • sync json-data after pipeline runs to S3 (@eraviart)
  • adapt dbnomics-api and dbnomics-data-model to read from S3 (@cbenz )
  • check that dbnomics-solr indexation script works (@cbenz)
  • import Git repositories history for source-data and json-data as S3 versions

Before closing issue

  • merge imf-fetcher!8 (merged)
Edited Feb 10, 2021 by Christophe Benz
Assignee Loading
Time tracking Loading