Saying hi, and a few questions and ideas
Hi all,
This project is truly amazing and I so glad I found it. I'm disappointed I didn't find it earlier.
I've got some questions, which you of course have no obligation to answer, although it would help me understand more about the data:
- Do you all have any estimates of the total size (on disk, rather than number of data series) of the data? What about just the metadata?
- How many servers are running the Gitlab instances, the data downloads and conversion, the website, and the search service?
- How big are the indexes for the search service (Solr), and how much RAM and CPU does the service generally use?
- Are you using a CDN in any part of the system?
I've also got a few ideas:
- Provide data dumps of all the metadata. Right now, the only way to get access to all the Providers, Datasets, and Series metadata (e.g. just the dimension values, not the observations) is either through the API or by accessing those JSON files from the
json-data
repositories. Both of these are somewhat limited. To allow users to best use the metadata (e.g. to discover relationships or to normalize data across series) it would be good to have a separate export for this. - If you aren't currently, think about using a CDN in front of the website and API. This would reduce load on the origin server and allow the CDN to serve thousands or millions of users if they were all requesting the same files (pages or API responses), because they would be cached. Cloudflare has a very good free service that scales as much as your traffic does.
Right now, I'm interested in using the metadata from DBNomics to see if I can use Linked Data and Knowledge Graph technology to enrich the data, and also to improve statistical data visualization.
I noticed that you are working towards moving the DBNomics data pipelines into Kubernetes; I also have experience with k8s, DevOps and distributed systems.
I can definitely see myself contributing to the project in some way.
Thanks for all your great work!