ONS: Download categories tree and CSV files

  • As a system
  • I want to obtain categories and CSV files from ONS website
  • In order to store the hierarchy of information for ONS

Acceptance criteria

  • All categories branches MUST end with a CSV file
  • Each dataset page MUST be stored as an HTML file in order to keep its description paragraph that follows the title
  • Categories pages SHOULD be stored as HTML files in sub-directories reproducing the URL paths
    • OR produce a JSON of categories hierarchy if simpler

Technical tasks

  • Start with page https://www.ons.gov.uk/
  • Parse HTML pages related to menu entries: Business, Industry and Trade; Economy; Employment and Labour Market; People, population and community (class="primary-nav__list")
  • Open link with green background named "View all datasets related to FOO"
  • Items with string "Dataset ID: XXXX" have a corresponding CSV file (to be verified with an assertion). Ignore others.
  • Parse each HTML pages until finding CSV files (CSV files following <h2>Your download options</h2>)
  • A branch that doesn't have a CSV file as a leave MUST be pruned
Edited Nov 08, 2017 by Bruno Duyé
Assignee Loading
Time tracking Loading