ONS: Download categories tree and CSV files
- As a system
- I want to obtain categories and CSV files from ONS website
- In order to store the hierarchy of information for ONS
Acceptance criteria
-
All categories branches MUST end with a CSV file -
Each dataset page MUST be stored as an HTML file in order to keep its description paragraph that follows the title -
Categories pages SHOULD be stored as HTML files in sub-directories reproducing the URL paths - OR produce a JSON of categories hierarchy if simpler
Technical tasks
-
Start with page https://www.ons.gov.uk/ -
Parse HTML pages related to menu entries: Business, Industry and Trade; Economy; Employment and Labour Market; People, population and community (class="primary-nav__list") -
Open link with green background named "View all datasets related to FOO" -
Items with string "Dataset ID: XXXX" have a corresponding CSV file (to be verified with an assertion). Ignore others. -
Parse each HTML pages until finding CSV files (CSV files following <h2>Your download options</h2>
) -
A branch that doesn't have a CSV file as a leave MUST be pruned
Edited by Bruno Duyé