Skip to content

GitLab

  • Menu
Projects Groups Snippets
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • D dbnomics-data-model
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 2
    • Issues 2
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Monitor
    • Monitor
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • dbnomics
  • dbnomics-data-model
  • Issues
  • #5

Closed
Open
Created Dec 09, 2021 by Gyorgy Gyomai@buerokrata

Encoding issue - when reading series.jsonl

I am working on a new toolbox based version of the Aqicn-fetcher. All seems to be fine but when I attempt to run a validation on the converted output, the program runs into an error traceable back to this line: https://git.nomics.world/dbnomics/dbnomics-data-model/-/blob/master/dbnomics_data_model/storages/filesystem.py#L196

The file is open with a wrongly guessed codec/decoder and hence some lines become unreadable and validation aborts. The error goes as follows.

File "E:\venv\aqicn\lib\site-packages\dbnomics_data_model\storages\filesystem.py", line 202, in iter_series_json_from_jsonl
    for line in fp:
  File "\\main.oecd.org\em_apps\python\current\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4777: character maps to <undefined>
- Dataset "aqicn/aqicn" at location aqicn/dataset.json 
  Error code: storage-error
  Message: Could not load "dataset.json"
Encountered errors codes:
    - storage-error: 1

We used UTF8 for writing the jsonl file. What should I do, enforce cp1252 when writing (although not sure all the characters would be covered), or is there a way to specify the encoding in the validator?

Assignee
Assign to
Time tracking