Encoding issue - when reading series.jsonl
I am working on a new toolbox based version of the Aqicn-fetcher. All seems to be fine but when I attempt to run a validation on the converted output, the program runs into an error traceable back to this line: https://git.nomics.world/dbnomics/dbnomics-data-model/-/blob/master/dbnomics_data_model/storages/filesystem.py#L196
The file is open with a wrongly guessed codec/decoder and hence some lines become unreadable and validation aborts. The error goes as follows.
File "E:\venv\aqicn\lib\site-packages\dbnomics_data_model\storages\filesystem.py", line 202, in iter_series_json_from_jsonl
for line in fp:
File "\\main.oecd.org\em_apps\python\current\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 4777: character maps to <undefined>
- Dataset "aqicn/aqicn" at location aqicn/dataset.json
Error code: storage-error
Message: Could not load "dataset.json"
Encountered errors codes:
- storage-error: 1
We used UTF8 for writing the jsonl file. What should I do, enforce cp1252 when writing (although not sure all the characters would be covered), or is there a way to specify the encoding in the validator?