Dumping pretty cyrillic UTF YAML and JSON files
TL;DR:
- PyYAML:
allow_unicode=True
- JSON:
ensure_ascii=True
Context:
- 231213-1710 Ukrainska Pravda dataset
- 231203-1745 Masterarbeit eval task LMentry-static-UA
- Any and all Ukrainian/cyrillic YAML and JSON files I dump
My favourite params for dumping both, esp. if Ukrainian/cyrillic/UTF is involved
All of the below are params one can pass to to_[json|yaml][_file]()
of Wizard Mixin Classes — Dataclass Wizard 0.22.3 documentation
(Py)YAML
https://pyyaml.org/wiki/PyYAMLDocumentation
params = dict(
allow_unicode=True, # write Ukrainian as Ukrainian
default_flow_style=False,
sort_keys=False, # so template is first in the YAML for readability
)
self.to_yaml_file(yaml_target, **params)
default_flow_style
prefers lists like this (from docu):
>>> print yaml.dump(yaml.load(document), default_flow_style=False)
a: 1
b:
c: 3
d: 4
JSON
to_json(indent=4, ensure_ascii=False)
The difference being:
(Pdb++) created_tasks[0][0].to_json()
'{"question": "\\u042f\\u043a\\u0435 \\u0441\\u043b\\u043e\\u0432\\u043e \\u043a\\u043e\\u0440\\u043e\\u0442\\u0448\\u0435: \\"\\u043a\\u0456\\u0442\\"\\u0447\\u0438 \\"\\u0441\\u043e\\u0431\\u0430\\u043a\\u0430\\"?", "correctAnswer": "\\u043a\\u0456\\u0442", "templateUuid": "1da85d6e7cf5440cba54e3a9b548a037", "taskInstanceUuid": "6ac71cd524474684abfec0cfa3ef5e1e", "additionalMetadata": {"kind": "less", "template_n": 2, "t1": "\\u043a\\u0456\\u0442","t2": "\\u0441\\u043e\\u0431\\u0430\\u043a\\u0430", "reversed": false}}'
(Pdb++) created_tasks[0][0].to_json(ensure_ascii=False)
'{"question": "Яке слово коротше: \\"кіт\\" чи \\"собака\\"?", "correctAnswer": "кіт", "templateUuid": "1da85d6e7cf5440cba54e3a9b548a037", "taskInstanceUuid": "6ac71cd524474684abfec0cfa3ef5e1e", "additionalMetadata": {"kind": "less", "template_n": 2, "t1": "кіт", "t2": "собака", "reversed": false}}'
(Pdb++)
Nel mezzo del deserto posso dire tutto quello che voglio.
comments powered by Disqus