serhii.net

In the middle of the desert you can say anything you want

20 Jul 2022

Dataset files structure Huggingface recommendations

Previously: 220622-1744 Directory structure for python research-y projects, 220105-1142 Order of directories inside a python project

Datasets.

HF has recommendations about how to Structure your repository, where/how to put .csv/.json files in various splits/shards/configurations.

These dataset structures are also ones that can be easily loaded with load_dataset(), despite being CSV/JSON files.

Filenames containing ’train’ are considered part of the train split, same for ’test’ and ‘valid’

And indeed I could without issues create a Dataset through ds = datasets.load_dataset(my_directory_with_jsons).

Nel mezzo del deserto posso dire tutto quello che voglio.