18 Feb 2024

Huggingface dataset build configs

Goal: create multiple dataset configs for 231203-1745 Masterarbeit LMentry-static-UA task.

Example: datasets/templates/new_dataset_script.py at main · huggingface/datasets
Tutorial: Builder classes
- One example they give: https://huggingface.co/datasets/frgfm/imagenette/blob/main/imagenette.py

Developing:

One can in _URLS provide paths to local files as well, to speed up development!

It’s not magic dictionaries, it’s basically syntax known to me (with Features etc.) which is neat!

                elif self.config.name == "WhichWordWrongCatTask":
                    yield key, {
                        "question": data["question"],
                        "correctAnswer": data["correctAnswer"],
                        "options": data["additionalMetadata_all_options"]
                        #  "second_domain_answer": "" if split == "test" else data["second_domain_answer"],
                    }

Ah, dataset viewer not available :( But apparently one can use manual configs and then it works: https://huggingface.co/docs/hub/datasets-manual-configuration

I can use https://huggingface.co/datasets/scene_parse_150/edit/main/README.md as an example here.

dataset_info:
- config_name: scene_parsing
  features:
  - name: image
    dtype: image
  - name: annotation
    dtype: image
  - name: scene_category
    dtype:
      class_label:
        names:
          '0': airport_terminal
          '1': art_gallery
	  '2': badlands
- config_name: instance_segmentation
  features:
  - name: image
    dtype: image
  - name: annotation
    dtype: image

… This shows WISTask in the viewer, but not LOWTask (because 'str' object has no attribute 'items' )

configs:
- config_name: LOWTask
  data_files: "data/tt_nim/LOWTask.jsonl"
  features: 
  - name: question
    dtype: string
  - name: correctAnswer
    dtype: string
  default: true
- config_name: WISTask
  data_files: "data/tt_nim/WISTask.jsonl"

And I can’t download either with python because

Traceback (most recent call last):
  File "/home/sh/.local/lib/python3.8/site-packages/datasets/builder.py", line 1873, in _prepare_split_single
    writer.write_table(table)
  File "/home/sh/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 568, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/home/sh/.local/lib/python3.8/site-packages/datasets/table.py", line 2290, in table_cast
    return cast_table_to_schema(table, schema)
  File "/home/sh/.local/lib/python3.8/site-packages/datasets/table.py", line 2248, in cast_table_to_schema
    raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
question: string
correctAnswer: string
templateUuid: string
taskInstanceUuid: string
additionalMetadata_kind: string
additionalMetadata_template_n: int64
additionalMetadata_option_0: string
additionalMetadata_option_1: string
additionalMetadata_label: int64
additionalMetadata_t1_meta_pos: string
additionalMetadata_t1_meta_freq: int64
additionalMetadata_t1_meta_index: int64
additionalMetadata_t1_meta_freq_quantile: int64
additionalMetadata_t1_meta_len: int64
additionalMetadata_t1_meta_len_quantile: string
additionalMetadata_t1_meta_word_raw: string
additionalMetadata_t2_meta_pos: string
additionalMetadata_t2_meta_freq: int64
additionalMetadata_t2_meta_index: int64
additionalMetadata_t2_meta_freq_quantile: int64
additionalMetadata_t2_meta_len: int64
additionalMetadata_t2_meta_len_quantile: string
additionalMetadata_t2_meta_word_raw: string
additionalMetadata_reversed: bool
additionalMetadata_id: int64
system_prompts: list<item: string>
  child 0, item: string
to
{'question': Value(dtype='string', id=None), 'correctAnswer': Value(dtype='string', id=None), 'templateUuid': Value(dtype='string', id=None), 'taskInstanceUuid': Value(dtype='string', id=None), 'additionalMetadata_kind': Value(dtype='string', id=None), 'additionalMetadata_template_n': Value(dtype='int64', id=None), 'additionalMetadata_all_options': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'additionalMetadata_label': Value(dtype='int64', id=None), 'additionalMetadata_main_cat_words': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'additionalMetadata_other_word': Value(dtype='string', id=None), 'additionalMetadata_cat_name_main': Value(dtype='string', id=None), 'additionalMetadata_cat_name_other': Value(dtype='string', id=None), 'additionalMetadata_id': Value(dtype='int64', id=None), 'system_prompts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
because column names don't match

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test.py", line 18, in <module>
    ds = load_dataset(path, n)
  File "/home/sh/.local/lib/python3.8/site-packages/datasets/load.py", line 1797, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/sh/.local/lib/python3.8/site-packages/datasets/builder.py", line 890, in download_and_prepare
    self._download_and_prepare(
  File "/home/sh/.local/lib/python3.8/site-packages/datasets/builder.py", line 985, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/sh/.local/lib/python3.8/site-packages/datasets/builder.py", line 1746, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/sh/.local/lib/python3.8/site-packages/datasets/builder.py", line 1891, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurredwhile generating the dataset

Oh goddammit. Relevant:

pytorch - Load dataset with datasets library of huggingface - Stack Overflow
python - Huggingface datasets ValueError - Stack Overflow is an issue because he loaded it locally
machine learning - Huggingface Load_dataset() function throws “ValueError: Couldn’t cast” - Stack Overflow also Ukrainian

I give up.

Back to the script.

Last thing I’ll try (as suggested by tau/scrolls · Dataset Viewer issue: DatasetWithScriptNotSupportedError):

Convert Dataset To Parquet - a Hugging Face Space by albertvillanova

…

feels so unsatisfying not to see the datasets in the viewer :(

tau/scrolls · Dataset Viewer issue: DatasetWithScriptNotSupportedError this feels like something relevant to me. We’ll see.

Nel mezzo del deserto posso dire tutto quello che voglio.

serhii.net

Huggingface dataset build configs