15 Feb 2024

Huggingface Hub full dataset card metadata

The HF Hub dataset UI allows to set only six fields in the metadata, the full fields can be set through the YAML it generates. Here’s the full list (hub-docs/datasetcard.md at main · huggingface/hub-docs):

---
# Example metadata to be added to a dataset card.  
# Full dataset card template at https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md
language:
- {lang_0}  # Example: fr
- {lang_1}  # Example: en
license: {license}  # Example: apache-2.0 or any license from https://hf.co/docs/hub/repositories-licenses
license_name: {license_name}  # If license = other (license not in https://hf.co/docs/hub/repositories-licenses), specify an id for it here, like `my-license-1.0`.
license_link: {license_link}  # If license = other, specify "LICENSE" or "LICENSE.md" to link to a file of that name inside the repo, or a URL to a remote file.
license_details: {license_details}  # Legacy, textual description of a custom license.
tags:
- {tag_0}  # Example: audio
- {tag_1}  # Example: bio
- {tag_2}  # Example: natural-language-understanding
- {tag_3}  # Example: birds-classification
annotations_creators:
- {creator}  # Example: crowdsourced, found, expert-generated, machine-generated
language_creators:
- {creator}  # Example: crowdsourced, ...
language_details:
- {bcp47_lang_0}  # Example: fr-FR
- {bcp47_lang_1}  # Example: en-US
pretty_name: {pretty_name}  # Example: SQuAD
size_categories:
- {number_of_elements_in_dataset}  # Example: n<1K, 100K<n<1M, …
source_datasets:
- {source_dataset_0}  # Example: wikipedia
- {source_dataset_1}  # Example: laion/laion-2b
task_categories:  # Full list at https://github.com/huggingface/huggingface.js/blob/main/packages/tasks/src/pipelines.ts
- {task_0}  # Example: question-answering
- {task_1}  # Example: image-classification
task_ids:
- {subtask_0}  # Example: extractive-qa
- {subtask_1}  # Example: multi-class-image-classification
paperswithcode_id: {paperswithcode_id}  # Dataset id on PapersWithCode (from the URL). Example for SQuAD: squad
configs:  # Optional for datasets with multiple configurations like glue.
- {config_0}  # Example for glue: sst2
- {config_1}  # Example for glue: cola

# Optional. This part can be used to store the feature types and size of the dataset to be used in python. This can be automatically generated using the datasets-cli.
dataset_info:
  features:
    - name: {feature_name_0}    # Example: id
      dtype: {feature_dtype_0}  # Example: int32
    - name: {feature_name_1}    # Example: text
      dtype: {feature_dtype_1}  # Example: string
    - name: {feature_name_2}    # Example: image
      dtype: {feature_dtype_2}  # Example: image
    # Example for SQuAD:
    # - name: id
    #   dtype: string
    # - name: title
    #   dtype: string
    # - name: context
    #   dtype: string
    # - name: question
    #   dtype: string
    # - name: answers
    #   sequence:
    #     - name: text
    #       dtype: string
    #     - name: answer_start
    #       dtype: int32
  config_name: {config_name}  # Example for glue: sst2
  splits:
    - name: {split_name_0}                  # Example: train
      num_bytes: {split_num_bytes_0}        # Example for SQuAD: 79317110
      num_examples: {split_num_examples_0}  # Example for SQuAD: 87599
  download_size: {dataset_download_size}   # Example for SQuAD: 35142551
  dataset_size: {dataset_size}             # Example for SQuAD: 89789763

# It can also be a list of multiple configurations:
# ```yaml
# dataset_info:
#   - config_name: {config0}
#     features:
#       ...
#   - config_name: {config1}
#     features:
#       ...
# ```

# Optional. If you want your dataset to be protected behind a gate that users have to accept to access the dataset. More info at https://huggingface.co/docs/hub/datasets-gated
extra_gated_fields:
- {field_name_0}: {field_type_0}  # Example: Name: text
- {field_name_1}: {field_type_1}  # Example: Affiliation: text
- {field_name_2}: {field_type_2}  # Example: Email: text
- {field_name_3}: {field_type_3}  # Example for speech datasets: I agree to not attempt to determine the identity of speakers in this dataset: checkbox
extra_gated_prompt: {extra_gated_prompt}  # Example for speech datasets: By clicking on “Access repository” below, you also agree to not attempt to determine the identity of speakers in the dataset.

# Optional. Add this if you want to encode a train and evaluation info in a structured way for AutoTrain or Evaluation on the Hub
train-eval-index:
  - config: {config_name}           # The dataset config name to use. Example for datasets without configs: default. Example for glue: sst2
    task: {task_name}               # The task category name (same as task_category). Example: question-answering
    task_id: {task_type}            # The AutoTrain task id. Example: extractive_question_answering
    splits:
      train_split: train            # The split to use for training. Example: train
      eval_split: validation        # The split to use for evaluation. Example: test
    col_mapping:                    # The columns mapping needed to configure the task_id.
    # Example for extractive_question_answering:
      # question: question
      # context: context
      # answers:
      #   text: text
      #   answer_start: answer_start
    metrics:
      - type: {metric_type}         # The metric id. Example: wer. Use metric id from https://hf.co/metrics
        name: {metric_name}         # Tne metric name to be displayed. Example: Test WER
---

Valid license identifiers can be found in [our docs](https://huggingface.co/docs/hub/repositories-licenses).

For the full dataset card template, see: [datasetcard_template.md file](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/datasetcard_template.md).

Found this in their docu: Sharing your dataset — datasets 1.8.0 documentation

Full MD template : huggingface_hub/src/huggingface_hub/templates/datasetcard_template.md at main · huggingface/huggingface_hub EDIT: oh nice “import dataset card template” is an option in the GUI and it works!

Nel mezzo del deserto posso dire tutto quello che voglio.

serhii.net

Huggingface Hub full dataset card metadata