In the middle of the desert you can say anything you want
Punctuation
word~/cite{xxx}.
sentence.\footnote{}
1
sent~\cite{}.\footnote{}
\enquote{}
with italics for longer sentences.)Bits
Not bits
====== Open research questions:
SH, [10 Apr 2024 14:58:39] LMES — дослідити robustness моделей, і наприклад глянути яка залежність accuracy людей і ШІ в залежності від мммм різниці в довжині слів чи номеру слова (“яке стотринадцяте слово в реченні …”) CBT-UA — нормально evaluate, а ще для людей і машин — глянути scores якщо давати тільки challenge segment. Я це тестив з нейромережами (не попало в paper), але там дуже часто були кращі результати з фрагментом ніж з усією казкою
SH, [10 Apr 2024 14:59:57] Зробити датасет по biases і фемінітивам, у мене написаний код для генерації нульової версії, там по суті речення типу “моя жінка займається програмуванням компʼютерних систем, тобто за професією вона — ….”
SH, [10 Apr 2024 15:00:20] Мрія всього життя таки зробити Russian-Ukrainian interference dataset на предмет русизмів та російських помилок
SH, [10 Apr 2024 15:02:57] UA-CBT — взяти казки з project Gutenberg, взяти іноземні казки перекладені українською, і порівняти scores моделей на тасках по казкам з цих різних джерел. Можна забити на фільтрацію, чисто зробити human baseline на частині згенерованого датасету. Так можна зробити нереально великий датасет і знати що там максимум умовнио 80% бо 20% тасків сміття
Also:
\autoref
is like \ref
but it adds the word, not just the number. 3.2
-> Figure 3.2
:
cross referencing - What’s the difference between \ref and \autoref? - TeX - LaTeX Stack Exchange
j
Wrapping stuff in this command makes it stand out; also greppable by TODO which removes the need to remember commands
\newcommand{\TODO}[1]{{\color{magenta}#1}}
Previously:
Is there a suggested way of debugging dataset generators? - 🤗Datasets - Hugging Face Forums
Instead of committing etc. every time, one can clone the dataset path locally through git and then point load_dataset()
to that local folder with the dataset script file!
Random nugget from Document to compress data files before uploading · Issue #5687 · huggingface/datasets:
- gz, to compress individual files
- zip, to compress and archive multiple files; zip is preferred rather than tar because it supports streaming out of the box
(Streaming: https://huggingface.co/docs/datasets/v2.4.0/en/stream TL;DR don’t download the entire dataset for very large datasets, add stream=true
to the load_dataset()
fn)
Til from NASA’s (!) docs1 that there are two sub-levels after subsubsection
:
\subsubsection{Example Sub-Sub-Section}
\label{sec:example-subsubsection}
\ref{sec:example-subsubsection} is an example of \texttt{subsubsection}.
\paragraph{Example Paragraph}
\label{sec:example-paragraph}
\ref{sec:example-paragraph} is an example of \texttt{paragraph}.
\subparagraph{Example Sub-Paragraph}
\label{sec:example-subparagraph}
\ref{sec:example-subparagraph} is an example of \texttt{subparagraph}.
I so needed them!
Goal: create multiple dataset configs for 231203-1745 Masterarbeit LMentry-static-UA task.
Developing:
_URLS
provide paths to local files as well, to speed up development!It’s not magic dictionaries, it’s basically syntax known to me (with Features etc.) which is neat!
elif self.config.name == "WhichWordWrongCatTask":
yield key, {
"question": data["question"],
"correctAnswer": data["correctAnswer"],
"options": data["additionalMetadata_all_options"]
# "second_domain_answer": "" if split == "test" else data["second_domain_answer"],
}
Ah, dataset viewer not available :( But apparently one can use manual configs and then it works: https://huggingface.co/docs/hub/datasets-manual-configuration
I can use https://huggingface.co/datasets/scene_parse_150/edit/main/README.md as an example here.
dataset_info:
- config_name: scene_parsing
features:
- name: image
dtype: image
- name: annotation
dtype: image
- name: scene_category
dtype:
class_label:
names:
'0': airport_terminal
'1': art_gallery
'2': badlands
- config_name: instance_segmentation
features:
- name: image
dtype: image
- name: annotation
dtype: image
…
This shows WISTask in the viewer, but not LOWTask (because 'str' object has no attribute 'items'
)
configs:
- config_name: LOWTask
data_files: "data/tt_nim/LOWTask.jsonl"
features:
- name: question
dtype: string
- name: correctAnswer
dtype: string
default: true
- config_name: WISTask
data_files: "data/tt_nim/WISTask.jsonl"
And I can’t download either with python because
Traceback (most recent call last):
File "/home/sh/.local/lib/python3.8/site-packages/datasets/builder.py", line 1873, in _prepare_split_single
writer.write_table(table)
File "/home/sh/.local/lib/python3.8/site-packages/datasets/arrow_writer.py", line 568, in write_table
pa_table = table_cast(pa_table, self._schema)
File "/home/sh/.local/lib/python3.8/site-packages/datasets/table.py", line 2290, in table_cast
return cast_table_to_schema(table, schema)
File "/home/sh/.local/lib/python3.8/site-packages/datasets/table.py", line 2248, in cast_table_to_schema
raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
question: string
correctAnswer: string
templateUuid: string
taskInstanceUuid: string
additionalMetadata_kind: string
additionalMetadata_template_n: int64
additionalMetadata_option_0: string
additionalMetadata_option_1: string
additionalMetadata_label: int64
additionalMetadata_t1_meta_pos: string
additionalMetadata_t1_meta_freq: int64
additionalMetadata_t1_meta_index: int64
additionalMetadata_t1_meta_freq_quantile: int64
additionalMetadata_t1_meta_len: int64
additionalMetadata_t1_meta_len_quantile: string
additionalMetadata_t1_meta_word_raw: string
additionalMetadata_t2_meta_pos: string
additionalMetadata_t2_meta_freq: int64
additionalMetadata_t2_meta_index: int64
additionalMetadata_t2_meta_freq_quantile: int64
additionalMetadata_t2_meta_len: int64
additionalMetadata_t2_meta_len_quantile: string
additionalMetadata_t2_meta_word_raw: string
additionalMetadata_reversed: bool
additionalMetadata_id: int64
system_prompts: list<item: string>
child 0, item: string
to
{'question': Value(dtype='string', id=None), 'correctAnswer': Value(dtype='string', id=None), 'templateUuid': Value(dtype='string', id=None), 'taskInstanceUuid': Value(dtype='string', id=None), 'additionalMetadata_kind': Value(dtype='string', id=None), 'additionalMetadata_template_n': Value(dtype='int64', id=None), 'additionalMetadata_all_options': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'additionalMetadata_label': Value(dtype='int64', id=None), 'additionalMetadata_main_cat_words': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), 'additionalMetadata_other_word': Value(dtype='string', id=None), 'additionalMetadata_cat_name_main': Value(dtype='string', id=None), 'additionalMetadata_cat_name_other': Value(dtype='string', id=None), 'additionalMetadata_id': Value(dtype='int64', id=None), 'system_prompts': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)}
because column names don't match
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "test.py", line 18, in <module>
ds = load_dataset(path, n)
File "/home/sh/.local/lib/python3.8/site-packages/datasets/load.py", line 1797, in load_dataset
builder_instance.download_and_prepare(
File "/home/sh/.local/lib/python3.8/site-packages/datasets/builder.py", line 890, in download_and_prepare
self._download_and_prepare(
File "/home/sh/.local/lib/python3.8/site-packages/datasets/builder.py", line 985, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/sh/.local/lib/python3.8/site-packages/datasets/builder.py", line 1746, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/sh/.local/lib/python3.8/site-packages/datasets/builder.py", line 1891, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurredwhile generating the dataset
Oh goddammit. Relevant:
I give up.
Back to the script.
Last thing I’ll try (as suggested by tau/scrolls · Dataset Viewer issue: DatasetWithScriptNotSupportedError):
Convert Dataset To Parquet - a Hugging Face Space by albertvillanova
…
feels so unsatisfying not to see the datasets in the viewer :(
tau/scrolls · Dataset Viewer issue: DatasetWithScriptNotSupportedError this feels like something relevant to me. We’ll see.
Got bit by this.
random — Generate pseudo-random numbers — Python 3.12.2 documentation
random.sample()
) IS WITHOUT REPLACEMENT: no duplicates unless present in list
(random.shuffle()
)random.choices()
) IS WITH REPLACEMENT: duplicates MAY happen.Also:
random.shuffle()
works in-place. Sampling len(x) is a way to shuffle immutable lists.jq -s '.' input.jsonl > output.json
jq -c '.[]' input.json > output.jsonl