Huggingface HF Custom NER with BERT: tokenizing, aligning tokens, etc.
Really nice google colab showing more advanced datasets
bits in addition to what’s on the label:
Custom Named Entity Recognition with BERT.ipynb - Colaboratory
Pasting this example from there:
class dataset(Dataset):
def __init__(self, dataframe, tokenizer, max_len):
self.len = len(dataframe)
self.data = dataframe
self.tokenizer = tokenizer
self.max_len = max_len
def __getitem__(self, index):
# step 1: get the sentence and word labels
sentence = self.data.sentence[index].strip().split()
word_labels = self.data.word_labels[index].split(",")
# step 2: use tokenizer to encode sentence (includes padding/truncation up to max length)
# BertTokenizerFast provides a handy "return_offsets_mapping" functionality for individual tokens
encoding = self.tokenizer(sentence,
is_pretokenized=True,
return_offsets_mapping=True,
padding='max_length',
truncation=True,
max_length=self.max_len)
# step 3: create token labels only for first word pieces of each tokenized word
labels = [labels_to_ids[label] for label in word_labels]
# code based on https://huggingface.co/transformers/custom_datasets.html#tok-ner
# create an empty array of -100 of length max_length
encoded_labels = np.ones(len(encoding["offset_mapping"]), dtype=int) * -100
# set only labels whose first offset position is 0 and the second is not 0
i = 0
for idx, mapping in enumerate(encoding["offset_mapping"]):
if mapping[0] == 0 and mapping[1] != 0:
# overwrite label
encoded_labels[idx] = labels[i]
i += 1
# step 4: turn everything into PyTorch tensors
item = {key: torch.as_tensor(val) for key, val in encoding.items()}
item['labels'] = torch.as_tensor(encoded_labels)
return item
def __len__(self):
return self.len
For aligning tokens, there’s Code To Align Annotations With Huggingface Tokenizers. It has a repo: LightTag/sequence-labeling-with-transformers: Examples for aligning, padding and batching sequence labeling data (NER) for use with pre-trained transformer models
Also the official tutorial (Token classification) has a function to do something similar:
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples[f"ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i) # Map tokens to their respective word.
previous_word_idx = None
label_ids = []
for word_idx in word_ids: # Set the special tokens to -100.
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx: # Only label the first token of a given word.
label_ids.append(label[word_idx])
else:
label_ids.append(-100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs["labels"] = labels
return tokenized_inputs
Nel mezzo del deserto posso dire tutto quello che voglio.
comments powered by Disqus