In the middle of the desert you can say anything you want

01 Jun 2022

Huggingface HF Custom NER with BERT: tokenizing, aligning tokens, etc.

Really nice google colab showing more advanced datasets bits in addition to what’s on the label: Custom Named Entity Recognition with BERT.ipynb - Colaboratory

Pasting this example from there:

class dataset(Dataset):
	def __init__(self, dataframe, tokenizer, max_len):
		self.len = len(dataframe) = dataframe
		self.tokenizer = tokenizer
		self.max_len = max_len
	def __getitem__(self, index):
		# step 1: get the sentence and word labels
		sentence =[index].strip().split()
		word_labels =[index].split(",")
		# step 2: use tokenizer to encode sentence (includes padding/truncation up to max length)
		# BertTokenizerFast provides a handy "return_offsets_mapping" functionality for individual tokens
		encoding = self.tokenizer(sentence,
		# step 3: create token labels only for first word pieces of each tokenized word
		labels = [labels_to_ids[label] for label in word_labels]
		# code based on
		# create an empty array of -100 of length max_length
		encoded_labels = np.ones(len(encoding["offset_mapping"]), dtype=int) * -100
		# set only labels whose first offset position is 0 and the second is not 0
		i = 0
		for idx, mapping in enumerate(encoding["offset_mapping"]):
		if mapping[0] == 0 and mapping[1] != 0:
		# overwrite label
		encoded_labels[idx] = labels[i]
		i += 1
		# step 4: turn everything into PyTorch tensors
		item = {key: torch.as_tensor(val) for key, val in encoding.items()}
		item['labels'] = torch.as_tensor(encoded_labels)
		return item
	def __len__(self):
		return self.len

For aligning tokens, there’s Code To Align Annotations With Huggingface Tokenizers. It has a repo: LightTag/sequence-labeling-with-transformers: Examples for aligning, padding and batching sequence labeling data (NER) for use with pre-trained transformer models

Also the official tutorial (Token classification) has a function to do something similar:

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # Map tokens to their respective word.
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # Set the special tokens to -100.
            if word_idx is None:
            elif word_idx != previous_word_idx:  # Only label the first token of a given word.
            previous_word_idx = word_idx

    tokenized_inputs["labels"] = labels
    return tokenized_inputs
Nel mezzo del deserto posso dire tutto quello che voglio.