26 May 2022

Three stories

Three in-the-trenches-type stories. First two happened when I was a Werkstudent, third one when working full-time.

Bachelor’s thesis

Native language identification on english tweets.

In: english-language tweets. Things like tokens, length, punctuation, and other typical features.
Out: in which geographic area (US, India, Russia, …) does this person live, as a proxy of his first language.

Decided to try using Tensorflow tf.data and tf.estimator for that, why not learn something new.

My very first version performed incredibly well. There had to be an error somewhere. There was.

Due to my unfamiliarity with the libs I left the lat/lon in the input data. And the network learned to predict the country name based on lat/lon instead of native language identification.

Most interesting bug in my career

I was trying to train a language model and see what happens if random words in the training data are replaced by garbage tokens.

So with a certain probability a sentence was picked, in it with a certain probability a token was masked or replaced by random text. So picking some sequences and in them randomizing a token with a certain probability.

In the town where I [PREDICT_ME] born, lived a [GARBAGE], who sailed the sea.

Training LMs is hard, my dataset was HUGE, and training took days/weeks.

Randomly sometimes on day 2-3-4 training crashed, with a nondescript error, or sometimes it didn’t and training went on.

And I think it was TF1 and outputting the specific examples that made it fail was nontrivial, though I don’t remember the details.

Debugging it was nightmarish. And running a program that runs for a week and SOMETIMES crashes and you have to start again was very, very frustrating.

The cause:

In one place in the code, I tokenized by splitting the sentence by whitespaces, in another one - by spaces.
In the infinite-billion-sentences-dataset, TWO contained single double-spaces.

The dataset is shuffled, and sooner or later we got to one of these sentences. When split, we got a token with length 0:

>>> "We all live in a yellow  submarine".split(' ')
['We', 'all', 'live', 'in', 'a', 'yellow', '', 'submarine']

If randomization decided to randomize one of these sentences and in them SPECIFICALLY the '' token, everything crashed.

For this to happen, the randomness gods had to give a very specific set of circumstances, low probability by themselves but likely to happen at least once during a week-long training.

Quickly creating a synthetic-but-human dataset

I’m really proud of this one.

The problem

Given: 10 example forms. Mostly consistent ones, containing a table with rows. Some rows contain text, some contain signatures, all by different people. The forms themselves were similar but not identical, and scan quality, pen colors, printing etc. varied.

Task: detect which rows and cells have signatures. Rows might have gaps, rows could have text but no signature. And do it really fast: both providing the proof of concept and the runtime of the network had to be quick.¹

Problem: you need data to evaluate your result. … and data to train your NN, if you use a NN. You need this fast.

Attempt 1 - vanilla computer vision

I tried to do vanilla computer vision first. Derotate, get horizontal/vertical lines, from them - the cells, and see if they have more stuff (darker color) than some average value.

It failed because the signatures and text belonged people filling hundreds of such forms per day - the signatures and text could be bigger than the row they’re on. You never knew if it’s someone who wrote a shorter text, or text ’leaking’ from the rows above/below.

Attempt 2 - ML

I knew that Detectron2 could be trained to detect handwriting/signatures on relatively few examples. But again, we need evaluation data to see how well we do. 10 documents are too little.

.. okay, but what would evaluation look like?

I don’t really need the pixel-positions of the signatures. I need just rows/cells info. And filling forms is fast.

Crowdsourcing the eval dataset.

I wrote a pseudorandom numbers generator, that based on a number returns which cells and which columns have to be filled.

Bob, your number is 123. This means:

in document 1 you fill row 2,4,7 in column 1, row 3,8,9 in column 3, and sign row 2 and 7 in column 3. Save it as 123-1.pdf

in document 2, …

Zip the results and email them to me as 123.zip

Then 100+ forms were printed out and distributed. Everyone got a number and N forms and which numbers to fill.

Then they send us zips, from which we could immediately parse the ground truth. And we knew which cells contained what in each image without needing to manually label each.

The dataset was not too synthetic: we got different people using different pens, different handwriting and different signatures to fill the forms.²

That dataset then gave us a good idea of how well our network would perform in real life.

This meant that we had to use only one NN, because running separate ones for detecting different things would have been slow. So we had to train a checkpoint that predicts ALL THE CLASSES, ergo we couldn’t train something to detect handwriting using another dataset. (Using a NN to find signatures and using something else to find other classes would work, though.) Honestly I don’t remember that part well, but there were reasons why we couldn’t just train something to find handwriting on another dataset without generating our own. ↩︎
Much better than we could have done by, say, generating ‘handwriting’ using fonts and distortions and random strings. ↩︎

Nel mezzo del deserto posso dire tutto quello che voglio.

serhii.net