serhii.net

In the middle of the desert you can say anything you want

20 Dec 2023

GBIF iNaturalist plantNet duplicates

TL;DR

  • plantNet does no deduplication I could find
  • iNat
    • allows literally duplicating observations with a click e.g. if there are multiple plants to detect, keeping image files constant
    • etiquette allows duplicating observations in many cases, unless it’s e.g. literally the same photo (but diff observations for different growth stages is fine)
      • good karma to mention related observations in the description
    • does no deduplication by images etc. on its side, but may soon have something
  • GBIF
    • is more worried about adding the same herbarium plant twice from different collections or same picture both from iNat and Pl@ntNet
      • does deduplication more like record linking
    • has a clustering/collection feature but it looks for similar observations only across datasets, not within the same one. But the blog post about it is awesome and usable to do deduplication within same datasets as well.
  • Recommendation: roll SHA1 ourselves

iNaturalist

Fazit

  • duplicates can happen if:
    • bad: multiple pics of the same organism at same time as different observations
    • good: different organisms on the same pic
      • using the duplicate function -> URI seems to be preserved
      • manually upload picture, possibly cropped
  • etiquette seems to be that if it’s different parts of picture add it to description
    • we can look for links to observations in descriptions
  • no auto-duplication detection included in iNat. Wikipedia uses SHA1 for this purpose.

Refs

Example instances

PlantNet

  • TL;DR nothing.
  • Found nothing useful after a quick google search
  • Pl@ntNet automatically identified occurrences
    • TODO understand what is that, but I think it’s about requests than about plants:

      remove shared queries (already present in observation dataset) - remove duplicate session (keep the most recent query based on the session number) -

GBIF

Fazit

Refs

Random

GBIF/iNat

Random / fun

Next steps

Look for GBIF/iNat/plantnet repos on Github and look their mentions of duplicates

Nel mezzo del deserto posso dire tutto quello che voglio.