20 Dec 2023

GBIF iNaturalist plantNet duplicates

TL;DR

plantNet does no deduplication I could find
iNat
- allows literally duplicating observations with a click e.g. if there are multiple plants to detect, keeping image files constant
- etiquette allows duplicating observations in many cases, unless it’s e.g. literally the same photo (but diff observations for different growth stages is fine)
  - good karma to mention related observations in the description
- does no deduplication by images etc. on its side, but may soon have something
GBIF
- is more worried about adding the same herbarium plant twice from different collections or same picture both from iNat and Pl@ntNet
  - does deduplication more like record linking
- has a clustering/collection feature but it looks for similar observations only across datasets, not within the same one. But the blog post about it is awesome and usable to do deduplication within same datasets as well.
Recommendation: roll SHA1 ourselves

iNaturalist

Fazit

duplicates can happen if:
- bad: multiple pics of the same organism at same time as different observations
- good: different organisms on the same pic
  - using the duplicate function -> URI seems to be preserved
  - manually upload picture, possibly cropped
etiquette seems to be that if it’s different parts of picture add it to description
- we can look for links to observations in descriptions
no auto-duplication detection included in iNat. Wikipedia uses SHA1 for this purpose.

Refs

https://forum.inaturalist.org/t/what-do-the-community-guidelines-mean-by-duplicate-observations/39454

Finally, you can totally upload the same picture multiple times if there’s multiple organisms in one picture that you would like identified - you want to have a separate observation for each organism. Usually if I do this I’ll make a note in the description of what I’m looking to have IDed.
- duplicate observations OK if same plant different time etc.
https://www.inaturalist.org/posts/28325-tech-tip-tuesday-duplicating-observations: “perfectly okay to duplicate observations”, because species in the background etc.
- there’s literally a function for this
- otherwise one can crop the picture
https://forum.inaturalist.org/t/duplicate-obervations/18378/4
- wikimedia uses SHA1 to identify duplicates
https://forum.inaturalist.org/t/duplicate-prevention-notify-observers-if-their-image-checksums-match-others-on-the-site/258/38
- inat doesn’t have duplicate detection for pictures uploaded multiple times (discussion on adding this feature)
Random
- just came across a user who repeatedly submits pairs of some robber fly photo weeks apart. (https://forum.inaturalist.org/t/create-a-flag-category-for-duplicate-observations/29647/42)

Example instances

same pic, different crops:
- https://www.inaturalist.org/observations/56301457
- https://www.inaturalist.org/observations/55724066
Other observations mentioned in descriptions
- https://www.inaturalist.org/observations/26942475

PlantNet

TL;DR nothing.
Found nothing useful after a quick google search
Pl@ntNet automatically identified occurrences
- TODO understand what is that, but I think it’s about requests than about plants:
  
  remove shared queries (already present in observation dataset) - remove duplicate session (keep the most recent query based on the session number) -

GBIF

Fazit

GBIF has clustering of records to find duplicates, but does it only between datasets, not within the same one
- REALLY cool pages on the topic and in general: Identifying potentially related records - How does the GBIF data-clustering feature work? - GBIF Data Blog
- Examples
  - how to search for it: this is all plantnet observations that are part of a cluster Occurrence search (“has related records”)
  - observation that is part of a cluster Occurrence 4011692074
Darwin Core has associatedOccurrences for kinda similar ones, incl. by parsing the descriptions for mentions etc.: Darwin Core Quick Reference Guide - Darwin Core
- not used for duplicate detection as of the awesome blog post above from 2021-11-04
- some people use it in other ways: https://discourse.gbif.org/t/differences-in-use-between-associatedoccurrences-and-associatedorganisms/2561/2
- I can’t find how to access it from the API
Their main threat model is the same herbarium speciment scanned and put into different collections, or the same plant photo uploaded to two different platforms (e.g. pNet + iNat) ending in GBIF twice

Refs

Has clustering of records that appear to be similar:
- Identifying potentially related records - How does the GBIF data-clustering feature work? - GBIF Data Blog
  - very thorough
  - downsides
    - clustering compares only across datasets, not within the same one!
    - runs once every few weeks
  - Biodiversity Data Use (less cool page mentioning duplications)
  - More theory on this: New data-clustering feature aims to improve data quality and reveal cross-dataset connections
- Occurrence search (“has related records”)
  - Example: Occurrence 4036191318
- in API:
  - New data-clustering feature aims to improve data quality and reveal cross-dataset connections
    
    matching similar entries in individual fields across different datasets
  - https://api.gbif.org/v1/occurrence/1141981356/experimental/related
  - vanilla curl https://api.gbif.org/v1/occurrence/4011664186 | jq -C | less has isInCluster but nothing more
GBIF has associatedOccurrences
- Darwin Core Quick Reference Guide - Darwin Core
- not yet used in clustering
- parses descriptions etc. for links
Darwin Core Resource Relationship – Extension darwin core bit about relationships between records docu
Duplicate occurrence records - Data Publishing - GBIF community forum
- user distinguishes three types of duplicates: exact, strict, relaxed
- discussions of using bash/cli magic to parse occurrences.txt to find them,
  - ref. A data cleaner’s cookbook - Content 1
  - BASHing data: Partial duplicates
  - all with a more record linking vibe
- I’m not aware of any backend or external packages (e.g. in R or Python) that can tidy a Darwin Core dataset
- Fortunately or unfortunately, Darwin Core datasets are complex beasts that don’t lend themselves to automated checking and fixing. For this reason people (not backend routines) are the best Darwin Core data cleaners 4. The code recipes 2 I use are freely available on the Web and I (and now others) are happy to train others in their use.
Duplicate observations across datasets - GBIF community forum
- Something that I often hear repeated on iNaturalist and BugGuide is that posting the same observation on both platforms results in the observation being ingested twice by GBIF.
- TL;DR posted twice, not found as duplicate because had no location info on one of the platforms

Random

Developer Blog: “I noticed that the GBIF data portal has fewer records than it used to – what happened?” about removing dulicates from GBIF in 2012

GBIF/iNat

Looking for observations with links in description

creeping thistle from Norfolk County, ON, Canada on July 12, 2019 at 08:44 AM by Jeremy Hussell. Same plants as in this previous observation: https://www.inaturalist.org/observations/28646599 · iNaturalist == Occurrence 3005247358
“occurrence remarks” in GBIF are iNat’s “description”

Random / fun

https://community.openstreetmap.org/t/semi-automated-tree-additions/102197/57?page=4 openstreetmap automatically adding trees and needing deduplication

Next steps

Look for GBIF/iNat/plantnet repos on Github and look their mentions of duplicates

Nel mezzo del deserto posso dire tutto quello che voglio.

serhii.net

GBIF iNaturalist plantNet duplicates

TL;DR

iNaturalist

Fazit

Refs

Example instances

PlantNet

GBIF

Fazit

Refs

Random

GBIF/iNat

Looking for observations with links in description

Random / fun

Next steps