GBIF iNaturalist plantNet duplicates
TL;DR
- plantNet does no deduplication I could find
- iNat
- allows literally duplicating observations with a click e.g. if there are multiple plants to detect, keeping image files constant
- etiquette allows duplicating observations in many cases, unless it’s e.g. literally the same photo (but diff observations for different growth stages is fine)
- good karma to mention related observations in the description
- does no deduplication by images etc. on its side, but may soon have something
- GBIF
- is more worried about adding the same herbarium plant twice from different collections or same picture both from iNat and Pl@ntNet
- does deduplication more like record linking
- has a clustering/collection feature but it looks for similar observations only across datasets, not within the same one. But the blog post about it is awesome and usable to do deduplication within same datasets as well.
- is more worried about adding the same herbarium plant twice from different collections or same picture both from iNat and Pl@ntNet
- Recommendation: roll SHA1 ourselves
iNaturalist
Fazit
- duplicates can happen if:
- bad: multiple pics of the same organism at same time as different observations
- good: different organisms on the same pic
- using the duplicate function -> URI seems to be preserved
- manually upload picture, possibly cropped
- etiquette seems to be that if it’s different parts of picture add it to description
- we can look for links to observations in descriptions
- no auto-duplication detection included in iNat. Wikipedia uses SHA1 for this purpose.
Refs
-
Finally, you can totally upload the same picture multiple times if there’s multiple organisms in one picture that you would like identified - you want to have a separate observation for each organism. Usually if I do this I’ll make a note in the description of what I’m looking to have IDed.
- duplicate observations OK if same plant different time etc.
-
https://www.inaturalist.org/posts/28325-tech-tip-tuesday-duplicating-observations: “perfectly okay to duplicate observations”, because species in the background etc.
- there’s literally a function for this
- otherwise one can crop the picture
-
https://forum.inaturalist.org/t/duplicate-obervations/18378/4
- wikimedia uses SHA1 to identify duplicates
-
- inat doesn’t have duplicate detection for pictures uploaded multiple times (discussion on adding this feature)
-
Random
-
just came across a user who repeatedly submits pairs of some robber fly photo weeks apart. (https://forum.inaturalist.org/t/create-a-flag-category-for-duplicate-observations/29647/42)
-
Example instances
- same pic, different crops:
- Other observations mentioned in descriptions
PlantNet
- TL;DR nothing.
- Found nothing useful after a quick google search
- Pl@ntNet automatically identified occurrences
- TODO understand what is that, but I think it’s about requests than about plants:
remove shared queries (already present in observation dataset) - remove duplicate session (keep the most recent query based on the session number) -
- TODO understand what is that, but I think it’s about requests than about plants:
GBIF
Fazit
- GBIF has clustering of records to find duplicates, but does it only between datasets, not within the same one
- REALLY cool pages on the topic and in general: Identifying potentially related records - How does the GBIF data-clustering feature work? - GBIF Data Blog
- Examples
- how to search for it: this is all plantnet observations that are part of a cluster Occurrence search (“has related records”)
- observation that is part of a cluster Occurrence 4011692074
- Darwin Core has
associatedOccurrences
for kinda similar ones, incl. by parsing the descriptions for mentions etc.: Darwin Core Quick Reference Guide - Darwin Core- not used for duplicate detection as of the awesome blog post above from 2021-11-04
- some people use it in other ways: https://discourse.gbif.org/t/differences-in-use-between-associatedoccurrences-and-associatedorganisms/2561/2
- I can’t find how to access it from the API
- Their main threat model is the same herbarium speciment scanned and put into different collections, or the same plant photo uploaded to two different platforms (e.g. pNet + iNat) ending in GBIF twice
Refs
-
Has clustering of records that appear to be similar:
- Identifying potentially related records - How does the GBIF data-clustering feature work? - GBIF Data Blog
- very thorough
- downsides
- clustering compares only across datasets, not within the same one!
- runs once every few weeks
- Biodiversity Data Use (less cool page mentioning duplications)
- More theory on this: New data-clustering feature aims to improve data quality and reveal cross-dataset connections
- Occurrence search (“has related records”)
- Example: Occurrence 4036191318
- in API:
- New data-clustering feature aims to improve data quality and reveal cross-dataset connections
matching similar entries in individual fields across different datasets
- https://api.gbif.org/v1/occurrence/1141981356/experimental/related
- vanilla
curl https://api.gbif.org/v1/occurrence/4011664186 | jq -C | less
hasisInCluster
but nothing more
- New data-clustering feature aims to improve data quality and reveal cross-dataset connections
- Identifying potentially related records - How does the GBIF data-clustering feature work? - GBIF Data Blog
-
GBIF has
associatedOccurrences
- Darwin Core Quick Reference Guide - Darwin Core
- not yet used in clustering
- parses descriptions etc. for links
-
Darwin Core Resource Relationship – Extension darwin core bit about relationships between records docu
-
Duplicate occurrence records - Data Publishing - GBIF community forum
- user distinguishes three types of duplicates: exact, strict, relaxed
- discussions of using bash/cli magic to parse occurrences.txt to find them,
- ref. A data cleaner’s cookbook - Content 1
- BASHing data: Partial duplicates
- all with a more record linking vibe
-
I’m not aware of any backend or external packages (e.g. in R or Python) that can tidy a Darwin Core dataset
-
Fortunately or unfortunately, Darwin Core datasets are complex beasts that don’t lend themselves to automated checking and fixing. For this reason people (not backend routines) are the best Darwin Core data cleaners 4. The code recipes 2 I use are freely available on the Web and I (and now others) are happy to train others in their use.
-
Duplicate observations across datasets - GBIF community forum
-
Something that I often hear repeated on iNaturalist and BugGuide is that posting the same observation on both platforms results in the observation being ingested twice by GBIF.
- TL;DR posted twice, not found as duplicate because had no location info on one of the platforms
-
Random
- Developer Blog: “I noticed that the GBIF data portal has fewer records than it used to – what happened?” about removing dulicates from GBIF in 2012
GBIF/iNat
Looking for observations with links in description
- creeping thistle from Norfolk County, ON, Canada on July 12, 2019 at 08:44 AM by Jeremy Hussell. Same plants as in this previous observation: https://www.inaturalist.org/observations/28646599 · iNaturalist == Occurrence 3005247358
- “occurrence remarks” in GBIF are iNat’s “description”
Random / fun
- https://community.openstreetmap.org/t/semi-automated-tree-additions/102197/57?page=4 openstreetmap automatically adding trees and needing deduplication
Next steps
Look for GBIF/iNat/plantnet repos on Github and look their mentions of duplicates