In the middle of the desert you can say anything you want
I was trying to do a join based on two columns, one of which is a pd Timestamp
.
What I learned: If you’re trying to join/merge two DataFrames not by their indexes,
pandas.DataFrame.merge
is better (yay precise language) than
pandas.DataFrame.join.
Or, for some reason I had issues with df.join(.. by=[col1,col2])
, even with df.set_index([col1,col2]).join(df2.set_index...)
, then it went out of memory and I gave up.
Then a SO answer1 said
use merge if you are not joining on the index
I tried it and df.merge(..., by=col2)
magically worked!
Spent hours trying to understand what’s happening.
TL;DR categorical types inside groupbys get shown ALL, even if there are no instances of a specific type in the actual data.
# Shows all categories including OTHER
df_item[df_item['item.item_category']!="OTHER"].groupby(['item.item_category']).sum()
df_item['item.item_category'] = df_item['item.item_category'].astype(str)
# Shows three categories
df_item[df_item['item.item_category']!="OTHER"].groupby(['item.item_category']).sum()
Rel. thread: groupby with categorical type returns all combinations · Issue #17594 · pandas-dev/pandas
Both things below work! Seaborn is smart and parses pd groupby-s as-is
sns.histplot(data=gbc,
x='items_available',
hue="item.item_category",
)
sns.histplot(data=gbc.reset_index(),
x='items_available',
hue="item.item_category",
)
TL;DR
df.loc[row_indexer, col_indexer] = value
col_indexer
can be a non-existing-yet column! And row_indexer
can be anything, including based on a groupby
filter.
Below, the groupby filter has dropna=False
which would return also the rows that don’t match the filter, giving a Series with the same indexes as the main df
# E.g. this groupby filter - NB. dropna=False
df_item.groupby(['item.item_id']).filter(lambda x:x.items_available.max()>0, dropna=False)['item.item_id']
# Then we use that in the condition, nice arbitrary example with `item.item_id` not being the index of the DF
df_item.loc[df_item['item.item_id']==df_item.groupby(['item.item_id']).filter(lambda x:x.items_available.max()>0, dropna=False)['item.item_id'],'item_active'] = True
I’m not sure whether this is the “best” way to incorporate groupby results, but seems to work OK for now.
Esp. the remaining rows have nan
instead of False, can be worked around but is ugly:
df_item['item_active'] = df_item['item_active'].notna()
# For plotting purposes
sns.histplot(data=df_item.notna(), ... )
Pandas Filter by Column Value - Spark By {Examples} has more examples of conditions:
# From https://sparkbyexamples.com/pandas/pandas-filter-by-column-value/
df.loc[df['Courses'] == value]
df.loc[df['Courses'] != 'Spark']
df.loc[df['Courses'].isin(values)]
df.loc[~df['Courses'].isin(values)]
df.loc[(df['Discount'] >= 1000) & (df['Discount'] <= 2000)]
df.loc[(df['Discount'] >= 1200) & (df['Fee'] >= 23000 )]
df[df["Courses"] == 'Spark']
df[df['Courses'].str.contains("Spark")]
df[df['Courses'].str.lower().str.contains("spark")]
df[df['Courses'].str.startswith("P")]
mwouts/itables: Pandas DataFrames as Interactive DataTables:
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)
It kept “loading”. I set the notebook to ’trusted’ as per help, didn’t help.
But this did:
init_notebook_mode(all_interactive=True, connected=True)
(connected=True
makes it load libraries dynamically from the internet, and is not the default)
Allows more interesting interactive dynamical tables. Incl. things like sorting by column etc.
Courtesy of ChatGPT, after googling for 15 minutes didn’t bring anything immediately copy-pasteable.
I’ll deal with the existential crisis from this later.
To explode the cells containing dictionaries in a pandas DataFrame and append them as additional columns, you can use the apply() method along with the pd.Series() constructor. Here’s an example:
import pandas as pd
# create example DataFrame
df = pd.DataFrame({
'id': [1, 2],
'info': [
{'name': 'Alice', 'age': 25},
{'name': 'Bob', 'age': 30, 'city': 'New York'}
]
})
# use apply() with pd.Series() to explode the dictionaries and append them as columns
df = pd.concat([df.drop('info', axis=1), df['info'].apply(pd.Series)], axis=1)
print(df)
Output:
id name age city
0 1 Alice 25 NaN
1 2 Bob 30 New York
In this example, we first create a DataFrame with two rows, where the info column contains dictionaries. We then use the apply() method with pd.Series() to explode the dictionaries and create additional columns for each key in the dictionaries. Finally, we concatenate the exploded columns with the original DataFrame using pd.concat().
Related: 230529-1413 Plants datasets taxonomy
Citizen science (similar to [..] participatory/volunteer monitoring) is scientific research conducted with participation from the general public
most citizen science research publications being in the fields of biology and conservation
can mean multiple things, usually using citizens acting volunteers to help monitor/classify/.. stuff (but also citizens initiating stuff; also: educating the public about scientific methods, e.g. schools)
allowed users to upload photos of a plant species and its components, enter its characteristics (such as color and size), compare it against a catalog photo and classify it. The classification results are juried by crowdsourced ratings.4
“Here we present two Pl@ntNet citizen science initiatives used by conservation practitioners in Europe (France) and Africa (Kenya).”
@fuccilloAssessingAccuracyCitizen2015
(2015) z>
Volunteers demonstrated greatest overall accuracy identifying unfolded leaves, ripe fruits, and open flowers.
@crallAssessingCitizenScience2011
Assessing citizen science data quality (2011) z>
@chenPerformanceEvaluationDeep2021
(2021) z>
- Georeferenced plant observations from herbarium, plot, and trait records;
- Plot inventories and surveys;
- Species geographic distribution maps;
- Plant traits;
- A species-level phylogeny for all plants in the New World;
- Cross-continent, continent, and country-level species lists.
@ortizReviewInteractionsBiodiversity2021
A review of the interactions between biodiversity, agriculture, climate change, and international trade (2021) z/d>(e.g. strong colour variation and the transformation of 3D objects after pressing like fruits and flowers) <
@waldchenMachineLearningImage2018
(2018) z>
@goeau2021overview
(2021) z><@goeauAIbasedIdentificationPlant2021
(2021) z>“Lab-based setting is often used by biologist that brings the specimen (e.g. insects or plants) to the lab for inspecting them, to identify them and mostly to archive them. In this setting, the image acquisition can be controlled and standardised. In contrast to field-based investigations, where images of the specimen are taken in-situ without a controllable capturing procedure and system. For fieldbased investigations, typically a mobile device or camera is used for image acquisition and the specimen is alive when taking the picture (Martineau et al., 2017). ”<
@waldchenMachineLearningImage2018
(2018) z>
@pearseDeepLearningPhenology2021
(2021) z>, but there DL failed less without flowers than non-DL), but sometimes don’t@walkerHarnessingLargeScaleHerbarium2022
(2022) z/d>
@goodwinWidespreadMistakenIdentity2015
(2015) z/d>EDIT separate post about this: 230529-1413 Plants datasets taxonomy
We can classify existing datasets in two types:
@giselssonPublicImageDatabase2017
(2017) z>), common weeds in Denmark dataset <@leminenmadsenOpenPlantPhenotype2020
(2020) z/d> etc.
FloraCapture requests contributors to photograph plants from at least five precisely defined perspectives
There are some special datasets, satellite and whatever, but especially:
@mamatAdvancedTechnologyAgriculture2022
Advanced Technology in Agriculture Industry by Implementing Image Annotation Technique and Deep Learning Approach (2022) z/d> has an excellent overview of these)Additional info present in datasets or useful:
http://ceur-ws.org/Vol-2936/paper-122.pdf / <@goeau2021overview
(2021) z> ↩︎
https://hal-lirmm.ccsd.cnrs.fr/lirmm-03793591/file/paper-153.pdf / <@goeau2022overview
(2022) z> ↩︎
IBM and SAP open up big data platforms for citizen science | Guardian sustainable business | The Guardian ↩︎
Deep Learning with Taxonomic Loss for Plant Identification - PMC ↩︎
Goal: Interact with Zotero from within Obsidian
Solution: “Citations”1 plugin for Obsidian, “Better Bibtex”2 plugin for Zotero!
Neat bits:
There’s a configurable “Citations: Insert Markdown Citation” thing!
<_`@{{citekey}}` {{titleShort}} ({{year}}) [z]({{zoteroSelectURI}})/[d](https://doi.org/{{DOI}})_>
- {{citekey}}
- {{abstract}}
- {{authorString}}
- {{containerTitle}}
- {{DOI}}
- {{eprint}}
- {{eprinttype}}
- {{eventPlace}}
- {{page}}
- {{publisher}}
- {{publisherPlace}}
- {{title}}
- {{titleShort}}
- {{URL}}
- {{year}}
- {{zoteroSelectURI}}
hans/obsidian-citation-plugin: Obsidian plugin which integrates your academic reference manager with the Obsidian editor. Search your references from within Obsidian and automatically create and reference literature notes for papers and books. ↩︎
retorquere/zotero-better-bibtex: Make Zotero effective for us LaTeX holdouts ↩︎
<C-N>
for me.zotero://
links don’t work for me, and the default .desktop file they provide seems broken - TODO laterGitstats is the best I know: tomgi/git_stats: GitStats is a git repository statistics generator.
gitstats /path/to/repo /path/to/output/dir
Generates comprehensive static html reports with graphs. Authors, files, times of the day/week/month, ….