07 Jun 2023

Vaex as faster pandas alternative

I have a larger-than-usual text-based dataset, need to do analysis, pandas is slow (hell, even wc -l takes 50 seconds…)

Vaex: Pandas but 1000x faster - KDnuggets - that’s a way to catch one’s attention.

Reading files

I/O Kung-Fu: get your data in and out of Vaex — vaex 4.16.0 documentation

vx.from_csv() reads a CSV in memory, kwargs get passed to pandas’ read_csv()
vx.open() reads stuff lazily, but I can’t find a way to tell it that my .txt file is a CSV, and more critically - how to pass params like sep etc
- vx.from_ascii() has a parameter called sepe rator?! API documentation for vaex library — vaex 4.16.0 documentation
the first two support convert= that converts stuff to things like HDFS, optionally chunk_size= is the chunk size in lines. It’ll create $N/chunk_size$ chunks and concat together at the end.
Ways to limit stuff:
- nrows= is the number of rows to read, works with convert etc.
- usecols= limits to columns by name, id or callable, speeds up stuff too and by a lot

Writing files

I can do df.export_hdf5() in vaex, but pandas can’t read that. It may be related to the opposite problem - vaex can’t open pandas HDF5 files directly, because one saves them as rows, other as columns. (See FAQ)
When converting csv to hdf5, it breaks if one of the columns was detected as an object, in my case it was a boolean. Objects are not supported¹, and booleans are objects. Not trivial situation because converting that to, say, int, would have meant reading the entire file - which is just what I don’t want to do, I want to convert to hdf to make it manageable.

Doing stuff

Syntax is similar to pandas, but the documentation is somehow .. can’t put my finger on it, but I don’t enjoy it somehow.

Stupid way to find columns that are all NA

l_desc = df.describe()
# We find column names that have length_of_dataset NA values
not_empty_cols = list(l_desc.T[l_desc.T.NA!=df.count()].T.columns)
# Filter the description by them
interesting_desc = l_desc[not_empty_cols]

[BUG-REPORT] TypeError: Cannnot export column of type: object · Issue #2033 · vaexio/vaex ↩︎

Nel mezzo del deserto posso dire tutto quello che voglio.

serhii.net

Vaex as faster pandas alternative

Reading files

Writing files

Doing stuff

Stupid way to find columns that are all NA