Vaex as faster pandas alternative
I have a larger-than-usual text-based dataset, need to do analysis, pandas is slow (hell, even wc -l
takes 50 seconds…)
Vaex: Pandas but 1000x faster - KDnuggets - that’s a way to catch one’s attention.
Reading files
I/O Kung-Fu: get your data in and out of Vaex — vaex 4.16.0 documentation
vx.from_csv()
reads a CSV in memory, kwargs get passed to pandas’read_csv()
vx.open()
reads stuff lazily, but I can’t find a way to tell it that my.txt
file is a CSV, and more critically - how to pass params likesep
etcvx.from_ascii()
has a parameter called sepe rator?! API documentation for vaex library — vaex 4.16.0 documentation
- the first two support
convert=
that converts stuff to things like HDFS, optionallychunk_size=
is the chunk size in lines. It’ll create $N/chunk_size$ chunks and concat together at the end. - Ways to limit stuff:
nrows=
is the number of rows to read, works with convert etc.usecols=
limits to columns by name, id or callable, speeds up stuff too and by a lot
Writing files
- I can do
df.export_hdf5()
in vaex, but pandas can’t read that. It may be related to the opposite problem - vaex can’t open pandas HDF5 files directly, because one saves them as rows, other as columns. (See FAQ) - When converting csv to hdf5, it breaks if one of the columns was detected as an
object
, in my case it was a boolean. Objects are not supported1, and booleans are objects. Not trivial situation because converting that to, say, int, would have meant reading the entire file - which is just what I don’t want to do, I want to convert to hdf to make it manageable.
Doing stuff
Syntax is similar to pandas, but the documentation is somehow .. can’t put my finger on it, but I don’t enjoy it somehow.
Stupid way to find columns that are all NA
l_desc = df.describe()
# We find column names that have length_of_dataset NA values
not_empty_cols = list(l_desc.T[l_desc.T.NA!=df.count()].T.columns)
# Filter the description by them
interesting_desc = l_desc[not_empty_cols]
Nel mezzo del deserto posso dire tutto quello che voglio.