serhii.net

In the middle of the desert you can say anything you want

07 Jun 2023

Vaex as faster pandas alternative

I have a larger-than-usual text-based dataset, need to do analysis, pandas is slow (hell, even wc -l takes 50 seconds…)

Vaex: Pandas but 1000x faster - KDnuggets - that’s a way to catch one’s attention.

Reading files

I/O Kung-Fu: get your data in and out of Vaex — vaex 4.16.0 documentation

  • vx.from_csv() reads a CSV in memory, kwargs get passed to pandas’ read_csv()
  • vx.open() reads stuff lazily, but I can’t find a way to tell it that my .txt file is a CSV, and more critically - how to pass params like sep etc
  • the first two support convert= that converts stuff to things like HDFS, optionally chunk_size= is the chunk size in lines. It’ll create $N/chunk_size$ chunks and concat together at the end.
  • Ways to limit stuff:
    • nrows= is the number of rows to read, works with convert etc.
    • usecols= limits to columns by name, id or callable, speeds up stuff too and by a lot

Writing files

  • I can do df.export_hdf5() in vaex, but pandas can’t read that. It may be related to the opposite problem - vaex can’t open pandas HDF5 files directly, because one saves them as rows, other as columns. (See FAQ)
  • When converting csv to hdf5, it breaks if one of the columns was detected as an object, in my case it was a boolean. Objects are not supported1, and booleans are objects. Not trivial situation because converting that to, say, int, would have meant reading the entire file - which is just what I don’t want to do, I want to convert to hdf to make it manageable.

Doing stuff

Syntax is similar to pandas, but the documentation is somehow .. can’t put my finger on it, but I don’t enjoy it somehow.

Stupid way to find columns that are all NA

l_desc = df.describe()
# We find column names that have length_of_dataset NA values
not_empty_cols = list(l_desc.T[l_desc.T.NA!=df.count()].T.columns)
# Filter the description by them
interesting_desc = l_desc[not_empty_cols]
Nel mezzo del deserto posso dire tutto quello che voglio.