Filtering csv files bigger than memory to a Pandas dataframe

Vincent Teyssier
3 min readAug 11, 2018

--

Very often we need to parse big csv files and select only the lines that fit certain criterias to load in a dataframe. However, with bigger than memory files, we can’t simply load it in a dataframe and select what we need. The solution is to parse csv files in chunks and append only the needed rows to our dataframe. In this article I also give a few tools to look at memory usage in general.

First, a few methods to determine the different sizes we would need(memory, csv, dataframe).

Finding the amount of RAM available :

To do that we need a library called psutil

pip install psutil

And use the following method to get the amount of available RAM

import psutil
svmem = psutil.virtual_memory()
print (svmem.available)

Getting the size of your csv:

We will use the os library for that. We can also iterate through a folder and get the cumulative size of the files if your dataset is split into several files:

import os # only one file dataset:
os.path.getsize('./dataset.csv')
# same but for a folder containing several files:
total_filesize = 0
for filename in os.listdir('./dataset_folder/'):
total_filesize = total_filesize + os.path.getsize(filename)

Getting the size of a dataframe:

The following command returns the memory size occupied by a dataframe df, including its index.

df.memory_usage(index=True).sum() 

Another option is to use the sys library as follow:

import sys
sys.getsizeof(df)

Processing csv files in chunk

Now that we got the necessary bricks, let’s read the first lines of our csv and see how much memory it takes.

df_sample = pd.read_csv(‘./dataset.csv’, nrows=10)
df_sample_size = df_sample.memory_usage(index=True).sum()

In the following example, I am dedicating 1Gb of RAM for the processing and alert me through the process if my result dataframe is getting too big to be stored in memory.

To do that we use the chunksize and iterator parameter, which determines the amount of rows we will read at each iteration:

# define a chunksize that would occupy a maximum of 1Gb 
# we divide by 10 because we have selected 10 lines in our df_sample
# we then get the integer part of the result
my_chunk = (1000000000 / df_sample_size)/10
my_chunk = int(my_chunk//1)
# create the iterator
iter_csv = pd.read_csv(
‘./dataset.csv’,
iterator=True,
chunksize=my_chunk)
# concatenate according to a filter to our result dataframe
df_result = pd.concat(
[chunk[chunk['my_field']>10]
for chunk in iter_csv])

In the concatenation we are passing the filters we want to apply on the data to only retain the lines matching this filter. In this example we want to keep lines where the column my_field have values above 10. But you can use any classic pandas way of filtering your data.

Full code below so you don’t need to stictch the pieces of code above:

import psutil
import pandas as pd
import csv
svmem = psutil.virtual_memory()
print (svmem.available)
PATH = r”C:\tmp\dataset\tf_train.csv”
df_sample = pd.read_csv(PATH, nrows=10)
df_sample_size = df_sample.memory_usage(index=True).sum()
print (df_sample_size)
print (df_sample)
# define a chunksize that would occupy a maximum of 1Gb
# we divide by 10 because we have selected 10 lines in our df_sample
my_chunk = (1000000000 / df_sample_size)/10
my_chunk = int(my_chunk//1) # we get the integer part
print (my_chunk)
# create the iterator
iter_csv = pd.read_csv(
PATH,
iterator=True,
chunksize=my_chunk)
# concatenate according to a filter to our result dataframe
df_result = pd.concat(
[chunk[chunk[‘n3’]>0]
for chunk in iter_csv])
print (df_result)

Sign up to discover human stories that deepen your understanding of the world.

--

--

Responses (3)

Write a response