How do you read a row from a row in a dataframe in python?

Never use iterrows and itertuples again

How do you read a row from a row in a dataframe in python?

Image by author, emojis by OpenMoji (CC BY-SA 4.0).

When I started machine learning, I followed the guidelines and created my own features by combining multiple columns in my dataset. It’s all well and good, but the way I did it was horribly inefficient. I had to wait several minutes to do the most basic operations.

My problem was simple: I didn’t know the fastest way to iterate over rows in Pandas.

I often see people online using the same techniques I used to apply. It’s not elegant but it’s ok if you don’t have much data. However, if you process more than 10k rows, it quickly becomes an obvious performance issue.

In this article, I’m gonna give you the best way to iterate over rows in a Pandas DataFrame, with no extra code required. It’s not just about performance: it’s also about understanding what’s going on under the hood to become a better data scientist.

Let’s import a dataset in Pandas. In this case, I chose the one I worked on when I started: it’s time to fix my past mistakes! 🩹

You can run the code with the following Google Colab notebook.

This dataset has 22k rows and 43 columns with a combination of categorical and numerical values. Each row describes a connection between two computers.

Let’s say we want to create a new feature: the total number of bytes in the connection. We just have to sum up two existing features: src_bytes and dst_bytes. Let's see different methods to calculate this new feature.

❌❌ 1. Iterrows

According to the official documentation, iterrows() iterates "over the rows of a Pandas DataFrame as (index, Series) pairs". It converts each row into a Series object, which causes two problems:

  1. It can change the type of your data (dtypes);
  2. The conversion greatly degrades performance.

For these reasons, the ill-named iterrows() is the WORST possible method to actually iterate over rows.

10 loops, best of 5: 1.07 s per loop

Now let’s see slightly better techniques…

❌ 2. For loop with .loc or .iloc (3× faster)

This is what I used to do when I started: a basic for loop to select rows by index (with .loc or .iloc).

Why is it bad? Because DataFrames are not designed for this purpose. As with the previous method, rows are converted into Pandas Series objects, which degrades performance.

Interestingly enough,.iloc is faster than .loc. It makes sense since Python doesn't have to check user-defined labels and directly look at where the row is stored in memory.

10 loops, best of 5: 600 ms per loop
10 loops, best of 5: 377 ms per loop

Even this basic for loop with .iloc is 3 times faster than the first method!

❌ 3. Apply (4× faster)

The apply() method is another popular choice to iterate over rows. It creates code that is easy to understand but at a cost: performance is nearly as bad as the previous for loop.

This is why I would strongly advise you to avoid this function for this specific purpose (it's fine for other applications).

Note that I convert the DataFrame into a list using the to_list() method to obtain identical results.

10 loops, best of 5: 282 ms per loop

The apply() method is a for loop in disguise, which is why the performance doesn't improve that much: it's only 4 times faster than the first technique.

❌ 4. Itertuples (10× faster)

If you know about iterrows(), you probably know about itertuples(). According to the official documentation, it iterates "over the rows of a DataFrame as namedtuples of the values". In practice, it means that rows are converted into tuples, which are much lighter objects than Pandas Series.

This is why itertuples() is a better version of iterrows(). This time, we need to access the values with an attribute (or an index). If you want to access them with a string (e.g., if there’s a space in the string), you can use the getattr() function instead.

10 loops, best of 5: 99.3 ms per loop

This is starting to look better: it is now 10 times faster than iterrows() .

❌ 5. List comprehensions (200× faster)

List comprehensions are a fancy way to iterate over a list as a one-liner.

For instance, [print(i) for i in range(10)] prints numbers from 0 to 9 without any explicit for loop. I say "explicit" because Python actually processes it as a for loop if we look at the bytecode.

So why is it faster? Quite simply because we don't call the .append() method in this version.

100 loops, best of 5: 5.54 ms per loop

Indeed, this technique is 200 times faster than the first one! But we can still do better.

✅ 6. Pandas vectorization (1500× faster)

Until now, all the techniques used simply add up single values. Instead of adding single values, why not group them into vectors to sum them up? The difference between adding two numbers or two vectors is not significant for a CPU, which should speed things up.

On top of that, Pandas can process Series objects in parallel, using every CPU core available!

The syntax is also the simplest imaginable: this solution is extremely intuitive. Under the hood, Pandas takes care of vectorizing our data with an optimized C code using contiguous memory blocks.

1000 loops, best of 5: 734 µs per loop

This code is 1500 times faster than iterrows() and it is even simpler to write.

✅✅ 7. NumPy vectorization (1900× faster)

NumPy is designed to handle scientific computing. It has less overhead than Pandas methods since rows and dataframes all become np.array. It relies on the same optimizations as Pandas vectorization.

There are two ways of converting a Series into a np.array: using .values or .to_numpy(). The former has been deprecated for years, which is why we're gonna use .to_numpy() in this example.

1000 loops, best of 5: 575 µs per loop

We found our winner with a technique that is 1900 times faster than our first competitor! Let’s wrap things up.

🏆 Conclusion

How do you read a row from a row in a dataframe in python?

The number of rows in the dataset can greatly impact the performance of certain techniques (image by author).

Don’t be like me: if you need to iterate over rows in a DataFrame, vectorization is the way to go! You can find the code to reproduce the experiments at this address. Vectorization is not harder to read, it doesn’t take longer to write, and the performance gain is incredible.

It’s not just about performance: understanding how each method works under the hood helped me to write better code. Performance gains are always based on the same techniques: transforming data into vectors and matrices to take advantage of parallel processing. Alas, this is often at the expense of readability. But it doesn’t have to be.

Iterating over rows is just an example but it shows that, sometimes, you can have the cake and eat it. 🎂

If you liked this article, follow me on Twitter @maximelabonnefor more tips about data science and machine learning!

How do you access rows by rows in a DataFrame?

To access a group of rows in a Pandas DataFrame, we can use the loc() method. For example, if we use df. loc[2:5], then it will select all the rows from 2 to 5.

How do you access a specific row in a DataFrame in Python?

In the Pandas DataFrame we can find the specified row value with the using function iloc(). In this function we pass the row number as parameter.

How do I read rows in pandas?

DataFrame. itertuples() method is used to iterate over DataFrame rows as namedtuples. In general, itertuples() is expected to be faster compared to iterrows() .

How do I select specific rows and columns from a DataFrame?

To select a single value from the DataFrame, you can do the following. You can use slicing to select a particular column. To select rows and columns simultaneously, you need to understand the use of comma in the square brackets.