If pandas.DataFrame is iterated by for loop as it is, column names are returned. You can iterate over columns and rows of pandas.DataFrame with the iteritems(), iterrows(), and itertuples() methods.
This article describes the following contents.
- Iterate pandas.DataFrame in for loop as it is
- Iterate columns of pandas.DataFrame
- DataFrame.iteritems()
- Iterate rows of pandas.DataFrame
- DataFrame.iterrows()
- DataFrame.itertuples()
- Iterate only specific columns
- Update values in for loop
- Speed comparison
For more information on the for statement in Python, see the following article.
- for loop in Python (with range, enumerate, zip, etc.)
Use the following pandas.DataFrame as an example.
import pandas as pd import numpy as np df = pd.DataFrame({'age': [24, 42], 'state': ['NY', 'CA'], 'point': [64, 92]}, index=['Alice', 'Bob']) print(df) # age state point # Alice 24 NY 64 # Bob 42 CA 92
Iterate pandas.DataFrame in for loop as it is
If you iterate pandas.DataFrame in a for loop as is, the column names are returned in order.
for column_name in df: print(column_name) # age # state # point
Iterate columns of pandas.DataFrame
DataFrame.iteritems()
The iteritems() method iterates over columns and returns (column name, Series), a tuple with the column name and the content as pandas.Series.
- pandas.DataFrame.iteritems — pandas 1.4.2 documentation
for column_name, item in df.iteritems(): print(column_name) print('------') print(type(item)) print(item) print('------') print(item[0], item['Alice'], item.Alice) print(item[1], item['Bob'], item.Bob) print('======\n') # age # ------ # <class 'pandas.core.series.Series'> # Alice 24 # Bob 42 # Name: age, dtype: int64 # ------ # 24 24 24 # 42 42 42 # ====== # # state # ------ # <class 'pandas.core.series.Series'> # Alice NY # Bob CA # Name: state, dtype: object # ------ # NY NY NY # CA CA CA # ====== # # point # ------ # <class 'pandas.core.series.Series'> # Alice 64 # Bob 92 # Name: point, dtype: int64 # ------ # 64 64 64 # 92 92 92 # ====== #
Iterate rows of pandas.DataFrame
The iterrows() and itertuples() methods iterate over rows. The itertuples() method is faster.
If you only need the values for a particular column, it is even faster to iterate over the elements of a given column individually, as explained next. The results of the speed comparison are shown at the end.
DataFrame.iterrows()
The iterrows() method iterates over rows and returns (index, Series), a tuple with the index and the content as pandas.Series.
- pandas.DataFrame.iterrows — pandas 1.4.2 documentation
for index, row in df.iterrows(): print(index) print('------') print(type(row)) print(row) print('------') print(row[0], row['age'], row.age) print(row[1], row['state'], row.state) print(row[2], row['point'], row.point) print('======\n') # Alice # ------ # <class 'pandas.core.series.Series'> # age 24 # state NY # point 64 # Name: Alice, dtype: object # ------ # 24 24 24 # NY NY NY # 64 64 64 # ====== # # Bob # ------ # <class 'pandas.core.series.Series'> # age 42 # state CA # point 92 # Name: Bob, dtype: object # ------ # 42 42 42 # CA CA CA # 92 92 92 # ====== #
DataFrame.itertuples()
The itertuples() method iterates over rows and returns a tuple of the index and the content. The first element of the tuple is the index.
- pandas.DataFrame.itertuples — pandas 1.4.2 documentation
By default, it returns a namedtuple named Pandas. Because it is a namedtuple, you can access the value of each element by . as well as [].
for row in df.itertuples(): print(type(row)) print(row) print('------') print(row[0], row.Index) print(row[1], row.age) print(row[2], row.state) print(row[3], row.point) print('======\n') # <class 'pandas.core.frame.Pandas'> # Pandas(Index='Alice', age=24, state='NY', point=64) # ------ # Alice Alice # 24 24 # NY NY # 64 64 # ====== # # <class 'pandas.core.frame.Pandas'> # Pandas(Index='Bob', age=42, state='CA', point=92) # ------ # Bob Bob # 42 42 # CA CA # 92 92 # ====== #
A normal tuple is returned if the name parameter is set to None.
for row in df.itertuples(name=None): print(type(row)) print(row) print(row[0], row[1], row[2], row[3]) print('======\n') # <class 'tuple'> # ('Alice', 24, 'NY', 64) # Alice 24 NY 64 # ====== # # <class 'tuple'> # ('Bob', 42, 'CA', 92) # Bob 42 CA 92 # ====== #
Iterate only specific columns
If you only need the elements of a particular column, you can also write as follows.
The pandas.DataFrame column is pandas.Series.
print(df['age']) # Alice 24 # Bob 42 # Name: age, dtype: int64 print(type(df['age'])) # <class 'pandas.core.series.Series'>
If you apply pandas.Series to a for loop, you can get its values in order. You can get the values of that column in order by specifying a column of pandas.DataFrame and applying it to a for loop.
for age in df['age']: print(age) # 24 # 42
You can also get the values of multiple columns with the built-in zip() function.
- zip() in Python: Get elements from multiple lists
for age, point in zip(df['age'], df['point']): print(age, point) # 24 64 # 42 92
Use the index attribute if you want to get the index. As in the example above, you can get it together with other columns by zip().
print(df.index) # Index(['Alice', 'Bob'], dtype='object') print(type(df.index)) # <class 'pandas.core.indexes.base.Index'> for index in df.index: print(index) # Alice # Bob for index, state in zip(df.index, df['state']): print(index, state) # Alice NY # Bob CA
Update values in for loop
The pandas.Series returned by the iterrows() method is a copy, not a view, so changing it will not update the original data.
for index, row in df.iterrows(): row['point'] += row['age'] print(df) # age state point # Alice 24 NY 64 # Bob 42 CA 92
You can update it by selecting elements of the original DataFrame with at[].
for index, row in df.iterrows(): df.at[index, 'point'] += row['age'] print(df) # age state point # Alice 24 NY 88 # Bob 42 CA 134
See the following article on at[].
- pandas: Get/Set element values with at, iat, loc, iloc
However, in many cases, it is not necessary to use a for loop to update an element or to add a new column based on an existing column. It is simpler and faster to write without a for loop.
Same process without a for loop:
df = pd.DataFrame({'age': [24, 42], 'state': ['NY', 'CA'], 'point': [64, 92]}, index=['Alice', 'Bob']) df['point'] += df['age'] print(df) # age state point # Alice 24 NY 88 # Bob 42 CA 134
You can add a new column.
df['new'] = df['point'] + df['age'] * 2 print(df) # age state point new # Alice 24 NY 88 136 # Bob 42 CA 134 218
You can also apply NumPy functions to each element of a column.
df['age_sqrt'] = np.sqrt(df['age']) print(df) # age state point new age_sqrt # Alice 24 NY 88 136 4.898979 # Bob 42 CA 134 218 6.480741
For strings, various methods are provided to process the columns directly. The following is an example of converting to lower case and selecting the first character.
- pandas: Handle strings (replace, strip, case conversion, etc.)
- pandas: Slice substrings from each element in columns
df['state_0'] = df['state'].str.lower().str[0] print(df) # age state point new age_sqrt state_0 # Alice 24 NY 88 136 4.898979 n # Bob 42 CA 134 218 6.480741 c
Speed comparison
Compare the speed of iterrows(), itertuples(), and the method of specifying columns.
Use pandas.DataFrame with 100 rows and 10 columns as an example. It is a simple example with only numeric elements, row name index and column name columns are default sequential numbers.
- numpy.arange(), linspace(): Generate ndarray with evenly spaced values
- pandas: Get first/last n rows of DataFrame with head(), tail(), slice
import pandas as pd df = pd.DataFrame(pd.np.arange(1000).reshape(100, 10)) print(df.shape) # (100, 10) print(df.head()) # 0 1 2 3 4 5 6 7 8 9 # 0 0 1 2 3 4 5 6 7 8 9 # 1 10 11 12 13 14 15 16 17 18 19 # 2 20 21 22 23 24 25 26 27 28 29 # 3 30 31 32 33 34 35 36 37 38 39 # 4 40 41 42 43 44 45 46 47 48 49 print(df.tail()) # 0 1 2 3 4 5 6 7 8 9 # 95 950 951 952 953 954 955 956 957 958 959 # 96 960 961 962 963 964 965 966 967 968 969 # 97 970 971 972 973 974 975 976 977 978 979 # 98 980 981 982 983 984 985 986 987 988 989 # 99 990 991 992 993 994 995 996 997 998 999
Note that the code below uses the Jupyter Notebook magic command %%timeit and does not work when run as a Python script.
- Measure execution time with timeit in Python
%%timeit for i, row in df.iterrows(): pass # 4.53 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit for t in df.itertuples(): pass # 981 µs ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %%timeit for t in df.itertuples(name=None): pass # 718 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) %%timeit for i in df[0]: pass # 15.6 µs ± 446 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) %%timeit for i, j, k in zip(df[0], df[4], df[9]): pass # 46.1 µs ± 588 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) %%timeit for t in zip(df[0], df[1], df[2], df[3], df[4], df[5], df[6], df[7], df[8], df[9]): pass # 147 µs ± 3.78 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
iterrows() is slow because it converts each row to pandas.Series.
itertuples() is faster than iterrows(), but the method of specifying columns is the fastest. In the example environment, it is faster than itertuples() even if all columns are specified.
As the number of rows increases, iterrows() becomes even slower. You should try using itertuples() or column specification in such a case.
Of course, as mentioned above, it is best not to use the for loop if it is not necessary.