If you ever worked with datasets there is mostly a need to get specific information about that whole data. If you use python and handle data with pandas, then filtering data is quite easy. In order to select rows based on column values, in a DataFrame there are many different ways in Pandas.
In this tutorial, we will explore various techniques for selecting rows from a DataFrame. By the end, you will be able to filter and extract specific rows of interest from your data, making it easier to analyze and draw insights.
Method 1: Boolean Indexing
The most straightforward and easiest method to select rows based on column values is by boolean indexing. To create a boolean mask, you can use comparison operators (>
, <
, >=
, <=
, ==
, !=
) or other boolean operators (|
for OR, &
for AND, ~
for NOT). You can then use this mask to select the corresponding rows from the DataFrame by passing it inside square brackets.
One can also combine multiple conditions using parentheses and boolean operators, as shown below where we selected two cities. It allows you to create more complex boolean masks that can select rows based on multiple conditions.
import pandas as pd # create a sample DataFrame df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40], 'City': ['New York', 'Paris', 'London', 'Sydney']}) # select rows where Age is greater than 30 df_gt_30 = df[df['Age'] > 30] # select rows where City is 'Paris' or 'Berlin' df_paris_berlin = df[(df['City'] == 'Paris') | (df['City'] == 'Berlin')]
In the above sample DataFrame we have three columns (Name
, Age
, and City
) and five rows. The resulting DataFrame after boolean indexing will contain only the selected rows but all columns. If you want to choose specific columns as well, you can pass a list of column names inside the square brackets. Consider the following example:
df_gt_30_names = df.loc[df['Age'] > 30, ['Name', 'City']]
This will select only the Name
and City
columns for the rows where Age
is greater than 30.
Boolean indexing can also be used with loc
method. In Pandas, the loc
method is used to select data from a DataFrame based on the labels of rows and columns. Boolean indexing with loc
method allows you to select rows based on a boolean condition.
Here’s an example:
import pandas as pd # create a DataFrame df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40], 'City': ['New York', 'Paris', 'London', 'Sydney']}) # boolean indexing with loc condition = df['Age'] > 30 result = df.loc[condition] print(result)
In the above code, the condition
variable contains a boolean condition, which is True
for rows where the Age is greater than 30, and False
for rows where the Age is less than or equal to 30. In order to select rows where the condition is True
, the loc
method is used.
The result of this code will be a new DataFrame that contains only the rows where the Age is greater than 30:
Name Age City 2 Charlie 35 London 3 David 40 Sydney
Method 2: Using isin
for a list of values
If you have a list of values and you want to select DataFrame rows based on all the values in the list then you can use isin
operator.
Suppose you have a DataFrame df
with columns “name” and “age”:
import pandas as pd df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'Dave'], 'Age': [25, 30, 35, 40]})
Now you want to select only the rows where the name
column is either Alice
or Charlie
. The following code will do the job.
selected_rows = df[df['Name'].isin(['Alice', 'Charlie'])]
This will create a new DataFrame selected_rows
that contains only the rows where the name
column is either Alice
or Charlie
.
Note that the
isin
method takes a list of values to match against and returns a boolean Series that can be used to select the desired rows from the DataFrame using the indexing operator[]
.
Method 3: Where Method
Another way to filter DataFrame in Pandas is to use the where
method. The where
method returns a new DataFrame that is a copy of the original, but the values in the rows that do not meet the specified condition are set to NaN
. This means that the resulting DataFrame will have the same shape as the original DataFrame, but some of its values will be missing.
Consider the following code
import pandas as pd # create a DataFrame df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40], 'City': ['New York', 'Paris', 'London', 'Sydney']}) # Use the where method to select rows where Age > 30 df_filtered = df.where(df['Age'] > 30) print(df_filtered)
Output:
Name Age City 0 NaN NaN NaN 1 NaN NaN NaN 2 Charlie 35.0 London 3 David 40.0 Sydney
As you can see, it has returned a new DataFrame where the rows that do not meet the condition have been set to NaN
.
To remove the rows with missing values, you can use the dropna
method
df_filtered = df.where(df['Age'] > 30).dropna() print(df_filtered)
Output:
Name Age City 2 Charlie 35.0 London 3 David 40.0 Sydney
Method 4: Query Method
If you use SQL often, you might want to use a query to filter a DataFrame. The query
method will save you. This method provides a way to filter and select rows from a DataFrame using a string expression. It allows you to write a query-like syntax to filter rows based on certain conditions, making the code more concise and readable.
The syntax of the query()
method is as follows:
DataFrame.query(expr, inplace=False, **kwargs)
where:
expr
: A string expression to filter rows based on certain conditions.inplace
: A boolean value indicating whether to modify the original DataFrame or return a new one. If this is set toTrue
the original DataFrame will have the permanent change.**kwargs
: Additional parameters to be passed to the query function.
Here is an example:
import pandas as pd # create a sample dataframe df = pd.DataFrame({ 'name': ['Alice', 'Bob', 'Charlie', 'David'], 'age': [25, 30, 35, 40], 'gender': ['F', 'M', 'M', 'M'] }) # select rows where age is greater than 30 and gender is male using query method selected_rows = df.query('age > 30 and gender == "M"') print(selected_rows)
Output:
name age gender 2 Charlie 35 M 3 David 40 M
Note that the query expression is written as a string and enclosed in quotes. It returns a new DataFrame with the selected rows. It can be beneficial when dealing with large datasets, as it can make the code more readable and maintainable by separating the filtering logic from the rest of the code.
Time Comparison
Now let’s have a time comparison of the methods we have discussed. If you are looking for a time-efficient method this block will help you.
Let’s first generate a sample DataFrame with 10 million rows and two columns:
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(0, 10, size=(10000000, 2)), columns=['A', 'B'])
We will compare the following methods for selecting rows based on column conditions:
- Using boolean indexing with the
loc
method - Using the
query
method - Using the
where
method
Following is a time comparison for selecting rows where the value of column A
is greater than 5:
import timeit # Boolean indexing with loc start_time = timeit.default_timer() df.loc[df['A'] > 5] print("Time taken by boolean indexing with loc: {:.5f} seconds".format(timeit.default_timer() - start_time)) # Query method start_time = timeit.default_timer() df.query('A > 5') print("Time taken by query method: {:.5f} seconds".format(timeit.default_timer() - start_time)) # Where method start_time = timeit.default_timer() df.where(df['A'] > 5).dropna() print("Time taken by where method: {:.5f} seconds".format(timeit.default_timer() - start_time))
Output:
Time taken by boolean indexing with loc: 0.16906 seconds Time taken by query method: 0.21511 seconds Time taken by where method: 0.69927 seconds
By this, we can see that boolean indexing with loc
is the fastest method, followed by the query
method and the where
method. This is a rough estimate. However, it is essential to note that the exact timings may vary depending on the size and complexity of the DataFrame and the specific conditions used for selection.
Conclusion
That concludes this tutorial. Do practice these methods on your own, that’s how you will master these and will be applying to different scenarios. The ability to select rows based on column values is an essential skill for any data analyst or data scientist working with Pandas, and it can significantly improve the efficiency and accuracy of your data analysis workflow. If you like this tutorial then consider checking out our Python tutorials page, where we regularly post content for beginners and advanced developers. You’re sure to find something interesting there.