What is Pandas and What is it Used For
Pandas is a Python data library. It makes life easier for analysts: Where 10 lines of code were needed before, one line is now enough.
For example, to read data from a CSV file, in standard Python you must first decide how to store the data, then open the file, read it line by line, separate the values, and strip the data of special characters.
With pandas, things are simpler. First, you do not have to worry about how the data is stored — it’s in the data frame. Second, you only have to write a single command:
Pandas adds new data structures to Python — series and dataframes. Let’s examine them closely.
Data Structures: Series and DataFrames
That is, if you multiply a list by 2, you get the same list repeated 2 times.
But if you multiply the series, its length will not change, but the elements will double.
Note the first column of the output. This is the index that stores the addresses of each element in the series. Each element can then be retrieved by accessing the desired address.
Another difference between series and lists is that you can use any values as indexes, which makes the data clearer. Let us say we are analyzing monthly sales. We use the names of the months as indexes, and the values are the sales.
Now we can get the values of each month:
Since a series is a
Dataframes are tables. They have rows, columns and cells.
Technically, columns of a dataframe are series. Since columns usually describe the same objects, all columns share the same index.
I will explain how to create dataframes and load data into them in the following chapter.
Creating and DataFrame and Loading Data
Sometimes we don’t know what the data is and can’t specify the structure beforehand. Then it’s convenient to create an empty dataframe and fill it with data later.
And sometimes the data is already there, but stored in a variable from standard Python, such as a dictionary. To create the dataframe, we pass this variable to the same command.
It happens that data is missing in some records. For example, look at the
goods_sold list — it contains sales broken down by product category. In the first month we sold cars, computers and software. In the second month there were no cars, but bicycles, and in the third month there were cars again, but no bicycles.
When you load the data into the dataframe, Pandas creates columns for all product categories and fills them with data where possible.
Note that the sales of bicycles in the first and third months are
NaN — which stands for Not a Number. This is how Pandas marks missing values.
Now let’s break down how to load data from files. Most often the data is stored in Excel spreadsheets or csv, tsv files.
Excel tables are read with the pd.read_excel() function. Pass the address of the file on your computer and the name of the sheet you want to read as parameters. The command works with both xls and xlsx:
The csv and tsv files are text files with data separated by commas or tabs.
Both are read with the
.read_csv() function, the tab character is passed with the
sep parameter (from «separator»).
When loading, we can assign a column as an index. Imagine you are loading a table with orders. Each order has its own unique number. If we assign this number as an index, we can then extract the data using the
df[order_id] command. Otherwise we have to write the filter
df[df['id'] == order_id ].
In a later section, I will explain how to get the data from the dataframes. To assign a column as an index, add the
index_col parameter to the
read_csv() function, which corresponds to the name of the desired column.
After you load the data into the dataframe, it’s a good idea to examine it — especially if you are not familiar with it.
Exploring the Data
Let’s imagine we’re analyzing the sales of an online store. We have data on orders and customers. Let’s load the file with the sales into the variable orders and let’s specify that the id column should be used as a dataframe index,
Let’s examine four attributes of every dataframe:
.columns, .index and .dtypes.
.shape indicates how many rows and columns the dataframe has. It returns a tuple of values (n_rows, n_columns). The rows come first, then the columns.
Our dataframe has 5009 rows and 5 columns.
Okay, we know the scale. Now we want to see what information is in each column. Use
.columns to find out the column names.
Now we see that the table contains the order date, delivery method, customer number and sales.
.dtypes to find out the data types in each column and see if they need to be processed. There are cases where numbers are loaded as text. If we try to add two text values
'1' + '1', we’ll not get the number
2, but the string
object type is text, float64 is a fractional number like 3.14.
.index attribute we can see the row names.
As expected, there are order numbers in the index of the dataframe: 100762, 100860, and so on.
sales column, the value of each order is stored. To get the range, average and median cost, we use the
Finally, to look at some examples of dataframe entries, we use the
.head() command. It returns 6 first records of the dataframe.
Now that we’ve a first idea or dataframes, let’s discuss how to get data out of them.
Getting the Data from DataFrames
You can retrieve data from dataframes in a number of ways: by specifying column and row numbers, by using conditional statements, or by using the query language. In the following chapters I’ll tell you more about each method.
Specifying the right rows and columns
Let’s continue with the analysis of the online store sales that we uploaded in the previous section. Suppose I want to display the sales column. To do this, enclose the column name in square brackets and place it after the name of the dataframe:
Note that the result of this command is a Series with the same index.
If you need to output multiple columns, insert a list with their names in square brackets:
orders[['customer_id', 'sales']]. Caution: The square brackets are now double. The first pair of brackets is from the dataframe, the second pair is from the list.
Let’s move on to the rows. You can filter them by index and rank. For example, if we want to display only orders 100363, 100391 and 100706, there’s a command
.loc for that.
Another time we want to get only orders 1 to 3 in the order, regardless of their number in the table. Then the command
.iloc is used.
You can filter dataframes by columns and columns at the same time:
Often you don’t know the ids of the desired orders in advance. For example, if the task is to get orders worth more than $1000. This task can be conveniently solved with the help of conditional operators.
If — then. Conditional operators
Problem: You need to find out where the largest orders come from. Let’s start with all purchases worth more than $1000:
Remember I mentioned at the beginning of this article that in the Series all operations are applied item by item? Well, the
orders['sales'] > 1000 operation is applied to every element in the series and, if the condition is met, returns
True. If it’s not met,
False is returned. The resulting row is stored in the
The second command filters the rows of the data frame containing the Series. If the
filter_large element is
True, the row is displayed, if it’s
False — not. The result is a data frame with orders worth more than $1000.
I wonder how many expensive orders were delivered with the first class? Let’s add another condition to the filter.
The logic hasn’t changed. In the variable
filter_large we store the row that satisfies the condition
orders['sales'] > 1000. In
filter_first_class — the row that satisfies
orders['ship_mode'] == 'First'.
Then we merged both rows using the logical
filter_first_class & filter_first_class. We obtained a new Series of the same length, in the elements of which only orders of value greater than 1000, supplied with the first class are
True. There can be any number of these conditions.
The Query Language
Another way to solve the previous problem is to use the query language. We write all the conditions in one string
'sales > 1000 & ship_mode == «First»' and pass them to the
.query() method. The query is more compact.
Pro tip: you can store filter values in a variable and refer to it in the query with the
'sales > @sales_filter'.
Now that we’ve figured out how to retrieve data from the data frame, let’s move on to counting aggregate metrics: Number of Orders, Total Sales, Average Check, and Conversion Rate.
Let’s calculate how much money the store made with each delivery class. Let’s start with a simple calculation: we add up the income from all orders. To do this, use the
Let’s add a shipping class. Before summing, group the data using the
3.514284e+05 — scientific format of number output. This means 3.51 * 105. Since we don’t need such precision, we can instruct Pandas to round the values to hundredths.
Much better! Now we see the amount of revenue for each delivery class. We can’t tell from the total revenue whether things are getting better or worse. Let’s add a breakdown by order date.
You can see that the revenue jumps from day to day: sometimes it’s $10, sometimes it’s $378. I wonder if it’s the number of orders or the average check that changes. Let’s add the number of orders to the sample. To do this, instead of
.sum(), we use the
.agg() method, passing in a list of the names of the required functions.
Wow, it turns out that’s how the average check jumps. I wonder what the best day was? To find out, we sort the resulting dataframe: We display the top 10 days by revenue.
The command has become very large and doesn’t read well. To simplify it, you can split it into several lines. We put a backslash
\ at the end of each line.
On its most successful day, March 18, 2014, the store made $27,000 with a standard shipping class. I wonder where the customers who placed those orders came from? To find out, you need to combine the order data with the customer data.
So far, we have only looked at the table of orders. However, we also have data about the customers of the online store. Let us load them into the
customers variable and see what they are.
We know the type of customer, their location, their name and the name of their contact person. Each customer has a unique customer ID. The same number is in the
customer_id column of the orders table. We can find out which orders each customer has placed by using these columns. For example, let us look at the orders placed by the user
Let us return to the task from the previous section: find out which customers placed orders with standard shipping on March 18. To do this, we merge the Customers and Orders tables. Dataframes are joined using the
.join() methods. They all do the same thing, but differ in syntax — in practice it is enough to know how to use one of them.
Let me show you an example of
.merge(), I first specified the names of the dataframes I wanted to join. Then I specified exactly how they should be merged and which columns should be used as keys.
The key is a column that connects the two data frames. In our case, it is the customer ID. In the table with orders it is in the column
customer_id, and in the table with customers it is in the index. So in the command we write:
Solving the Problem
Let’s consolidate what we’ve learned by solving a problem. Let’s find the 5 cities that generated the most revenue in 2016.
First, let’s filter the orders from 2016
The city is an attribute of users, not orders. Let’s add user information.
Now we need to group the resulting dataframe by city and to calculate the revenue.
Next: sort by descending order of sales and keep the top 5:
Try it yourself:
- How many orders were shipped first class in the last 5 years?
- How many customers from California are in the database?
- How many orders have they placed?
- Create a dataframe with the average checks for all states for each year.
That’s it for today! See you again soon!
Every few weeks I send a summary of the content I have created and interesting things I have found online. of their release.