Data analysis–Pandas③

Table merge

vertical merger

In daily work, table merging is a common requirement. Let’s take the previous sales data as an example. Generally, each team only counts the sales data of its own team, and then summarizes the data to the person in charge to obtain the final summary.
The sales tables of the four ABCD teams are as follows. We merge the tables vertically, that is, superimpose the table contents from top to bottom.

This uses the concat() method in pandas, putting the tables to be merged into a list as a parameter

pd.concat([df1, df2])

Applied to our example, the complete code is as follows:

import pandas as pd

df_a = pd.read_csv('2019 Team A sales data.csv')
df_b = pd.read_csv('2019 Team B sales data.csv')
df_c = pd.read_csv('2019 Team C sales data.csv')
df_d = pd.read_csv('2019 Team D sales data.csv')

df = pd.concat([df_a, df_b, df_c, df_d])
print(type(df))
# Output:<class 'pandas.core.frame.DataFrame'>

As you can see, the merged table is also a DataFrame type.

horizontal merger

In addition to the vertical merging of tables, there is another situation – the horizontal merging of tables.
Assume that the sales data of team A is divided into the first half of the year and the second half of the year. The two tables are as follows:

The simplest way to horizontally merge these two tables is pd.merge(Table 1, Table 2). After merging, you will get a new table as shown below:

The above example is the simplest case of the merge() method, which only passes in two tables that need to be merged horizontally as parameters. The merge() method also has two more important parameters:onandhow

The parameter on indicates the column name used for merging. You can use a list to specify multiple column names. These column names must exist in both tables. If the parameter on is not specified, pandas will automatically use the column names of both tables as the value of the parameter on.

In the above example, the parameter on defaults to the salesperson.When merging, pandas will merge rows with the same value in the salesperson column horizontally, instead of simply merging them sequentially., so the two tables can be merged correctly even if they are arranged in different orders.

The two tables in the above example are relatively simple, and the salespersons are the same. Assuming that Liu Yi resigned in the third or fourth quarter, the table becomes as follows:

At this time, the results after pd.merge (Table 1, Table 2) are as follows:

It can be seen that Liu Yi’s data has been eliminated after the merger. This is actually caused by the parameter how. The parameter how specifies the merging method. There are four methods available. The default is inner.

  • left
  • right
  • outer
  • inner (default)

inner(Inner join) means to retain only the parts of the column specified by the parameter on (salesperson in the above example) that appear in both tables. Because Liu Yi appears in Table 1, Liu Yi’s row of data is eliminated after the merger.

outer(outer join) andinnerInstead, it retains all data in both tables, filling missing data with NaN. That is, fill in the data of Liu Yi in the third and fourth quarters in the above example with NaN.

Note that if a column is filled with NaN, the data in this column will become floating point data.


left(Left join) means merging Table 2 into Table 1. Specifically, it means retaining all the data in Table 1, merging the common salesperson data in the two tables in Table 2, and eliminating the unique data in Table 2. Missing data are also filled with NaN.right(Right join) is just the opposite, indicating that table one is merged into table two.

Data cleaning

Before, our data was very complete and the format was uniform. But sometimes, our data may be partially missing, formatted inconsistently, or even have data errors. At this time, the data becomes “dirty” data, and we need to clean it.

Missing data is one of the most common problems. The reasons for this problem may be: omission when filling in manually, not having this data itself, etc.
I deleted some of the Douban Books Top 250 data, intentionally causing data to be missing so that we could clean it.

After reading the table file, call the info() method and describe() method to see the following information:

It can be seen from this: There are 250 pieces of data in total, and 2 pieces of data are missing for price. Missing data will cause errors in subsequent data analysis. Common methods of dealing with missing data includeDelete missing data rowsandAssign values ​​to missing data

Delete missing data rows

Suppose we think that the row of data with missing data is not trustworthy and needs to be deleted. Just one line of code is enough:

df.dropna()

The function of the dropna() method is to delete all rows containing NaN. After execution, the two rows of data with missing data in the above table will be deleted.
The dropna() method also returns the deleted table and will not affect the original table. If you want to delete directly on the original table, pass inplace=True.

However, the above method is too simple and crude. It may be that data that does not affect the analysis is missing and will be ruthlessly deleted. So we need to delete strategically.
We can pass inhow='all'To control the deletion of this row when all the data in this row are NaN.howThe default isany, indicating that this row will be deleted as long as there is a NaN.

df.dropna(how='all')

We can also pass inthreshParameters to control when the number of non-null values ​​in a row is less than the number of times that the row is deleted. For example, when the number of non-null values ​​in a row of data is less than 5, the row is deleted. The code can be written like this:

df.dropna(thresh=5)

When we only want to analyze prices, the absence of other data has no impact on our analysis. At this time, you can passsubsetThe parameters determine which columns have missing data before they are deleted. For example, when there is missing data in the book title and price columns, you can write like this:

df.dropna(subset=['Book title', 'Price'])

To sum up:

data filling

In addition to deleting missing data rows, in some cases we can also assign values ​​to missing data. For example, if the price is missing, we can set the missing price to 0 as follows:

df['price'].fillna(0)

First, select the columns with missing data, and then callfillna()Method that will fill in any missing data for the column with the parameters passed in.

We can also pass ininplace=Trueto directly change the original form,Most methods in pandas return the modified table without modifying the original table. We can all pass ininplace=Trueto modify the original form directly. If you don’t want to modify the original form, just save the modified form to a new variable.

But setting the missing price directly to 0 is not the optimal solution, as it will lower the overall price. We can set it to a closer value, such as the average price:

df['price'].fillna(df['price'].mean())

After processing like this, the deviation of the data will be smaller. The above code is an ideal situation and will actually report an error. why? Let’s take a look at our data table again.

It can be seen that in addition to missing data, the data format of the price column is not uniform. Some are integers, some are decimals, and some have yuan added after them. We have no way to analyze such data and need to unify its format. We need to remove the word “yuan” and keep one decimal place.

Unified data format

The first is to remove the “meta” word, which is actually string processing. In Python, you can use the replace() method for string replacement. We can remove the “yuan” as follows:

'29.00 yuan'.replace('yuan', '')

There are also these string methods in pandas, which are hidden under the str attribute. So in pandas we can write:

df['price'].str.replace('yuan', '')

Visit firststrproperty, and then call the string method inside. String methods in Python, such asreplace()upper()lower()split()Wait in it.

It should be noted that because strings are immutable, these methods only return new data and do not change the original string.. Therefore, we also need to do the following to replace the data in the price column in the original table with the modified data:

df['price'] = df['price'].str.replace('yuan', '')

After this processing, it looks like a number, but in fact the type is still a string. We need to useastype()Method to perform type conversion. The price is a decimal, so we convert it to a float.

df['price'] = df['price'].str.replace('yuan', '').astype('float')

The table after processing the prices looks like this:

Except for missing data or NaN, other prices are unified into floating point numbers with one decimal place. At this time we execute df.describe() again to take a look at the processed data:

Because the price column is processed into numbers, price-related information appears in the statistical information. And if you execute df[‘price’].fillna(df[‘price’].mean()) at this time, no error will be reported.
But the calculated average price is 33.090161, but we don’t need to be precise to so many decimal places. So we can use the round() method to round, the parameter is the number of decimal places to keep. So the final code is as follows:

df['price'].fillna(df['price'].mean().round(1), inplace=True) # Keep 1 decimal place

Remove duplicate data

First, we deliberately create a table with duplicate data for demonstration:

import pandas as pd

df = pd.DataFrame({'Username': ['Liu Yi', 'Chen Er', 'Liu Yi', 'Zhang San'], 'Transaction amount': [25.8, 15.5, 56.3, 46.2]})
repeat = pd.concat([df, df])
print(repeat)
Username transaction amount
0 Liu Yi25.8
1 Chen Er 15.5
2 Liu Yi 56.3
3 Zhang San 46.2
0 Liu Yi25.8
1 Chen Er 15.5
2 Liu Yi 56.3
3 Zhang San 46.2

I merged the same table and got a table with duplicate data. we can calldrop_duplicates()This method deletes exactly duplicate rows, that is, rows with exactly the same data in each column.

print(repeat.drop_duplicates())

result:

Username transaction amount
0 Liu Yi25.8
1 Chen Er 15.5
2 Liu Yi 56.3
3 Zhang San 46.2

We can also passsubsetThe parameter specifies deduplication by column, that is, as long as the data in this column is repeated, the duplicate content will be deleted.

print(repeat.drop_duplicates(subset=['username']))
Username transaction amount
0 Liu Yi25.8
1 Chen Er 15.5
3 Zhang San 46.2

It can be seen that only one of Liu Yi’s two records is left after deduplication according to the user name.By default, the first piece of non-duplicate data is retained., if you want to keep the last piece of unique data, you can pass it inkeep='last'

print(repeat.drop_duplicates(subset=['username'], keep='last'))
Username transaction amount
1 Chen Er 15.5
2 Liu Yi 56.3
3 Zhang San 46.2

Assuming that the order of data is arranged by transaction time, the above code can obtain the most recent transaction record of each user.

What we talked about earlier was how to remove duplicates, so what should we do if we want to know what non-duplicate data there is? In fact, it is very simple, just call the unique() method on the corresponding column.

print(repeat['username'].unique())
# Output: ['Liu Yi' 'Chen Er' 'Zhang San']

print(len(repeat['username'].unique()))
# Output: 3

It can be seen that a total of 8 data transaction records are actually only transactions generated by 3 users. In addition to using len(repeat[‘username’].unique()) to count the number of unique ones, we can also directly use the nunique() method to get it.

print(repeat['username'].nunique())
# Output: 3

To sum up:

A very practical method in pandas – apply() method

The first parameter of the apply() method is a function. When we pass in the function, the apply() method will apply the function to each data in the table column and pass the table data to the function as a parameter. (a bit like mapping)

for example:

# The apply() method is not used
df['price'] = df['price'].str.replace('yuan', '').astype('float')

#Used the apply() method
def format_price(x):
  return float(x.replace('original', ''))

df['price'] = df['price'].apply(format_price)

It seems that there is more code, but it is just longer to define the function. If we use an anonymous function, we can also do it in one line:

f['price'] = df['price'].apply(lambda x: float(x.replace('yuan', '')))

Previously, we used methods in pandas for data processing. The premise is that you must first know the str attribute, astype() method, etc. Using the apply() method is very flexible. We can directly obtain the data and process it using Python. The rules for simultaneous processing can be very flexible, as long as a function is defined as required.

Attachment: anonymous function

Anonymous functions in Python:

The following is a comparison chart between anonymous functions and ordinary functions:

Using anonymous functions does not improve the efficiency of the code, it just makes the code look more concise. Some functions will only be used once. When you don’t know how to name it, it is very convenient to use anonymous functions directly.

We can also assign anonymous functions to variables, so that anonymous functions also have names.

Related Posts

Basic use of Python Request get post agent

Exception: Python in worker has different version 2.7 than that in driver 3.6

Calculation of pi π in Python

Detailed explanation of finding the difference between two dataframes using Pandas

Python3 crawler tutorial-basic use of aiohttp

Python Digital Analog Notes-PuLP Library (1) Introduction to Linear Programming

Anaconda is installed on the D drive, but why is the working terminal itself on the C drive?

ModelArts(1)——From quick start to remote development

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*