Detailed explanation of finding the difference between two dataframes using Pandas
1. Intersection
intersected=pd.merge(df1,df2,how=’inner’)
Extend (intersect columns) intercepted=pd.merge(df1,df2,on[‘name’],how=’inner’)
2. Difference set (df1-df2 as an example)
diff=pd.concat([df1,df2,df2]).drop_duplicates(keep=False)
Detailed explanation of difference set function:
1. Pandas can easily combine Series and DataFrame objects through the concat() function. The syntax format of the function is as follows: pd.concat(objs,axis=0,join=’outer’,join_axes=None,ignore_index=False)
2. There needs to be duplicate values in a column in the dataframe. Apply drop_duplicates to solve this problem.
For example: ata={“a” [2,2,3,5,5,10],”c” [4,5,6,7,8,12]}
pd_data=pd.DataFrame(data=data)
print(pd_data)
t=pd_data.drop_duplicates(subset=[‘c’,’b’],keep=’last’,inplace=False)
print(t)
illustrate:
keep=’first’ means to keep the first occurrence of duplicate rows, which is the default value. The other two values of keep are “last” and False, which respectively indicate retaining the last duplicate row and removing all duplicate rows.
inplace=True means to delete duplicates directly on the original DataFrame, while the default value of False means to generate a copy. If you want to generate a new DataFrame:,inplace=False
subset removes heavy columns. subset=[‘c’,’b’], indicating that the records in the row: both columns c and b are repeated.
3. Combining concat and drop_duplicates solves the problem of finding the difference set.
In addition, there is another way to achieve the same purpose: