Merging dataframes in Python Pandas is possible through merge() function. The format is like this –
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
Suppose you have two dataframes – df1
and df2
–
df1 = pd.DataFrame({'leftKey': ['ironman', 'hulk', 'thor', 'thor'], 'value': [1, 2, 3, 5]}) df2 = pd.DataFrame({'rightKey': ['hulk', 'thor', 'hulk', 'ironman'], 'value': [5, 6, 7, 8]})
And you want to merge them, then the function will be called like this –
df1.merge(df2)
The output will be –
leftKey value rightKey 0 thor 5 hulk
merge()
(without extra flags) checks the common values for keys in both dataframes and include them only in the final dataframe.
In df1
and df2
, only value 5 is common. Their corresponding keys are thor
and hulk
.
Merging with left_on right_on
With left_on
and right_on
parameters, we can join both dataframes on common keys. So the values of both dataframes will be there in the merged dataframe.
df1.merge(df2, left_on='leftKey', right_on='rightKey')
Output:
leftKey value_x rightKey value_y 0 ironman 1 ironman 8 1 hulk 2 hulk 5 2 hulk 2 hulk 7 3 thor 3 thor 6 4 thor 5 thor 6
It works like this – a key of df1
will be merged with all the similar keys in df2
. So, hulk
in df1
will be merged with two hulk
keys in df2
.
Total entries of key1 = total entries of key1 in df1 x total entries of key1 in df2.
Using value suffix
A suffix is used for value column to distinguish it with the value column of second dataframe.
In the above output the value columns are represented as value_x
and value_y
. Although the name was value
for both df1
and df2
dataframes.
This is because _x
and _y
are the default values of suffixes. Pandas attach them automatically if we do not specify any custom suffix.
df1.merge(df2, left_on='leftKey', right_on='rightKey', suffixes=('_a', '_b'))
Output:
leftKey value_a rightKey value_b 0 ironman 1 ironman 8 1 hulk 2 hulk 5 2 hulk 2 hulk 7 3 thor 3 thor 6 4 thor 5 thor 6
It is mandatory to have a suffix otherwise it will throw this error –
ValueError: columns overlap but no suffix specified
df1.merge(df2, left_on='leftKey', right_on='rightKey', suffixes=(False, False))
Output:
Traceback (most recent call last): File "<string>", line 8, in <module> File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 9190, in merge return merge( File "/usr/local/lib/python3.8/dist-packages/pandas/core/reshape/merge.py", line 121, in merge return op.get_result() File "/usr/local/lib/python3.8/dist-packages/pandas/core/reshape/merge.py", line 717, in get_result llabels, rlabels = _items_overlap_with_suffix( File "/usr/local/lib/python3.8/dist-packages/pandas/core/reshape/merge.py", line 2314, in _items_overlap_with_suffix raise ValueError(f"columns overlap but no suffix specified: {to_rename}") ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')
Merging with Left Join
You can merge two dataframes similar to left join in sql. It means, all the keys of left dataframe will be included but for right dataframe, only matching keys will be included.
In our example, df1
and df2
have only 1 matching value, i.e. 5, for keys ironman (df1)
and hulk (df2)
.
We set the left join using how='left'
parameter –
df1.merge(df2, how='left')
Output:
leftKey value rightKey 0 ironman 1 NaN 1 hulk 2 NaN 2 thor 3 NaN 3 thor 5 hulk
By default, it matches on the common column. In our example, value
column is common in both dataframes, so it did the join on value
.
But you can also specify the column on which the join should perform. We can pass that using on='common_column_name'
parameter –
df1.merge(df2, how='left', on='value')
If you will use a non-common column like leftKey
or rightKey
then it will throw error –
>>> df1.merge(df2, how='left', on='leftKey') Traceback (most recent call last): ... raise KeyError(key) KeyError: 'leftKey'
Merging with Inner Join
If you are familiar with sql inner join, then you can easily visualize the dataframe merge with inner join.
It will only include the fields where common field values (or specified column using on
parameter) are same.
In our example, only value 5 is same in both dataframes –
df1.merge(df2, how='inner')
Output:
leftKey value rightKey 0 thor 5 hulk
Using Cross Join Merge
Cross join merges each value of first dataframe with the second. So, in total there are len(df1) x len(df2) values in merged dataframe.
df1.merge(df2, how='cross')
Output:
leftKey value_x rightKey value_y 0 ironman 1 hulk 5 1 ironman 1 thor 6 2 ironman 1 hulk 7 3 ironman 1 ironman 8 4 hulk 2 hulk 5 5 hulk 2 thor 6 6 hulk 2 hulk 7 7 hulk 2 ironman 8 8 thor 3 hulk 5 9 thor 3 thor 6 10 thor 3 hulk 7 11 thor 3 ironman 8 12 thor 5 hulk 5 13 thor 5 thor 6 14 thor 5 hulk 7 15 thor 5 ironman 8
Merging Multiple Dataframes
The merge function works on two dataframes. In case you have multiple dataframes and you want to merge them, then either you will have to use a loop or a recursion function over all the dataframes.
You can also use reduce() function with fillna(void) to merge multiple dataframes.
Merging multiple dataframes using loop
You can run a loop over a collection of dataframes and merge their outputs. To see this in action, left take few more dataframes –
df1 = pd.DataFrame({'leftKey': ['ironman', 'hulk', 'thor', 'thor'], 'value': [1, 2, 3, 5]}) df2 = pd.DataFrame({'rightKey': ['hulk', 'thor', 'hulk', 'ironman'], 'value': [5, 6, 7, 8]}) df3 = pd.DataFrame({'thirdKey': ['hulk', 'hulk', 'hulk', 'ironman'], 'value': [5, 9, 8, 8]}) df4 = pd.DataFrame({'fourthKey': ['thor', 'thor', 'ironman', 'ironman'], 'value': [8, 6, 6, 6]})
Now we have 4 dataframes and we can run a loop over them –
final_df = df1 for df in [df2, df3, df4]: final_df = final_df.merge(df, how='left') print(final_df)
Output:
leftKey value rightKey thirdKey fourthKey 0 ironman 1 NaN NaN NaN 1 hulk 2 NaN NaN NaN 2 thor 3 NaN NaN NaN 3 thor 5 hulk hulk NaN
Using Recursion
We can merge dataframes using a recursion function. This function will merge two dataframes and pass the output to itself and merge with another dataframe. This goes on till the last dataframe in the list. Here is the code –
def mergeAllDfs(dfs, mergedDf=''): if len(dfs) == 0 or (not isinstance(mergedDf, pd.DataFrame) and len(dfs) < 2): return mergedDf if not isinstance(mergedDf, pd.DataFrame): mergedDf = dfs.pop(0) nextDf = dfs.pop(0) return mergeAllDfs(dfs, mergedDf.merge(nextDf, how='left')) mergeAllDfs([df1, df2, df3, df4])
Output:
leftKey value rightKey thirdKey fourthKey 0 ironman 1 NaN NaN NaN 1 hulk 2 NaN NaN NaN 2 thor 3 NaN NaN NaN 3 thor 5 hulk hulk NaN
Using reduce() function
As you might already know, a reduce()
function uses lambda to run a function over a list of values and return a consolidated output. For example –
reduce(lambda x, y: x * y, [1, 2, 3, 4, 5]) # Output - 120
In this code the final output will be 120. Because reduce function will pick one value from list and perform the operation with output as second value.
Here we are doing x * y, so it will work like this – ((((1 * 2) * 3) * 4) * 5)
Let’s use this function to merge multiple dataframes. Remember to import reduce from functools.
from functools import reduce reduce(lambda x, y: x.merge(y, how='left'), [df1, df2, df3, df4])
Output:
leftKey value rightKey thirdKey fourthKey 0 ironman 1 NaN NaN NaN 1 hulk 2 NaN NaN NaN 2 thor 3 NaN NaN NaN 3 thor 5 hulk hulk NaN
Conclusion
In this article we saw different methods to merge two dataframes. Also, we discussed about ways to handle merging of multiple dataframes using loops, recursion and reduce function.