How to merge multiple dataframes in Python Pandas? Code Example

Total
0
Shares

Merging dataframes in Python Pandas is possible through merge() function. The format is like this –

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Suppose you have two dataframes – df1 and df2

df1 = pd.DataFrame({'leftKey': ['ironman', 'hulk', 'thor', 'thor'], 'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rightKey': ['hulk', 'thor', 'hulk', 'ironman'], 'value': [5, 6, 7, 8]})

And you want to merge them, then the function will be called like this –

df1.merge(df2)

The output will be –

    leftKey  value   rightKey
0      thor      5       hulk

merge() (without extra flags) checks the common values for keys in both dataframes and include them only in the final dataframe.

In df1 and df2, only value 5 is common. Their corresponding keys are thor and hulk.

Merging with left_on right_on

With left_on and right_on parameters, we can join both dataframes on common keys. So the values of both dataframes will be there in the merged dataframe.

df1.merge(df2, left_on='leftKey', right_on='rightKey')

Output:

   leftKey  value_x rightKey  value_y
0  ironman        1  ironman        8
1     hulk        2     hulk        5
2     hulk        2     hulk        7
3     thor        3     thor        6
4     thor        5     thor        6

It works like this – a key of df1 will be merged with all the similar keys in df2. So, hulk in df1 will be merged with two hulk keys in df2.

Total entries of key1 = total entries of key1 in df1 x total entries of key1 in df2.

Using value suffix

A suffix is used for value column to distinguish it with the value column of second dataframe.

In the above output the value columns are represented as value_x and value_y. Although the name was value for both df1 and df2 dataframes.

This is because _x and _y are the default values of suffixes. Pandas attach them automatically if we do not specify any custom suffix.

df1.merge(df2, left_on='leftKey', right_on='rightKey',
          suffixes=('_a', '_b'))

Output:

   leftKey  value_a rightKey  value_b
0  ironman        1  ironman        8
1     hulk        2     hulk        5
2     hulk        2     hulk        7
3     thor        3     thor        6
4     thor        5     thor        6

It is mandatory to have a suffix otherwise it will throw this error –

ValueError: columns overlap but no suffix specified

df1.merge(df2, left_on='leftKey', right_on='rightKey',
          suffixes=(False, False))

Output:

Traceback (most recent call last):
  File "<string>", line 8, in <module>
File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 9190, in merge
return merge(
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/reshape/merge.py", line 121, in merge
return op.get_result()
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/reshape/merge.py", line 717, in get_result
llabels, rlabels = _items_overlap_with_suffix(
File "/usr/local/lib/python3.8/dist-packages/pandas/core/reshape/merge.py", line 2314, in _items_overlap_with_suffix
raise ValueError(f"columns overlap but no suffix specified: {to_rename}")

ValueError: columns overlap but no suffix specified: Index(['value'], dtype='object')

Merging with Left Join

You can merge two dataframes similar to left join in sql. It means, all the keys of left dataframe will be included but for right dataframe, only matching keys will be included.

In our example, df1 and df2 have only 1 matching value, i.e. 5, for keys ironman (df1) and hulk (df2).

We set the left join using how='left' parameter –

df1.merge(df2, how='left')

Output:

   leftKey  value rightKey
0  ironman      1      NaN
1     hulk      2      NaN
2     thor      3      NaN
3     thor      5     hulk

By default, it matches on the common column. In our example, value column is common in both dataframes, so it did the join on value.

But you can also specify the column on which the join should perform. We can pass that using on='common_column_name' parameter –

df1.merge(df2, how='left', on='value')

If you will use a non-common column like leftKey or rightKey then it will throw error –

>>> df1.merge(df2, how='left', on='leftKey')

Traceback (most recent call last):
  ...
raise KeyError(key)
KeyError: 'leftKey'

Merging with Inner Join

If you are familiar with sql inner join, then you can easily visualize the dataframe merge with inner join.

It will only include the fields where common field values (or specified column using on parameter) are same.

In our example, only value 5 is same in both dataframes –

df1.merge(df2, how='inner')

Output:

  leftKey  value rightKey
0    thor      5     hulk

Using Cross Join Merge

Cross join merges each value of first dataframe with the second. So, in total there are len(df1) x len(df2) values in merged dataframe.

df1.merge(df2, how='cross')

Output:

    leftKey  value_x rightKey  value_y
0   ironman        1     hulk        5
1   ironman        1     thor        6
2   ironman        1     hulk        7
3   ironman        1  ironman        8
4      hulk        2     hulk        5
5      hulk        2     thor        6
6      hulk        2     hulk        7
7      hulk        2  ironman        8
8      thor        3     hulk        5
9      thor        3     thor        6
10     thor        3     hulk        7
11     thor        3  ironman        8
12     thor        5     hulk        5
13     thor        5     thor        6
14     thor        5     hulk        7
15     thor        5  ironman        8

Merging Multiple Dataframes

The merge function works on two dataframes. In case you have multiple dataframes and you want to merge them, then either you will have to use a loop or a recursion function over all the dataframes.

You can also use reduce() function with fillna(void) to merge multiple dataframes.

Merging multiple dataframes using loop

You can run a loop over a collection of dataframes and merge their outputs. To see this in action, left take few more dataframes –

df1 = pd.DataFrame({'leftKey': ['ironman', 'hulk', 'thor', 'thor'],
                    'value': [1, 2, 3, 5]})

df2 = pd.DataFrame({'rightKey': ['hulk', 'thor', 'hulk', 'ironman'],
                    'value': [5, 6, 7, 8]})
                    
df3 = pd.DataFrame({'thirdKey': ['hulk', 'hulk', 'hulk', 'ironman'],
                    'value': [5, 9, 8, 8]})

df4 = pd.DataFrame({'fourthKey': ['thor', 'thor', 'ironman', 'ironman'],
                    'value': [8, 6, 6, 6]})

Now we have 4 dataframes and we can run a loop over them –

final_df = df1
for df in [df2, df3, df4]:
    final_df = final_df.merge(df, how='left')
    
print(final_df)

Output:

   leftKey  value rightKey thirdKey fourthKey
0  ironman      1      NaN      NaN       NaN
1     hulk      2      NaN      NaN       NaN
2     thor      3      NaN      NaN       NaN
3     thor      5     hulk     hulk       NaN

Using Recursion

We can merge dataframes using a recursion function. This function will merge two dataframes and pass the output to itself and merge with another dataframe. This goes on till the last dataframe in the list. Here is the code –

def mergeAllDfs(dfs, mergedDf=''):
    if len(dfs) == 0 or (not isinstance(mergedDf, pd.DataFrame) and len(dfs) < 2):
        return mergedDf
    if not isinstance(mergedDf, pd.DataFrame):
        mergedDf = dfs.pop(0)
    nextDf = dfs.pop(0)
   
    return mergeAllDfs(dfs, mergedDf.merge(nextDf, how='left'))

mergeAllDfs([df1, df2, df3, df4])

Output:

   leftKey  value rightKey thirdKey fourthKey
0  ironman      1      NaN      NaN       NaN
1     hulk      2      NaN      NaN       NaN
2     thor      3      NaN      NaN       NaN
3     thor      5     hulk     hulk       NaN

Using reduce() function

As you might already know, a reduce() function uses lambda to run a function over a list of values and return a consolidated output. For example –

reduce(lambda x, y: x * y, [1, 2, 3, 4, 5])

# Output - 120

In this code the final output will be 120. Because reduce function will pick one value from list and perform the operation with output as second value.

Here we are doing x * y, so it will work like this – ((((1 * 2) * 3) * 4) * 5)

Let’s use this function to merge multiple dataframes. Remember to import reduce from functools.

from functools import reduce

reduce(lambda x, y: x.merge(y, how='left'), [df1, df2, df3, df4])

Output:

   leftKey  value rightKey thirdKey fourthKey
0  ironman      1      NaN      NaN       NaN
1     hulk      2      NaN      NaN       NaN
2     thor      3      NaN      NaN       NaN
3     thor      5     hulk     hulk       NaN

Conclusion

In this article we saw different methods to merge two dataframes. Also, we discussed about ways to handle merging of multiple dataframes using loops, recursion and reduce function.