valueerror: cannot reindex from a duplicate axis – Code Example

Total
0
Shares

In this article I will quickly provide you some code examples which will help you in resolving valueerror: cannot reindex from a duplicate axis. This is the Python error which occurs when there are duplicate values in index or rows or columns labels.

Why duplicates are allowed?

According to pandas guide

If you’re familiar with SQL, you know that row labels are similar to a primary key on a table, and you would never want duplicates in a SQL table. But one of pandas’ roles is to clean messy, real-world data before it goes to some downstream system. And real-world data has duplicates, even in fields that are supposed to be unique.

Code Example

Error Code – Let’s first replicate this error –

s1 = pd.Series([0, 1, 2], index=["a", "b", "b"])
s1.reindex(["a", "b", "c"])

Since we have duplicates in our index in above example, reindex() will throw error –

Traceback (most recent call last)
<ipython-input-4-18a38f6978fe> in <module>
----> 1 s1.reindex(["a", "b", "c"])
ValueError: cannot reindex from a duplicate axis

Solution

Method 1 – Detect duplicate labels –

df2.index.is_unique

If this is false, then you need to sanitize your indexes.

Method 2 – Detect duplicate columns –

df2.columns.is_unique

If its not true, then columns have duplicate labels. Need to look at this.

Method 3 – Drop duplicate rows using duplicated()

Getting the list of all indexes whether they are duplicated or not

df2.index.duplicated()

This will return array of boolean values indicating if an index is duplicated.

Now let’s delete duplicate rows –

df2.loc[~df2.index.duplicated(), :]

Method 4 – Handling duplicates using groupby()

Using groupby() we can do some computations over duplicate labels. For example, we can take the mean of duplicate values like 0 , 1 -> 0.5

df2.groupby(level=0).mean()

Method 5 – Prevent duplicate indexes using allows_duplicate_labels flag

pd.Series([0, 1, 2], index=["a", "b", "b"] ).set_flags( allows_duplicate_labels=False)

This will prevent duplicates and throw error otherwise –

Traceback (most recent call last)
<ipython-input-19-11af4ee9738e> in <module>
----> 1 pd.Series([0, 1, 2], index=["a", "b", "b"]).set_flags(allows_duplicate_labels=False)
DuplicateLabelError: Index has duplicates.
      positions
label          
b        [1, 2]

This applies to both columns and rows.