Converting Hierarchical Indexes to Relational Model Structure (Flatten a Binary Tree) using Python
A lot of times we have to deal with Multi-Dimensional (Cube) Databases where the data is stored in a parent-child column relationship , much like a tree (Binary Tree). But as a Data Engineer to perform ETL and store the data in SQL as a dimension or a fact table we are often required to flatten the tree into a more familiar relational model/Row-column format.
Let me give you an example:
In the above pandas Dataframe the first column contains the parent and the second column contains the child. These two columns contains data regarding the multiple levels of a Tree.
The levels of the tree are compressed into a row and column format in the above mentioned product category dataframe. The below tree is just a representation of how the actual data looks like in a Binary Tree.
This the output we want:
result = df[~df['child'].isin(df['parent'])]
result.columns = ['leaf', 'lev_1']
ix = 1
while True:
result = result.merge(df, 'left', left_on=f'lev_{ix}', right_on='child').drop(columns='child')
if (result['parent'].isna().all()):
result = result.drop(columns='parent')
break
result.loc[result['parent'].isna(), f'lev_{ix}':'parent'] = result[result['parent'].isna()][['parent', f'lev_{ix}']].values
ix += 1
result = result.rename(columns={'parent': f'lev_{ix}'})# rename and reorder columns to match your expected result
result = result.rename(columns={f'lev_{ix-i}': f'lev_{i}' for i in range(ix)}).reindex(columns=[f'lev_{i}' for i in range(ix)]
+ ['leaf'])
Here is an step by step walkthrough the whole of the code:
result = df[~df['child'].isin(df['parent'])]
The above snippet is find all those child which does not have a parent. These would come out to be the root nodes. In our case this would come out to be 5. As the number 5 is the root node of the whole tree.
result.columns = ['leaf', 'lev_1']
Then modifying the result dataframe as to name the columns as leaf and lev_1.
ix = 1
while True:
result = result.merge(df, 'left', left_on=f'lev_{ix}', right_on='child').drop(columns='child')
if (result['parent'].isna().all()):
result = result.drop(columns='parent')
break
result.loc[result['parent'].isna(), f'lev_{ix}':'parent'] = result[result['parent'].isna()][['parent', f'lev_{ix}']].values
ix += 1
result = result.rename(columns={'parent': f'lev_{ix}'})
- The above code starts with a counter counts which keeps the count of levels that the tree has. The levels is the distance/height from leaf node to the root node.
- Then we are performing a left join to merge the parents of every node adjacent to the lev_{ix}, lev_{ix+1}, lev_{ix+2}, ………up-to the level at which the tree goes.
- If the node has no parent that means it has only one node i.e,. the Root Node. Hence drop the parents column and break out of the infinite while loop.
- Then rearrange the columns to get the desired output.
I hope you got the idea how to convert the hierarchical indexes to a columns based table.