Feb 9, 2018

Merging/Combining multiple CSV files with different column(s) in Python

Scenario: One of my friend was tasked with merging/concatenating multiple CSV files. Files have different/new column(s) i.e. some files having different shape/structure(varying column(s) and rows). How can you merge them into one CSV file? without loss of data (both columns and rows)?

Solution: My friend tried but nothing worked for him and asked me to give it a try. Solutions out there on the Web dint satisfactorily solved the requirement, also the code looked complex for me for a routine problem of merging, and I felt there can be a simple solution. As most of the solutions though combined multiple files, there was some loss of data(rows).

So after some thought below is working solution:-

Ensure all CSV files are placed in a single directory on the disk and use below code.
import pandas as pd
import os
import glob

def merge_files(files):
    #files[0] contains the name of the 1st csv file you want to merge.
    df = pd.read_csv(files[0])  
    #Iterate from 2nd file to list of all file names to merge. 
    for cur_csvfile in files[1:]:
        #And perform outer join with previous dataframe        
        df = df.merge(pd.read_csv(cur_csvfile,encoding='latin1'), how='outer')        
    return df    
        
#Code starts here
srcpath="D:\\Data Science\\filenames\\" 
os.chdir(srcpath) 
#If you need a list of filenames that all have a certain extension, prefix, or any common string in the middle,
#use glob instead of writing code to scan the directory contents yourself.
fileNames_toMerge = glob.glob("*.csv")  
mergedDf = merge_files(fileNames_toMerge) 
mergedDf.to_csv("D:\\Data Science\\Full\\mergedFile.csv",index=None)

Above code was tested on merging 16 CSV files, each with approx:10,000 rows with varying column(s).

Please share your experiences on merging and if this post helped you.

No comments:

Post a Comment