Scenario: One of my friend was tasked with merging/concatenating multiple CSV files. Files have different/new column(s) i.e. some files having different shape/structure(varying column(s) and rows). How can you merge them into one CSV file? without loss of data (both columns and rows)?
Solution: My friend tried but nothing worked for him and asked me to give it a try. Solutions out there on the Web dint satisfactorily solved the requirement, also the code looked complex for me for a routine problem of merging, and I felt there can be a simple solution. As most of the solutions though combined multiple files, there was some loss of data(rows).
So after some thought below is working solution:-
Ensure all CSV files are placed in a single directory on the disk and use below code.
Above code was tested on merging 16 CSV files, each with approx:10,000 rows with varying column(s).
Please share your experiences on merging and if this post helped you.
Solution: My friend tried but nothing worked for him and asked me to give it a try. Solutions out there on the Web dint satisfactorily solved the requirement, also the code looked complex for me for a routine problem of merging, and I felt there can be a simple solution. As most of the solutions though combined multiple files, there was some loss of data(rows).
So after some thought below is working solution:-
Ensure all CSV files are placed in a single directory on the disk and use below code.
import pandas as pd
import os
import glob
def merge_files(files):
#files[0] contains the name of the 1st csv file you want to merge.
df = pd.read_csv(files[0])
#Iterate from 2nd file to list of all file names to merge.
for cur_csvfile in files[1:]:
#And perform outer join with previous dataframe
df = df.merge(pd.read_csv(cur_csvfile,encoding='latin1'), how='outer')
return df
#Code starts here
srcpath="D:\\Data Science\\filenames\\"
os.chdir(srcpath)
#If you need a list of filenames that all have a certain extension, prefix, or any common string in the middle,
#use glob instead of writing code to scan the directory contents yourself.
fileNames_toMerge = glob.glob("*.csv")
mergedDf = merge_files(fileNames_toMerge)
mergedDf.to_csv("D:\\Data Science\\Full\\mergedFile.csv",index=None)
Above code was tested on merging 16 CSV files, each with approx:10,000 rows with varying column(s).
Please share your experiences on merging and if this post helped you.
No comments:
Post a Comment