Data Science And Data Engineering Experiences: February 2018

Feb 9, 2018

Merging/Combining multiple CSV files with different column(s) in Python

Scenario: One of my friend was tasked with merging/concatenating multiple CSV files. Files have different/new column(s) i.e. some files having different shape/structure(varying column(s) and rows). How can you merge them into one CSV file? without loss of data (both columns and rows)?

Solution: My friend tried but nothing worked for him and asked me to give it a try. Solutions out there on the Web dint satisfactorily solved the requirement, also the code looked complex for me for a routine problem of merging, and I felt there can be a simple solution. As most of the solutions though combined multiple files, there was some loss of data(rows).

So after some thought below is working solution:-

Ensure all CSV files are placed in a single directory on the disk and use below code.

import pandas as pd
import os
import glob

def merge_files(files):
    #files[0] contains the name of the 1st csv file you want to merge.
    df = pd.read_csv(files[0])  
    #Iterate from 2nd file to list of all file names to merge. 
    for cur_csvfile in files[1:]:
        #And perform outer join with previous dataframe        
        df = df.merge(pd.read_csv(cur_csvfile,encoding='latin1'), how='outer')        
    return df    
        
#Code starts here
srcpath="D:\\Data Science\\filenames\\" 
os.chdir(srcpath) 
#If you need a list of filenames that all have a certain extension, prefix, or any common string in the middle,
#use glob instead of writing code to scan the directory contents yourself.
fileNames_toMerge = glob.glob("*.csv")  
mergedDf = merge_files(fileNames_toMerge) 
mergedDf.to_csv("D:\\Data Science\\Full\\mergedFile.csv",index=None)

Above code was tested on merging 16 CSV files, each with approx:10,000 rows with varying column(s).

Please share your experiences on merging and if this post helped you.

Feb 2, 2018

Visualizations in R vs Python

Although Python and R are among top languages for Data Analysis and Data Modeling. R is better for Data Visualizations. Because with fewer lines of code you can get complex visuals.

We can see this in action with “Iris” data. Four measurements for each species are plotted as subplots. Contrast Figure-1(R visual) and Figure-2(Python visual)

Figure-1(R visual)

Figure-2(Python visual)

R Code

library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)

dfiris<- fread("D:\\Data Science\\Data Sets\\iris.csv")
dfiris %>% gather(Measurement, Value, -species) %>%  ggplot(aes(x = species, y = Value,fill=species)) +  geom_boxplot() +  facet_grid(Measurement ~ .)

Python Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

dfiris = pd.read_csv('D:/Data Science/Data Sets/iris.csv')
fig = plt.figure()
fig, axis = plt.subplots(4,1,figsize=(10,8))
sns.boxplot( y="petal_length", x= "species", data=dfiris,  orient='v' , ax=axis[0])
sns.boxplot( y="petal_width", x= "species", data=dfiris,  orient='v' , ax=axis[1])
sns.boxplot( y="sepal_length", x= "species", data=dfiris,  orient='v' , ax=axis[2])
sns.boxplot( y="sepal_width", x= "species", data=dfiris,  orient='v' , ax=axis[3])

As you can see R’s ggplot2 automatically displays a beautiful plot with a legend. But in Python’s Matplotlib and seaborn you need to explicitly write additional code for showing a legend and theming it to make it look attractive.

Simple visuals are easy to code in both Python and R. But for complex visuals, more lines of code has to be written in Python. Python community is already working on active porting of R’s ggplot2, but it has yet to be matured.

Please share your experiences on which visualizations R/Python worked better for you and why.