Data Science And Data Engineering Experiences

Feb 10, 2021

Multi-class Classification using Bert with Kera's and Tensorflow

Problem Statment

We will be implementing Multi-class classification using BERT with Keras and Tensorflow.

In the current scenario, we will be classifying document(s)/sentence(s) into one of 8 categories 0 to 7 i.e. trained Bert model will label each sentence in the text corpus to one of 0,1,2,3,4,5,6,7.

Important Details of the Solution

The complete code with output is available on my GitHub at:

https://github.com/srichallla/NLP/blob/main/Bert_kerasTF_multiclassification.ipynb

We will be using a pre-trained Bert model to extract embeddings for each sentence in the text corpus and then use these embeddings to train a text classification model. We then use this trained Bert model to classify text on an unseen test dataset.

Bert expects labels/categories to start from 0, instead of 1, else the classification task may not work as expected or can throw errors. If your dataset has labels starting from 0, we should modify them. In the current dataset, labels are starting from 0. So modifying them to start from 1 as below:

df3['label_encode'] = df3['Label'].map({'1':0,'2':1,'3':2,'4':3,'5':4,'6':5,'7':6,'8':7})

Here we are creating a new column in a dataframe to store the modified labels, instead of overwriting an existing one.

Also, the "label" column should be of type int or float. If the "label" column is of type obj OR string, it has to be converted to int or float. Else Bert will not work as expected or can throw errors. In the current dataset "label_encode" column is of type obj. So converting it into int type as below:

df3['label_encode'] = df3.label_encode.astype(int)

The important limitation of Bert is that the maximum length of each sentence/sequence in a dataset or text corpus for Bert should be 512 tokens. Here we are setting it to 200. If the sentence length is smaller than 200, it will be padded with zeros. If the length is bigger, sentence will be truncated.

Larger the sentence length, the training time increases.

Bert inputs to TFBertForSequenceClassification model:

1) input_ids: They are token indices, which are numerical representations of each word in a sentence. Bert Tokenizer chunks each sentence into words and replaces/maps each word in a sentence to a number from the "WordPiece vocabulary" dictionary, by means of a lookup, where the key is the word and value is its numerical index.

2) token_type_ids: Are all the same as we do not have question-answer or pairs of sentences. For the classification problem, we treat the whole sentence as one. So in our case, it's all zeros.

3) attention_mask: will tell the model that we should not focus attention on [PAD] tokens.

4) labels: Actual Labels (label_encode feature/column) from a given labeled dataset. Required for training and validation datasets. Not required for the test dataset, as the model should predict those.

we will use the encode_plus function, which does the above 3 steps for us. 4th step we need to handle manually.

tokenizer.encode_plus(sentence, 
                add_special_tokens = True, # add [CLS], [SEP]
                max_length = 200, # max length of the text that can go to BERT
                pad_to_max_length = True, # add [PAD] tokens
                return_attention_mask = True, # add attention mask to not focus on pad tokens
              )

For multi-class classification, we must specify the number of unique categories/labels to classify the sentences/documents, 8 in our case.

model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=8)

If you do not specify the "num_labels", predictions on the test dataset will all be nan.

Finally, we train the Bert model on our dataset and then use the trained model to predict the class label on the test dataset.

With a batch-size of 6, max sentence length of 200 and, 1 epoch we got 97% accuracy.

Aug 5, 2020

Programatically get Spark Performance Metrics using SparkListener in PySpark

Code for accessing Spark performance measures or metrics programmatically in PySpark using SparkListener.

class SparkListener(object):

def onApplicationEnd(self, applicationEnd):

pass

def onApplicationStart(self, applicationStart):

pass

def onBlockManagerRemoved(self, blockManagerRemoved):

pass

def onBlockManagerAdded(self, blockManagerAdded):

pass

def onBlockUpdated(self, blockUpdated):

pass

def onEnvironmentUpdate(self, environmentUpdate):

pass

def onExecutorAdded(self, executorAdded):

pass

def onExecutorMetricsUpdate(self, executorMetricsUpdate):

pass

def onExecutorRemoved(self, executorRemoved):

pass

def onJobEnd(self, jobEnd):

pass

def onJobStart(self, jobStart):

pass

def onOtherEvent(self, event):

pass

def onStageCompleted(self, stageCompleted):

pass

def onStageSubmitted(self, stageSubmitted):

pass

def onTaskEnd(self, taskEnd):

pass

def onTaskGettingResult(self, taskGettingResult):

pass

def onTaskStart(self, taskStart):

pass

def onUnpersistRDD(self, unpersistRDD):

pass

class Java:

implements = ["org.apache.spark.scheduler.SparkListenerInterface"]

class TaskEndListener(SparkListener):

def onTaskEnd(self, taskEnd):

print("executorRunTime : " + str(taskEnd.taskMetrics().executorRunTime()) )

print("resultSize : " + str(taskEnd.taskMetrics().resultSize()) )

print("executorCpuTime : " + str(taskEnd.taskMetrics().executorCpuTime()) )

spark = init_spark("SparkPerfListeners")

spark.sparkContext._gateway.start_callback_server()

te_listener = TaskEndListener()

spark.sparkContext._jsc.sc().addSparkListener(te_listener)

numRdd = spark.sparkContext.parallelize(range(6), 2) # 2 partitions and 2 tasks per stage

for i in numRdd.collect(): print(i) # stage-1

sumRes = numRdd.reduce(lambda x,y : x+y) # stage-2

print("Sum : " + str(sumRes))

spark.sparkContext._gateway.shutdown_callback_server()

spark.sparkContext.stop()

Output of the above program

executorRunTime : 823
resultSize : 1483
executorCpuTime : 27362130
executorRunTime : 823
resultSize : 1483
executorCpuTime : 38787637
0
1
2
3
4
5
executorRunTime : 45
resultSize : 1418
executorCpuTime : 2543456
executorRunTime : 48
resultSize : 1418
executorCpuTime : 2636528
Sum : 15

Decoding the output for understanding

Stage-1 (Collect) Task-1 Metrics
executorRunTime : 823          (in milli-seconds)
resultSize : 1483                     (in bytes)
executorCpuTime : 27362130 (in nano-seconds)

Stage-1 (Collect) Task-2 Metrics
executorRunTime : 823            (in milli-seconds)
resultSize : 1483                      (in bytes)
executorCpuTime : 38787637  (in nano-seconds)

Stage-2 (Reduce) Task-1 Metrics
executorRunTime : 45                (in milli-seconds)
resultSize : 1418                         (in bytes)
executorCpuTime : 2543456       (in nano-seconds)

Stage-2 (Reduce) Task-2 Metrics
executorRunTime : 48                (in milli-seconds)
resultSize : 1418                         (in bytes)
executorCpuTime : 2636528      (in nano-seconds)

Here I have collected metrics for each task. You can get aggregated metrics for each stage by implementing "onStageCompleted" etc...

For more TaskMetrics to report on refer to:
https://spark.apache.org/docs/2.2.0/api/java/org/apache/spark/scheduler/SparkListenerTaskEnd.html
https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/status/api/v1/TaskMetrics.html

May 6, 2020

Building a Deep Learning Model That Suggests Closest Emotions to Given Emotion

In this post, I will demonstrate how to build a model that, given an emotion and its effect/type (Positive Or Negative) get the three closest synonymous emotions to a given emotion.

I use an online dictionary sometimes, then I got a thought that why don’t I pick some synonyms for “hate” and “love” and build a Deep Learning(DL) model using Keras and Tensorflow backend, that uses Embeddings Vectors (To know more about Entity embeddings and advantages of them over OneHot encoded vectors click here). So below is the list of synonyms I picked from Dictionary.com

emotionsList= ['like','antipathy', 'hostility','love','warmth','loathe','abhor','intimacy','dislike','venom','affection', 'tenderness','animosity','attachment','infatuation','fondness','hate']

But to build any good model we need representative data, that’s when I thought I may need one more feature that helps my DL model to suggest better closest emotions. So I came up with “emoeffect” feature as below. For example, If someone says I like you, it gives a positive impression/feeling. But if someone says I hate you, it gives a negative feeling to us.

emoaffect= ['positive','negative', 'negative','positive','positive','negative','negative','positive','negative','negative','positive', 'positive','negative','positive','positive','positive','negative']

Here the objective is not to build the model with the best accuracy, but to generate best embeddings. So that we can use these embedding vectors to find closest matches. So we treat this problem as a supervised task. The supervised task is just the method through which we train our network.

The complete code with explanation is available on Github at:

https://github.com/srichallla/DeepLearning/blob/master/DLmodelGetClosestEmotions.ipynb

Feb 12, 2020

Transforming Categorical Variable into Embedding Vectors using Deep Learning

In this post will go through on how to transform categorical variable(s) into low dimensional embedding vectors.

The complete code with sample output is available on Github at:
https://github.com/srichallla/DeepLearning/blob/master/CategoricalToEmbeddings.ipynb

First, we will transform the categorical variable into one-hot encodings.

For simplicity, we will have a collection of emotions in a list named "emotionsList". Below code transforms the contents of the list into one-hot encodings.

emotionsList= ['like','antipathy', 'hostility','love','warmth','loathe','abhor',
     'intimacy','dislike','venom','affection','tenderness','animosity','attachment',
     'infatuation','fondness','hate']

#One hot encode the above list
ohe = OneHotEncoder()
X = np.array(emotionsList, dtype=object).reshape(-1, 1)
transformed_X =ohe.fit_transform(X).toarray()
transformed_X #onehot sparse array of shape (17,17) is the result

This transformation results in a high dimensional sparse matrix of shape (17,17). The size of one-hot encoding is equivalent to the number of unique elements in "emotionsList". So the drawbacks with this one-hot transformations are: High dimensionality, Sparse with mostly zeros and, semantics or meanings are not captured.

Next, we will try to transform the same "emotionsList" into embedding vectors using a deep learning embedding layer using Keras with a Tensorflow backend, as below.

#Define Embedding Size
embedding_size = int(min(np.ceil((len(emotionsList))/2), 50 ))

#Converting the categorical input to an entity embedding using Neural network embedding layer 
to reduce the dimensions from 17 to 9
input_model = Input(shape=(1,))
output_model = Embedding(len(emotionsList), embedding_size, name='emotions_embedding')(input_model)
output_model = Reshape(target_shape=(embedding_size,))(output_model)

model = Model(inputs = input_model, outputs = output_model)
model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Here Weights are the embeddings, so retrieving the weights generated
emotions_layer = model.get_layer('emotions_embedding')
emotions_weights = emotions_layer.get_weights()[0]
print(emotions_weights) # embedding vector of (17,9)

So above code results in transforming a single categorical variable taken in an "emotionsList" into embedding vectors/weight vectors of shape (17,9). So clearly you can see that each emotion from the "emotionsList" is transformed into an embedding weight/vector of size 9 as opposed to 17 with one-hot.

Advantages with converting categorical variables to embedding vectors are: Low dimensionality, Densely packed and, captures the semantics.

In my next blog, I will come up with a post that demonstrates how to train a neural network on emotions like hate, love etc... so that once it's trained, it will suggest the semantically closest emotion(s) to the given emotion.

Jan 24, 2019

GitHub: Setup, Configuration and GitBash commands on Windows

I was working on a project for GE power. GitHub was the version control tool being used for code check-in and check-out. Here, I list the commands being used, so that it can act as a reference for me and others at a future date.

Setup and configuration for the 1^st time use on windows machine using GitBash

1) Download “gitbash” for windows from below link and install it: -

https://gitforwindows.org/

2) To create and register ssh key to sync data from GitHub repository, follow instruction from below link for windows: -

https://help.github.com/articles/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent/#platform-windows

3) After step-2 above. Open GitBash and enter below command(s):

ssh-keygen -t rsa -C youremail@company.com

4) eval $(ssh-agent -s)

5) ssh-add /D/Srinivas/Git_SSHKeys/id_rsa (replace this with the path and filename you created to store SSHkeys in step-2 above)

6) Open .pub file in notepad and copy the public key.

7) Add the key to GitHub, follow instruction from below link for windows: -

https://help.github.com/articles/adding-a-new-ssh-key-to-your-github-account/

8) Copy the SSH URL (not https) from GitHub "CloneorDownload" button and cd to the directory/path were you want to download or sync GitHub code. Example: cd /D/Srinivas/My_SourceCode/directoryname and issue below command: -

git clone git@github.build.yourcompany.com:reponame/directoryname.git

End of first time setup and configuration steps.

Some GitHub commands you may use regularly using GitBash

1) To create a local branch for development from master(default) branch:

git checkout -b YourLocalBranchName master/origin

List of commands to push changes in your local branch to GitHub

1) Go to local branch other than master, where you made changes to existing files or added new files.

2) git add .

3) git status (inspects the contents of working directory and staging area)

4) git commit -m "Your comments"

5) git push origin YourLocalBranchName

6) Browse to GitHub and see if your changes are now visible.

Note: You have to commit code changes to your current local/dev branch, before checking-out (switching) to a different branch, else your current changes are copied to different

branch your check-out to.

Creating pull request using GitBash

1) First sync your local directory with origin/upstream, using below command:

git fetch origin

2) Go to git hub and create new pull request (follow instruction form below link, under “Create pull request” sub heading, if link doesn’t work, do google search for the same)

https://www.digitalocean.com/community/tutorials/how-to-create-a-pull-request-on-github

To flush or revert changes to any branch

1) git fetch origin

2) git reset --hard origin

Apr 9, 2018

Working With Large Datasets

Working with very large datasets that are too big to fit into memory is an important task. Below are some of the techniques and tools to work efficiently with large datasets:-

1) Consider exploring the large datasets on a machine with more RAM, high-speed processor and/or more cores. Instead of investing in physical hardware, consider cheaper options of owning/renting virtual machines.

2) When creating models using complex Machine Learning algorithms on huge data, write smarter code:-
a)Vectorize your code and avoid for loops.
b) Allocate memory efficiently in code by creating an empty variable(s) with an appropriate number of elements.
Example in R: var1 <- numeric(10000) (if you know the numbers required in advance)

3) Some can thwart this constraint by storing their data objects in a database, and only using selected subsets that fit in memory.

4) When exploring or training a dataset we may not need a complete dataset. So use sub-samples of the available data instead of requiring the whole dataset to be held at once. But ensure that your sub-sample is representative of the complete dataset.

5) Divide and conquer: Split data into chunks and analyze, but do this in batches of fixed sizes that fit into memory.

6) Use parallel computation as opposed to serial. Use split-apply-combine or MapReduce techniques.

7) If your data is very large, consider using tools like Spark/HDFS. Spark shell enables us to do interactive data analysis using Python. Data analysis and machine learning is supported by Mllib library. If the dataset is so huge that R or Pandas is unable to handle and doing analysis on a sub-sample is not an option for you. Then consider using Spark.

If you know of any other tools and techniques to work with very large datasets, feel free to comment below.

Mar 9, 2018

Case Study: Data Analysis with Hotel Prices Dataset

Data analysis case study with Hotel prices dataset. The detailed code and output(s) is on Github at below URL:-

https://github.com/srichallla/Analyzing_Hotels_Dataset/blob/master/HotelPrices_eda.ipynb

Lets first load the data and understand the features:-

df = pd.read_csv("bookings.csv",encoding='latin1')
df.info()

RangeIndex: 221069 entries, 0 to 221068 Data columns (total 21 columns):
room_id 221069 non-null int64
host_id 221067 non-null float64
room_type 221055 non-null object
borough 0 non-null float64
neighborhood 221069 non-null object
reviews 221069 non-null int64
overall_satisfaction 198025 non-null float64
accommodates 219750 non-null float64
bedrooms 220938 non-null float64
price 221057 non-null float64
minstay 63969 non-null float64
latitude 221069 non-null float64
longitude 221069 non-null float64
last_modified 221069 non-null object
date 221069 non-null object
survey_id 70756 non-null float64
country 0 non-null float64
city 70756 non-null object
bathrooms 0 non-null float64
name 70588 non-null object
location 70756 non-null object
dtypes: float64(12), int64(2), object(7)

We don't have a business objective for this dataset. But, before performing exploratory data analysis, its good to define one. So here are some possible business objectives I could think of, by looking at the dataset and its features:-

1) Predict the hotel prices. it’s a supervised learning regression problem.

2) We can also classify prices as high, medium, low which make the objective as a supervised learning classification problem.

3) We can perform hotel segmentation using clustering, to find hidden patterns or meaningful clusters.

For the current analysis, my business objective would be to predict hotel prices.

Looking at the output of df.info(), we can infer:-

a) borough, country, and bathrooms features are all nulls, and hence can be dropped.

b) host_id, room_type, overall_satisfaction, accommodates, bedrooms, and price has 2,14,23044,1319,131 and 12 missing values respectively.

c) minstay has 157100 (more than 70%) missing values, hence we will not include it in our analysis.

d) survey_id has 150313 (more than 65%) missing values, we will drop it too.

e) city too has more than 65% missing values. Also it has only two unique values:- df.city.unique():[nan, 'Amsterdam']. So wi will drop it too.

f) name, and location has 150481 (68%), and 150313(67%) missing values. So will drop these.

g) Finally, I will also drop the "last_modified" feature from the analysis, as I feel it's not significant for analyzing our business objective of predicting hotel prices.

Just by looking at the summary of features. We were able to drop unwanted features. So we now
have 221069 rows and 12 columns in the dataset.

We will be merging the "room_id" and "host_id" and create a new feature "hotel_id". And then drop
those 2 features. Although it's better to concatenate these two features. Here I am going to add them up.

df['hotel_id'] = df['room_id'] + df['host_id']

Let's plot some Boxplots to gain insights into some of the features

Looking at box plots above, we can notice significant outliers. It's good to understand if these are indeed outliers, if so, what caused these outliers, and finally how to deal with them.

Let's look at "bedrooms":-

df.bedrooms.unique()
array([ 1., 2., 3., nan, 4., 0., 5., 10., 7., 9., 6.,
8.])
Looking at above values, I feel individual house can have 10 bedrooms. And some rent just cabins, so 0 bedrooms are possible.
So I don't think these are outliers and hence we leave them as is.

Let's look at "accommodates"-
array([ 2., 4., 6., 3., 1., 5., 8., 7., 16., 12., 14.,
9., 10., 13., 15., 11., nan, 17.])
I think independent rented houses can accommodate 17. So we leave them as is.

Now let's look at some "Price" statistics:-
* Let's make an assumption that all "prices" in the dataset are per night in "Euros".
* With that assumption we can see that the cheapest hotel price is "10 Euros" and the costliest one is "9916 Euros" for one night.

Lets group by "neighborhood" and see if these outliers are specific to one neighborhood. As can be seen in below boxplot(s), there are outliers across many neighborhoods. And "9916 euros" per night seems to be too much. Also "10 euros" per night seems unreasonable to me. So let's remove these outliers from "price".

Also using R when I plot "latitude" and "longitude" on a map (see below). By looking at data points on the map, and also by verifying that "location" for all data points contains "Amsterdam", we can conclude that they all belong to one city and hence no outliers as far as location is concerned.
So we will drop "latitude", "longitude" from our dataset to reduce redundancy with "neighborhood" feature.

R script snippet for above map plottings

# Get the location with detailed address from Latitude and Longitude
result <- do.call(rbind,
                  lapply(1:nrow(df),
                         function(i)revgeocode(as.numeric(df[i,6:5]))))
df <- cbind(df,result)

#Now let’s plot this same data on an image from Google Maps using R's ggmap
AmsterdamMap<-qmap("amsterdam", zoom = 11)

"AmsterdamMap+                                        ##This plots the Amsterdam Map
geom_point(aes(x = longitude, y = latitude), data = df) "   ##This adds the points to it

There are various ways to treat the outliers, but one approach that statisticians follow is based on Inter-Quartile range(IQR). Where IQR = 75% - 25% (3rd-1st quartile),we get from df.price.describe() . Anything that's above and below 1.5*IQR can be removed. And that's what we are going to do here.

Note: In real time whenever we are making assumptions or performing data transformations like treating outliers etc. on data, its always advisable to discuss with the domain expert.

Below diagram depicts price boxplot after removing outliers.

Let's see if a relationship exists between Price and neighborhood. Let's get the "mean" price by neighborhood and plot a barchart.
Here I am plotting top 15 of 23 neighborhoods to prevent cluttering.

Looks like some neighborhood hotel rates are costlier than others, see below barchart.

More often than not we need to look at multiple types of visulizations, to find patterns and gain insights.

So let's plot a scatterplot for price and neighborhood. But since scatterplot cannot work on text/string data for "neighborhood" let's map it to some integer values, and then plot it.

Lets create a new column 'neighborhood_num' with the mappings. We may later drop "neighborhood".

By looking at the above plot. Observe the datapoint(s) are clustered as vertical lines in the scatter plot, this is because each belongs to one neighbourhood. The regression line is going up which indicates a positive correlation between neighborhood and price. But to confirm let's also look at line plot.

Looks like there is mostly an upward trend in price and neighborhood (neighborhood_num). Will look at correlation scores a little later. Let's get into time series analysis.

There's a "date" column. Generally, Hotel prices are higher during weekends. We will create a new column "day_of_week" from "date" column.

In the below barchart, during weekends, on Saturdays (day_of_week=5) average Hotel prices are high. There are no samples for Sunday in this dataset, strange!!!. In real time we need to check with domain expert(s) and/or other teams which are responsible to maintain data, like DB team.

Let's look at correlation heat map.

You can see that there's a positive correlation of 0.29 between price and neighborhood (neighborhood_num). So we are right.

But currently, we are dealing with over 200000 samples.

Note: Sometimes its enough to do analysis on a subset of the whole population. Let's see if above correlations between price, neighborhood (neighborhood_num) and other variables hold good for a subset of samples.

As we see that most of the hotels in a given neighbourhood have similar per night price. Let's find the sum of reviews, mean of overall_satisfaction, price and merge the original dataframe with a respective sum, mean values (Code is in accompanying Jupyter notebook).

Let's see if above correlations hold good for the subset of whole Hotels population.

We can see above that most of the correlations hold good for the subset of Hotel samples. So we can work with the subset from now on. Because working with a subset of samples saves training time and storage issues.

From the above correlations heatmap we can find:-

a) bedrooms and accommodates has a high positive correlation of 0.62, so we can consider only one of these two for predicting prices. And it's better to go with "accommodates", bcz it has a higher correlation of 0.45 with price.

b) Price and neighborhood (neighborhood_num) has a positive correlation of 0.31

Lets see if "room_type" has any relationship with Price, which is our target or dependent variable.

Looking at above barchart its clearly visible that there's a positive correlation between room_type and price. To find its correlation score we need to convert "room_type" from text/string to numeric.

As we can see above room_type (roomtype_num) and price has a positive correlation.

After type casting price from float to int and filtering out nulls, price distribution can be seen below.

In the price distribution above, mean is greater than the median, so the distribution is right-skewed. I am going to end my analysis here.

If anyone has a suggestion to view in a different dimension or another kind of plot(s) to gain more insights, please feel free to comment below.

Feb 10, 2021

Multi-class Classification using Bert with Kera's and Tensorflow

Aug 5, 2020

Programatically get Spark Performance Metrics using SparkListener in PySpark

May 6, 2020

Building a Deep Learning Model That Suggests Closest Emotions to Given Emotion

Feb 12, 2020

Transforming Categorical Variable into Embedding Vectors using Deep Learning

Jan 24, 2019

GitHub: Setup, Configuration and GitBash commands on Windows

Setup and configuration for the 1st time use on windows machine using GitBash

Some GitHub commands you may use regularly using GitBash

git fetch origin

To flush or revert changes to any branch

1) git fetch origin

2) git reset --hard origin

Apr 9, 2018

Working With Large Datasets

Mar 9, 2018

Case Study: Data Analysis with Hotel Prices Dataset

Setup and configuration for the 1^st time use on windows machine using GitBash