Feb 12, 2020

Transforming Categorical Variable into Embedding Vectors using Deep Learning

In this post will go through on how to transform categorical variable(s) into low dimensional embedding vectors.

The complete code with sample output is available on Github at:
https://github.com/srichallla/DeepLearning/blob/master/CategoricalToEmbeddings.ipynb

First, we will transform the categorical variable into one-hot encodings.

For simplicity, we will have a collection of emotions in a list named "emotionsList". Below code transforms the contents of the list into one-hot encodings.

emotionsList= ['like','antipathy', 'hostility','love','warmth','loathe','abhor',
     'intimacy','dislike','venom','affection','tenderness','animosity','attachment',
     'infatuation','fondness','hate']

#One hot encode the above list
ohe = OneHotEncoder()
X = np.array(emotionsList, dtype=object).reshape(-1, 1)
transformed_X =ohe.fit_transform(X).toarray()
transformed_X #onehot sparse array of shape (17,17) is the result

This transformation results in a high dimensional sparse matrix of shape (17,17). The size of one-hot encoding is equivalent to the number of unique elements in "emotionsList". So the drawbacks with this one-hot transformations are: High dimensionality, Sparse with mostly zeros and, semantics or meanings are not captured.

Next, we will try to transform the same "emotionsList" into embedding vectors using a deep learning embedding layer using Keras with a Tensorflow backend, as below.

#Define Embedding Size
embedding_size = int(min(np.ceil((len(emotionsList))/2), 50 ))

#Converting the categorical input to an entity embedding using Neural network embedding layer 
to reduce the dimensions from 17 to 9
input_model = Input(shape=(1,))
output_model = Embedding(len(emotionsList), embedding_size, name='emotions_embedding')(input_model)
output_model = Reshape(target_shape=(embedding_size,))(output_model)

model = Model(inputs = input_model, outputs = output_model)
model.compile(optimizer = 'Adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Here Weights are the embeddings, so retrieving the weights generated
emotions_layer = model.get_layer('emotions_embedding')
emotions_weights = emotions_layer.get_weights()[0]
print(emotions_weights) # embedding vector of (17,9)

So above code results in transforming a single categorical variable taken in an "emotionsList" into embedding vectors/weight vectors of shape (17,9). So clearly you can see that each emotion from the "emotionsList" is transformed into an embedding weight/vector of size 9 as opposed to 17 with one-hot.

Advantages with converting categorical variables to embedding vectors are: Low dimensionality, Densely packed and, captures the semantics.

In my next blog, I will come up with a post that demonstrates how to train a neural network on emotions like hate, love etc... so that once it's trained, it will suggest the semantically closest emotion(s) to the given emotion.