Language Independent Text-to-Emotion Classification

Location

This project was done as a industrial software internship at Hike Pvt. Ltd., New Delhi, during May ‘17 - Jul ‘17, the summer of my second year.

Details

The project aimed at developing a machine learning model to assign an emotion to any chat message provided as input. This would allow quick sticker recommendations to be based not only the textual reference, but also take into account the emotional state of the user for that message.

This task was accomplished by developing neural network models using the open source software library tensorflow from Google. In order to get the highest accuracy, multiple models had to be developed, all based on different machine learning techniques including CNN, RNN, and LSTM networks.

The esteem of the model lies in the fact that it is a language independent model, which when trained on enough texts from any language, should be able to successfully predict the most probable emotion of any text in that language.

The project also comprised of gathering the data required for training this model. For this purpose, a sub-project had to be developed that acted like a Language-Classifier, which segregated the chat corpus into chats of the top 5 chat languages in the application, namely Hindi+English, Tamil, Marathi, Gujarati and Bengali, but can easily be extended to any number of languages as required.

This segregation into the different chatting languages was required since the data needed for training had to be prepared by manual tagging of the messages to a particular emotion class out of a predefined set of classes. Notably, the model does not require any language classification of the input message, once it has been trained on this prepared data.

The crude conversation data was obtained from a corpus of anonymized “random” text chat sequences. Following this, the language-classifier was applied to get the texts from each language separated out. The classifier started with the sticker tags of the languages, processed them at multiple stages and performed a frequency analysis through a Confusion Matrix to get unique set of words for each language. Using these words, some texts were extracted from the corpus for each language. These texts were then used to get a better and more extensive list of unique common words through the same process. Now, using this final word list, another frequency based search on the conversations allowed extraction of texts uniquely belonging to each language.

These texts then had to be properly “cleaned” so as to contain only relevant information. This was done using some basic python scripting followed by running the corpus through a Vector Space Model (VSM) that used word2Vec modelling and tf-idf weights to generate word and document vectors, and use those to clear irrelevant and repeated information. The process also involved gathering data from multiple NoSQL databases in MongoDB.

Finally, a Server-Client support for the model was also developed wherein the user could submit the text, which is processed by the Client and submitted to the Server along the necessary details as a GRPC request. The server hosts the multiple versions of the model, with the latest being fetched to answer the query it receives from the client. This support was added through Google's tensorflow_serving library. The model can also be hosted on Google Machine Learning Engine, the support for which can be easily implemented in the existing code.

The tensorflow serving library has support only on linux environments, and hence for setting up the server system, a appropriate Docker environment also had to be set-up on the local machine, that included all the dependencies for the same.

As a simple but efficient extension, the text message that contains emojis isn't analyzed via the model, and instead the emotion can be directly taken from the emoji itself on the client itself without any need to hit the server.

Avatar
Amrit Singhal
MS in Machine Learning

My research interests include machine learning, reinforcement learning and artifical intelligence.

Related