NLP-Semi supervised Topic modelling using Guided LDA

Sumeet Sinha
Nov 23, 2021
5 min read

Sumeet Sinha - November 23,2021

Introduction

In the current world, a large volume of data is generated/collected and shared every second. One of the major tasks which we face is to organise these data into various categories/tags which in turn provides better quality data for insight generations using various data science algorithms.

Topic modelling is one of the techniques which is used to resolve the above problem at hand. It has very applications in everyday life e.g. Recommendations based on the user's query, reviews & profiling.

What is Topic Modelling?

Topic modelling is an unsupervised machine learning technique that is used to identify the hidden topic present in a document or any large volume of texts. It consists of a word or group of words (termed as topic) which conveys the context in the document. Latent Dirichlet Allocation (LDA) is a popular algorithm for topic modelling.

LDA's topic detection process:

Gibbs sampling is one of the techniques using which LDA learns the topics and the topic representations of each document. Following are the steps involved:

1. The algorithm goes through each document and all the words present is assigned to one of the K topics chosen beforehand. 2. This random assignment gives topic representations of all documents & word distribution of all the topics which act as the starting point of a recurring technique to achieve the goal. 3. Topic assignment improvement procedure: For each document d, go through each word w and compute: p(topic t | document d): proportion of words in document d that are assigned to topic t p(word w| topic t): proportion of assignments to topic t, overall documents d, that come from word w Reassign word w a new topic t_n, where we choose topic t_n with probability, p(topic t_n | document d) * p(word w | topic t_n) This generative model predicts the probability that topic t_n generated word w On repeating the last step a large number of times, we reach a steady state where topic assignments are good. These assignments are then used to determine the topic mixtures of each document.

Following are some practical applications of topic modelling:-

1. Text categorisation: It can assist in identifying various topics present in a collection of documents. 2. Classification: It can be used for Short Message Service ( SMS ) spam filtering as the content of spam SMS is different from general messages using SM-BTM ( Short Message Biterm Topic Model) 3. Content Recommendation: It can be used to generate personalised content for readers, based on the keywords in the current article along with a group of readers with similar interests. 4. Search Engine Optimisation (SEO): It can be used to improve search algorithms. In 2018 Google added a topic layer in its Google Knowledge Graph to assist in identifying the most relevant content for the searches.

Why there is a need for semi-supervised Topic Modelling?

As mentioned earlier, LDA is used popularly for topic modelling, but it is a completely unsupervised ML technique. In real-world business problems, an amalgamation of domain knowledge by the experts along with the ability of the ML algorithm to identify hidden patterns from the unlabelled data is required to achieve our business objectives. E.g. consider a scenario where a pharmaceutical company has received tons of feedback on their products across the globe and the company want to exploit this information for budget allocation/brand re-imaging or business expansion.

Semi-supervised topic modelling is the right tool to assist them in the process since, with the traditional LDA process, the data science team is likely to end up with a bunch of words (i.e. topic) that may not be capturing the essence of information from a business perspective. In traditional LDA, topics explain only the most obvious and superficial aspects of the feedback documents. This approach provides a skewed impression of the corpus, which leads to the situation where the topics identified by LDA are not following the underlying topical structure of the document.

To keep this article on the topic, I am not dwelling on the exact mathematical computations for LDA, there is a various article on LDA which you can check upon I have just explained what the algorithm lacks when it comes to real data in processing. This problem has been overcome by providing additional information to the model. To solve this problem, we provide seed topics and set the algorithm to use these seed topics as a guide for semi-supervised topic modelling. These seeds are created using words/phrases which are highly significant to the context e.g. For the pharmaceutical industry seeds can be created based on side effects, symptoms, medical condition, cause of diseases, the chemical composition of the product. Seed words that are not present in the document can be replaced by similar words which are generally used to express the same meaning using Stanford’s GloVe.

How does semi-supervised topic modelling work?

In semi-supervised topic modelling, seeds converge towards a topic by providing an extra boost to seeds by elevating the significance score making it bias towards the seeded topic. This algorithm encourages the topic modelling to follow the seed set and does not force it. This is being held by setting seed confidence. The seed confidence interval can be any value between 0 and 1.

When we set seed confidence as 0 then the algorithm is not at all biased towards seeds and results will be the same as traditional LDA. If seed confidence has value 1, then the algorithm will be completely biased for terms provided as seeds. If we set the seed confidence as 0.1, which means we are biasing seeds by 10% more toward the seeded words. The seed confidence interval can be increased if seed terms will have more data into it for proper guidance of the algorithm.

Coding:

import guidedlda
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from gensim import corpora,models
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import keyedvectors
glove_input_file= 'D:/project/glove.6B/glove.6B.100d.txt'
word2vec_outputfile='D:/project/glove.6B.100d.txt.word2vec'

cleanText=[remove_stopwords(doc) for doc in cluster_text]
preprocess=[clean_text(doc) for doc in cleanText]
vect=CountVectorizer()
vect.fit(preprocess)
vocab_terms=tuple(vect.get_feature_names())
word2id=dict((v,idx) for idx,v in enumerate(vocab_terms))
document_dtm=vect.transform(preprocess)
seed_topic_list=clean_seeds(vocab_terms,product_name)
model=guidedlda.GuidedLDA(n_topic=10,n_iter=100,random_state=9,refresh=20)
seed_topics={}
for t_id,st in enumerate(seed_topic_list):
    for word in st:
        seed_topics[word2id[word]]=t_id
model.fit(document_dtm,seed_topics=seed_topics,seed_confidence=0.2)
topic_word=model.topic_word_
topic_dic=[]
for i,topic_dist in enumerate(topic_word):
    topic_words=np.array(vocab_terms)[np.argsort(topic_dist)][:-(10 +1):-1]
    scores=topic_dist[np.argsort(topic_dist)][:-(10 + 1):-1]
    for j,term in enumerate(topic_words):
        topic_dic.append({'Cluster':cluster_name, 'Topic':i, 'Term':term, 'Score':scores[j]})
        df_topic=pd.DataFrame(topic_dic)

Result: Sample topics obtained.

Conclusion

Guided LDA will prove to be a great asset in topic modelling by providing the relevant topic present in the data to assist you in decision making. For example in our use case- which products are facing what issue in which geographical region.

The key insights present in feedback like what complaints in general costumers have or any suggestion to improve their customer experience which would have fallen through cracks due to high volume of data.

Thank you for your time, hope you enjoyed reading it.

About Author:-

Sumeet Sinha is a Data scientist , proficient in Machine Learning,NLP,Deep Learning, Statistical Modelling & Python. He is passionate about learning and always looks forward to solve challenging analytical problems.