Automatic Topic Classification of Research Papers using the NLP Topic Model NMF

Obianuju Okafor
6 min readMay 25, 2020

To keep up with innovations in research fields, literature review is often performed. The analysis and coding of papers done during literature reviews can be quite tasking especially when the number of papers to be analyzed is large. A possible alternative is to use NLP topic modelling techniques. Topic modelling involves extracting the most representative topics occurring in a collection of documents and grouping the documents under a topic.

There are several topic modelling techniques, such as LDA, LSA, and NMF. In this experiment, i am going to be using NMF to automatically classify 50 HCI-related research papers under their most representative topics. I chose NMF as it proved to be the most accurate technique out of the 3 techniques. The full code can be found here. I followed the following steps outlined below:

Methodology

1. Import Libraries

First thing i did was to import all the necessary libraries which will be used in this experiment. Most of the libraries used were from ‘nltk’.

import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
from nltk.stem.wordnet import WordNetLemmatizer
import re
import gensim
import gensim.corpora as corpora
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import pyLDAvis
from pyLDAvis import sklearn as sklearn_lda

2. Load the data

The selected 50 research papers data had been entered into an excel file, then converted to a csv file called ‘research_papers.csv’, with columns for ‘ID’, ‘Authors’, ‘Title’, ‘Year’, ‘Conference/Journal’, ‘Introduction’, and ‘Conclusion’. I show the first 5 rows of my data file below.

I loaded the excel file into a dataframe called ‘dataset’.

#Load data file
dataset = pd.read_csv(r'research_papers.csv', encoding='ISO-8859–1')
dataset.head()

3. Clean Data

Before proceeding, i had to clean up the data. I dropped the unnecessary columns like ‘ID’, ‘Authors’, ’Year’, ‘Conference/Journal’. For papers with no conclusion, I filled the empty cell with the text “No conclusion”. Next, I merged the two columns ‘Abstract’ and ‘Conclusion’ to form a new column called ‘Paper_Text’. The text of this new column is what i would be working with in this experiment.

#Remove the unecessary columns
dataset = dataset.drop(columns=['Id', 'Reference', 'Codes', 'Authors', 'Year', 'Conference/ Journal'], axis=1)
#Fill in the empty cells
dataset = dataset.fillna('No conclusion')
#Merge abstract and conclusion
dataset['Paper_Text'] = dataset["Abstract"] + dataset["Conclusion"]
#show first 5 records
dataset.head()

4. Preprocess Data

Another important step was to preprocess the data. First, i tokenized each sentence into a list of words, then i removed punctuations, stopwords and words of length less than 3. Lastly, i lemmatized each token.

#function for lemmatization
def get_lemma(word):
lemma = wn.morphy(word)
if lemma is None:
return word
else:
return lemma
# tokenization
tokenized_data = dataset['Paper_Text'].apply(lambda x: x.split())
# Remove punctuation
tokenized_data = tokenized_data.apply(lambda x: [re.sub('[-,()\\!?]', '', item) for item in x])
tokenized_data = tokenized_data.apply(lambda x: [re.sub('[.]', ' ', item) for item in x])
# turn characters to lowercase
tokenized_data = tokenized_data.apply(lambda x: [item.lower() for item in x])
# remove stop-words
stop_words = stopwords.words('english')
stop_words.extend(['from','use', 'using','uses','user', 'users', 'well', 'study', 'survey', 'think'])
# remove words of length less than 3
tokenized_data = tokenized_data.apply(lambda x: [item for item in x if item not in stop_words and len(item)>3])
# lemmatize by calling lemmatization function
tokenized_data= tokenized_data.apply(lambda x: [get_lemma(item) for item in x])

5. Create Bigrams and Trigrams

An additional step i did was to create bigrams and trigrams. Bigrams and Trigrams are two or three words frequently occurring together in a document. Some examples in my dataset include: ‘block based programming’,visually impaired’, ‘screen reader’, ‘programming language’, ‘computer science’, etc. I used Gensim’s Phrases model to build and implement the bigrams and trigrams. The two important arguments to Phrases are min_count and threshold. The higher the values of these param, the harder it is for words to be combined to bigrams or trigrams.

# Build the bigram and trigram models
bigram = gensim.models.Phrases(tokenized_data, min_count=5, threshold=10) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[tokenized_data], threshold=10)
# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
# Define functions for creating bigrams and trigrams.
def make_bigrams(texts):
return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
return [trigram_mod[bigram_mod[doc]] for doc in texts]
# Form Bigrams
tokenized_data_bigrams = make_bigrams(tokenized_data)
# Form Trigrams
tokenized_data_trigrams = make_trigrams(tokenized_data)

After creating the bigrams and trigrams, i combined the tokens back into sentences. The de-tokenized, cleaned and preprocessed text data was entered into a new column in my dataframe called ‘clean_text’.

# de-tokenization, combine tokens together
detokenized_data = []
for i in range(len(dataset)):
t = ' '.join(tokenized_data_trigrams[i])
detokenized_data.append(t)
dataset['clean_text']= detokenized_data
documents = dataset['clean_text']

6. Create Document-Term Matrix

This is the first step towards topic modelling. Each and every term and document in the dataset has to be represented as a vector. We will use sklearn’s TfidfVectorizer to create a document-term matrix using only 1000 terms (words) from our corpus. I could have used all the terms in my corpus to create this matrix but that would require a lot of computation.

#Set variable number of terms
no_terms = 1000
# NMF uses the tf-idf count vectorizer
# Initialise the count vectorizer with the English stop words
vectorizer = TfidfVectorizer(max_df=0.5, min_df=2, max_features=no_terms, stop_words='english')
# Fit and transform the text
document_matrix = vectorizer.fit_transform(documents)
#get features
feature_names = vectorizer.get_feature_names()

7. Generate topics using Topic Model

The generated document-term matrix will be decomposed into multiple matrices. We will use sklearn’s decomposition model NMF to perform the task of matrix decomposition. The number of topics to be generated can be specified by using the n_components parameter. The resulting matrices derived after running the topic model are the document-topic matrix and term-topic matrix. In the term-topic matrix, sorting the rows in reverse, reveals the top terms for each topic.

#Set variables umber of topics and top words.
no_topics = 10
no_top_words = 10
# Function for displaying topics
def display_topic(model, feature_names, num_topics, no_top_words, model_name):
print("Model Result:")
word_dict = {}
for i in range(num_topics):
#for each topic, obtain the largest values, and add the words they map to into the dictionary.
words_ids = model.components_[i].argsort()[:-no_top_words - 1:-1]
words = [feature_names[key] for key in words_ids]
word_dict['Topic # ' + '{:02d}'.format(i)] = words
dict = pd.DataFrame(word_dict)
dict.to_csv('%s.csv' % model_name)
return dict
# Apply NMF topic model to document-term matrix
nmf_model = NMF(n_components=no_topics, random_state=42, alpha=.1, l1_ratio=.5, init='nndsvd').fit(document_matrix)

To view the generated topics we call our ‘display_topic’ function, it will print the 10 topics and their top 10 terms. From the top 10 terms, you can infer what a topic is about. For example, looking at ‘Topic #05’, we can tell that it is about the ‘challenges blind developers face in software development environment’. Each and every paper in the dataset falls under one of these topics. We will use the generated topics to classify every paper in the corpus.

10 generated topics and their top words

8. Classify papers under topics

Lastly, i use the 10 topics generated by the NMF model to categorize each and every paper in my dataset.

#Use NMF model to assign topic to papers in corpus
nmf_topic_values = nmf_model.transform(document_matrix)
dataset['NMF Topic'] = nmf_topic_values.argmax(axis=1)
#Save dataframe to csv file
dataset.to_csv('final_results.csv')
dataset.head(10)

The table below shows the topic classification of the first 10 papers in my dataset. As we can see the first paper was categorized under ‘Topic 4’. From the title and text of the paper, we can see that the paper relates to ‘block-based programming’. Looking at the previous table showing each of the 10 topics and their top words, we can see that ‘Topic 4’ is about ‘designing a block-based programming environment’. This means that the paper was rightly classified by the NMF model. The same can be seen in the classification of other papers in the dataset.

Result showing assigned topics

If you like this post, HIT Buy me a coffee! Thanks for reading.

Your every small contribution will encourage me to create more content like this.

--

--

Obianuju Okafor

Computer Science PhD Candidate. Research areas: Human Computer Interaction, Software Engineering, Accessibility, Machine Learning & Natural Language Processing.