Back to Glossary
Natural Language Processing

Topic modelling

Topic modeling is a type of unsupervised machine learning technique used to discover abstract "topics" that occur in a collection of documents. It analyzes the words within the documents to cluster them into these topics, where a topic is defined as a probability distribution over words.

Explanation

Topic modeling algorithms, such as Latent Dirichlet Allocation (LDA), identify underlying themes or subjects within a set of documents without requiring predefined labels. LDA, for instance, treats each document as a mixture of topics and each topic as a mixture of words. The algorithm infers the topic distribution for each document and the word distribution for each topic. The process involves iteratively refining the topic assignments to words and document assignments to topics until a stable state is reached. These models are valuable for tasks like document classification, information retrieval, and understanding large collections of text data. They provide a way to automatically organize and summarize text corpora, revealing hidden thematic structures that might not be immediately apparent through manual inspection. The quality of topic models is often evaluated using metrics like perplexity and topic coherence.

Related Terms