Topic models can be great tools to compute emergent trends out of a document collection, but they also tend to get easily confused by noise. This is especially true for the vocabulary they operate on. For example, they don't inherently know that "laptop" and "laptops" are the same word in the dictionary.
Lemmatization
The standard fix for the vocabulary issue is lemmatization. Depending on the language this is not necessarily a solved task. In addition, full-fledged approaches require downloading gigabytes of language-specific models, which is overkill for simple text clustering.
Simplemma is a lightweight, zero-dependency alternative that handles nearly 50 languages out of the box. Here is a minimal, reproducible pipeline showing how its language-chaining feature cleans up a bilingual dataset without getting too complex.
Reproducible Example
Install the requirements:
pip install simplemma scikit-learn
Imagine a tiny dataset mixing English and Spanish tech terms across two topics: Data Science and Server Hardware.
docs = [ "Data scientists are computing complex algorithms.", "El cálculo de algoritmos es complejo.", "The server hardware is overheating rapidly.", "Necesitamos enfriar los servidores." ]
Instead of building complex routing logic to detect the language of each sentence, we just pass lang=('en', 'es'). The library checks its English dictionary first, then seamlessly falls back to Spanish.
import simplemma cleaned_docs = [] for text in docs: tokens = simplemma.simple_tokenizer(text) lemmas = [ simplemma.lemmatize(t, lang=('en', 'es')) for t in tokens if t.isalpha() ] cleaned_docs.append(" ".join(lemmas))
Impact on the Models
When the parser hits Spanish words it falls back to the Spanish dictionary. It seamlessly reduces the plural "servidores" to "servidor" and the conjugated "necesitamos" to the base verb "necesitar".
Let's use sklearn's LDA algorithm for the sake of simplicity.
from sklearn.feature_extraction.text import CountVectorizer from sklearn.decomposition import LatentDirichletAllocation # Initialize the vectorizer (with stop words) vectorizer = CountVectorizer(stop_words=['el', 'ser', 'be', 'de', 'the']) # Initialize LDA for 2 topics (Data Science and Server Hardware) lda = LatentDirichletAllocation(n_components=2, random_state=42) # Vectorize the text and fit the model in one go lda.fit(vectorizer.fit_transform(cleaned_docs)) feature_names = vectorizer.get_feature_names_out() # Print the topics for i, topic in enumerate(lda.components_): top_words = [feature_names[j] for j in topic.argsort()[-5:][::-1]] print(f"Topic {i}: {top_words}") # The topics come out cleanly separated # Small gotcha: Data -> datum Topic 0: ['scientist', 'compute', 'datum', 'complex', 'algorithm'] Topic 1: ['overheat', 'hardware', 'server', 'rapidly', 'necesitar']
By the time you pass this cleaned text into LDA, the vocabulary is better compressed. This allows the LDA algorithm to generate dense, more readable clusters without the statistical dilution caused by word variations.
By normalizing the vocabulary across two languages, the statistical weight of the underlying concepts becomes much clearer. Other problems remain, for example distributional imbalance between the languages, which I could address in a following post.
TL;DR: If you need deep morphological parsing, use state-of-the-art packages. Concerning unsupervised topic modeling across multilingual text, simplemma is great for speed and simplicity.