Skip to content

#281: Language Detection in Python

For our experiment with Bert and Goodreads it did not matter if the review was written in Englisch or German. But not all tools are that flexible when it comes to the language. Often, we need to load a language-specific model and for that we need a reliable way to detect the language. Let us explore two libraries that can help us with this task in Python.

Beware of outdated libraries

If you search for a library to detect the language of a text in Python, you end up with many results. Unfortunately, most of the libraries are either outdated, no longer maintained or just as an archive still on GitHub. That is the reason I only compare two libraries in this post that both got commits in 2025.

Lingua-py

Lingua-py is a Python library designed for accurate natural language detection, able to identify languages in both short texts and mixed-language content. It supports 75 languages and works completely offline.

We can install it with this command:

uv pip install lingua-language-detector

To get a first impression of what short texts mean, we can run this little script:

from lingua import Language, LanguageDetectorBuilder

comments = [
    "Hi",
    "Welcome",
    "Hei",
    "Hej",
    "Hallå",
    "Hola",
    "God dag",
    "Guten Tag",
    "Ich spreche",
    "Bara lite grann",
    "Sólo un poco",
    "Gewoon een beetje",
    "Un peu",
    "Parlez-vous français? ",
    "Ich spreche nur ein bisschen Französisch.",
    "A little bit is better than nothing."
]


detector = LanguageDetectorBuilder.from_all_languages().with_preloaded_language_models().build()
for comment in comments:
    print(f"\n\n{comment}")
    confidence_values = detector.compute_language_confidence_values(comment)
    top_languages = confidence_values[:3]
    for confidence in top_languages:
        print(f"{confidence.language.name}: {confidence.value:.2f}")

When we run it, we get this output and may be surprised with the languages it comes up for single word texts:

Hi
MAORI: 0.06
TSONGA: 0.06
SWAHILI: 0.05


Welcome
ENGLISH: 0.33
TAGALOG: 0.06
MAORI: 0.06


Hei
MAORI: 0.10
GERMAN: 0.09
IRISH: 0.08


Hej
ALBANIAN: 0.16
YORUBA: 0.11
SLOVAK: 0.10


Hallå
SWEDISH: 0.81
BOKMAL: 0.09
NYNORSK: 0.06


Hola
SOTHO: 0.09
ZULU: 0.08
TSWANA: 0.07


God dag
DANISH: 0.10
NYNORSK: 0.09
BOKMAL: 0.08


Guten Tag
GERMAN: 0.16
BASQUE: 0.16
BOKMAL: 0.05


Ich spreche
GERMAN: 0.67
ITALIAN: 0.05
SHONA: 0.04


Bara lite grann
SWEDISH: 0.10
ITALIAN: 0.08
NYNORSK: 0.06


Sólo un poco
SPANISH: 0.40
TAGALOG: 0.11
SOTHO: 0.07


Gewoon een beetje
DUTCH: 0.90
AFRIKAANS: 0.03
DANISH: 0.01


Un peu
FRENCH: 0.13
CATALAN: 0.07
YORUBA: 0.05


Parlez-vous français?
FRENCH: 0.55
TAGALOG: 0.05
MALAY: 0.03


Ich spreche nur ein bisschen Französisch.
GERMAN: 1.00
DUTCH: 0.00
ESPERANTO: 0.00


A little bit is better than nothing.
ENGLISH: 0.47
TAGALOG: 0.07
GERMAN: 0.04

The longer the text we put in, the more accurate and confident the detection is. It is good to know that we can access the confidence.value to make our own decisions.

Another great use case is to detect multiple languages in the same text. For that we can go with the example from the documentation and filter for the languages we expect:

1
2
3
4
5
6
7
languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
sentence = "Parlez-vous français? " + \
           "Ich spreche nur ein bisschen Französisch. " + \
           "A little bit is better than nothing."
for result in detector.detect_multiple_languages_of(sentence):
    print(f"{result.language.name}: '{sentence[result.start_index:result.end_index]}'")

The 3 sentences get correctly identified and we can even get back what part of the text was in that language:

FRENCH: 'Parlez-vous français? '
GERMAN: 'Ich spreche nur ein bisschen Französisch. '
ENGLISH: 'A little bit is better than nothing.'

langdetect

Langdetect is a Python library that ports Google's language-detection library from Java to Python, maintaining the original classes and methods for consistency. It supports the detection of 55 languages using their ISO 639-1 codes, such as 'en' for English and 'fr' for French. We can install it with this command:

uv pip install langdetect

To see what we can detect with langdetect, we can use this little script where detect gives us a single language while detect_langs returns the 3 most likely languages and their probability:

from langdetect import detect, detect_langs

comments = [
    "Hi",
    "Welcome",
    "Hei",
    "Hej",
    "Hallå",
    "Hola",
    "God dag",
    "Guten Tag",
    "Ich spreche",
    "Bara lite grann",
    "Sólo un poco",
    "Gewoon een beetje",
    "Un peu",
    "Parlez-vous français? ",
    "Ich spreche nur ein bisschen Französisch.",
    "A little bit is better than nothing."
]


for comment in comments:
    print(f"\n\n{comment}")
    print(detect(comment))
    print(detect_langs(comment))


sentence = "Parlez-vous français? " + \
           "Ich spreche nur ein bisschen Französisch. " + \
           "A little bit is better than nothing."
print(f"\n\n{sentence}")
print(detect(sentence))
print(detect_langs(sentence))

As before, the shorter the text, the worse is the language detection – no matter how confident langdetect is:

Hi
nl
[nl:0.9999944602981674]


Welcome
it
[it:0.7142815525185242, de:0.28571376167761736]


Hei
de
[de:0.9999963235044235]


Hej
nl
[nl:0.9999951521367719]


Hallå
sv
[sv:0.9999932220872253]


Hola
tr
[tr:0.9999956518286951]


God dag
so
[so:0.9999966770843363]


Guten Tag
de
[de:0.9999935539163833]


Ich spreche
de
[de:0.9999953395598714]


Bara lite grann
it
[it:0.7142834162838725, id:0.28571313631944445]


Sólo un poco
es
[es:0.9999958090034128]


Gewoon een beetje
nl
[nl:0.9999964156786304]


Un peu
fr
[fr:0.9999958181234762]


Parlez-vous français?
fr
[fr:0.9999969499106933]


Ich spreche nur ein bisschen Französisch.
de
[de:0.9999983074529484]


A little bit is better than nothing.
en
[en:0.9999973130917644]


Parlez-vous français? Ich spreche nur ein bisschen Französisch. A little bit is better than nothing.
de
[de:0.9999941750659501]

Conclusion

Both lingua-py and langdetect can reliably detect languages when we have more than just a few words. However, I like the approach of lingua-py to detect multiple languages in the same text a lot and will use this library for my experiments.

The next idea I want to explore is text-to-speech and speech-to-text. Since we need to store some intermediate output, we first should see how Python handles temporary files.