GlotLID: Language Identification for Low-Resource Languages

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.


Introduction
The NLP community should create technology that covers as many languages as possible, not only medium-resource and high-resource languages.This goal can only be achieved if corpora for low-resource languages are available.Web-mined datasets -including CC100 (Wenzek et al., 2020), mC4 (Xue et al., 2021) and OSCAR (Abadji et al., 2021;Ortiz Suárez et al., 2019) -have made important contributions to low-resource NLP.In particular, they lay the ground for multilingual neural models like XLM-R (Conneau et al., 2020), mT5 (Xue et al., 2021) and Glot500 (ImaniGooghari et al., 2023).However, existing web-mined datasets have systematic quality issues (Kreutzer et al., 2022) and insufficient coverage of low-resource languages.
Low-quality datasets cause poor performance for downstream applications.They can also give rise to a misleading perception of progress when coverage of a low-resource language is claimed based on noisy data.NLP for low-resource languages requires high-quality datasets and high-quality datasets require high-quality LID (language identification).For this reason, high-quality LID for low-resource languages is paramount.To address this need, in this paper we present GlotLID-M, a high-quality LID that covers 1665 languages.We use ISO 639-3 to individuate languages.
When expanding the scope of LID from a few hundred to 1665 languages, the problem of granularity becomes severe.In real-world settings, LID needs to support both macrolanguages and their varieties; it also needs to be robust against out-ofmodel cousins (Caswell et al., 2020;Kreutzer et al., 2022).We pay particular attention to this issue.
While low-resource is our main focus, Blevins and Zettlemoyer (2022) point out that low-quality LID also affects high-resource corpora through contamination, resulting in claims of successful crosslingual transfer that are due to unrecognized coverage of low-resource languages.We also address this issue, e.g., we improve English F1 on the "Universal Declaration of Human Rights" corpus (UDHR) to .85 compared to .43 for OpenLID.
Contributions. (i) We curate GlotLID-C, a comprehensive dataset covering 1665 languages, most of them low-resource, from a diverse set of do-mains.(ii) We train GlotLID-M on GlotLID-C, an open-source LID covering these 1665 languages.(iii) In our experiments, GlotLID-M outperforms several baselines by more than 12% absolute F1 on UDHR, which we take as the best benchmark for our focus on low-resource languages.(iv) When balancing F1 and false positive rate (FPR), GlotLID-M also outperforms baselines on FLORES-200, which is dominated by high-/medium-resource languages.
2 Requirements for low-resource LID Main use case: Corpus creation.Corpus creation and cleaning is the main use case for our low-resource LID because we want to address the need for high-quality corpora for low-resource languages.Line-by-line LID filtering is an effective method for achieving high corpus quality.Reliable LID can eliminate various types of noise (see (Caswell et al., 2020;Kreutzer et al., 2022)) -including data from other languages and nonlinguistic data -that is frequent, especially in web-crawled content.By adjusting the confidence threshold, users will have control over the level of quality of the corpora they create.
Broad coverage of languages, minimize out-ofmodel cousin errors.We strive for as broad a coverage as is possible given available datasets.This has two benefits.First, it reduces "out-of-model cousin" errors (Caswell et al., 2020;Kreutzer et al., 2022), i.e., it reduces the risk that a language not covered is misclassified as a closely related covered language.Second, having LIDs that discriminate many low-resource languages is a pre-requisite for developing NLP technologies for the largest possible number of languages.Yet many existing LIDs only cover a few hundred languages.In this study, we therefore focus on LIDs having a broad coverage, excluding CLD2 (McCandless, 2010), Equilid (Jurgens et al., 2017), Langdetect (Shuyo, 2010) and langid.py(Lui and Baldwin, 2012).These LIDs cover less than 100 languages or are outperformed by the models we compare with. Open-source.
LIDs should be open-source to encourage open collaboration and conform to best research practices.Some LIDs that meet our other requirements are not open-source, e.g., those published by Caswell et al. (2020); Bapna et al. (2022); Kudugunta et al. (2023).CLD3 (Botha et al., 2017;Salcianu et al., 2018) is freely available, but its training code is not open-source.
Ease of use.LIDs should be easily deployable across platforms and programming environments without having to worry about dependencies, compatibility and lack of maintenance.
Because of this ease-of-use requirement, we do not consider whatlang (Brown, 2014b,a) nor idNet (Dunn, 2020), two broad-coverage LIDs that meet many other requirements, but are hard to use in many practical scenarios due to software issues and lack of maintenance.
Uncertainty assessment.In our use cases, we would like to rely on uncertainty measures to distinguish cases where the highest-probability language is certain from those where it is not.This would allow us to choose a level of confidence for the resulting corpus.For example, we may want to retain only sentences identified with a high confidence (say, 70%).This is essential to produce high-quality low-resource corpora.
Because of this requirement, we do not consider Franc (Wormer, 2014) as a baseline.While it has many desirable properties, it generally does not provide well-calibrated probabilities.It usually returns several classes, giving 1.0 to the top class and values close to 1.0 to several others.
Efficiency.LID is easy to run in parallel, but we still need an efficient solution to make it applicable to large corpora, not least for ecological reasons.
Lack of efficiency is the reason why we do not use AfroLID (Adebara et al., 2022) as a baseline, despite its excellent coverage of African languages. 1AfroLID is a transformer architecture and less efficient than its competitors.
Granularity flexibility.When scaling LID from a few hundred languages to more than 1500, it is hardly practical to restrict the set of labels to a single level of the language hierarchy (e.g., using resources like iso639-3.sil.org).This is due to the complexity of defining and delimiting languages, including the coexistence of macrolanguages and their varieties.In many cases, we want to keep both the macrolanguage and the varieties in our label set because the varieties we have data for are important languages in their own right.But for other varieties, we do not have variety-labeled data, so the only way to include them is through the macrolanguage.For example, FLORES-200 (NLLB Team et al., 2022) covers the macrolanguage aka (Akan) and its variety twi (Twi), but not its variety fat (Fanti).Keeping both aka and twi gives flexibility to LID users: they can either differentiate aka and twi or they can consolidate the two labels to the single label aka, depending on what makes more sense in their setting.

Dataset curation
We now describe GlotLID-C, a corpus for LID training that covers 1832 languages.
Source selection.We choose sources that we deem trustworthy (i.e., high chance of correct language label).To address the domain sensitivity of LID and broaden language coverage, we curate a diverse set of text domains.
We review sources referenced by ImaniGooghari et al. (2023); Burchell et al. (2023); Blaschke et al. (2023); Adebara et al. (2022); Adebara and Abdul-Mageed (2022).In each case, we consider the collection methodology, selecting sources whose language labels are trustworthy.We generally do not use web-crawled sources to avoid the associated problems (Kreutzer et al., 2022).Most selected sources are derived from Wikipedia, religious texts, collaborative translations, storybooks, and news sites.This gives us a coverage of 1832 languages, more than any other public LID.For a list of data sources, see §3.1.
Preprocessing.We ensure that each sentence is written in the correct script, based on the writing system databases of Kargaran et al. (2023) and van Esch et al. (2022).We use the GlotScript (Kargaran et al., 2023) Python library to determine scripts.We also eliminate duplicate sentences.
Statistics.Our final corpus, GlotLID-C, comprises 289 million sentences (i.e., lines of data) totaling 40GB and spans 1832 languages (identified by their ISO 639-3 code).1677 languages have more than 1000 sentences.Refer to §B for the total number of sentences per language.
Train/test split.We designate 85% of the data as GlotLID-C train.Let n l be the number of sentences from language l in the remaining 15%.Then we sample min(1000, n l ) sentences from it.We refer to the resulting dataset as GlotLID-C test.
Contamination.To make sure our evaluation data (especially UDHR, refer to §5.2) do not overlap with our sources, we compute contamination of UDHR in GlotLID-C train.
We count a UDHR test sentence as occurring in the training set if all of its word four-grams occur in one sentence of GlotLID-C.Most of these contam-inations are due to two resources: Wikipedia and Tatoeba (Tatoeba Community, 2023).GlotLID-C train shares 374 languages with UDHR.
For 292 languages, we find that none of the UDHR test sentences occurs in the training data.For 57 languages, less than 10% of UDHR test sentences occur in the training data.The remaining 25 languages with a contamination rate over 10% are all high/medium resource languages.

List of data sources
We select FastText (Joulin et al., 2017) as the architecture for GlotLID-M, because it satisfies all requirements outlined in §2 as we will explain now.
We train our FastText model GlotLID-M on GlotLID-C train with 1832 languages.FastText can easily handle the large number of languages in the corpus.Because of this broad coverage, outof-model cousin errors are reduced.Although we restrict the number of classes to 1665 for some experiments (e.g., in Table 2), GlotLID-M's classification always uses all 1832 languages to mitigate out-of-model cousin errors.This satisfies the first requirement from §2: GlotLID-M is a useful tool for corpora creation because it has a broad coverage of languages that can occur in raw data.
FastText is easy to use: It offers a number of language bindings, making it compatible with multiple programming languages (including C++, Python, Java, Node.js,Rust, Ruby, R) and reducing dependency, incompatibility and other software issues.
FastText meets the requirement of uncertainty assessment because it provides confidence scores that can serve as thresholds to effectively mitigate noise in the data.For the same reason, FastText also supports granularity flexibility: we can accumulate probabilities over language varieties to get a good estimate of the probability of the macrolanguage.To this end, we simply add to the macrolanguage probability the probabilities of its varieties.This way, the system can return appropriate estimates at various levels of granularity.
As a professionally designed and implemented linear classifier, FastText is efficient: it had the best throughput of the candidate solutions we tested and can process large corpora with high speed.As a linear model, FastText has the additional advantage of delivering explainable classification decisions.FastText is a multinomial logistic classifier.The input sentence is represented as an average of n-gram embeddings.This allows us to visualize how much each n-gram contributed to the final prediction.See NLLB Team et al. (2022), Fig. 8, for details.
Taking all these requirements together (and its good LID performance demonstrated in §6 and acceptable calibration in §C), GlotLID-M, based on FastText, is, in our opinion, an excellent tool for supporting our use case, the creation of highquality low-resource corpora.

Experimental setup
We train GlotLID-M on GlotLID-C train using the hyperparameters in (NLLB Team et al., 2022;Burchell et al., 2023) and otherwise FastText defaults (see §5.1).Following Arivazhagan et al. (2019), NLLB Team et al. (2022) and Burchell et al. (2023), we perform up-sampling for low resource languages.Sentences from a language l representing p l of the dataset are sampled proportionally to p 1 T l where T is the temperature.Following NLLB Team et al. (2022) and Burchell et al. (2023), we set 1 T = .3.

GlotLID-M hyperparameters
We provide the hyperparameters used to train the GlotLID-M in Table 1.

Evaluation data
We evaluate GlotLID-M on GlotLID-C test, FLORES-200 (NLLB Team et al., 2022) and UDHR4 (Universal Declaration of Human Rights).While testing on data unseen in training is standard in NLP, the results have to be taken with a grain of salt because there is often a domain mismatch in real-world applications of LID (Caswell et al., 2020;Dunn, 2020).FLORES-200 and UDHR address this concern: they are not part of our training set (however, see discussion in §3) and do not draw on our sources.Many other benchmarks share sources like Wikipedia with us (Thoma, 2018;Haas and Derczynski, 2021;Ahmadi et al., 2023).FLORES-200 and UDHR are also the benchmarks with the broadest available language coverage.FLORES-200 is a collection of 842 articles obtained from English-language Wikimedia projects.Each sentence in the articles was translated into 204 distinct language-script combinations, corresponding to 196 distinct languages, and human-verified.It provides 997 sentences for development, 1012 for dev-test and 992 for test.FLORES-200 test is not publicly available.Following prior work, we use dev-test as our FLORES test set.
The level of granularity across language (sub)families varies in FLORES; e.g., it includes nine varieties of Arabic.On the other hand, some languages (e.g., est:Estonian) are only available as macrolanguage.In some cases, FLORES includes both a macrolanguage and varieties, e.g., aka (Akan) and its variety twi (Twi), and zho (Chinese) and its variety yue (Yue Chinese).Although some issues have been reported (see §A.1) with FLORES, we do not have the resources to investigate them, so we use it as is.
UDHR consists of more than 500 translations of the "Universal Declaration of Human Rights".419 translations available from the "UDHR in Unicode" project have a iso-639-3 code that is not "und" (undetermined).We discard short sentences (e.g., consisting of just an article number or the single English word 'missing') by discarding the 35% shortest sentences for each language.
In some cases (e.g., Zulu and Quechua), UDHR contains both a macrolanguage and one of its varieties.We have also seen some issues in UDHR (see §A.2), but we have not extensively investigated these potential problems.
CLD3.CLD3 uses an n-gram (1≤n≤3) based neural network model.CLD3 sometimes deviates from established metadata conventions.For example, ISO-639-1 ku refers to kur (Kurdish), but in CLD3 ku refers to its variety kmr (Northern Kurdish).It refers to Hebrew as iw, but the ISO code for Hebrew has changed to he and heb.
FT176.FT176 is a FastText model that uses Wikipedia (WP) codes as labels.The documentation of language metadata is sometimes unclear; e.g., FT176 refers to Alemannic German as als although ISO-639-3 als is Tosk Albanian.It refers to the Malay macrolanguage as ms, but unlike ISO-639-3, this does not include ind (Indonesian).
NLLB and OpenLID.NLLB and OpenLID are FastText models.Their language label sets are mostly taken from FLORES, so granularity and coverage are similar to FLORES.
Language metadata matching.Matching the metadata of the models to the metadata of the benchmarks (FLORES, UDHR, GlotLID-C) is not easy.First, models do not consistently adhere to standard language codes.In addition, differences in granularity require matching rules.For example, if a benchmark only covers a macrolanguage and none of its varieties, then we consolidate classification decisions for the macrolanguage and its variations into the macrolanguage label.

Decision rule
Given an LID classifier m, a base set B of languages and a threshold θ, we assign label ϕ(s, m, B, θ) to sentence s as follows: We distinguish two scenarios: SET! and SET?.In scenario SET!, the set of languages covered by the evaluation benchmark is known.We restrict a model's predictions to those languages that occur in the benchmark.This means that B is a (proper or improper, see table captions for details) subset of the languages occurring in the benchmark.In scenario SET?, the set of languages covered by the evaluation benchmark is not known.We do not restrict a model m's predictions: the model considers the entire set of languages it was trained on.This means that B is the set of languages that m was trained on.Confidence thresholds.For CLD3, we use .5 and .7, the two preset thresholds in Google's CLD3 repository.For the other three baselines and GlotLID-M, we also use .5, but we use .3 as the second threshold value because .7 severely reduces the number of positive predictions for the FastText models, resulting in low F1.
Prior work has not systematically investigated the effect of confidence thresholding.However, it is of key importance for our use case of creating high-quality corpora for low-resource languages.See §5.4 and §6 for discussion of this point.SET! scenario.When comparing LIDs m 1 and m 2 (trained on the set of languages M 1 and M 2 ) on a benchmark T (supporting the set of languages B(T )), many evaluations create a subset M 1 ∩ M 2 ∩ B(T ) and remove all sentences in the benchmark that are labeled with languages outside of M 1 ∩ M 2 ∩ B(T ).SET! evaluation replicates this standard way of evaluating LIDs.

Decision rule
SET? scenario.We believe that the SET! scenario makes the LID task unrealistically easy: a portion of the data that could give rise to false positives (data not in It is particularly unrealistic for our low-resource scenario.Instead of hundreds of languages that are not supported by all models, we have more than a thousand.We therefore run evaluations on the data for all languages -not just for M 1 ∩ M 2 ∩ B(T ).That is, we run evaluations on the entire benchmark T , not on the subset in M 1 ∩ M 2 ∩ B(T ).This is the SET? setting in Table 3 where SET? signifies that the LID is not given prior knowledge about which languages occur in T .For example, for the comparison of CLD3 and GlotLID-M on FLORES in the top part (SET?) of Table 3, both CLD3 and GlotLID-M are run on the entire FLORES test set.We do not exclude the languages that are present in T , but are not part of M CLD3 ∩ M GlotLID , i.e., the languages outside of the set of 95 languages common to CLD3 and GlotLID-M.
Macro average.For a fair comparison to prior work, we restrict the macro average over languages to a subset of languages in order to replicate the experimental setup of this prior work.This subset is indicated in the tables.
Realistic evaluation for low-resource scenarios.We believe that our new evaluation setup SET? better approximates real world situations.In cleaning pipelines, LID models are often presented with an unknown set of languages without prior knowledge.Therefore, it is crucial for an LID to have the capacity to handle unknown languages.This can be achieved by setting a threshold θ on the confidence scores.If the confidence score for a predicted label falls below the threshold, the model should label the input text as "undetermined".This reduces the risk of languages unknown to the model being incorrectly categorized as a known language (the out-of-model problem).Consequently, when comparing LIDs, it is necessary to apply each model to the entire benchmark.

Evaluation measures
Unlike some older prior work (Jauhiainen et al., 2019b), we do not use accuracy because classes are highly imbalanced.Instead, we follow recent prior work (NLLB Team et al., 2022;Burchell et al., 2023) and use F1 and false positive rate (FPR).F1 is an aggregate measure of precision and recall, both of which are important: we want accurate classifications decisions (precision) and we do not want to lose too much data (recall).FPR is defined GlotLID-M, θ=.0 GlotLID-M, θ=.5 as FPR = FP FP+TN , where FP is the number of false positives, and TN is the number of true negatives.FPR helps us assess the potentially fatal effect of an even low false positive rate when the negative class is huge -which is the case in our scenario.For example, an FPR of .01(which prima facie may seem ok) for a language l with base frequency .01 can result in a corpus for l that contains 50% noise, an unacceptably high level.

Results
Table 2 gives results on GlotLID-C test, UDHR and FLORES-200.GlotLID-M does not perform well on some languages.In particular, there are 167 (1832-1665) low-resource languages for which either F1<.01 or FPR>.0005, often due to very small GlotLID-C training sets.The table gives results for "all" 1832 languages as well as for the "subset" of 1665 well-performing languages.We run GlotLID-M in two settings: θ=.0 (i.e., we choose the highest probability class no matter how low its probability is) and θ = .5(i.e., we only assign a language label if its probability exceeds .5).See Figure 1 for the definition of our decision rule.
Focusing on the "subset" results for θ = .5,F1 is .973on GlotLID-C and .924 on FLORES; and FPR is .0002on GlotLID-C and .0010 on FLORES.This is a very good performance, in particular for the use case of low-resource corpus creation because low FPR means that the resulting corpora will be less contaminated.On UDHR, again for the "subset" results for θ = .5,F1 is .770and FPR .0006.This is again an encouragingly low FPR, but F1 is quite a bit lower than for GlotLID-C and FLORES.The reason is that we have a domain shift (compared to GlotLID-C) and many more languages (compared to FLORES), resulting in lower F1.Although the UDHR results should be improved further, we will now show that they outperform the state of the art.
Table 3 compares GlotLID-M with four baselines.We consider two evaluation settings (SET? and SET!) and three thresholds θ.The top part of the table (SET?) corresponds to the case where the set of languages in the benchmark is not known, i.e., the LID makes predictions for all languages it was trained on.In contrast, in the SET! setting (bottom part), the set of languages in the benchmark is known, and each LID only makes predictions for those languages.SET? is a more realistic setting, as we usually do not know which languages occur in a corpus that needs to be cleaned.
For the SET? setting, GlotLID-M consistently outperforms CLD3 by a large margin.Taking into account that F1 and FPR should be balanced, we also take it to outperform FT176.Even though GlotLID-M's FPR is slightly higher in some cases, its F1 is better by a large margin, so that it is clearly the better performing system.
On UDHR, GlotLID-M also clearly outperforms OpenLID and NLLB for F1 and FPR by large margins.On FLORES, F1 is slightly worse and FPR slightly better compared with OpenLID and NLLB.We point out that this comparison is not entirely fair since OpenLID and NLLB were designed with FLORES in mind.More importantly, our use case is the creation of low-resource corpora for which UDHR is the more appropriate benchmark.
Comparing results for different thresholds, we observe that increasing θ lowers F1 (because recall is hurt) and lowers FPR (because precision is increased).This suggests that a higher threshold should be used since lower FPR will result in low-resource corpora with less contamination from high-resource languages.
For the less realistic SET! setting, GlotLID-M performs better than CLD3 and FT176 and comparably to OpenLID and NLLB.Overall, GlotLID-M clearly outperforms all baselines for the lowresource corpus creation use case.
To analyze variance of results, we ran three GlotLID experiments with different initial seeds on the 200 languages with the most data, splitting the data into 80% train and 20% test.The F1 score was .991 each time.This indicates that the variance Table 3: Evaluation of LID performance.Top ("SET?"):The set of languages is not known, i.e., each LID makes predictions for all languages it was trained on.Bottom ("SET!"):The set of languages is known: each LID only makes predictions for languages that occur in the benchmark.For the more realistic "SET?" setting, GlotLID-M outperforms the baselines on UDHR (which we take to be the best benchmark for the low-resource case) assuming a good tradeoff between FPR and F1 is desired; it either matches or outperforms them on FLORES.Let M i be the set of languages model m i was trained on and B(T ) the set of languages covered by benchmark T .Then F1 and FPR are averages over L = M 1 ∩ M 2 ∩ B(T ) when comparing models m 1 and m 2 ; this is indicated in the third row of table, e.g., |L| = 96 for m 1 = CLD3, m 2 = GlotLID.θ 1 =.5 for CLD3, θ 1 =.3 for FT176, OpenLID and NLLB.θ 2 =.7 for CLD3, θ 2 =.5 for FT176, OpenLID and NLLB.Referring to Figure 1, the base set B in SET? has size 103 for CLD3, 176 for FT176, 195 for OpenLID, 211 for NLLB and 1832 for GlotLID-M (i.e., the languages the LID was trained on).For scenario SET!, B = L, i.e., B = M 1 ∩ M 2 ∩ B(T ).For example, |B| = 96 (for both CLD3 and GlotLID) for the four cells in the the SET! rows and the CLD3 columns in the lower left corner of the table.The best result in each column is bolded, and the second-best result is underlined.
of FastText in this task (and by extension GlotLID) is negligible.

Analysis
In this section, we analyze the GlotLID-M results summarized in Table 2 (θ=.0, "all") for our main use case, the creation of high-quality corpora.We address four questions.(i) For which languages do we get a high number of false positives?(ii) For which languages do we produce a corpus with a high contamination rate?(iii) For which languages does learning completely fail?(iv) Is it more realistic to evaluate LID on a balanced test set (as in prior work) or on one that is skewed in favor of high-resource languages?
Most errors.We first analyze languages with a high number of errors.Table 4 (top, "most errors") gives for each of the three benchmarks the five languages that have the highest number of errors (column "language")."FP" is the number of false positives, "cl" the ratio of true positives to all positives (that is the "cleanness" of the corpus), "top FP source" the language that contributed most of the errors and "%" is the portion of these false positives as a percentage of all false positives.We use the cl measure in our analysis because it is ultimately the measure we want to optimize to produce high-quality low-resource corpora.Note that cl (the denominator is the total number of posi-tive sentences) is not directly related to FPR (the denominator is the number of sentences that do not belong to the language).cl is a more direct measure of the utility of the resulting corpus of a low-resource language (e.g., for training a language model) than FPR.
Most of the fifteen pairs of "conflated" languages shown in the table are closely related languages: varieties of Arabic (Standard, Najdi, Egyptian and Levantine), Persian (Iranian, Dari), Chinese (Mandarin, Yue, Wu, Hakka), English (Standard, Liberian), Quechua (Huallaga Huánuco, Huamalíes-Dos de Mayo Huánuco), Finnic (Finnish, Karelian), Slavic (Russian, Church Slavic), Bihari (Bihari, Bhojpuri) and Hindi (Standard, Awadhi).In many of these cases, speakers of one variety of the pair also have good knowledge of the other; e.g., many speakers of Arabic varieties know Standard Arabic.The two Quechua varieties are spoken in neighboring areas of Peru.The quantitatively largest use of Church Slavic (which may be reflected in the size of our corpora) is in Russia by Russian speakers.
Arabic, Chinese and English (and perhaps also Hindi, Persian and Bihari) are diglossic linguistic communities.There may be a lack of clear separation between the two conflated varieties in the available corpora because speakers switch back and forth between more formal and less formal ways Table 4: Analysis of the GlotLID-M runs with settings θ=.0, SET? from Table 2 and Table 3. "most errors": languages with the most false positives."most noisy": a sample of languages with cleanness between 0 and .5. "no positives": a sample of languages without positives."hi resource": a more realistic setting in which the distribution is skewed in favor of high-resource languages.For each "language", we give the number of false positives ("FP"), the cleanness of the resulting corpus ("cl": ratio true positives to all positives), its most conflated language ("top FP source"), FP contributed by that language and the ratio of the two FP numbers ("%").To save space, we write .99 for 1.00.
of speaking depending on factors like context, audience and subject.This type of fluid switching between languages often occurs in a single sentence or conversation, i.e., it manifests as code switching.As a result, much of the text (and speech) produced in one language may be mixed with the other language.New methods will have to be developed to deal with these quite complex challenges of creating training corpora for language identification; see also (Aguilar et al., 2020).
Apart from these related languages, at least four conflated language pairs in Table 4 are clear errors: Mandarin/Cherokee, Russian/Gilyak, Spanish/Piaroa and Liberian English/Dinka.Similar to the situation we described for the closely related languages, Gilyak (resp.Piaroa) is spoken in an area where Russian (resp.Spanish) is the dominant official language.This means that our training corpora will need to be improved: they most likely contain many sentences labeled as Gilyak/Piaroa that are partially or completely Russian/Spanish.We leave it to future work to revisit and improve our corpus selection and preprocessing methodology to address this data quality problem.
GlotLID-M confuses Mandarin and Cherokee because our Cherokee training data do not cover the Cherokee syllabary script.Sentences written in this script are returned with a close to uniform distribution over several other scripts, including Chinese, Japanese and Thai, which explains the confusion.The Dinka test set is noisy.In a man-ual inspection, we found 377 sentences that are clearly English, not Dinka.Because GlotLID-M did not learn very well to discriminate English and Liberian English, 174 of these 377 sentence were classified as Liberian English.
Most noisy corpora.The second part of Table 4 ("most noisy") gives, for each benchmark, a random selection of five languages whose cleanness score cl (ratio of true positive to all positives) is in the range 0<cl<.5.The total number of languages in this range is 9 for FLORES, 27 for UDHR and 6 for GlotLID-C.Again, most of the conflated pairs are closely related languages as in the last section.Additional pairs that occur here are Dyula/Bambara, Evenki/Orok, Croation/Bosnian, Berber languages (Standard Moroccan Tamazight, Atlas Tamazight) and two varieties of Chorote (Iyo'wujwa, Iyojwa'ja).The resulting corpora are noisy, an issue that we will have to address in future work.
No positives.Part 3 of Table 4 ("no positives") gives five random examples from languages for which there was not a single positive classification.There were no such languages for FLORES.
For UDHR, we identified two reasons.(i) Performance on GlotLID-C is good, but poor on UDHR.
Tetum is an example.The most likely cause is a domain shift or some other big train/test difference.(ii) The training set is too small (less than 30 sentences): hsn (Xiang Chinese), abk (Abkhazian), vep (Veps) and niv (Gilyak) are in this class.
For the five GlotLID-C random examples with no positives, the reason is also that the training sets were too small (less than 40 sentences): sck (Sadri), chg (Chagatai), liv (Liv), gbm (Garhwali) and tmw (Temuan).We should have set a higher threshold for minimum size of the training corpus.Note that the number of 1665 languages that we use throughout the paper already reflects this insight.Even though we train on 1832 languages, we claim reasonable performance for only 1665 (Table 2).
Test set skewed in favor of high-resource.FLORES and UDHR test sets are balanced: highresource and low-resource languages have about the same size.Following this model, we constructed the test set of GlotLID-C in the same way.F1 is independent of this distribution, but FPR and cleanness ("cl") are strongly dependent on it.The Spanish corpus generated by GlotLID-M on GlotLID-C test has a dismal cleanness of only .34.Is this a problem for GlotLID-M?
We believe the answer is no, as the corpora we run LID on will have a distribution skewed in favor of high-resource languages.To simulate this more realistic scenario, the last part of Table 4 ("hi resource") gives five selected languages for each benchmark where we have inflated the subsets for high-resource languages by a factor of 100.For example, instead of a single copy of the English part of FLORES, the test set now contains 100 copies.
We see in Table 4 that this results in clean corpora (cl=.99) for each of the fourteen highresource languages shown: Standard Arabic, Hindi, Russian, Spanish (FLORES); Mandarin, Finnish, Hindi, Russian, Spanish (UDHR); Mandarin, English, Hindi, Russian, Spanish (GlotLID-C).As an example, looking at Spanish for GlotLID-C (the first and last lines in the table), the number of false positives (1952) and the number of false positives contributed by the low-resource language Piaroa (156) are the same.But since the size of Spanish is increased 100x, its cleanness improves from .34 for the unrealistic uniform distribution to .99 for the realistic skewed distribution.Thus, as we would expect, LID for high-resource languages is a relatively easy problem and this does not change much if we run a broad-coverage LID like GlotLID-M.
Conversely, LID numbers for low-resource languages can be negatively affected.The Dzongkha corpus generated from FLORES in the uniform setting has 103 false positives and a cleanness of .91 (not shown).In the skewed setting, making Tibetan a high-resource language causes 10,300 false posi-tives from Tibetan to leak into Dzongkha, reducing its cleanness to an unacceptable .09.
This discussion suggests that the established evaluation methodology for LID is unsatisfactory.We recommend that future work considers both unifom and skewed test sets to better assess how LID is expected to perform in the real world.
This analysis demonstrates how much harder LID becomes when we represent as large and diverse sets of languages as we do.What we have shown is that there is a real danger of creating corpora that are badly contaminated.To address this, we need to develop methodologies and resources that better handle low-resource languages.
Based on the analysis described in this section we created and open-sourced a much improved version of the UDHR test set for evaluation of LID. 6ll UDHR results in this paper are based on the version of the UDHR test set descibed in §5.2.

Conclusion
We create GlotLID-C, an LID resource that covers 1832 languages, several times more than prior work.We introduce GlotLID-M, an open-source LID that covers 1665 languages with good results.The comparison of GlotLID-M against four LID baselines shows superior performance for the lowresource use case.In future research, we would like to improve quality of our training corpora and add more low-resource languages in to GlotLID.We hope GlotLID will be a valuable resource in creating higher-quality corpora for low-resource languages.

Limitations
(1) We publish list of GlotLID-C data sources as part of this work.There is no other LID benchmark available that covers as many languages as GlotLID-C does.GlotLID-C, FLORES and UDHR all have drawbacks as evaluation datasets for LID.An LID trained on GlotLID-C train and tested on GlotLID-C test will often find the same domain in the test set as in the training set.It is well known that this results in overly optimistic evaluation numbers.FLORES and UDHR consist of data that were not originallly produced in each language.Rather, they were translated from high-resource languages.The same is true to a lesser extent for GlotLID-C.Translated language is only an imperfect evaluation benchmark because it can differ greatly from natural language data, i.e., translationese is often not a good model of natural language data.
(2) Many corpora for the lowest resource languages are derived from religious sources.It should be noted that many Bible translations do not reflect actual language use.
(3) We do not conduct hyperparameter search and instead use the hyperparameters employed by previous studies.However, conducting such a search can make our findings more robust, considering the difference in the number of languages included in our study compared to the prior work.

Ethics Statement
We here highlight key ethical considerations for GlotLID.
Data.The data used in our study comes from openly available (but not necessarily freely redistributable) datasets, including resources previously published by researchers, publishers, and translators.We ensured that the data collection process complied with licensing of each dataset.
Bias.We recognize potential biases towards higher resource languages.We conducted a comprehensive analysis of errors and evaluated their impact on our results.
Inclusivity.We acknowledge the challenges associated with low-resource languages and have taken steps to include a diverse range of languages in our study.
Ethical Use.We have demonstrated both positive and negative outcomes of applying GlotLID-M as an LID tool.We acknowledge that GlotLID-M has a high error rate for some low-resource languages.This means that there is a potential risk of excluding low-resource languages during the collection and processing of NLP corpora.
Transparency.We provide detailed descriptions of our methodology, model architecture, and evaluation process.Additionally, we make our research artifacts, including model, code, and list of data sources openly available to foster collaboration and reproducibility.

A Evaluation data issues A.1 FLORES-200
There are some mistakes in the FLORES-200 dataset which have been raised by the community.For example, in a GitHub issue https://github.com/facebookresearch/flores/issues/61, it is pointed out that yue_Hant and zho_Hant should actually be very easy to distinguish from each other, and the Cantonese (Yue Chinese, yue_Hant) data in FLORES-200 is completely wrong.
In another issue (https://github.com/facebookresearch/flores/issues/63), it is mentioned that the Central Atlas Tamazight (tzm) is actually in Standard Moroccan Tamazight (zgh), as confirmed by a native speaker of Central Atlas Tamazight.

A.2 UDHR
There are some mistakes with UDHR.For example, both ckb and kmr files are the same.ckb is known for the Arabic script, although it can also be written in Latin.There are also some files that the writing system is not in popular use (based on Kargaran et al. ( 2023) metadata): • ckb_Latn (Arabic script is in use.) • azb_Latn (Arabic script is in use.) • khk_Mong (Cyrillic script is in use.) • vie_Hani (Latin script is in use.)

B Performance of GlotLID-M per language
The list of languages used to train GlotLID-M, along with the corresponding amount of available data and detailed results for each language, can be found in Tables 5-29 C Calibration As stated in §2, an LID model should provide a calibrated confidence measure in addition to its prediction.Reliability diagrams illustrate model calibration (DeGroot and Fienberg, 1983;Niculescu-Mizil and Caruana, 2005).These diagrams use expected sample accuracy as a function of confidence.If the model is perfectly calibrated, then the diagram plots the identity function.
We provide the reliability diagram for GlotLID-M on GlotLID-C test in Figure 2.For GlotLID-C test, the plot is nearly close to the identity function.However, for some of the low confidence scores, it's not calibrated.This mostly happens because we included so many languages in our models, and some of these languages are very similar to each other or have small training sizes.

Figure 1 :
Figure 1: Decision rule for assigning classes (i.e., languages) in language identification

Figure 1
Figure 1 defines our decision rule.SET! scenario.When comparing LIDs m 1 and m 2 (trained on the set of languages M 1 and M 2 ) on a benchmark T (supporting the set of languages B(T )), many evaluations create a subset M 1 ∩ M 2 ∩ B(T ) and remove all sentences in the benchmark that are labeled with languages outside of M 1 ∩ M 2 ∩ B(T ).SET! evaluation replicates this standard way of evaluating LIDs.SET?scenario.We believe that the SET! scenario makes the LID task unrealistically easy: a portion of the data that could give rise to false positives (data not inM 1 ∩ M 2 ∩ B(T )) is removed.It is particularly unrealistic for our low-resource scenario.Instead of hundreds of languages that are not supported by all models, we have more than a thousand.We therefore run evaluations on the data for all languages -not just forM 1 ∩ M 2 ∩ B(T ).That is, we run evaluations on the entire benchmark T , not on the subset in M 1 ∩ M 2 ∩ B(T ).This is the SET? setting in Table3where SET? signifies that the LID is not given prior knowledge about which languages occur in T .For example, for the comparison of CLD3 and GlotLID-M on FLORES

Table 2 :
Performance of GlotLID-M on GlotLID-C, UDHR and FLORES-200 test sets.Subset: restriction to an "operational" subset of languages that are either high-resource or for which GlotLID-M achieves F1̸ =0 and FPR≤.0005 on GlotLID-C test.L: intersection of GlotLID-M languages (all: 1832 or subset: 1665) and languages present in benchmark.Referring to Figure1, the size of the base set B is either 1832 (all) or 1665 (subset).L is the set of languages over which the macro average is computed.For example, for the last line (FLORES-200 subset), B consists of 1665 languages and the reported macro averages are computed over 177 languages.

Table 7 :
Performance of GlotLID-M on GlotLID-C test,

Table 8 :
Performance of GlotLID-M on GlotLID-C test,

Table 9 :
Performance of GlotLID-M on GlotLID-C test,

Table 10 :
Performance of GlotLID-M on GlotLID-C test,

Table 12 :
Performance of GlotLID-M on GlotLID-C test,

Table 13 :
Performance of GlotLID-M on GlotLID-C test,