Innovation August 13, 2019

Machine Learning Esperanto is here

Sebastiaan Van den Branden

Data Scientist

Kenny Helsens

Artificial Intelligence Lead


Ward Van Driessche

Machine Learning Engineer

Thomas Smolders

Resident Writer

Four fifths of the world population masters a maximum of two languages in their lifetime, with less than 0.1% speaking 5 languages fluently. But do you know someone who speaks 100 languages? Machine Learning (ML) gets pretty close to achieving this these days through advancements in Natural Language Processing (NLP). However, traditionally NLP problems are often solved one language at a time. Wouldn’t it be great if we could expand the functionality of our trained model to 100 extra languages without specifically training our model in these languages? Google offers a promising solution to this problem with a state of the art multilingual model which looks borderline magic, called BERT multilingual. Their solution offered a new perspective as we were building an email classifier for a client that could automatically route an incoming email to the right department based on its content. Belgium has 3 national languages, so the classifier had to be able to classify Dutch, French and German emails, to ensure useability for all Belgian users. This was the perfect opportunity for us to test this multilingual solution.


Training ML models that can classify and analyse text can be a difficult task. For social media networks like Facebook, where the main part of their service revolves around language, NLP is one of their top priorities. Facebook has 2.3 billion monthly active users around the globe, their service is available in 43 languages while being in the process of being translated to another 60 languages. This illustrates the importance for Facebook to find scalable solutions for their services to be deployed to foreign languages. Aside from interface translations, content moderation and sentiment analysis are other valuable services that profit from multilingual solutions.

If we were to take a more traditional approach to a multilingual problem, we could train an algorithm in a high-resource language — a language from which we have sufficient training data — and translate each document that originates from a foreign language into this high-resource language to make predictions. This sounds like a viable solution, but we can spot some flaws in this approach. Errors in translation get propagated through to classification, resulting in potentially degraded performance. Latency due to translation service calls can further reduce the user experience while additional costs are generated for using third party translation services.

Another approach would be to train parallel models for each language we want to use and detect the language of the input text before sending it to the right model. One downside is that we need sufficient training data for each inference language, which increases the complexity of the pipeline as multiple models need to be trained which in turn makes the solution difficult to scale.


There have been many advancements in the NLP space over the course of the last years. A problem that traditional neural networks face when it comes to sequences is that they require the dimensionality of the input and output to be known and fixed. In 2014, Google introduced a Sequence to Sequence model which is able to map a fixed length input with a fixed length output where the length of the input and output may differ. A multilayered Long Short-Term Memory (LSTM) is used to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. This is essential when translating a sentence that has a different number of words across different languages, for instance. The introduction of attention in NLP models allowed to better concentrate on the most relevant information instead of an entire sentence. It enables the algorithm to only pay attention to the words in an input sentence it finds the most important. Attention is a key step to extract semantics from an input sentence and passing it to the decoder to, for example, improve translation accuracy. State of the art NLP models like Google’s BERT take the attention approach one step further by ditching the recurrent properties, like LSTM’s, of NLP models in favor of Multi-Head Attention and Feed Forward layers called Transformers. Abandoning the recurrent properties of the model allows parallelization of the attention mechanism to improve performance. By taking this approach, they further push the envelope on what NLP models are capable of.


Not this Bert

As stated before, the most recent take of Google on the Transformer approach is called BERT, which stands for Bidirectional Encoder Representations from Transformers, and is a new method of pre-training language representations which obtains state-of-the-art results on a wide array of NLP tasks. BERT outperformed previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP. The fact that it is trained unsupervised allows the creators to pre-train this model on an enormous amount of plain text data which is publicly available on the web, like Wikipedia. The bidirectional part of BERT allows the model to extract contextual representations in relation to a word from either words to its left or its right. This results in state-of-the-art performance on a variety of datasets that test performance on sentence-level, sentence-pair-level, word-level and span-level tasks with almost no task-specific modifications.

So how does BERT in fact help us with our multilingual email classification problem you ask? Here’s where BERT multilingual models come in. Aside from single language pre-trained models, BERT offers a multilingual model that’s been trained on 100 of the most used languages on Wikipedia. This allows us to train an NLP task on a high-resource language and still achieve great performance on a foreign language, while our model has never seen examples in this foreign language, as explained in this paper. Because of the model’s ability to predict examples in a language it was not trained on, it is called Zero-Shot learning.


How good is this approach? We wanted to put this model to the test, so we used a BBC news article dataset that consists of 2225 English news articles divided into 5 categories: business, entertainment, politics, sports and tech. We used 80% of this dataset to build a classifier that is capable of predicting these 5 categories by fine-tuning the multilingual BERT model. The re-training of BERT Multilingual took us about 15 minutes on a Google Cloud Virtual Machine (VM) with a Tesla P100 GPU. Factoring in the time it took to set up our project, we only needed about an hour of VM time, which cost us about $0.25 in total. We then used the remaining 20% of the dataset to test our model and achieved 97% accuracy. To test the classification capabilities on Dutch articles we used Google Translate API to translate this test set to Dutch and found a 92% accuracy. We also need to weigh in the possible machine translation error in the 5% performance decrease. Finally, we took 15 Dutch news articles from the Flemish public broadcaster VRTand saw comparable results when our model predicted 14 of the 15 articles in the right category. The fact that our trained model is capable of classifying news articles in Dutch without ever having seen an annotated Dutch example is remarkable. The main takeaway is that this approach eliminates the need for sufficient training data for every language we want to predict.

The fact that our trained model is capable of classifying news articles in Dutch without ever having seen an annotated Dutch example is remarkable.

At In The Pocket we are always striving to implement state-of-the-art (ML) technologies in elegant consumer products to go that extra mile in delivering the best possible user experience. Multilingual models like this make NLP problems very scalable to foreign languages. Imagine a brand with worldwide e-commerce activities that wants to gain insights from user reviews and social media posts. It would be very time-consuming to do this for every local market. With this technology, they could train a model capable to perform sentiment analysis using only English training examples but deploy this model across every other language their customers are using.