AI gets a better score than 80% of Swedes* on our test for language understanding

*WIP, preliminary results

Here is a few demos where you can try things on your own


Model thinks: lärare vid högskola


Model thinks: prokrastinera – aktivt


Model thinks: Avsaknaden av forskning som har konkret koppling till lärarnas undervisningssituation.


Deep Neural Networks on Högskoleprovet a test on AI of the language understanding required for higher education

The breakthorugh in AI: Transformers

In September 2019, I woke up high as a kite after a collarbone operation. I look at my phone and in my feed, I see that there is a new model (Albert, A lite Bert) that beats the average person on tasks for Reading Comprehension in English. This made my high head spin even more. What were the possibilities? What if I was doing this? I have to do this!

The breakthrough came from a model created for machine translation and its continuations in the form of GPT and Bert. The new models completely shattered many of the leaderboards that researchers "compete" on for various tasks. I had done some experiments in English but it was Swedish that was most interesting and there were no good models.

In November, somewhat recovered, I started a journey to explore what the latest breakthroughs in Artificial Neural Networks for Language understanding meant for my language, Swedish. What fascinated me the most was not the high results on text and sentence classification, but that they managed to do tests on reading comprehension. Although the implementations are incredibly similar on a technical level, it is very exciting when it seems to have arisen a different type of reading comprehension than the one we have but which can still "reason" (pattern match?) over texts and answer questions. This was extremely exciting, what did it mean to me? How does it work? What does it output? Are there any similarities with how we do it?

What made it possible to beat the average Swede?


The new models were trained on huge amounts of data in relation to the experiments I did before which were mostly on Swedish Wikipedia which is only a fraction of the size compared to english wikipedia. The models that came were trained on 16GB, 38GB, 140GB +. In the past, people often trained on lines or sentences, but now it was full texts. I had to rebuild my entire dataset from scratch. No small task to do alone. All in all, I came up with about 100GB. If you assume that 1MB is 250 A4 pages, you would come to Seoul, South Korea from Stockholm. Given that the text would result in 25 million sheets printed placing them one after the other. Assuming that it would take an average of 1.7 minutes to read a page for a human being, it would take over 80 years of constant reading. This is for the small models.

Fortunately, there are a lot of resources available. The largest resource is CommonCrawl which contains dumps of internet. After cleaning the data, I found about 16GB on my first dump. The Riksdag shares a lot of texts and this was added quickly and easily. OPUS and its subtitles from movies and Wikipedia went along. Later Oscar was added. All in all, I came up with about 100GB.

Training these new large models requires a lot of computing power. For my small workstation with an RTX2080 graphics card, it would probably take several years to train a model really well. To be able to train, a supercomputer was needed. Google has developed a new type of hardware, Tensor Processing Unit (TPU) that is specifically designed to train models. These are provided freely through the Tensorflow Research Cloud. As a base member of the Tensorflow research cloud, you get access to a bunch of TPUs. Enough to run "small-scale" experiments in this world. It is thanks to TPU access that I have been able to do this.


Internationally, research in NLP (Natural Language Processing) is incredibly open. Everything runs like a running locomotive. It's terribly hectic to keep up with. Fortunately, a lot of models, data and code are shared. Even companies like Google share a lot and are a big part of why a lot of this is happening right now. This suits me perfectly who loves to pick things apart instead of endlessly repeating the basics until they stick without seeing the some nice end result.

What should I start disassembling? I ask those who did Albert, if I could see their code on how they solved RACE (Preparatory tasks for the Chinese college exam) and two days later a couple of files fell into Github. The grind begins ...

How do you teach an AI Swedish?


En bilfärja lämnade lilla varholmen
[ "en", "bil", "##fär", "##ja", "lämnade", "lilla", "var", "##holmen" ]

Does it see words? Letters? The model can see "tokens", a token can be a letter, part of a word or a whole word. Dividing texts into tokens has its pros and cons. The biggest problem and advantage is that these models look at how all the tokens in a text affect each other ("Attention"). If all tokens were just letters in the example above, it would be 19 and in longer texts this would require alot more compute, if we instead look at words it is only 5. If we save all words, we get a gigantic amount of tokens to store, partly we miss some context that could be useful. With the current tokenization, we get 8, a better trade-off. For example, you can guess that ["any token", "##holmen"] is about a place by the water even if the model has never seen that combination before"

In Swedish, we have an insane amount of compound words, such as "longtimestorage" and "pretimevoting". This quickly becomes a problem with how current algorithms search for tokens, which are not at all adapted to this. The tokens that the model sees are often poorly adapted and a different inflection of a sub-word can throw around all the tokens that become of the whole compound word. There will also be duplicates if it is in the beginning or in a word, e.g. then "holme" and "##holme" are seen as completely different tokens. You can test yourself below in the demo.

On the one hand, there is a lot of research on "Attention", which makes it difficult to look at longer sequences and some progress has been made. But then you can also have a smarter system for how to tokenize. There is some research on this and it is partly a problem that may be focused on in Germanic languages ​​other than English. I myself have a lot of ideas that I would like to explore. I assume that more people may be working on developing different solutions already. Then it may simply and partly be the case that more computing power is required, which resolves itself over time and this only becomes a temporary quick fix to compensate for today's slow computers seen from a future perspective. Need m0ar compute.






The model begins with creating embeddings, a kind of representation of what a token means. As an example, it tries to place "agreements" and "contracts" close to each other, probably "lawyer" is in the immediate area if it is well trained.

Then these representations go through several transformer layers. A transformer layer learns to ask questions ("attention") from each token on all other tokens. An example could be "who created the contract". A basic model often has 12 such "questions per layer". The answer to these questions is merged and goes through a neural network that creates a new representation of the token that can move on to the next layer. A base model often has 12 layers. If you increase heads and layers, you increase the number of patterns that the model can pick up. Often known models have different names, but it is usually variations on this or training method that differentiate them. For example GPT-3 has 96 heads and 96 layers and a slimmer variant of attention which makes it less expensive flops-wise to look at longer contexts(2048 tokens).



Lära sig svenska

To learn a model Swedish, you must have a task that it can practice on. A task where you can see if it is wrong. In order for it to learn as well as possible, you need extreme amounts of samples of that task. Therefore, it is best to come up with tasks that work on any text. Previously, this has been to remove tokens and then let the model guess which token should be there. Either in the middle of a sentence or guessing the next token in a text. If you guess a token in the middle of a text, it will be good at understanding texts. If you train it to guess the next token, it will be good at writing texts.

More efficient
One method developed by Clark et al. (Stanford, Google) is having a model that replaces tokens in texts with similar tokens. Then you have another model that guesses which word has been replaced. This leads to the model being given a more difficult task, which also means that it must look hard at each token, which provides better feedback. This leads to the model requiring significantly less resources to train and is significantly more energy-friendly and thus more environmentally friendly than previous methods. I chose this method.

Below you can see the output from the model. The larger the number, the more the model thinks it has been replaced.


Test its abilities on the university entrance exam

The goal is to train the model to be able to do tasks that are similar to the university entrance exam and then test it on the university entrance exam. This as a test of the generalization potential that the new models have. For example, a recurring question in the university entrance exam is, "what does the author say in the first paragraph?". Or the particular style of language that is used. I want to see if the model has such a general understanding that you should not have to train it to learn small tricks to get good results.


This task is more like some form of a given word, a combination of words or a phrase, choose the most similar in meaning. It is not as easy as just having a list of words, because here there are proverbs and other things.

The datasets consists of words from Swedish wiktionary. I use a cosine similarity loss. Splits 80/10/10 and gets ~ 0.91 in cosine pearson

What is interesting about this task is that you can see how the model / method generalizes in addition to the list of synonyms that only consists of a word to cope with phrases and also words that are not in the list of synonyms. When I tested the model on 3 college exams, I got around 84% accuracy on the words. I also then to run the standard way for encoders with pooling and got around 89% at best. However, I do not use the result from there as I used 3 college exams as a dev set and this is cheating. What I want to do is train it to quite generally solve similar tasks as HP and only use HP as a final test. What is interesting besides generalization is that this task is probably a very good measure of how well the model divides tokens. Correct subword -> more generalization.


Read a text that has several empty spaces, in these places one or more words fits. Choose one from 4 options which is best.

Here I did not train at all. The model already has a real talent for this from its basic understanding of Swedish! You can test above to see what the model spits out when you replace different words. Even if it thinks that the word you have replaced may be there, it will generally be more uncertain about the words around it. So by looking at a box around the words, you have a pretty good basis for whether the words fit. This gave 78% accuracy on 3 tests.

There is a lot to gain here. Some by training it to find exchanged words in texts with extremely high quality and not random pages from the internet. You could then apply a classifier and train it to choose the right alternative. However, this goes a bit against the purpose, to test how well it generalizes.


Read a text and a question. Choose the answer from 4 options whichever is best.

Here comes the difficult part. The university entrance exam is probably the largest dataset available on reading comprehension in swedish that is somewhat easily available. Maybe it is available for national tests for high school or middle school or at a company that offers training materials for this. I chose to look at what material was available in English. I trained a model to translate from English to Swedish. That model had to translate reading comprehension tasks into Swedish. Then I trained a Swedish model to do the tasks. The code for training in Tensorflow I have shared in Huggingfaces github. The best result I got with this method was 1.0. I think there is a lot to be gained by using better data and using a larger model, unfortunately this was both resource-intensive in the form of compute but also in time. I hope to be able to continue experimenting with this in the future. There is a model in English, UnifiedQA. It is trained on T5, a model that is available in extreme sizes. They trained a model to do very broad reading comprehension tasks, I do tests with their 3B variant in the end and translate with a model that enters the Swedish texts in English to see where the limit seems to go.

The biggest limitation here is data and model size. Reading comprehension requires a lot from the model. The more patterns the model can find the better the results. Having many examples of the patterns required to do the task also increases the chance. What I had the opportunity to do was basically a model with 12 layers based on my then knowledge. I started experimenting with getting into how to fit and optimize for larger models. Using UnifiedQA 11B would possibly increase the accuracy by about 5%.

If you look at the results below, it still seems that it might be better to train a smaller model in Swedish and still beat the big English with translation. It would be exciting to see if I continue! It should also be taken into account that the English tasks are easier compared to the Swedish ones.


Score: 0, 1.9% - 0/0