Language learning has surged during the pandemic. Duolingo, which is synonymous with gamified language learning, saw its fastest growth period this March, with a 101% global increase in new users. From those who simply have more time on their hands to students trying to keep up during the pandemic school year, the app is a huge boon. All that extra data isn’t going to waste — because Duolingo invested early in AI, the app keeps getting better as it grows beyond the 30 million monthly active users reported in December 2019.
“One of the things people don’t know is that even though Duolingo is very gamified and it just looks very cutesy, we actually record everything you do to try to basically have a model of what you know,” Duolingo CEO Luis von Ahn told VentureBeat. We spoke to von Ahn about all the ways Duolingo uses AI and then followed up with the company’s research director, Burr Settles, who joined in 2013 (Duolingo was founded in 2012). “We hired this guy named Burr who has a Ph.D. in AI,” von Ahn said when describing the company’s first foray into AI. “He came in and the idea was ‘Try to figure out how to use AI to improve Duolingo.’”
We’ve already done deep dives into how Duolingo uses AI to humanize virtual language lessons and to drive its English proficiency tests. This is a closer look at all aspects of the app itself, including the AI behind Stories, Smart Tips, podcasts, reports, and even notifications.
All of that adds up to a superior language learning experience, Duolingo says. Indeed, the company today published a report claiming its users performed as well on reading and listening tests as students who took four semesters of university classes, in half as many hours.
As you use it, Duolingo builds an exceptionally detailed profile based on what you know and what you don’t know.
“We know everything by individual word,” von Ahn said. “We have a whole space repetition system. We know how many times you’ve seen that word and we have an idea of how long it will take you to forget this word.”
The spaced repetition system was the very first AI project the company undertook, back in 2013. The model is able to predict when you’ve forgotten something because you haven’t seen it very frequently, or very recently. To this day, Duolingo uses it to help select which challenges it will put into a practice session for you.
“That’s still in production,” Settles said. “That’s a project that we actually hadn’t really touched for about seven years, and over the last quarter we’re actually reviving that and finally going back and improving on those with some things we’ve learned. And also in 2013, we built a computer adaptive placement test. When you first sign up for a course, you can take about five minutes and it will place you into where you belong in the course. And we’re also doing some active improvements on that. That second project was the inspiration for the Duolingo English tests. VentureBeat recently did a rather in-depth thing on that, but that’s AI end to end.”
Within each lesson, Duolingo decides which exercises to give you based on the words and concepts the app believes you need to practice. The specific exercises you are served vary, so each overall lesson of exercises ends up being different for everyone.
“We may have a list of 20 words that we’re trying to teach you. That is the same for everyone,” von Ahn explained. “But with those words, we have some latitude about how we teach them to you. For example, we may teach you the word ‘chair’ by giving you the sentence, ‘I love this chair’ or we may give you the sentence ‘I sat on this chair.’ And we make the choice about which one to teach you the word chair with based on what we think may be better for you.’”
If you’re struggling with the past tense and Duolingo has an array of exercises for your level in various tenses, it will pick the past tense ones within the lesson you’re doing, just to make sure you practice that more.
This is all possible thanks to a machine learning implementation affectionately called Birdbrain.
“Generally, for every exercise we have a really good idea of how hard that exercise is for you,” von Ahn said. “For every sentence, before we give it to you we have a probability of what likelihood there is that you’re going to get that sentence right, that exercise right. It gives us no explanation about what you know and what you don’t know, it just says ‘Emil, for the sentence the man is on the chair, has a 93% chance of getting it right.’”
Furthermore, Birdbrain adjusts the difficulty within a lesson based on how hard a sentence is for you specifically. “And we use that to calibrate difficulty,” von Ahn said. “If you’re getting everything right, we say ‘Let’s give you something that we think you only have a 70% chance of getting right to see whether you get it right or not.’ If you’re getting a lot of things wrong, we actually start giving you things that are easier.”
Think of Birdbrain as the ultimate personalization learning system.
“It’s a massive system that trains every night using the half a billion or so lessons from the day before,” Settles explained. “As a byproduct of making these predictions, it models how hard the challenges are, as well as how proficient the users are. And so we’ve got this microservice now within the system where what we call session generator — that’s the system that constructs your lesson just for you when you go in to do a lesson or a practice session. And it would say ‘Okay, here’s like 200 challenges that I could put into this. I’m only going to use 14 of them, but here [are] 200 or so that might fit.’ Birdbrain will come back and say ‘Well, out of those 200, here’s the probability threat for each one of those. And then session generator can use that to help pick which challenges will be in there. It can use it to sequence or order which challenges will be in that particular lesson.”
Duolingo can start giving you custom AI-generated lessons or suggestions within lessons after you’ve completed about a hundred exercises, or just five or six lessons. The system is fairly new — Duolingo only started developing it in October 2019 and launched a product feature that used it in March.
“Multiple teams use this service to fine-tune the experiences that they own,” Settles said. “And so over time the fraction of sessions that are being personalized with Birdbrain keeps going up.”
Last month, 6-8% of Duolingo sessions were being personalized by Birdbrain. Today the number is at 12% as teams at the company keep finding new ways to use it.
For Birdbrain’s personalization to be truly useful, Duolingo needs to know why you are failing certain exercises.
“When you get a challenge right or wrong, at this point Birdbrain doesn’t actually know why you got it right or wrong,” Settles said. “If there was a misconjugation or if it was a noun adjective agreement, or if you just typed in word salad, it doesn’t distinguish between those as far as Birdbrain is concerned.”
Duolingo uses everything it knows about every exercise, which is tagged with as much detail as possible (part of speech, sentence structure, tense, and so on), so it can figure out what to blame. Those tags were once done manually, but not anymore.
“We do a lot of it automatically now — of the tagging for every exercise,” von Ahn said. “And then, whenever you get it wrong we have this algorithm that’s called Blame that we try to assign blame for why you got it wrong. So when you enter and you get it wrong, we try to figure out like ‘Oh it’s because you didn’t know the word for that or it’s because you knew the word for that but you don’t know how to make it go into the past tense.’ And then we have a pretty good idea of the things that you often get wrong.”
There is no separate algorithm when you get the exercise right, but Duolingo tracks that as well.
“If it’s right, we give you credit. We say ‘Okay, he just did an exercise that has these words and these concepts and got it right, therefore our confidence that this person knows these concepts went up.’ But if you get it wrong, it’s much harder because Blame is trying to figure out which of the concepts is the culprit for why you got it wrong. And sometimes we can’t because you enter an answer that is so off that who knows? But most of the mistakes that people enter, usually it’s like one or two things off. And we try to figure out what concept you didn’t know. Did you not know the word for it? Did you not know the gender of the word for it? Did you not know how to conjugate into the past tense? Did you not know that adjectives come before the noun?”
Blame can spit out multiple reasons for why you got something wrong. And of course, the more mistakes you make, the harder it is to decipher. “At some point it just kind of gives up,” von Ahn said.
If you know you’re going to get a challenge wrong, but you recognize a word, translating just that word would be better than responding with gobbledygook. “It would definitely be better for our model. Our model would have a better opinion of you.”
Conversely, if you get the whole thing right, Duolingo does not necessarily think you know all the concepts therein — maybe you just guessed correctly. “That’s exactly right,” von Ahn confirmed. “This is all probabilistic. Now we have a little more confidence that you know the word for ‘banana.’”
Knowing how to translate individual words isn’t enough for effective communication in a new language. Sentence construction and understanding is just as important. Last year, the company started working on a feature called Smart Tips. For some mistakes that you make, Duolingo tries to figure out the root cause so it can offer you a timely tip. For example, if Duolingo notices that you entered the right words but in the incorrect order, it can give you a corrective grammar tip right after it spits out that your input was incorrect.
Seems simple enough, right? It turns out Smart Tips isn’t just straightforward machine learning.
“That required some major creativity,” Settles said. “Each challenge and each response gets run through what is a pretty textbook natural language processing pipeline. Here’s the sentence, these are all the nouns. This noun is masculine, it’s plural, and it is the subject of this verb. All of that stuff is pretty textbook. But then figuring out that this person made this specific mistake — they got the word order wrong or they got the gender of the noun and adjective agreement wrong. Those are rules that are human crafted on top of the textbook, natural language processing pipeline.”
Settles’ Ph.D. is in active learning, and he wrote a book about machine learning algorithms that ask questions. Rather than just passively consuming data and learning to predict something, they develop a hypothesis or multiple hypotheses and try to figure out which is the right one by asking questions of a human oracle.
“What we’re doing here is we run an NLP for the correct answer and we run the NLP pipeline on the wrong answer,” Settles said. “We look at the difference between those and try to come up with a bunch of explanations of what’s wrong. We know it’s wrong. But what’s wrong about it? And then do that in aggregate over a couple million exercises every day. And then make suggestions to a human of like, ‘Hey, this is what I think is going wrong in a lot of these challenges.’ And then it will propose some rules, and they can kind of click on the rules and see examples of the correct answer and the incorrect answer that would be covered by that rule. They kind of collaborate with the AI to come up with the right set of rules.”
It’s this back and forth between the AI and the human staff that results in rules for common grammatical error patterns. The process requires aggregating all the data about the mistakes that Duolingo users make every day. Duolingo’s staff then decides what is a rule and whether it should be published as a tip. Some compiling and optimization follow to ensure that the new tip shows up quickly on your phone when you make the corresponding mistake. And then it happens all over again, with new types of mistakes and rules published.
Duolingo even uses AI to improve the effectiveness of its notifications. The app sends you a notification each day to remind you to practice.
“We use AI to figure out when to send them to you and what to tell you,” von Ahn said. “We trained the whole system trying to figure out when is the best time to send the notification based on your own activity. We know your activity on Duolingo and then for a given day we’ve watched all the days in the past when you’ve used Duolingo, and then we pick a time when to best send you the reminder and also what to say in that reminder. We’ve made pretty big gains in terms of number of people coming back.”
After Duolingo implemented its novel bandit algorithm, the company saw a 2% increase in new user retention one day through one week after they downloaded the app.
That might not seem like a lot, but it’s a significant increase if you consider that the only input data is when the app is used. After just a few days, Duolingo can optimize when to send you the notification. Even one day of data is useful.
“It’s pretty good actually,” von Ahn said. “It’s interesting. If we only have one day of information about you, you know what the system does? It sends you the notification at exactly the same time the next day. Turns out that’s actually pretty good. After we have a few days, we get better and better. Probably after about a week of usage, we get a pretty good idea of when you use Duolingo. Sometimes it may vary by day of the week, so we have noticed that for some people it does something a little different for the weekends than during the weekdays. The system is kind of all trained using just data from you, but it gets pretty good, pretty fast.”
Unlike most AI implementations, where there is always a ton of potential for improvement, this feels like a solved problem. “I don’t know if it’s a solved problem, but we feel pretty good with what we have there, and it’s hard to imagine that we can do a lot better,” von Ahn said. “Like maybe we can do a little better, but it does a pretty good job.”
Whenever you submit an answer to a challenge and Duolingo says you got it wrong, you have the option to hit the Report button. If you think you got it right, you can appeal.
“We get about, somewhere between half a million and a million of those every week, and 90% of them are junk,” Settles said. “They’re either accidental taps or the people are wrong but they think they’re right. But about 10% of those are bugs in the course. Or not necessarily bugs, but things that are acceptable. Maybe they’re not the most fluent or idiomatic way of doing it, but they’re correct, and so we should modify the course content to include those. But it’s a real needle in the haystack kind of process for the course content maintainers and developers.”
To address this challenge, the team built a machine learning system using a logistic regression algorithm that would surface the useful reports.
“For a while, we just sort of ranked the reports by how many people submitted this particular exact sort of report,” Settles said. “And that helped a little bit. But in the process of doing that we collected a lot of training data; well this is actually correct and this is not correct. So we were able to train a machine learning model to predict which reports are likely to be accepted by our contributors. And we did this in a vastly kind of multilingual way so that now there’s an interface that basically rank-orders all of the reports so that they can find the most salient ones to fix first.”
It’s important that Duolingo is ranking the reports and not just discarding the less useful ones — after all, no algorithm is perfect. Plus, there are still too many reports for the team to get through, regardless of prioritization.
“At least the ones at the top tend to be more likely to be acceptable and changes that we should actually make,” Settles said. “Some of them when you look at them are like ‘Yeah that’s obvious.’ Language is so expressive. There’s so many ways of saying the exact same thing that even if you’re thinking really hard about it, you won’t necessarily cover all of the bases.”
The results speak for themselves.
“It used to be that when we rolled out a brand-new course that it took about six months or so to graduate from beta,” Settles noted. “One of the criteria for graduating from beta is that we have fewer than a certain number of reports per number of sessions. The first two courses that we rolled out after we created this tool I think were Latin and Scottish Gaelic. Those graduated from beta in five weeks. It has significantly cut down how quickly we could deal with these reports as they came in.”
In a single quarter last year, Duolingo used unsupervised machine learning to build a tool for determining the difficulty of any text for language learners. The team used the Common European Framework of Reference (CEFR), which has a six-level scale: A1 and A2 (beginner), B1 and B2 (intermediate), and C1 and C2 (advanced).
Not only does the tool classify the language level of the text, it also judges the level of individual words and constructions. The public version only has English and Spanish, which you can try yourself (CEFR Checker), but internally Duolingo also has it working for Spanish, French, Portuguese, German, and Italian.
“Our language and curriculum experts, as they’re developing the curricula, they organize vocabulary into the different levels,” Settles explained. “We’re basing this off of decades of research that has already been done on vocabulary profiles. We use that as training data. But the vast majority of work that is done in creating these vocabulary profiles is English-only, because learning English is a multi-billion dollar industry, whereas learning Portuguese, not so much.”
That limitation meant the team had to lean on its team of Ph.D.s in linguistics with classroom instruction experience who develop a lot of the course content. These linguists put together profiles of around 7,000 English words and labeled them per the CEFR. Then the AI team got to work training the model using massive amounts of text on the internet so it can learn the difficulty of all 10 million words in the English language via word embeddings and transfer learning.
“We invented some multilingual natural language processing ways of transfer learning,” Settles said. “We’re essentially doing multilingual multitask transfer learning, where we have mostly data in English but we’re able to train a system that can make accurate predictions in Spanish, French, German, Italian, and Portuguese, even though we’re bootstrapping from English. It does make some mistakes. The curriculum experts in those languages can correct salient mistakes, and then we retrain the model until it becomes more accurate.”
Duolingo has a Stories tab that features short stories for testing your reading comprehension. The Stories team uses CEFR Checker to test whether the difficulty level of what they write is appropriate.
“We say, ‘Okay, we need 10 more stories at this specific language level,’” von Ahn said. “Then we have writers write them and then we check whether they’re at that language level. If they’re not, we return them to the writers and we say ‘Hey this is too difficult still, you should simplify it.’”
Duolingo also records podcasts so you can keep learning outside of the app. The podcast team similarly uses CEFR Checker to make sure the script they wrote before they start recording is the right level of difficulty for a given episode. Other teams at the company are also using CEFR Checker and making feature requests to the point that Settles wants to go back and improve it.
Above you can see CEFR Checker’s analysis of this article.
The most important question I asked the duo was the one thing Duolingo users struggle with the most: What order should I do lessons in?
“We’ve explored this, and we probably should continue exploring it,” von Ahn said. “This is something that we know that a lot of people struggle with — what is the best order to do it? We’ve thought about this quite a bit and yeah, that’s something that we have used AI for in the past, but I don’t think we’ve ever done anything that is better than what we currently have there, which is just kind of letting people explore.”
Duolingo unlocks harder lessons based on past lessons you’ve completed, but that’s the only guidance you get. Could AI help you pick what to learn next?
“Probably at some point because we now have the tools to start working on that,” Settles said. “So it is in the backlog of things to put on the roadmap.”