What Mistakes Are Translation Machines Still Making In The Real World?
never would have predicted this would be the subject of my very first blog post
Link to Part 2; Link to Part 3.
Bit of a weird first-ever blog post, this. If I’d thought seriously about blogging before, I would have tried to start in a systematic way: an introduction, some general post about what I was interested in and what I wanted to talk to you about. But I never started that blog, did I? I merely figured out answers to some of my obscure questions about archaic Turkish TV vocabulary, and thought vaguely that “someone should write that down”.
I’ve read advice saying that the best way to start writing is to just write something, anything at all, and so here we are. I begin with a post about something I read that made me think.
What I want to write today is more like a long comment really, on a discussion that began with this post on AI scaling at Astral Codex Ten. To summarise, the argument is about whether machine-learning language models like OpenAI’s GPT series are missing some “spark” that will be required for them to ever pass as human. Will they one day develop genuine comprehension of what language is, as a consequence of simply taking the current models and scaling them bigger? Or will they keep making the occasional truly bone-headed mistake, demonstrating that we can never trust them to do anything important? Is language comprehension even necessary for them to do their job? To quote the ACX post directly:
My understanding of [Gary] Marcus is that he’s going off some argument like this: GPT makes ridiculous mistakes that no human would make. This proves that it doesn’t have common sense - or, more technically some specific skill called “world-modeling”. If you could model the world, you would never make these kinds of ridiculous mistakes. Therefore, until someone figures out how to add in that skill, GPT will continue to be deficient.
Of course then there was some back-and-forth on Twitter that I didn’t follow, but Gary Marcus helpfully collated some of the arguments (and important people to read) from the opposite point of view at his blog here.
My own humble contribution to this debate is just some real-world examples. I can’t formulate a priori examples of problems computers will never be able to solve, but I immediately thought about the types of mistakes I see them make every day in the real world. Real-world examples have got to be useful in a way that’s complementary to carefully formulated test questions, right?
So why am I particularly qualified to talk about this subject? I’m not a linguist, or a programmer, and I have no special interest in machine learning (my day job is in pharmaceutical R&D, so if this blog ever starts to sound like a scientific report, that’s why). However, since the lockdowns at the start of the pandemic I have watched hundreds of hours of Turkish television, using subtitles of varying quality. And, I have laughed out loud at some hilarious machine-generated translation errors that were obvious to me, even as a native English-speaking white Australian with no previous connection to the Turkish language. In fact, it was repeatedly seeing the same translation errors popping up that inspired me to learn some Turkish; to try to understand what the hell the subtitles robot was even doing.
Now I know what you’re thinking; this guy who probably adds “of course this will lead directly to a cure for cancer” to all of his scientific papers is now adding “this will solve important problems in machine learning” to his personal blog about watching TV. To which I reply: yeah you got me. But pretend you’re a grant reviewer, and let me give it my best shot: if the subtitle generator knows a lot of Turkish, but lacks a world model, what happens when it meets me, who possesses an educated adult human’s concept of language, but minimal Turkish? I won’t notice any errors due to subtleties of Turkish grammar or the like (in fact, a malicious translator could rewrite the story entirely a la Monty Python’s Dirty Hungarian Phrasebook sketch and I wouldn’t necessarily notice). If I am rubbish at Turkish, surely the errors I do notice will be those that contradict my own internal model of the world. Or to state it as formally as I can: if any transcription/translation errors are in principle unfixable without the addition of a cognitive model of the world, they may be enriched in that subset of errors that I am capable of noticing. Lets take a look!
Background to the Examples
This part is more to do with the subject matter for any future blog posts, and I’ve probably included it because I am so used to putting a Methods section in a scientific report. If you really don’t want to hear about me and some fun TV shows, you can skip it (Hah! Just like the Methods section in a real scientific paper! Am I right? Guys?) and go to the examples below.
The TV shows: Kuruluş: Osman and Destan
Kuruluş: Osman (“Establishment: Osman”) is a soapy historical drama about the eponymous founder of the Ottoman Empire, set at the end of the 13th century. Lots of sword-fights and scheming and cliff-hanger endings (and if this sounds like your thing, you may be at the right blog).
Destan (“Epic”) is another historical fiction, this one set on the Central Asian steppes in the 8th century. More sword-fights. More scheming. Fabulous costumes. Trailer here.
While these are both shows that I’m currently watching, they have interesting challenges for a transcription/translation robot: in particular a mix of modern dialogue with archaic words, and also some pseudo-archaic stuff mixed in to make it sound “ye olde”. From user daltonlarinjoe on Reddit:
They are using some old words such as "pusat" for weapon, instead of Turkish word "silah" but it's not accurate either. Turks were using the word "yarağ/g/k" back then for "weapon" but I guess the producers couldn't use that word cause it means "dick" in Turkish slang today. And there is a slight difference in grammar for example they are always saying "sen ne dersin?" instead of "ne diyorsun sen?" etc.
The language models: Youtube’s Automatic Subtitle service, and Google Translate
Here is the support page for the Youtube Automatic Captioning service. I have no idea which particular language model/algorithms are actually used:
Note: These automatic captions are generated by machine learning algorithms, so the quality of the captions may vary. We encourage creators to add professional captions first. YouTube is constantly improving its speech recognition technology. However, automatic captions might misrepresent the spoken content due to mispronunciations, accents, dialects, or background noise. You should always review automatic captions and edit any parts that haven't been properly transcribed.
I am confident that the Turkish captions for the two shows above are generated by the speech recognition algorithms with almost no human review or curation, but I can’t be certain with my level of Turkish. Evidence in favor of machine transcription is that firstly, it always takes a few days after a two-hour episode is uploaded for the captions to appear, and secondly I have a few examples where the Turkish subtitles look wrong due to accents/dialects as warned above. Evidence against machine transcription is that the (Turkish) subtitles are unbelievably good. I am often amazed at the quality of the speech recognition engine, even when actors whisper, growl, or speak really fast.
I won’t personally catch many transcription errors though, as I almost always use the English subtitles (except when they go hilariously wrong). Although the initial transcription takes a couple of days, as soon as the Turkish subtitles go up you can get them translated in situ to almost any of the Google Translate-supported languages. While the final results aren’t flawless by any means, the shows are eminently watchable. Like I said, I could be wrong about the level of human involvement in the transcription process, but I feel like most of the errors I am capable of noticing creep in in the translation part.
(Just make sure you set the subtitles to “Turkish” and not “Turkish (auto-generated)” as in the picture below, before selecting “Auto-translate” to English. The “Turkish (auto-generated)” subtitles are generated word-by-word as the video plays, and are of extremely poor quality.)
The human evaluator: Me
No formal training in languages other than English after my highschool French/Italian. I have always been interested though, and for example spent the summer before a trip to Thailand learning enough of the alphabet that I could puzzle through a street sign or a menu in Thai. Like everyone else did in lockdown in 2020, I installed a couple of language apps on my phone, and according to these I have now gained a vocabulary of a few thousand Turkish words. My comprehension is still poor though; watching a show in Turkish without subtitles, even at normal speaking speeds, just sounds like a stream of semi-recognisable words going past, far too fast for me to parse.
Example 1: Adding grammatical gender in translations
This one seems like a pretty trivial problem and straightforward to describe, but it strikes me as a really difficult problem for a machine. It starts with one of the pleasingly simple things about learning the Turkish language, which is the lack of grammatical gender. In place of the English 3rd-person singular pronouns he/she/it/they/etc., Turkish has a single pronoun “o” which covers all of them.
The problem arises when you want to translate the neutral pronoun “o” into English (or some other gendered language), and you need to add the grammatical gender back in.
Put the sentence “she gave it to him” into Google Translate and it spits out the correct Turkish: “ona verdi”. This sentence doesn’t specify the gender of anyone involved. Now click the double arrows at the top to swap the languages over, and translate “ona verdi” back in to English: you get two answers “gave it to her” and “gave it to him”. We have lost one bit of the original gender information altogether (you might want to use a byte or more for real-world examples, but in the case of conservative Turkish TV shows, only male and female are ever portrayed) and are given a choice of two guesses for the other. In practice, watching a translated TV show, the algorithm seems to choose between “him” or “her” fairly randomly.
So here is a specific formulation of a difficult problem for machine translation, loosely based on a real example: Two characters on a TV show are discussing (in Turkish) a meeting with the Governor of Kulucahisar, who is not present. Both characters (and the viewer) are aware that the Governor is a woman, since when she was introduced in the previous episode she was wearing a tiara and a dress (and this is conservative TV, so you only get two choices for her gender). However, since it’s not relevant to the conversation, and Turkish grammar doesn’t require it, her gender is never mentioned; they can refer to her as the Governor, or with the neutral pronoun “o”.
An alert human viewer, prompted with the current scene, can bring together the audiovisual information about the Governor from the episode they watched last week, and translate the neutral pronoun “o” as “she” without a new audiovisual cue about her gender. What will a machine need to do to perform as well as a human? It needs to link the abstract idea of the Governor in the new episode, with the audiovisual picture from the old, via the only thing they have in common: the phrase “Governor of Kulucahisar”. This sounds a lot like assigning a meaning to that phrase. But of course the actual question we are discussing is whether the machine needs us to install the “Assign meanings to phrases” software suite before it can do this, or if it only needs to get better at doing what it already does, i.e. just match stuff up with other stuff it saw in the past.
(Or does it cheat by sticking in a neutral 3rd person singular “they”, and move on?)
Example 2: Turkish Proper Nouns
Like in English, certain Turkish nouns can also be names. For example, actor and former model Burak Çelik has a decidedly manly surname that translates to “Steel”.
The Youtube subtitle transcriber rarely has a problem recognising proper nouns or common names. It seems to capitalise them correctly in the middle of a sentence and use apostrophes correctly under Turkish grammar rules. One common exception is a pre-Islamic word for “heaven”, uncommon anywhere except Destan; instead of the correct place name “Uçmak”, it is transcribed and conjugated as the modern Turkish verb “uçmak” (“to fly”).
Google Translate struggles more often. Proper nouns that were capitalised by the subtitler are usually translated ok, especially if located in the middle of a sentence where they have some context. However, capitalisation becomes ambiguous when a proper noun is located at the start of a sentence, where it would be capitalised in any case. The character named Boran in Kuruluş Osman often gets the short end of the stick, with his name sometimes translated correctly as “Boran”, and other times incorrectly as “Storm”.
As with names, the subtitle transcription engine does pretty well with titles. The title “Han” (usually spelled “Khan” in English) is capitalised and conjugated properly in context. From Google Translate, we get some minor laughs from the close similarity of “Han” (“Khan”), “han” (“inn”), and “hanım” (“woman”); for example something like the acknowledgement “Evet Han’ım” (“Yes, my Khan”) might occasionally come through as “Yes, woman”.
So far so good, right? The simple examples above don’t look hard to fix. Just add “Uçmak” to the dictionary, and force Google Translate to use more context from adjacent sentences or something, once more computing power is available. Lets try a more complicated example, with both names and titles.
The founder of the Ottoman Empire, Osman I, was apparently married to one Râbi'a Bâlâ Hâtun, fictionalised as “Bala Hatun” in Kuruluş Osman. I don’t know if her name “Bala” means anything in particular, but the word “bal” does: it’s the Turkish word for “honey”. Now a particularly stupid and literal-minded robot, familiar with basic Turkish grammar, might confuse the name “Bala” with the word “bala”, meaning “at or towards honey” but luckily Google Translate is not that stupid…
…Well hardly ever, anyway. The problem comes with Bala’s title, “Hatun”, meaning “Lady”. Perhaps as a direct result of previous TV like this, “Hatun” apparently entered the popular vernacular some time ago as a way to joke around with the girls. User “cyrano” on ekşisözlük.com (“sour dictionary”, roughly equivalent to urbandictionary.com but sooo much more) defines “Hatun” thusly (lightly edited Google Translation):
After the internet/irc, a female synonym that has become [like chewing gum in] the mouth of the people who are copies of each other in terms of word choice/use. character defining word.
A fun idiom if I’m reading it right; chewing gum being something that makes your mouth move constantly.
User “horni” says (again Google Translated):
It's a word I hate to use... it's definitely slang. but it's addicting
You get the idea. Bringing this all together, how does Google Translate render Bala’s name in English? Usually correctly, as “Bala Hatun”, but sometimes as “Honey babe”, or even “Honey honey”. This somewhat ruins the immersion in a historical drama, but could possibly cause an international incident if she were e.g. the current President of Türkiye.
We can go deeper. In episode 92 of Kuruluş Osman starting here around timestamp 1:21:25, the male soldiers (plus a few female soldiers) are away at war, but the remaining women are responsible for the defence of the camp. We see Bala inspecting the troops, and hear the following exchange with Aygül, who is responsible for training the women (English version of the subtitles):
Bala Hatun: “Aygül! What do you think? Are you ready for a new war, sisters?”
Aygül replies: “Even the nastiest chicks I've ever seen are ready to be your sisters.”
Of course Aygül actually used the respectful “hatunlar” (“ladies”), and the adjective “toy” (“naive; inexperienced; green”) to describe them, so the intended meaning was probably something like “Even the greenest recruits I’ve ever seen would be willing to fight beside you”. The problem is that Google Translate, which was trained in part on Big Data and web scraping, sees the recent slang usage of “hatun” much more frequently than the original. Without context other than the adjacent text, Google Translate is only too happy to veer off into offensiveness, given a little nudge.
Can we go deeper still? It can’t get worse than that, can it? Put the text of Aygül’s reply into Google Translate exactly as it appears on screen in Turkish, including the two ellipses indicating the line break in the middle:
“En toy gördüğüm hatunlar bile... ...bacıyan olmaya hazırdır.”
Now put your cursor before the second “b”, and backspace to delete the second ellipsis (like I happened to do while working on this), and you get the following translation:
“Even the nastiest chicks I've ever seen are ready to be your mistress.”
Crikey, Google; settle down! What’s it doing here? My first thought was that while two ellipses is a line break, maybe one ellipsis is interpreted as the speaker pausing just before the word “bacıyan” (“your sister”), shifting the meaning to a more euphemistic one. Nice neat theory, but sadly probably wrong; if you delete the second ellipsis as well, thus deleting my suggestive “pause”, then the meaning shifts again:
“Even the nastiest chicks I've ever seen are ready to be a mistress.”
Just “a” mistress, instead of Bala’s specifically. So its not deriving the same “meaning” that I did from the significant pause. But what if it could? The thing is, the speech recognition AI would probably be reasonably good at guessing true “meaning”, given it can presumably be trained to distinguish emotions such as anger, scorn, or relief from the tone of voice used. Why isn’t Google Translate hooked up to the speech recognition software, to help it make better translations? What would be required to export data on “meaning” from the speech recognition engine, in a form (text) that Google Translate could use? While I’ll let a machine learning expert answer that one, it seems obvious that Google Translate is still spitting out the occasional horribly inappropriate translation because it’s difficult to stop it from doing so. Is that something that will be fixed by making computer power cheaper, and just doing more of the same only scaled bigger? Or does it mean that in order to go from (spoken words + tone of voice) to text, and then to translated text, you need to go through an internal model called something like “meaning”? Time will tell.
Concluding remarks
It hasn’t escaped my attention that the two very common and representative error examples I chose today, were basically Google Translate a) getting gender wrong and b) saying offensively inappropriate things to women. As Gary Marcus reminds us in his post, “many of the unfortunate consequences of a webscraping/big data approach to AI are disproportionately borne by women and minorities.” That suggests we need to ask two questions; not just “how big would we need to scale up the current models before the current problems go away, preferably a number less than infinity?” but also “what shortcuts can we take to make AI-generated stuff less gross, right now?”. If that involves installing a prefrontal cortex on top of current machine learning language models so we can force them to learn manners, then maybe we look at doing that too.
These conclusions are nothing particularly new or original; heck, Gary Marcus’s blog is literally titled “The Road To AI We Can Trust”, and lists a bunch of people who have been working on this for ages. Me watching TV shows in a country far from the place they were made is a very narrow gap to peer through, and of course I won’t see anything clearly. But hey! I learned something, and if any other curious types happen to peer through the same gap, maybe this will help them.
(I do have another example error I wanted to include, a speech recognition error rather than translation, but I wanted to record a clip from the show with subtitles hardcoded in and the copyright holders said “nah”. I’ll try to post it separately if I can get it to work. Otherwise, more talking about TV to come!)
*Update* Here is the next bit: Link to Part 2; Link to Part 3.