Whispers From Beyond The Veil
in which Whisper (OpenAI's transcription machine) battles against a veiled man speaking in riddles
The old hipster stroked his beard. “When I first started watching anime, it was all on VHS. We had to order videotapes by snail-mail, and they came with hard-coded subtitles. That was if you were lucky; often they were dubbed into English using really bad voice actors, and the tape itself was a copy of a copy of a copy and you could barely see through the lines on the screen…”
He swirled the dregs of his beer in its hand-labeled bottle. “You kids have got it easy!”
The ease of watching foreign language TV and movies these days never ceases to amaze me. Dozens of hours of Turkish TV, far more than I could watch, are uploaded to official Youtube channels every week. Machine transcription and translation allow passable subtitles to be generated within hours, simplifying the amount of human curation needed. Language apps feed me Turkish vocabulary whenever I have a spare moment, and obscure words that the translation robots struggle with can be tracked down on the internet given a little effort.
Although machine translation still needs a lot of work (see previous posts), by comparison machine transcription is very, very good. And both are improving rapidly, both in quality and in accessibility. ChatGPT will now translate stuff for you in an easy web interface, and transcription won’t be far behind. Until recently, getting a large audio file like a two-hour episode of Turkish TV transcribed has generally required a paid service. Now we have Whisper.
Introduced in September 2022, Whisper is the open-source automatic speech recognition system from OpenAI (the ones behind ChatGPT and Dall-E). Anyone can download the code and use it, but until some nice shiny apps are developed you will need a certain level of computer proficiency, and be comfortable with using the command line. Happily, other people have now developed things to a level that accommodates my own humble computer skills, in the form of the yt-whisper repository on Github. This code includes a set of Python scripts for downloading Youtube audio (via the yt-dlp code) and feeding it to Whisper for transcription or translation.1 The output is a subtitle file in either VTT or SRT format.
I won’t attempt to talk you through installing and running this code, because I am not confident that I could troubleshoot successfully if anyone tried to follow my guide. Instead I want to show you some results, which are quite impressive.
There’s only one setting that makes a difference to Whisper’s performance,2 and that is the size of the language model to be used. A larger model gives better results, and obviously takes more time to run. The default size is “small”, which apparently works well with audio in English, but it’s suggested you go up to “medium” or “large” for transcribing other languages. Lets give it a try!
I ran the script twice on episode 35 of Alparslan: Büyük Selçuklu, once on “small” then again on “large”. Episode runtime is 2:27:48; “small” took 1:45:48 sec to process this, while “large” took considerably longer at 7:28:11. My computer is nothing special though.
For comparison of the results I’ve chosen this scene3 showing Alparslan's fever dream: poisoned and unconscious, he receives healing from some strange veiled figures in the dream world. I chose this mainly because the mysterious figures are speaking in riddles and I really wanted a transcript of what they were saying because I couldn’t make sense of much of it on my own. Also, the combination of groups of people speaking in unison, using unusual vocabulary and religious imagery, along with the ever-present background music and some mystical-sounding echo effects should raise the difficulty level for the language model and present a proper challenge. As usual for this blog, I learned a couple of new words along the way.
In the following, the first line of each quote block is Whisper's transcript using the "small" model; the second line is Whisper's "large" transcript, and the third is an English translation.4 The first few blocks are all spoken by the mysterious veiled man:
1:04:12 Merak etme, yaran hafif. Lakin yükün ağır.
1:04:32 Merak etme, yaran hafif. Lakin yükün ağır.
(Don’t worry, your wound is minor. But your burden is heavy.)
So far so good, both models give a correct transcript. However, the timing of the “small” transcript is off; it’s too early by 20 seconds. Proceeding to the next line:
1:04:19 Şimdi ayak ha ve uyuyana kadar.
1:04:39 Şimdi ayağa kalk ve uyuyanları uyandır.
(Now get up and awaken the sleepers)
The “large” model is almost perfect; the speaker actually uses the archaic “imdi” instead of the modern “şimdi”, but since the two words are almost identical in sound and meaning it’s not a serious mistake. Output from the “small” model is heavily flawed, and the 20-second timing offset remains.
The next line is a difficult one for me, with the mysterious figure speaking in riddles. Whisper’s “large” model transcribes it as:
1:04:55 Düşen düştüğü gibi kalkarsa kalkmamış say.
(If the fallen one stands up in the manner in which he fell, count him as not having stood up)
It’s maybe possible that final word could be “sayır” instead which doesn’t change the meaning, but I’m not at all confident of the translation I’ve given there. Much like the character of the Sphinx in Mystery Men (1999), I think the guy is being very mysterious here.
You’ll notice though, that I’ve reported no transcription for the “small” model? That’s because it had done nothing but repeat the word “yeter” (“enough”) for fourteen seconds after its previous output. Poor thing; the Sphinx has broken Whisper’s tiny brain! If there is any doubt, its next output confirms the situation:
1:04:40 pronik만et uygulan sabijine de yenilir.
1:04:52 Gühlem gelmelerle gitmekten uygulamas locking Development them offenses sonrasısı 밤en yussana.
1:05:01 Kullan Thingin dinlerinde,
1:05:03 San underground istor düleri��iz Erik Aładat thenrain.
The “small” model vomits out this hilarious stream of broken Turkish, random Korean characters, and the strange phrase “locking Development them offenses” for some reason. Amazingly, it seems to feel much better after this. The transcription is reasonably accurate for the following minute or two of dialogue, and the 20-second timing offset is now corrected!
It’s pretty clear though that the “small” model is not worth using for Turkish transcription. Probably best if we let it sit quietly somewhere comfortable so that it can’t see anything that might upset it, and just use the “large” model from now on.
The mysterious figure continues into a very interesting exchange:
1:05:03 <Mysterious figure> Al yerden bir avuç toprak, şifan ile kalk.
(Take a handful of earth from the ground, stand up and be healed)1:05:07 <Chorus of veiled figures> Hay Hak.
(Life! Truth!)1:05:37 <Mysterious figure> Neyle kalktınız Alp'imiz Arslan?
(What enables you to stand, our warrior Arslan?)1:05:49 <Alparslan> Toprak ile.
(Soil does.)1:05:51 <Mysterious figure> Toprak insandır evladım. Toprak anadır.
(My child, earth is a person. Earth is a mother.)1:05:56 <Mysterious figure> Bu topraklar da Anadolu'dur.
(So, these soils are full of the mother.)1:06:00 <Chorus of veiled figures> İnşallah Anadolu'dur.
(If God wills it, it is Anatolia)
I mentioned the reverence that is held for the soil in my last post - note again the ritualistic sprinkling of earth at 1:05:40. Here, the “Mother Earth” aspect is invoked as part of the healing ritual. This allows the mysterious figure to make a play on the words “ana dolu” (“full of the mother”) and “Anadolu” (the Turkish word for the Anatolian peninsula). I’m pretty sure he emphasises this pun for the audience by messing with the vowel harmony slightly; pronunciation at 1:05:56 sounds to me like “Anadolu'dır” to emphasise the connection to “anadır”, instead of the grammatically correct “Anadolu'dur” which is the way the chorus pronounces it at 1:06:00, and the way that Whisper transcribes it.
(It’s a fun play on words, but it’s not the actual origin for the place name. “Anadolu” apparently comes from the Greek Ἀνατολή (Anatolḗ) referring to the east as the direction where the sun rises.)
Anyway, that’s a bit of a diversion from what I’m supposed to be doing, which is assessing how good Whisper’s transcription is. I’ll quote more of the scene for this purpose:
1:06:02 <Mysterious figure> Elinle tuttuğun toprak hem şifandır hem gazağındır.
(That earth you took with your hand is both your healing and your Holy War)1:06:08 <Chorus of veiled figures> Hay hak, Pir'imiz ne güzel dedi.
(Life! Truth! What a beautiful thing our elder said.)1:06:13 <Alparslan> Ne diye yüzünde perde var?
(Why do you have a veil on your face?)1:06:16 <Mysterious figure> Vallahi yüzümüzde perde yok. Billahi yüzümüzde perde yok.
(As God is my witness, we do not have veils on our faces. I swear we do not have veils on our faces.)1:06:21 <Chorus of veiled figures> Vallahi perde insan gözünde, billahi evli insan gözünde.
(As God is my witness, the veils are on the eyes of the people. I swear the veils are on the eyes of the people.)5
In all of the above, Whisper’s Turkish transcription is reproduced uncorrected. This is partly because it rarely needs correcting. There are a couple of errors worth commenting on, though.
As usual, I am amazed by the model’s ability to correctly transcribe groups of people speaking in unison. There’s one place I’m pretty sure it made an error, which is the line at 1:05:07 above. Whisper hears the group chanting “Hay Hak”, but I am quite certain it is actually “Hârî Hak”. Now “hârî” is a very obscure old Ottoman word that apparently translates as “worthy”. It isn’t listed in Wiktionary, nor in the TDK dictionaries, but there are online Ottoman Turkish dictionaries that give a translation, such as luggat.com.
Note that I was unfamiliar with “hârî” until now. I simply thought that “Hay” didn’t sound quite right, then googled “hari” to see what came up.6 This is the process that I concluded must be difficult for Google’s subtitle transcription engine, when I said:
Since words appear in common usage before they appear in dictionaries, machines will regularly be confronted with brand new words that they’ve never seen before. How many generations of language model will we have to go through before a new word can be correctly classified as such by the machine that first encounters it, rather than the alternative above which is to assume it misheard? How long until a chatbot can understand when to coin a new word, if its internal dictionary is lacking the words it needs to express something? Will that new word sound natural to the first human that hears it, or will that language model instantly fail the Turing test?
The answer to the “how long” questions above could well be “ChatGPT already does this now”. But if my understanding of how language models are trained is correct, i.e. they don’t understand “meaning” but instead repeat words based on how frequently they appear together in training data, then the language models could really struggle with adding something truly new.
It seems that Whisper was not familiar with the word “hârî”, and so it went with a (presumably lower quality) match to “hay” instead. I could add this example as further evidence that machines will struggle with words that weren’t in their training data or dictionaries. But look again at the line further down:
1:06:02 <Mysterious figure> Elinle tuttuğun toprak hem şifandır hem gazağındır.
(That earth you took with your hand is both your healing and your Holy War)
My interpretation of this line is that “gazağındır” is a transcription error, and the mysterious figure is probably saying “gazandır” instead,7 derived from the Islamic word “gaza” (“Holy War”) as reflected in my translation. Whisper perhaps mistakes the speaker's slight emphasis on "gaza" as extra syllable-length. But this goes against my belief that transcription machines will ignore the evidence from their own "ears" if the word they hear is new: as far as I can tell, “gazağındır” is not a real word in Turkish, but if it were it would be pronounced pretty similarly to “gazandır”. The machine has transcribed exactly what it heard, guessing the spelling since it apparently doesn't know the word "gaza". It has effectively invented the word "gazak" (which does not appear to be a real Turkish word) and used correct Turkish grammar to build it up into the nonsense word “gazağındır”.
There we go. The answer to my “how long” questions above could indeed be “its already here”.
Lets get some more evidence, before we proclaim that machines can now transcribe words they’ve never encountered. How does Whisper handle nonsense words in English? The answer here is “pretty damn well”. Spike Milligan’s own reading of his famous poem “On the Ning Nang Nong” is transcribed almost perfectly by Whisper’s “large” model.
I guess it’s possible that words like “ning-nang-nong” were in the training data somehow? We need to find another example to be sure. Try this one: “How English sounds to non-English speakers”, which is based on the short play “Skwerl” by Brian Fairbairn & Karl Eccleston; script published here. How does Whisper cope with this load of nonsense?
Here, the answer is “interestingly well”. Here is a partial transcript, beginning at the part where they sit down at the table together, around 50 seconds in. The published script isn’t followed exactly, as far as I can tell (it’s surprisingly hard to follow along, even as a native English speaker) but it’s pretty close. And Whisper is again far better than I am at “understanding” and transcription. First line in each block is Whisper’s “large” model transcript; second line (in brackets) is from the published script. Note how Whisper’s transcript is nonsense in terms of meaning, but in contrast to the script it has used very few actual nonsense words:
1:00 So I read your poem on the watch today.
(So I ran to york around the wash today.)1:02 Oh?
(Oh?)1:03 Yeah. That dulls a wrinkle in your face.
(Yeah. That doll’s areen blunderface.)1:05 Can't believe that whore helped him, John. Did you stop by the Love Life Call?
(Can bereave that mory alpen john. Joo flan by the long blat call?)1:11 Yeah, I couldn't buy the next drink. I played that privatey by the wrong front line today.
(Yeah. I coon by the mex areen. Oh you bleed that pribadium by the rongfort line today?)1:15 Oh, the Raising Man of the National Marine? Don't you agree with that, Trajan?
(Wha? The razy man in the nash marine? Doan for meen that you greed that tresion.)1:19 No, Prestation is Trapped. I mean, Why the Crest Soldier for the Magdalene Nation is Further Grat to my Chosec.
(No, purstation is trap. I mean, why the crest soldier for the magbaleen nation? Its further grad to my chosik.)1:24 Chosec for the Magalong?
(Chosik for the magalon.)1:25 Magalong my shit.
(Magalon my shit.)
In every case here, you can argue that Whisper is matching to words it knows. Even “Trajan”, “Chosec” and “Magalong” are apparently real people’s names, which would explain why Whisper capitalises “Magalong” at 1:24. “Privatey” looks like a made up word, until you google it and see all the matches to simple online miss-spellings of “privately”. None of the nonsense words in the script are transcribed correctly; everything is instead matched to similar but already existing words.
That means “gazağındır” above could be the only example we have of a robot inventing a new word by writing down exactly what it hears. We need to look at it more closely to make sure it stands up.
Turns out that “gazak” might be an old Persian word for “snack”. This word seems to have been loaned into a few Turkic languages such as Uzbek and Azerbaijani, and even into Hindi (all of these languages are available in Whisper). Azerbaijani is so close to Turkish that they are almost dialects, and “gazağındır” is parsed by Google Translate as a genuine word in Azerbaijani; I don’t know enough to say that this is an error. But we still have no strong evidence that Whisper can transcribe anything other than words it already knows somehow.
Whisper was built8 by scaling up a known technique (weakly supervised speech recognition) by an order of magnitude over what had been done before. It apparently performs almost as well as professional human transcribers (see the Figure reproduced below). It’s likely that scaling it further will make it even better, and it will make fewer and fewer errors. But will it ever be able to do what I did above, which is to recognise that the sound “hârî” is likely to be a new word that it hasn’t heard before? It may spontaneously develop this ability when scaled larger, by some unknown orders of magnitude. There’s little evidence that it’s beginning to do this yet, instead it very strongly prefers to match with a known word.
The thing is, as time goes by and training datasets get larger, encompassing all modern languages as well as those words known from the past, the number of truly new word sounds gets smaller. Ultimately, for alphabet-based languages such as Turkish and English, it may not matter if language models never learn to transcribe a completely new sound; simply writing down the closest known word-sound across all languages may effectively cover all the possibilities.
Unfortunately Whisper can’t do both at once; transcription and translation must be executed separately, i.e. run it again. Translation into English is currently the only supported language option.
Not strictly true; you can also tell Whisper which language it will be transcribing, which saves time compared to the default “detect language” setting.
Alparslan: Büyük Selçuklu (TRT 2022) 35. Bölüm; timestamp 1:04:14
The translation is based off this one originally submitted by OpenSubtitles.org user Wild_Hunter803, modified later by user Scooby74, then further modified by me.
To understand this religious imagery, see for example sūrat l-baqarah (2:7) “Allah has set a seal upon their hearts and upon their hearing, and over their vision is a veil. And for them is a great punishment.”
The trick to doing this is to ask Google the question in Turkish, i.e. type in “hari ne demek”.
Interestingly, the “small” model agrees with me, and writes “gazandır” here.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, Ilya Sutskever; arXiv:2212.04356. https://doi.org/10.48550/arXiv.2212.04356