How does “AI translation” work? Some high-level thoughts

The computer world is a high-bullshit industry.   Every computer system consists of nothing more than silicon chips running streams of ones (1) and zeros (0), however grandly this may be dressed-up.  The unwary blindly accept and repeat the words and pictures offered by salesmen with something to sell.  These are repeated by journalists who need something to write about.  Indeed the IT industry is the victim of repeated fads.  These are always hugely oversold, and they come, reach a crescendo, and then wither away.  But anybody doing serious work needs to understand what is going on under the hood.  If you cannot express it in your own words, you don’t understand it, and you will make bad decisions.

“AI” is the latest nonsense term being pumped by the media.  “Are the machines going to take over?!” scream the journalists.  “Your system needs AI,” murmur the salesmen.  It’s all bunk, marketing fluff for the less majestic-sounding “large language models (LLM) with a chatbot on the front.”

This area is the preserve of computer science people, who are often a bit strange, and are always rather mathematical.  But it would seem useful to share my current understanding as to what is going on, culled from a number of articles online.   I guarantee none of this; this is just what I have read.

Ever since Google Translate, machine translation is done by having a large volume of texts in, say, Latin, a similarly large volume in English, and a large amount of human-written translations of Latin into English.  The “translator” takes a Latin sentence input by a human, searches for a text containing those words in the mass of Latin texts, looks up the existing English translation of the same text, and spits back the corresponding English sentence.  Of course they don’t just have sentences; they have words, and clauses, all indexed in the same way.  There is much more to this, particularly in how material from one language is mapped to material in the other, but that’s the basic principle.  This was known as – jargon alert – “Neural Machine Translation” (NMT).

This process, using existing translations, is why the English translations produced by Google Translate would sometimes drop into Jacobean English for a sentence, or part of it.

The “AI translation” done using an LLM is a further step along the same road, but with added bullshit at each stage.  The jargon word for this technology seems to be “Generative AI”.

A “large language model” (LLM) is a file.  You can download them from GitHub.  It is a file containing numbers, one after another.  Each number represents a word, or part of a word.  The numbers are not random either – they are carefully crafted and generated to tell you how that word fits into the language.  Words relating to similar subjects have numbers which are “closer together”.  So in a sentence “John went skiing in the snow,” both “snow” and “skiing” relate to the same subject, and will have numbers closer together than the same number for “John.”

Again you need a very large amount of text in that language on both sides.  For each language, these texts are then processed into this mass of numbers.  The numbers tell you whether the word is a verb or a noun, or is a name, or is often found with these words, or never found with those.  The mass of numbers is a “language model”, because it contains vast amounts of information about how the language actually works.  The same English word may have more than one number; “right” in “that’s right” is a different concept to the one in “the politicians of the right.”  The more text you have, the more you can analyse, and the better your model of the language will be.  How many sentences contain both “ski” and “snow”?  And so on.  The model of how words, sentences, and so on are actually used, in real language texts, becomes better, the more data you put in.  The analysis of the texts starts with human-written code that generates connections; but as you continue to process the data, the process will generate yet more connections.

The end result is these models, which describe the use of the language.  You also end up with a mass of data connecting the two together.  The same number in one side of the language pair will also appear in the other model, pointing to the equivalent word or concept.  So 11050 may mean “love” in English but “am-” in Latin.

As before, there are a lot of steps to this process, which I have jumped over.  Nor is it just a matter of individual words; far from it.

The term used by the AI salesmen for this process is “training the model.”  They use this word to mislead, because it gives to the reader the false impression of a man being trained.  I prefer to say “populating” the model, because it’s just storing numbers in a file.

When we enter a piece of Latin text in an AI Translator, this is encoded in the same way.  The AI system works out what the appropriate number for each token – word or part-word – in our text is.  This takes quite a bit of time, which is why AI systems hesitate on-screen.  The resulting stream of encoded numbers are then fed into the LLM, which sends back the corresponding English text for those numbers, or numbers which are mathematically “similar”.  Plus a lot of tweaking, no doubt.

But here’s the interesting bit.  The piece of Latin that we put in, and the analysis of it, is not discarded.  This is more raw data for the model.  It is stored in the model itself.

This has two interesting consequences.

The first consequence is that running the same piece of text through the LLM twice will always give different results, and not necessarily better ones.  Because you can never run the same text through the same LLM twice; the LLM is different now, changed to include your text.

The second consequence is even more interesting: you can poison a model by feeding it malicious data, designed to make it give wrong results.  It’s all data, at the end of the day.  The model is just a file.  It doesn’t know anything.  All it is doing is generating the next word, dumbly.  And what happens if the input is itself AI-generated, but is wrong?

In order to create a model of the language and how it is used, you need as much data as possible.  Long ago Google digitised all the books in the world, and turned them into searchable text, even though 80% of them are in copyright.  Google Books merely gives a window on this database.

AI providers need lots of data.  But one reason why they have tried to conceal what they are doing is, in part, because the data input is nearly all in copyright.  One incautious AI provider did list the sources for its data in an article, and these included a massive pirate archive of books.   But they had to get their data from somewhere.  Similarly this is why there are free tiers to all the AI websites – they want your input.

So… there is no magic.  There is no sinister machine intelligence sitting there.  There is a file full of numbers, and processes.

The output is not perfect.  Even Google Translate could do some odd things.  But AI Translate can produce random results – “hallucinations”.

Further reading

Share

2 thoughts on “How does “AI translation” work? Some high-level thoughts

  1. There is another risk about machine translation based on parallel corpora. The principle is sound at first sight: preexisting translations made by humans are supposedly correct, and so can be used safely as training material for the program. The quality of the automatic translation depends directly on that of the human translations it is trained with. But as people start relying more and more on the technology and machine-translated texts get published through traditional channels, sooner or later these will be fed into the system as yet more training material. The obvious consequence is that incorrect patterns derived from a defective machine translation are more likely to be validated by their frequency in a database with an ever growing percentage of machine-translated texts, as the input from actual human translators decreases.
    The news is probably the biggest multilingual corpus in existence, updated daily; and a lot of it is machine-translated these days. It is worrying that all these translations must become part at some point of the machine’s reference material.
    I guess that generative AI in general has to address the same problem, because it depends just as much on the quality of its raw data. AI companies boast that their engines are up to date because they feed on the internet, and at the same time the internet is flooded with AI-generated texts. Is it possible to avoid a vicious circle there?

  2. I agree. As far as I can see, AI must destroy itself over time, precisely because the output increasingly becomes the input. But everyone is having too much fun to care right now.

Leave a Reply