Finally, a Use for Big Data: Cracking the Voynich Manuscript

The Voynich Manuscript might have been dropped to Earth by aliens; it might be a medieval cipher whose mystery outlived anyone who had the key; it also might be a prank and moneymaking scheme by some haggard rare bookseller. But whatever the book actually is, Brazilian scientists are pretty certain that the manuscript’s text—which is written in a language and alphabet only found in the Voynich itself—isn’t just gibberish. There’s meaning in there, and complex network modeling or other big data tools might crack the enigma that has thus far proven unbreakable.

Granted, the work led by Dr. Diego Amancio hasn’t yet told us anything new about the manuscript, which is named for the antiquarian who came across the medieval-looking book in 1912, Wilfred Voynich. A professor at University of São Paulo’s Institute of Mathematical and Computer Sciences, Amancio found evidence that indicates, at least, that the manuscript makes some sort of sense. Beyond just revealing the manuscript’s secrets, Amancio’s work may help to boost the intelligence of bots past the Turing Test, like the impressive or maybe unimpressive softwareEugene Gootsman, which famously sort of passed the test early this year.

“Our research has shown that the Voynich Manuscript presents a great deal of statistical patterns that are similar to those of natural languages,” says Amancio. Besides endorsing the existence of some meaning in the text, his conclusions fly in the face of many theories that treat this piece of work as an elaborate prank made by some old-school braggadocio.

Fraud theories have long loomed over the studies of the manuscript. Chemical analyses prove that the book was crafted between 1404 and 1439, but much of the book’s life, like its meaning, remains shrouded. It began its rise to worldwide fame starting at the beginning of the 20th century, with its rediscovery in Italy, by the Polish bookseller Wilfrid Voynich.

Voynich wasn’t able to translate the weird book, nor could he find anyone who could. Thanks to his efforts, however, the story of the Voynich Manuscript and its singular Voynichese language has piqued the interest of scientists and cryptographers no less notable than Alan Turing himself. None have succeeded, but people to this day won’t give up trying.

It’s easy to see why: The Voynich Manuscript boasts around 200 pages written in unknown characters, and filled with sketches of bizarre, unrecognizable plants, naked women diving into weird-ass pools, tentacled creatures, and Zodiac constellations. I mean, what’s going on here?

Image: Wikimedia Commons

In his thesis, published on the scientific magazine PLoS One, Amancio applied statistical methods to the text to determine if it was just codified mumbo-jumbo thatlooks a lot like a text—which frustrated would-be translators have accused it of being—or if its an actual text in an actual language. Instead of considering possible meanings, and attempting to translate, Amancio mapped the words with clusters and connection cables, in what is called complex network modeling.

“In texts, each distinct word represents a vortex. Each words are connected by an edge if they appear close to the other on it”, Amancio said. Programming on C and using the Network 3D software, Amancio managed to create gigantic orbital models in which words and their connections showed themselves according their presence and location within the text.

The researcher attested that the Voynich systems are, in 90 percent of cases, similar to those of other known books such as the Bible, indicating that it’s an actual piece of text in an actual language, and not well planned gibberish. While Voynichese has been accepted as very language-like, at least, by employing concepts such as frequency and intermittence, which measure occurrence and concentration of a term in the text, Amancio was able to discover the manuscript’s keywords.

The gigantic model created by Diego Amancio

What emerged are terms like cthygokeedy, and shedy. Although each original Voynichese letter got an equivalent from the Latin alphabet, the results don’t make any sense to human beings yet. Let’s just say that Amancio sees that as an edge piece of a bigger, complicated puzzle. “These words can be studied further by cryptographers and other manuscript scholars,” he says.

After so much research, even he can only guess at what the book’s actual content says. “I believe it’s a compendium of medieval practices involving medicinal recipes, astrological and metaphysical descriptions and fertility rites, as the images imply,” he says.


The Voynich Manuscript is held at the Beinecke Library at Yale. Amancio saved himself a trip to New Haven, and instead examined the pages of the book through its digital version, which has opened the Voynich to any willing cryptographer with an internet connection.

At the beginning of 2014, Stephen Bax, professor of applied linguistics on Bedforshire University, in the UK, implied that Amancios’s hunch is correct after using a totally different approach to the enigma. In an interview with Motherboard, Bax pointed to one of his articles wherein, just like the first Egyptologists, he began establishing meanings starting from capital letters, proper nouns and illustrative images. In this manner, Bax believes he discovered Voynichese words for “bull” and “coriander.”

His line of work has attracted a great deal of criticism from other scholars though. The website Cipher Mysteries dismissed Bax as that dude who has come crashing in late to the party. Bax, though, shrugs and goes on trying to break the code. In the end, criticism like this is commonplace on the academic universe, and in the manuscript’s case, which has attracted its share of crackpots, even more so.

Jorge Stolfi, a specialist in natural languages’ processing, has studied the manuscript for seven years. He believes it to be a transcription of an East Asia language, but the Campinas University (Unicamp) professor avoids stating what the book contains for certain. “Even if my hypothesis is correct, I don’t dare foreseeing when it’ll be deciphered,” he told FAPESP Magazine in an interview.

Image: Wikimedia Commons

Another language specialist raised on a hard sciences diet, Osvaldo Oliveira Jr. is a bit more apprehensive regarding the Manuscript’s translation. The professor and director for USP São Carlos’ Physics Institute is openly enthusiastic about the possibilities presented by research that merge data and machine analysis with complex network modeling though.

“We want to use statistical physics to analyze texts. This is a hardware limitation in 2014, but it may not be so in 2020,” he says. Along with Amancio and researchers Eduardo Altmann and Diego Rybski, Osvaldo is a co-author of the article on the manuscript. According to their research, the principles used on the book can also someday be used to identify the authors of unknown documents, as well as determine the most likely meaning for an ambiguous term, and even determine the quality of a translation.

If all it takes is more hardware to crack the Voynich manuscript, whose authorship, meaning, and translation have resisted revelation for over a century, it’s time to send some processors to Brazil, stat. Making robotic Russian teenagers is one thing, but this is a whole new language game.

This article was translated from Motherboard Brazil by Thiago “Índio” Silva. It has been published in Motherboard US (Aug/2014).


