Scientists Can Now Store Digital Data in DNA With 100 Percent Accuracy
The new method works like a simple Sudoku puzzle, essentially using hints to keep any lost data from ruining the overall picture. "Even if you don't get all the numbers, you can still solve the Sudoku puzzle," Yaniv Erlich, co-author of the paper and a professor of computer science at Columbia University, told me over the phone.
According to the study, done in collaboration with New York Genome Center's Dina Zielinski, this method is much more efficient than previous ones, allowing for more data to be squeezed into and out of DNA strands - fitting 215,000,000 gigabytes on one gram of DNA. Compare that to the DVD's max of 8.5 gigabytes, or the iPhone's max of 256 gigabytes.
In comparison, in a 2013 study, other researchers managed to fit about 2,000,000 gigabytes on one gram of DNA.
Scientists are looking to DNA for data storage for several reasons: it can pack tons of information into very small molecules, it'll never become obsolete (unlike CDs or cassettes), and it can last for tens of thousands of years. (Scientist and artist Joe Davis had the idea of planting a forest of trees whose DNA encode all of Wikipedia, as Motherboard previously reported.)
The four-lettered nucleobase alphabet of DNA (A, C, G and T) can be transformed into binary code - for example, as 00 for A, 01 for C, 10 for G and 11 for T.
The crucial advance in this new study is the use of DNA Fountain, or fountain codes - a bit of coding theory that lets you transform whole files into encoded chunks, or "droplets" - to store the files, which Erlich said protects against corruption. If you have a fountain of encoded data, and catch enough droplets, you can put the file back together.
"You can think of every DNA oligo as a hint," Erlich explained. An oligo, or oligonucleotide, is a short DNA molecule. "Even if not all the DNA oligos are going to [survive uncorrupted and be readable], you can still solve the puzzle."
Until now, a popular technique for writing data into DNA (used in a 2013 study, for example) was the repetition mode, he explained. If you were encoding, say, the lyrics to the Beatles' "She Loves You", the sequences would be "she loves," "loves you," you yeah," "yeah yeah."
"They used this repetitive strategy so that if you miss one oligo, you still have the other oligo," Erlich said. If you miss the "loves you" chunk, you would still have "she loves" and "you yeah" to form the lyrics. "But you don't pack your information on the file efficiently, and there's a chance that you'll miss all four oligos in this sentence."
To test their method, Erlich and Zielinski encoded six files into DNA: a full computer operating system, a computer virus, the 1895 film "Arrival of a Train at La Ciotat," an Amazon gift card, a Pioneer plaque, and a study on information theory. They copied and diluted their files several times, and even gave the manuscript to one of Erlich's Twitter followers: if he could download and decode it, they told him, the $50 Amazon Gift Card would be his. He bought a book with the money.
Aside from the time it takes to encode and store, then download and decode, such large files - about 24 hours to to decode even 2 megabytes of data - the other remaining limitation for the the field is financial, Erlich said.
He estimated that it cost them $7,000 to encode and decode 2 megabytes. "We improved the efficiency by 60 per cent," he said. "But it's still quite expensive to store information on DNA." With time, that too should change.