A Blog by Ed Yong

Shakespeare’s Sonnets and MLK’s Speech Stored in DNA Speck

When Nick Goldman first opened the package, he couldn’t quite believe that it contained anything at all, much less all of Shakespeare’s sonnets. The parcel had come from a facility in the US and arrived at the European Bioinformatics Institute in the UK, in March 2012.  It contained a series of small plastic vials, at the bottom of which were… apparently nothing. It was Goldman’s colleague Ewan Birney who showed him the tiny dust-like specks that he had missed.

These specks were DNA, and they contained:

  • All of the Bard’s 154 sonnets.
  • A 26-second clip of Martin Luther King’s legendary “I have a dream” speech
  • A PDF of James Watson and Francis Crick’s classic paper where they detailed the structure of DNA
  • A JPEG photo of Goldman and Birney’s institute
  • A code that converted all of that into DNA in the first place

The team sent the vials off to a facility in Germany, where colleages dissolved the DNA in water, sequenced it, and reconstructed all the files with 100 percent accuracy. It vindicated the team’s efforts to encode digital information into DNA using a new technique—one that could be easily scaled up to global levels. And it showed the potential of the famous double-helix as a way of storing our growing morass of data.

In cold, dark faciliites like Svalbard's Global Seed Vault (which is unstaffed), DNA files could last for tens of thousands of years. Credit: Svalbard Global Seed Vault/Mari Tefre

A better format

DNA has several big advantages over traditional storage media like CDs, tapes or hard disks. For a start, it takes up far less space. Goldman’s files came to 757 kilobytes and he could barely see them. For a more dramatic comparison, CERN, Europe’s big particle physics laboratory, currently stores around 90 petabytes of data (a petabyte is a million gigabytes) on around 100 tape drives. Goldman’s method could fit that into 41 grams of DNA. That’s a cupful.

DNA is also incredibly durable. As long as it is kept in cold, dry and dark conditions, it can last for tens of thousands of years with minimal care. “The experiment was done 60,000 years ago when a mammoth died and lay there in the ice,” says Goldman. Readable DNA fragments have been recovered from such mammoths, as well as a slew of other prehistoric creatures. “And those weren’t even carefully prepared samples. If you did that under controlled circumstances, you should be good for more than 60,000 years.”

(For those of you wondering if the information would mutate, it can’t. It’s not inside a living thing, and not being copied. It’s just the isolated non-living molecule.)

And using DNA would finally divorce the thing that stores information from the things that read it. Time and again, our storage formats become obsolete because we stop making the machines that read them—think about video tapes, cassettes, or floppy disks. That’s a faff—it means that archivists have to constantly replace all their equipment, and laboriously rewrite their documents in the new format du jour, all at great expense. But we will always want to read DNA. It’s the molecule of life. Biologists will always study it. The sequencers may change, but as Goldman says, “You can stick it in a cave in Norway, leave it there in a thousand years, and we’ll still be able to read that.”

Credit: Goldman et al., Nature

The code

DNA has a proven track record for storing information. It already stores all the instructions necessary to build one of you, or a giraffe, or an oak tree, or a beetle (oh so many beetles). To exploit it, all you need to do is to convert the binary 1s and 0s that we currently use into the As, Gs, Cs and Ts of DNA.

A Harvard scientist called George Church did exactly that last year. He used a simple cipher, where A and C represented 0, and G and T represented 1. In this way, he encoded his new book, some image files, and a Javascript programme, amounting to 5.2 million bits of information

Goldman and Birney have encoded the same amount, but with a more complex scheme. In their system, every byte—a string of 8 ones or zeroes—is converted into five DNA letters. These strings are designed so that there are never any adjacent repeats. This makes it easier for sequencing machines to read and explains why they had a far lower error rate (that is, none) compared to Church’s method.

Using their cipher, they converted every stream of data into a set of DNA strings. Each one is exactly 117 letters long and contains indexing information to show where it belongs in the overall code. The strings also overlap, so that every bit is covered by four separate strings. Again, this reduces error. Any mistake would have to happen on four separate strings, which is very unlikely.

Accuracy aside, Goldman’s coding system has a more fanciful advantage—it should be apocalypse-proof. Let’s get a bit fanciful: Imagine that there’s a calamity that wrecks human civilisation, creating a huge discontinuity in our technology. The survivors rebuild and eventually relearn what DNA is and how to decode it. Maybe they find some of these stores, locked away in a vault.  “They’d quickly notice that this isn’t DNA like anything they’ve seen,” says Goldman. “There are no repeats. Everything’s the same length. It’s obviously not something from a bacterium or a human. Maybe it’s worth investigating. Of course you’d need to send some sort of Rosetta stone to tell people how to decode the message…”

"Well, isn't it lucky we stored our cat photos as DNA before all this happened?" (Scene from The Road, 2929 Productions)

Scaling up

Goldman calculated that this method could be feasibly scaled up to cover all of the world’s data (which currently stands at around 3 zettabytes—3 million million gigabytes). For now, the big problems are cost and speed. It’s still expensive to read DNA, and really expensive to write it. The team estimate that you would pay $12,400 to encode every megabyte of data, and $220 to read it back, based on current costs. But those costs are falling exponentially, far faster than those of other electronics.

If you use DNA, you face a steep one-time cost of writing the data. If you use other technologies, you face the recurring costs of having to re-write the data into whatever new format has arrived. It’s the ratio between these two prices that drives the economics of DNA storage.

At the moment, DNA only becomes cost-effective if you want to store things for 600 to 5000 years—that’s the threshold where  the one-time cost outweighs all the constant re-writing. But if the price of writing DNA falls by 100 times in the decade, as it assuredly will, then DNA becomes a cost-effective option for storing anything beyond 50 years. “Maybe you’d store your wedding videos,” says Goldman.

DNA technology is also getting faster, but for now, it only makes sense to use it for data that you want to keep for a very long time but aren’t going to access very often.

CERN’s a good example. By 2015, the Large Hadron Collider will be collecting around 50 to 60 petabytes every year—that’s a lot of tape! They also have to migrate their entire archives to new media every four to five years, to save space and avoid the cost of maintaining old equipment. And although people rarely use old data, it has to be kept for at least 20 years, and probably even longer. DNA could be a perfect means of storing these archives (although CERN’s senior computer scientist German Cancio tells me that it will still have to be read and verified every 2 years).

Reference: Goldman, Bertone, Chen, Dessimoz, LeProust, Sipos & Birney. 2013. Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature http://dx.doi.org/10.1038/nature11875


12 thoughts on “Shakespeare’s Sonnets and MLK’s Speech Stored in DNA Speck

  1. “The team estimate that you would pay $12,400 to encode every megabyte of data, and $220 to read it back, based on current costs.”

    “At the moment, DNA only becomes cost-effective if you want to store things for 600 to 5000 years”

    One of these two statements is wrong. I don’t know which one. But . . . let’s take 4GB as an example.

    That’s 50 cents on a writeable DVD, or $12.4 million in DNA.

    If DVDs are replaced by a new tech every YEAR, and you spend $10 to upgrade to the new hardware, and the cost of the new tech never ever ever goes down . . .

    After 5,000 years, you will have spent $52,500 to keep your data upgraded. In order to match that initial cost of $12.4 million, you would need to upgrade every year for around 1.2 million years.

    Note that this ignores DNA hardware upgrade costs, the cost of reads, and the fact that DNA is a read-once medium, making it an almost criminal method to store data.

  2. I do not think that money comes matters at all. Seems to me as to trying to compare costs of communication in Marconi’s and Tesla’s time with today’s TV or fibre optics. That the DNA may be encoded. Remember, do not worry about money, worry about success, money will take care about itself.

  3. – Thomas Weigel, the problem with your figures is your assumption of the cost of new tech. A little over 10 years ago, the cost of a (then) new DVD recordable drive was $500. A recordable blank DVD was $5. So migrating from the CD’s I’d put something on not quite 10 years earlier – also at a cost about the same to acquire at the time – would have been around $505 for 4 GB. That assumes that the format for the CD’s didn’t change (as they did), that they were still readable, and that I could read the DVD’s when their time was up to write them to something else. That’s also assuming that the files are in a format that’s still readable. I have old procedural word processing files which, if I hadn’t translated formats, would be “unreadable” now, in less than 20 years. What the DNA technology seems to be most useful for is long-term archival copies, rather than “read all the time” materials.

  4. The initial cost of writing to DNA is probably more expensive than other established technologies for the foreseeable future but maybe the process of making copies with PCR is more efficient and cheaper?

  5. Doug Jones: “the problem with your figures is your assumption of the cost of new tech.”

    Longterm data storage today uses tape. Tape is proven and cheap. It is not “new tech.” It’s several decades old!

    One does not use “new tech” for anything one wants to keep around for 5,000 years.

    DVDs are just barely tolerable today. They’re still a bit too new – as late as 2008, studies indicated that they had worse longevity and data integrity than CDs, which were themselves never a good fit.

    I picked DVDs because they are (a) more expensive, (b) barely tolerable, and (c) not quite as innately superior in every possible way to DNA as a storage medium. Tape is cheaper, faster, better for reads, easier to upgrade from, and proven by decades of use.

    I was making a point: the worst barely tolerable choice I could think of was still better than DNA.

    Using DVDs a decade ago would have been data murder. But even with that $505 figure? DVDs are cheaper than DNA storage over a 5,000 year span. Even if you spend $505 per year, instead of per decade.

    Doug Jones: “What the DNA technology seems to be most useful for is long-term archival copies, rather than “read all the time” materials.”

    You make this statement, but I cannot understand why. It isn’t cost-effective or trustworthy at any human timescale. And it’s read-once, which is practically data murder.

  6. @Doug Jones, DNA storage does nothing to combat the data format obsoleteness problem you had with your word processor files. Indeed, the sample mentioned in the article encoded an MP3 file and that format could easily become obsolete over the next few centuries (if not the next few decades).

    @Thomas Weigel, your characterization of DNA as read-once is way off the mark. To begin with, the initial “write” of DNA created about 12 million copies. After that, one of the key evolutionary features of DNA is that it is super-easy to copy. Copying the existing DNA (referred to as “amplifying”) is generally the first step in reading (“sequencing”) it, so in practice every time you read it you end up with more copies than you started with.

    As for the economic arguments, I agree they are nearly worthless. The future cost of data storage and retrieval as well as the future cost of DNA synthesis and sequencing are only going to go down, but how fast and by how much is impossible to know and it seems unreasonable to assume the ratio between the cost of writing DNA and the cost of transferring tape archives is going to be even close to constant over the next few decades.

    The main technology-proofing and cost saving argument comes from the fact that if we store DNA in a cave we can be pretty sure that we will be able to read it in 10,000 years, whereas to be sure of the future readability of man-made digital storage formats we have to regularly convert them as technology changes.

  7. @Thomas Weigel – further to @Jeremy ‘s point about copying and reading DNA, you make the assumption that DNA sequencing will always be a destructive process. Technological improvements may infinitely improve the way DNA is read and, indeed, written.

  8. @Jeremy Excellent point with regards to the Cartesian circle of obsoleteness. However I am sure it is possible to also encode on the DNA how to read these formats or how they work, to enable someone to read them in the future.

    My knowledge of computers is not the best but is the JPEG and PDF not converted into binary first before this process is started. So they are universally readable, never going obsolete?

    1. Converting it to binary doesn’t help the problem of the format itself going obsolete. A binary file is nothing more than a series of 1’s and 0’s–without knowing the encoding (not dissimilar from a cypher) there’s no way of knowing how to read the data. Typically a file has the first 4 bytes reserved for a “magic number” which tells the computer what type of file it is (.txt, .jpg, .wav, etc).

      Whether or not a .jpg file would be readable in 100 years depends on whether software engineers continue to see the value in telling a computer how to decode a file that starts with the magic number for .jpg. Personally I think it’s pretty likely they’ll still be able to read those files.

  9. @Thomas, DNA is not a “read-once medium”. They may have used a read-once technique for this experiment for all I know, but that says nothing about DNA itself.

Leave a Reply

Your email address will not be published. Required fields are marked *