The research community is excited about the potential for DNA to function as long-term archival storage. That’s largely because it’s extremely dense, chemically stable for tens of thousands of years, and comes in a format we’re unlikely to forget how to read. Although there have been some interesting progress, efforts have remained primarily in the research community due to high costs and extremely slow read and write speeds. These are problems that must be solved before DNA-based storage can be practical.
So we were surprised to hear that storage giant Seagate had entered into a collaboration with a DNA-based storage company called Catalog. To find out how close the company’s technology is to being useful, we spoke with Catalog CEO Hyunjun Park. Park pointed out that Catalog’s approach is contradictory on two levels: it doesn’t store data in the expected way, and it doesn’t focus on file storage at all.
a different storage
DNA is a molecule that can be thought of as a linear array of bases, with each base being one of four different chemicals: A, T, C, or G. Typically, each base in the DNA molecule is used to contain two pieces of information, with the bit values conveyed by the specific base that is present. So A can code 00, T can code 01, C can code 10, and G can code 11; with this encoding, the AA molecule would store 0000, while AC would store 0010, and so on. We can synthesize DNA molecules hundreds of bases long with high efficiency, and we can add side sequences that provide the equivalent of file system information, telling us how much of a piece of binary data represents an individual piece of DNA.
The problem with this approach is that the longer the string of bits you want to store, the more time and money it takes. Robotic hardware performs the synthesis reactions, and each hardware unit can only synthesize a single DNA molecule at a time. The raw materials that the hardware uses to perform that synthesis also add a cost for each molecule stored. While this isn’t a concern for small-scale demo projects, the costs quickly become prohibitive if you start storing large amounts of data. Citing a DNA synthesis cost of about 0.03 cents per base, Park said, “0.03 cents per two bits per base pair per, say, gigabytes, that’s a lot of money. That’s millions of dollars.”
Park told Ars that Catalog began by rethinking the coding process to get around this bottleneck. The company’s coding starts with a library of dozens to hundreds of short pieces of DNA called oligos (short for oligonucleotides). Then each bit in the data is assigned a unique combination of oligos; you can think of this as a silicon processor assigning a bit in memory a unique 64-bit address. If that bit is a 1, a robot can collect small samples of solutions containing each of the oligos needed to represent it and combine them with an enzyme that can bind all the oligos.
The enzyme fuses the oligos into a single, longer DNA molecule that contains the unique signature of the bit. If, on the other hand, the bit is a zero, the DNA corresponding to its address is not synthesized.
All the molecules that are produced can be combined into a single solution (which can be dried for long-term storage). To read the data, the population of DNA molecules is sequenced and an algorithm recognizes the unique combination of oligos present in each molecule. Recognized addresses are assigned a 1; the rest, a 0. This restores the data that was digitally encoded.
This system is much less data/DNA efficient than storing two bits in each base. But the individual molecules are still small enough to make an impressively compact and stable storage medium. And it saves a lot of time and money because of a fundamental asymmetry: It is much cheaper to synthesize a large amount of a specific DNA sequence than it is to synthesize small amounts of many different DNA sequences. So by assembling DNA using a small part of a large volume of prefabricated DNA, the cost of synthesis is dramatically reduced. Each reaction set can also be run in parallel; conversely, synthesizing individual sequences binds the machine on which they run until the synthesis is complete.