Solving the data storage problem using genomics

Written by: Aonghus Topham Edited by: Ryan Khan & Daniele Guido

THE DATA STORAGE PROBLEM - WHERE ARE WE NOW?

We all produce data every day, either through downloading music, snapping photos, or writing documents, but rarely do we stop and think about where this information is physically stored. The digital universe encompasses all the world’s digital data, and much like the physical universe it is expanding at an exponential rate.

It’s possible to imagine the physical universe constantly expanding into the void of outer space. The same is true with the digital universe, however, this constant expanse is occurring in a room with a locked door, limiting our capacity for long-term storage. Without advances in data storage techniques this door will remain locked and we will be forced to delete current data to make room for new information.

Modern storage solutions cannot keep up with this rate of growth meaning that our data stored today may not be around in 50 or 100 years time. In 2011, the capacity of the digital universe surpassed 1.8 zettabytes (1 zettabyte = 1 trillion gigabytes), with around 500 quadrillion files [1]. This is predicted to exceed 4.8 zettabytes in 2022 [2], highlighting our desperate need for new, innovative methods of storing digital information.

The history of data storage dates back to the early 1800s where devices known as punch cards were used to programme automated musical instruments, each card had a capacity of around 80 characters - not even enough for a short tweet these days [3]. The next 200 years saw significant advancements in data storage technology with the overall goal of creating smaller devices with larger capacities. Electronic data storage began with the invention of the Williams-Kilburn Tube in the 1940s, a 41 cm long, 15 cm wide cylinder displaying a grid of dots on cathodes that allowed a charge to be distributed throughout the tubes. Around 72 of these tubes would be needed to store a single JPEG image [3]. Magnetic tape became an important medium for data storage in the early 1900s, whereby a roll of tape is passed through tape heads that read and write data into the material. The magnetic core, invented in the early 1950s, was the first core memory used in computing, with a capacity of 2 KB (equivalent to a small JPEG) [3]; this was the standard in computing from its discovery until the 1970s. Finally, the Solid State Device (SSD) invented in the late 1970s [3] is still widely used as a data storage device, roughly measuring 6 cm in size and holding up to 16 TB (1000 GB).

Although impressive, these advancements are insufficient to account for the rising tide of digital data now being created on a daily basis. For us to continue our data downloading frenzies, new avenues of storage will be needed in the future, and as it turns out the answer may have been with us all along…

USING NATURE’S DATA STORAGE SYSTEM - DNA

The idea of using synthetic DNA as a digital data storage medium has been theorised for over 20 years. Unlike traditional data storage mechanisms, a massive advantage of using DNA is its longevity. With a minimum half-life of 500 years in harsh conditions [4] and an established longevity of thousands of years in the right conditions, DNA could be superior to rotating discs and magnetic tape, which can last stably for 3–5 years and 10–30 years respectively [4]. Measured ‘data density’ is another crucial factor in which DNA has the upper hand in comparison with other data storage materials previously used. Recently Fujifilm™ and IBM Research™ set a new record for data density of magnetic tape at 317 GB/square inch [5] (approximately 0.5 GB/mm2), compared with the potential data density of DNA being 109 GB/mm3 [4]. To put this into perspective, it is estimated the chromosomal DNA contained in each human cell is roughly 1.6 GB [6], suggesting that the theoretical information density of DNA vastly outweighs previous data storage materials.

**Figure 1: DNA nucleotide base structures. (Taken from atdbio.com [11])**

Advancements within the field of genomics and biotechnology over recent decades have enabled us to delve into the genome and manipulate it with a relatively high degree of accuracy. DNA is made up from four base units known as nucleotides: adenine, cytosine, guanine and thymine (Figure 1). Millions of these nucleotides are joined together in various sequences to form DNA molecules that are tightly packed into every living cell. DNA acts as instructions for our cells and facilitates our body systems functioning and growth.

So, how does DNA data storage work?

Imagine a small picture (JPEG) file. Firstly, the JPEG is translated into a binary string of code, known as the data object. Then, the write process for DNA data storage involves encoding this binary code into a primary nucleotide sequence, and synthesizing the subsequent DNA molecules, which are then stored under controlled conditions. Following this, transforming the encoded data back into the JPEG image would require sequencing of the DNA molecules, allowing the nucleotide sequence to be read by a computer and transformed back into the initial binary code. This code would represent the JPEG file which could be read by a computer and translated back into an image.

**Figure 2: DNA Diagram of the Data storage process. (Taken from Yoo et al., 2021 [12])**

CURRENT CHALLENGES WITH USING DNA AS A DATA STORAGE MEDIUM

The theoretical advantages of DNA-based data storage outlined above certainly do make it seem like an attractive option, acknowledging the increasing problem of data storage. However, the practical challenges of manipulating DNA in biotechnology are also pertinent and may limit the viability of this storage medium.

One challenge with DNA-based data storage is the trade-off between storage density and fidelity, whereby the downfall of storing masses of different datasets within one small test tube is the accuracy and accessibility of a specific data object. Whereas it is easy to isolate one binary string in traditional storage devices, extracting one particular nucleotide sequence from a pool of DNA molecules is far from easy. Large block access requires the entire sequence of the DNA pool to be read in order to isolate a single data object. This leads to an extremely long read latency, making it an inefficient way of extracting data back from the DNA. Other techniques have been suggested, such as the addition of specific primers (short nucleotide sequences) to individual nucleotide sequences pertaining to a specific data object. Following this, the polymerase chain reaction (PCR) is used to first amplify, then sequence only the desired dataset expediting overall data extraction [4].

A second issue with DNA-based data storage is error coding within the primary sequence of nucleotide base pairing. It is well established within literature that there is generally a 1% error rate per new nucleotide added to the primary sequence. This becomes an issue if you are wanting to synthesize and store a specific short DNA sequence with a high level of fidelity. Thus, an important aspect of DNA data storage is accounting for errors within the encoding process, mainly via incorporating controllable redundancy to ‘absorb’ errors made within the encoding step.

Thirdly, whilst the potential longevity of DNA under ideal conditions exceeds 100,000 years [7], DNA is known to be degraded in the presence of water, UV light, enzymes, microorganisms, oxygen, and other pollutants. Therefore, storage solutions for the synthesized DNA must be carefully considered. Two main strategies of DNA preservation exist: chemical encapsulation, which is embedding DNA molecules into a matrix, and physical encapsulation, meaning dry storing DNA in a hermetic container with the presence of an inert gas [7].

WHERE ARE WE AT WITH DNA DATA STORAGE?

Over the last two decades there have been various success stories in storing and recovering relatively small datasets from DNA. In 1999, researchers were able to store and extract a 23 character message [4] followed by researchers successfully recovering a 83 KB message from a collection of 5000 DNA strands in 2015 [8].

The DNA Data Storage Alliance™ consists of 25 member companies led by industry giant Twist Bioscience™ and is spearheading the creation of “an interoperable storage ecosystem based on DNA”. One exciting member of this collaboration is CATALOG™, a Boston based start-up aiming to become the first company to create a commercially viable DNA storage system. CATALOG’s method starts with a library of pre-synthesized pools of DNA, whereby DNA polymers from each pool are fused together to create sequences corresponding to various bits of binary code called identifiers.

Previous attempts to encode data objects within DNA have used each number 1, 2, 3, and 4 within the binary code to represent the four base nucleotides (A, T, C and G). CATALOG has approached this encoding paradigm differently, in that they use alternative combinations of pools of pre-made DNA to make ‘Identifiers’ which correspond to different binary strings. Through this method, CATALOG have been able to store the entirety of the 16 gigabyte English-language version of Wikipedia within DNA in 2019 [9].

IS DNA DATA STORAGE COMMERCIALLY VIABLE?

If DNA data storage is to become a reality, as well as refining the science behind manipulating DNA, it must be commercially viable in comparison to other forms of data storage. Currently, the prices of HDD storage ($30/terabyte), magnetic tape media storage ($7/terabyte) and SSD storage ($150/terabyte) all vastly undercut the price of DNA data storage, which is estimated to cost around $1.3 million/terabyte [10]. This expense is mainly due to the cost of DNA synthesis and sequencing that require complex technology and equipment. However, it is widely accepted that the cost of sequencing DNA has outpaced Moore’s Law (the speed capability of computers doubles every 2 years along with a decrease in price). In 2021, the DNA Data Storage Alliance published a white paper laying out the ambitious goal of reducing the cost of DNA synthesis to $1/gigabyte by 2024 and $1/terabyte by 2030 [7]. Whilst this may sound like an ambitious goal right now, it is worth keeping in mind that the Human Genome Project was estimated to cost around $100M in 2001 (to sequence the whole genome), but was reduced to around $1,000 in 2020 [7]. With this in mind, the goal of cutting synthesis costs to $1/terabyte in 2030 seems certainly achievable.

WHAT IS THE FUTURE OF DNA DATA STORAGE?

Since human beings have recorded data, the race to invent smaller devices with more capacity has prompted significant technological advancements. However, with each advent of a new data storage system they all seem to become more outdated by the next. DNA has been refined as a data storage medium for millions of years, meaning that unlike modern storage technologies it has ‘eternal relevance’. Data and data storage are arguably the lifeblood of today's society, with enterprises being liable for around 80% of the ‘data’ within the expanding digital universe [1]. As a result, the Data Storage Problem is one of the biggest challenges facing humanity today. It is clear that DNA has the potential to solve this problem, but in practice will it be enough to make a difference? Advancements in DNA synthesis and sequencing will help to advance this, but making large-scale DNA data storage a reality will require increased accuracy and efficiency in DNA manipulation techniques.

‍

REFERENCE LIST

[1] Historyofinformation.com. 2011. IDC Computes the Size of the Expanding Digital Universe: Surpassing 1.8 Zetabytes : History of Information. [online] Available at: <https://www.historyofinformation.com/detail.php?id=2957> [Accessed 14 March 2022].

[2] Twiki.cern.ch. 2019. Cisco Visual Networking Index: Forecast and Trends, 2017–2022. [online] Available at: <https://twiki.cern.ch/twiki/pub/HEPIX/TechwatchNetwork/HtwNetworkDocuments/white-paper-c11-741490.pdf> [Accessed 16 May 2022].

[3] The Gateway. 2018. Evolution of Data Storage Timeline - The Gateway. [online] Available at: <https://www.frontierinternet.com/gateway/data-storage-timeline/> [Accessed 14 March 2022]

[4] Bornholt, J., Lopez, R., Carmean, D., Ceze, L., Seelig, G. and Strauss, K., 2017. A DNA-Based Archival Storage System. IEEE Micro, pp.1-1.

[5] SearchDataBackup. 2021. Potential magnetic tape storage capacity surges in 'renaissance'. [online] Available at: <https://www.techtarget.com/searchdatabackup/news/252495598/Potential-magnetic-tape-storage-capacity-surges-in-renaissance#:~:text=Fujifilm%20and%20IBM%20Research%20set%20a%20record%20for,tests%20using%20strontium%20ferrite%2C%20a%20new%20magnetic%20particle.> [Accessed 19 March 2022].

[6] Twistbioscience.com. n.d. [online] Available at: <https://www.twistbioscience.com/sites/default/files/resources/2019-03/WhitePaper_DataStorage_29Oct18_Rev1.pdf> [Accessed 22 March 2022].

[7] Dnastoragealliance.org. 2021. [online] Available at: <https://dnastoragealliance.org/dev/wp-content/uploads/2021/06/DNA-Data-Storage-Alliance-An-Introduction-to-DNA-Data-Storage.pdf> [Accessed 21 March 2022].

[8] Grass, R., Heckel, R., Puddu, M., Paunescu, D. and Stark, W., 2015. Robust Chemical Preservation of Digital Information on DNA in Silica with Error-Correcting Codes. Angewandte Chemie International Edition, 54(8), pp.2552-2555.

[9] Shankland, S., 2019. Startup Catalog has jammed all 16GB of Wikipedia's text onto DNA strands. [online] CNET. Available at: <https://www.cnet.com/tech/computing/startup-packs-all-16gb-wikipedia-onto-dna-strands-demonstrate-new-storage-tech/> [Accessed 23 March 2022].

[10] Coughlin, T., 2021. DNA Storage Update. [online] Forbes. Available at: <https://www.forbes.com/sites/tomcoughlin/2021/10/28/dna-storage-update/?sh=82f59e02fb74> [Accessed 21 March 2022].

[11] Atdbio.com. n.d. ATDBio - Nucleic acid structure. [online] Available at: <https://atdbio.com/nucleic-acids-book/Nucleic-acid-structure> [Accessed 20 April 2022].

[12] Yoo, E., Choe, D., Shin, J., Cho, S. and Cho, B., 2021. Mini review: Enzyme-based DNA synthesis and selective retrieval for data storage. Computational and Structural Biotechnology Journal, 19, pp.2468-2476.

BIBILIOGRAPHY

Foote, K., 2017. A Brief History of Data Storage - DATAVERSITY. [online] DATAVERSITY. Available at: <https://www.dataversity.net/brief-history-data-storage/#> [Accessed 14 March 2022].

Record Head. 2022. What Is the Difference Between Blu-Ray and DVD? | Record Head. [online] Available at: <https://recordhead.biz/difference-blu-ray-dvd/> [Accessed 15 March 2022]

Medium. 2020. Evaluating the carbon footprint of a software platform hosted in the cloud. [online] Available at: <https://medium.com/teads-engineering/evaluating-the-carbon-footprint-of-a-software-platform-hosted-in-the-cloud-e716e14e060c> [Accessed 15 March 2022].

CATALOG. 2021. About Us — CATALOG. [online] Available at: <https://www.catalogdna.com/about> [Accessed 15 March 2022].

N. Goldman, P. Bertone, S. Chen, C. Dessimoz, E. M. LeProust, B. Sipos, and E. Birney. Towards practical, high-capacity, low- maintenance information storage in synthesized DNA. Nature, 494:77–80, 2013.

Verdict. 2019. Catalog successfully stores all 16GB of Wikipedia text on DNA. [online] Available at: <https://www.verdict.co.uk/dna-data-storage-2019/> [Accessed 21 March 2022].

‍



Can you benefit from cholesterol-reducing products? Science, Regulation and Consumer Impact

What is Personalised Nutrition?



Solving the data storage problem using genomics

Our latest updates. In your inbox. Once a month.