Technology

Technology

The number of Terabytes managed per storage administrator is growing. Most of this growth is in unstructured data – email, photos, videos, web pages, and other digital content that is not in databases. The growth of unstructured data has consistently outpaced the increase in disk sizes and the decrease in individual disk prices. As a consequence, both end users and the storage industry have been looking for new ways to get out ahead of storage growth, contain costs, and simplify management.

How Backup Stacks Up go_green_inset
Block deduplication is a new technology that has been routinely used to address data growth. However, its methodology tends to yield the best results for backups and structured data. Block level dedupe works when there are multiple duplicate versions of the same file because it looks at the file’s actual code – the 0s and 1s. When a document is backed up over and over again, the 0s and 1s stay the same because the file is simply duplicated. The similarities in the two files can be identified with the block dedupe because the sequence of their 0s and 1s are exactly the same.
Online data is different. Online data has few exact duplicates, rather there are files with a lot of similarities in each file. Furthermore, the majority of files contributing to the storage growth are already compressed by their applications; images and video (JPEG, MPEG, TIFF, GIF, PNG), compound documents (zip, email, HTML, Web Pages, PDFs) and Microsoft Office (Powerpoint, Word, Excel, Sharepoint etc. Block deduplication isn’t effective on already compressed files because when a file is compressed its 0s and 1s change from the original format.

Being Content-Aware
Almost all digital content in the modern data center is generated by a set of common applications and stored in the well-understood file formats of those applications. Applications developers are focused on providing functionality to end users – they think about the way data is consumed, not how it is stored. As a consequence, the way most applications store data is very inefficient.

Storage platforms, whether SAN or NAS, are built to provide generic storage, regardless of the file formats being stored. In a SAN, you store blocks of data written out by an operating system’s file system or database. In NAS, you store files written over a network storage protocol like NFS or CIFS. In both cases, the storage platform has no idea what is inside those blocks or files or how the data in them is accessed or grown.

The opportunity exists to create a solution that operates between applications and storage platforms to bridge the gap between the two and optimizes how the applications’ data is stored. We call this content-aware storage optimization.