How to Use Azure Archive Blob Storage for Long-term Data Retention

Long-term retention is a hard problem to solve for multiple reasons. First, it is a real challenge to find a cost-effective storage medium that is convenient to maintain. Secondly, it is not easy to keep track of large amounts of data over long time periods. Lastly, the need to recall data from long-term retention can unwind to be a costly and arduous exercise.

However, this is all changing with the latest advancements in cloud computing.

Meet Microsoft Azure Archive Blob Storage

Microsoft Azure has a new storage tier explicitly designed for long-term retention: Azure Archive Blob Storage .

Before we explore how you can use the Archive tier in Azure, it is essential to understand that cold storage in the cloud is not an entirely new concept. Other cloud vendors have announced cold storage solutions in recent years. However, in some cases – and this is one – being early to market is not always an advantage.

We think Microsoft has leapfrogged similar compettive offerings with the Archive tier for two main reasons:

Granularity of control – Microsoft’s API design for archiving in the cloud supports comprehensive data management. The ability to query, access, and manage data is becoming more critical than ever before with compliance requirements such as the EU’s upcoming General Data Protection Regulation (GDPR). Unlike other cloud providers’ approaches to deep archiving that force an arbitrary containerization of your data, with Microsoft, you can manage and access individual objects within the Archive tier.
Superior storage economics – Microsoft’s storage prices for the Archive tier are the most attractive on the market today. Better than Google Coldline and even better than Amazon Glacier.

Cold Archiving in the Cloud: How It Works

With the introduction of blob-level tiering control, your Azure Blob storage accounts now support a mix of tiers wherein individual items can be either Hot, Cool, or Archive.

Furthermore, you can change the tier at any time. Thus, placing content in and out of a cold storage state is conveniently done in-place with no management overhead.

Can Azure-based Cloud Archiving Replace Tape?

Advocates for tape media will call out that cloud storage does not compete with tape’s low storage prices.

Let’s face it: Provided that you do not have legal discovery and GDPR requests, tape is going to be the most cost-effective medium to store data.

However, ask any experienced eDiscovery lawyer, and they will tell you that tape is extremely expensive in a litigation scenario.

Archiving unstructured data in the cloud is revolutionizing long-term data management because it delivers a cost model comparable to tape economics, but unlike tape, the cloud is an intelligent secondary storage environment that is agile.

For example, if you run HubStor on Microsoft Azure for long-term retention, you have the following advantages over tape:

Convenience – Cloud storage uses native disk format with synchronous storage redundancy and erasure coding for durability. Even in the Archive tier, your cold data storage does not become dark data. Granular blob-level controls support data management, file analysis, user access, and the GDPR.
Discovery efficiency – Deep archiving in the cloud preserves your ability to look up and even search the data on demand, with access to the information that is precise and always ready. Organizations that might face audits, investigations, or litigation should weigh the costs of discovery against tape compared to the cloud’s intelligent archive model.
Built-in data protection without lock-in – Data on tape is considered safe when it is offline. However, then you must worry about the physical management of the media, and offsite warehousing vendors can lock you in. Cloud storage is, by default, redundant and self-healing with options for geo-redundancy and shadow copy that can provide a backup to protect against malicious insider attacks, all under control and ownership in the cloud that is yours.
No infrastructure management overheads – With cloud storage, there is no hardware lifecycle to manage. The IT team no longer worries about maintaining a tape library infrastructure and refreshing the tape hardware every seven to 10 years, for instance.
Security and global reach – All data in the Archive tier is automatically encrypted at rest using 256-bit AES encryption. The Archive tier is available in 14 Azure regions, and its availability will expand to more regions in the future.

How Does HubStor Integrate with Azure Archive Storage?

It is often said that the downside of cloud storage is that it lacks rich data management functions such as search, legal hold, WORM retention policies, and access control. Moreover, things like native deduplication and compression, file analytics, data classification, and activity auditing are also lacking.

We must keep in mind that cold storage options in the cloud are provided at the Infrastructure-as-a-Service (IaaS) level. The primary cloud provider’s Infrastructure-as-a-Service (IaaS) provides us with excellent economies-of-scale for disk-based storage pricing that is comparable to tape, and because of things like synchronous storage redundancy and erasure coding, you do not worry about data durability and hardware refreshes.

However, to satisfy an organization’s need for cloud data management, search, and access, you need more than just the IaaS layer; You also need the Software-as-a-Service (SaaS) layer to be present and tightly integrated with the underlying IaaS.

Let us take a closer look at how HubStor, a SaaS archive solution on Azure, integrates with the new Archive tier to deliver the best of both worlds.

Storage Tiering to the Cloud

The first challenge, of course, is how you can seed data to the cloud quickly.

HubStor includes installable software that connects to a variety of data sources. Whether Windows, CIFS, or NFS-based storage, your on-premises file systems are likely the repository that will benefit the most by having a release valve to long-term retention storage.

Using the policy controls in HubStor’s Connector Service, you can target multiple shares and directories at any level with individual policies. Much like a backup, HubStor takes a point-in-time snapshot with each crawl, capturing new items, changed items (versions), along with changes to folder structures and Access Control Lists (ACLs).

If you wish to archive older data but have no disruption to users or applications, HubStor support storage tiering. The stubbing method is different from Windows to CIFS, but either way, based on policies that you define, you can selectively leave pointers in the filers so that users and applications can initiate recalls from the cloud archive.

Data residing in the Hot or Cool tier is instantly retrievable. However, as we will later explore, items in the cold storage tier in the cloud do not recall immediately. Instead, they have a rehydration lag before the blobs are ready. Thus, you may find it ideal to remove stubs on-premises that point to items in Azure’s Archive tier since the recall request will return an error. Using HubStor’s policy controls, you can phase out stubs according to your rules for moving data to the Archive tier.

However, you do have the option of leaving stubs in your on-premises filers that point to items in cold storage. In this case, a request on a stub will initiate the object to be rehydrated from the Archive tier, at which point the stub will again work to recall the item as expected.

In-cloud Storage Tiering

In the cloud archive, HubStor’s object storage layer includes analytics and a granular policy engine that makes it easy to visualize and manage the distribution of content across the Hot, Cool, and Archive tiers.

HubStor enables IT administrators to manage storage tiering in Azure with rules that target data based on folder, last accessed, type, data owner, user or group access rights, size, DLP tags, and custom fields.

HubStor defaults to writing all data to either the Hot or Cool tier because, in the cloud, HubStor has things like full-text indexing, data classification, and integration with Azure Media Services and other analytics services which can involve opening files to render their contents. Therefore, writing data directly to the Archive tier in Azure could cause higher costs since other rules may run shortly after that wanting to open the files. Since the Archive tier involves higher activity costs for retrieval, especially early rehydration, writing all data to Hot or Cool first allows time for content analysis, PII detection, OCR, media transcription, and keyword indexing processes to run before storage tiering rules come into effect.

HubStor’s cost-optimization approach for in-cloud tiering also means that data in the Archive tier can be fully searchable. For example, large video files are processed with speech-to-text indexing while they reside on Hot storage, and then the videos are written to low-cost cold storage. This way, not only do we minimize activity costs, but the video in the Archive tier is readily searchable through the indexed transcript. Rehydration of the footage from the Archive tier only occurs if a user needs to playback or export the file.

Search and the Archive Tier

By default, a search cluster in HubStor will index all item-level metadata, folders, and access rights, thus, at minimum, making all data in the Archive tier readily searchable by metadata.

This basic level of indexing does not involve a file open request to render the contents of files. Thus, it does not require a scaled search cluster configuration and very little storage space is needed to maintain the index. As a result, the default indexing in HubStor is fast, highly scalable, and very inexpensive, and delivers a cold archive that is searchable.

If you wish to use HubStor’s full text search, data classification, or leverage Microsoft Cognitive Services against your data, then HubStor’s in-cloud storage tiering design will help by having these content-level processes work with the data while it is on the Hot tier. If data is content-indexed or otherwise classified and later moved to the Archive tier, then the data in cold storage will be fully searchable since the contextual data is maintained separately.

Cloud Data Management and the Archive Tier

Just as important as search is the ability to understand holistically the data you are storing, and be able to manage it as needed.

For example, you may wish to attribute storage costs to business entities such as projects or cost centers in your organization.

A legal situation may arise that requires particular data to be placed on litigation hold. Alternatively, a request under the GDPR may come in that needs you to isolate and delete files with an automatic audit record.

Traditionally, long-term retention, mainly when handled with tape, is burdensome in this regard. It just is not possible to actively manage the data in long-term retention – you have to recall it to manage it.

Fortunately, that is no longer the case with the cloud. Regardless of the tier (Hot, Cool, or Archive), we can actively manage it in HubStor. Things like litigation hold, associating content with a legal case, classifying the data, storage cost analysis, retention, and search work with the data regardless of the tier.

User Experience and the Archive Tier

Earlier we mentioned that information on the Archive tier is not instantly retrievable. It can take several hours to rehydrate.

In a recent presentation, we introduced the Archive tier to an IT team considering the cloud for long-term retention. In their scenario, it was essential to provide users with self-service access to the cloud archive. HubStor supports this in two ways: 1) stubs in the on-premises file system, and 2) Web portal access with browse, search, recall, and share.

This particular organization felt that their user community would not take well to the Archive tier’s slow retrieval response. Even if the data is 20 years old, they explained, the expectation is that the file opens when requested.

In the screenshot above, we see in HubStor’s Web-access user portal that a search returns across the tiers (items in the Archive tier have grey-colored file names). If the user clicks to open such a file, they see a pop-up that tells them it is now being rehydrated, and the item will be available within 15 hours.

If this user experience will not suffice for your user community, we recommend that your tiering policies in HubStor be used to phase data from Hot to Cool, but should go no further than Cool. This way, you can still reduce your long-term cloud storage costs with the Cool tier, albeit to a lesser degree than with Archive, while supporting immediate access to all content for your users.

The good news is that you have total control over what tier your data resides, and whether or not end-user accessible data will be placed on the Archive tier.

Conclusion and Future Enhancements

Adoption of the Archive tier depends on your requirements, data management philosophy, and the workloads in question.

We believe the Archive tier is a perfect fit for closed project data, compliance data, legal discovery preservation, ex-employee records, culture preservation, and other such content that you need to keep but will not likely ever need to access again.

At HubStor, we are already looking forward the next wave of enhancements to make the Archive tier more convenient for our clients. Here are some of the top feature improvements that are in discussion:

Email notification of completed rehydration requests – Today, on HubStor’s self-service Web portal for end-users, a person can opt for email notification when a bulk download completes. We believe it will be useful to offer users the option to receive a similar email notification when a requested item rehydrates from the Archive tier. Like our other notification, this would include the link that takes the user to the file directly.
Cost estimate before rehydration – It has been suggested that HubStor show the user a contextual cost estimate so that they can decide if its worth fetching the item or not. We will wait and see on this one. It seems like a useful feature, but it might also annoy users, especially if the cost estimate is fractions of a cent.
Stub sync – We also see the need to offer a checkbox option that will have HubStor automatically remove stubs from on-premises filers for any data that moves to the Archive tier in the cloud. Such automatic syncing would eliminate the need for an IT admin to think about coordinating his stubbing policies in the HubStor Connector Service and his Archive tiering policies in the cloud. It will be an essential feature for organizations that want to avoid the scenario of users or applications clicking a stub that can’t instantly fetch an item because it is in cold storage, yet still wish to phase items to the Archive tier eventually.

Next Steps

You can learn more about HubStor by downloading the recent ESG technical lab validation report here.

Please contact us if you want to start a proof-of-concept or model pricing for your archive requirements.

Cloud Data Management Blog

How to Use Azure Archive Blob Storage for Long-term Data Retention

Meet Microsoft Azure Archive Blob Storage

Cold Archiving in the Cloud: How It Works

Can Azure-based Cloud Archiving Replace Tape?

How Does HubStor Integrate with Azure Archive Storage?

Storage Tiering to the Cloud

In-cloud Storage Tiering

Search and the Archive Tier

Cloud Data Management and the Archive Tier

User Experience and the Archive Tier

Conclusion and Future Enhancements

Next Steps

About the Author

Geoff Bourgeois

Get Notified

Recent Posts

RSS Feed

Solutions

Company

Get In Touch

Recent Posts