Data Backup and Recovery Considerations for Hadoop and Big Data

IBM says that 90% of the data in the world today has been created in the last two years alone. IBM  also says that 80% of data captured today is unstructured. Sources of unstructured data are, among others, posts to social media sites, digital pictures and videos, point-of-sale systems. All of this unstructured data can be termed as Big Data.

Salvus Data ConsultantsBecause of the wide-ranging benefits that small and medium size businesses can gain from Big Data in today’s competitive world, many are implementing a local Big Data strategy. To help businesses of all sizes manage Big Data, there is Hadoop. The Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

The Hadoop project has various elements. Below are a few of the more pertinent :

  • Hadoop Common – libraries and utilities
  • Hadoop Distributed File System (HDFS) – a distributed file-system that stores data on commodity servers while providing high aggregate bandwidth across the cluster.
  • Hadoop MapReduce – The “Map” step takes the input, divides it into smaller sub-problems, and distributes them to worker nodes.  The worker node processes the smaller problem, and passes it to its master node. The “Reduce” step then collects the answers to all the sub-problems and combines them to form the output.

As stated by Nathan Coutinho in his CDW Blog article  5 Ways to Future-Proof Your Data Center for Big Data “The whole point of Hadoop is to keep the data local on commodity servers and economical local storage…”

Small and Medium size businesses find Hadoop attractive because of it ability to provide high availability to data on local commodity servers.

A data strategy is never complete without a Data Backup and Recovery strategy. A Big Data implementation using Hadoop presents a need for even more focus on the ability to recover from a catastrophic event quickly. However, the SMB is not often staffed or tooled to design and execute a backup strategy of this level of complexity.  The other consideration is that since the attractiveness of Hadoop is to use local servers, there is a further need to implement a data backup and recover strategy that can be managed remotely but not have a requirement that the live data be transferred to or running in a cloud environment.

There are Data Backup/Recovery Managed Service Providers (DB/R MSP) that provide remote management of the Backup process, along with professional Disaster Backup and Recovery consultation. Contracting an DB/R MSP with the model of remote DB/R management allows the SMB to maintain their data locally without the need to hire new staff or train existing staff in sophisticated data backup and recovery processes. Additionally, the SMB can have a comprehensive Data Backup and Recovery strategy while housing their Big Data locally.