Big Data - What it is and why it matters
Ever wondered how all photos, videos, status updates of billions of users are stored at Facebook? The convergence of social media and big data gives birth to a whole new level of technology.
What is Big Data?

Big data is a term that describes the large volume of data that floods a business on a everyday basis. But it’s not the amount of data that’s important. It’s what organizations do with the data that matters. Big data can be analyzed for insights that lead to better decisions and strategic business moves.
The use of Big Data is becoming common these days by the companies to outperform their competitors. In most industries, existing and new alike, the strategies resulting from the analyzed data is used to compete, innovate and capture value.
Big Data helps the organizations to create new growth opportunities.These companies have ample information about the products and services, consumer preferences that can be captured and analyzed.
Three V’s of Big Data
Big data enables organizations to store, manage, and manipulate vast amounts of disparate data at the right speed and at the right time. To gain the right insights, big data is typically broken down by three V’s

Volume — Organizations collect data from a variety of sources, including business transactions and social media networks.In the past, storing it would’ve been a problem — but introduction of new technologies like Hadoop have eased the burden.Size of data plays very crucial role in determining value out of data.Hence,Volume is one characteristic which needs to be considered while dealing with Big Data.
Velocity — It refers to the speed of generation of data.Since, the flow of data is massive and continuous,the real potential lies in how fast the data can be generated and client demands can be met.This determines the real potential in data.
Variety — Data comes in all formats - from structured datasets ,numeric data in traditional databases to unstructured text documents, email, video and audio.Variety refers to the large diverse sources and the nature of data, both structured and unstructured.
Understanding Unstructured Data
Unstructured data is different than structured data in that its structure is unpredictable. Examples of unstructured data include documents, e-mails, blogs, digital images, videos, and satellite imagery. It also includes some data generated by machines or sensors. In fact, unstructured data accounts for the majority of data in a company storage.
Importance of Distributed Storage in Big Data
As data is significantly growing, storing large amounts of information across a network of machines becomes a necessity. In the past, most companies weren’t able to store this vast amount of data. It was too expensive. Even if companies were able to capture the data, they didn’t have the tools to easily analyze the data and use the results to make decisions.Therefore, comes the need for Distributed File Storage, to control how data is stored and retrieved.

A distributed storage system is an infrastructure that can split data across multiple physical servers, and across more than one data center. It takes the form of a cluster of storage units, with a mechanism for data synchronization between cluster nodes.
The Hadoop Distributed File System
Hadoop is an open source, Java based framework used for storing and processing data in the range of gigabytes to terabytes across different machines. The data is stored on inexpensive commodity servers that run as clusters. Its distributed file system enables concurrent processing and fault tolerance.
Key HDFS features include:
Distributed file system: HDFS is a distributed file system (or distributed storage) that handles large sets of data that run on commodity hardware. You can use HDFS to scale a Hadoop cluster to hundreds/thousands of nodes.Hadoop clusters are composed of a network of master and worker nodes that orchestrate and execute the various operations.
Blocks: HDFS is designed to support very large files. It splits these large files into small pieces known as Blocks. These blocks contain a certain amount of data that can be read or write, and HDFS stores each file as a block.
Replication: You can replicate HDFS data from one HDFS service to another. Data blocks are replicated to provide fault tolerance, and an application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later.
Data reliability: HDFS creates a replica of each data block that’s on the nodes in any given cluster, providing fault tolerance. If a node fails, you can still access that data on other nodes that contain a copy of the same data in that HDFS cluster. By default HDFS creates three copies of blocks.
Organizations are realizing that categorizing and analyzing Big Data can help make major business predictions. Hadoop allows enterprises to store as much data, in whatever form, simply by adding more servers to a Hadoop cluster. Each new server adds more storage and processing power to the cluster. This makes data storage with Hadoop less expensive than earlier data storage methods.
Why Big Data is important for Facebook?

243,055 photos uploaded
100,000 friends requested
150,000 messages sent
13,888 apps installed
3,298,611 items shared
For a user, all these information are just statistics, but for Facebook, these are big challenges. Facebook requires massive storage infrastructure to house this enormous data and this data is growing steadily as users add hundreds of millions of new photos every day. If they refused to handle all this data, sure their business would die of data overflow. To handle all this data, they have adopted the Big Data technology.
Big data really is about having insights and making an impact on your business. If you aren’t taking advantage of the data you’re collecting, then you just have a pile of data, you don’t have big data.” By processing data within minutes, Facebook can rollout out new products, understand user reactions, and modify designs in near real-time says Jay Parikh,VP of Engineering.