Technical Challenges of Big Data


The technical challenges in developing Big Data applications come from the non-functional requirements of these systems such as scalability, performance and security.
Every aspect of Big Data is BIG; whether it is storage, computation or complexity. Big Data consists of large datasets in terms of Volume, Velocity and/or Variety that require a scalable architecture for efficient storage, processing and analysis. But when we perform analytics on top of Big Data, then another set of 3V's are introduced which are Veracity, Value and Visualization. Though Value and Visualization do throw up some challenges but they cannot be termed as unique to Big Data. The challenges that are unique to Big Data are due to colossal Volume, Velocity and Variety of captured data. Veracity of data is a major challenge while performing Analytics.
Traditional technologies hadn't envisioned a need for such a massive scale on these "V" characteristics. They cannot store, manage and mine the massive volumes of data in a timely manner. Traditional data management techniques are unable to integrate unstructured data and machine generated data into existing data warehouse infrastructures.
All this has led to following technical challenges in Big Data.

Data Storage

Companies today store large volumes of diverse data from web logs, click streams, sensors, and many other sources. The volume of this data crosses many terabytes and petabytes at times. Technically it is possible to store this amount of data on disks, however, challenge lies in cost effective means of data storage. The essential requirement of Big Data storage is that it shall handle very large amount of data and keep scaling to keep up the growth, and that it allows efficient operation on data.

A Big Data storage solution should overcome the following challenges
  • Preventing data loss:  When storage systems reach hundreds of terabytes in scale, drive failures and errors present a constant challenge.  The traditional approach to protecting data against these failures is replication—creating one or more copies of the data to ensure that there is always a backup available in case of data loss.  However, creating copies of the data eventually becomes difficult to sustain from a cost and administration standpoint. For instance, Cleversafe, Inc. [Cleversafe13] published a paper that describes how current storage systems based on RAID arrays were not designed to scale to this type of data growth. As a result, the cost of RAID-based storage systems increases as the total amount of storage increases, while data protection degrades, resulting in permanent digital asset loss. With the capacity of storage devices today, RAID-based systems cannot protect data from loss. Most IT organizations using RAID for big storage incur additional costs to copy their data two or three times to protect it from inevitable data loss.
  • Maintain High Availability of System:  End users and customers have grown to expect 24/7 access to information—downtime is unacceptable. Data must always be available during storage system upgrades, drive failures, network outages—even during the failure of an entire data center.
  • Protecting data from unauthorized access: Data must be protected from unauthorized access, both when it’s “at rest” and when it’s traversing the network.  The massive scale of big data makes it difficult for organizations to identify and address potential security threats.
  • Managing Economy of Scale:  Storage devices continue to grow in terms of capacity while declining in price (more bits per device at a lower cost).  However, even as the price of capacity declines, the cost to power, cool, house, connect and manage that capacity continues to pose a challenge to budgets.

Data Processing

The data processing infrastructure that was developed to run business operations over the past decades is having trouble keeping pace with today’s digital landscape. Across every industry, daily interactions and transactions are moving online and the business services are becoming increasingly automated. Volumes of multi-structured, machine-generated data from a variety of sources have skyrocketed, and smart companies want to capture and make use of it all. As data volumes grow and the complexity of data types and sources increases, data processing workloads take longer to run and the time available for reporting and analysis is reduced.

The magnitude of 3Vs of Big Data (Volume, Velocity and Variety) determines the computation power required to process data. Imagine the computational power required to process 20 petabytes of data within a day by Google, or analyzing 200 million tweets per day to derive trending topics on Twitter, or NYT processing 4TB worth of raw images into 11 Million finished PDFs in 24 hrs. [WikibonBDInfo, 2012]. So the challenge is not only because of tremendous volume of data, but also the speed (velocity) at which it is generated and the variety of sources from which it is produced.

Various traditional technologies are available for high-performance computing which includes Cluster Computing and Grid Computing. However, high cost of Cluster and non-reliability of Grid Computing make them unsuitable for Big Data processing. Cloud computing which though seems to be an ideal solution for Big Data processing too puts up constraints such as the need to transfer huge volumes of data across network.

The challenges for high-performance computing for Big Data is on following parameters

  • Linear Horizontal Scalability: The system should be able scale horizontally. More importantly than a system being scalable is a system being linearly scalable [Marz14]. A linear scalable system can maintain performance under increased load by adding resources in proportion to the increased load. A non-linear scalable system, while “scalable”, isn’t particular useful as it may not be feasible from a cost perspective.
  • Cost Effectiveness: Economy of Scale is the most important criteria that need to be satisfied by any computation solution for Big Data processing.
    Suppose the number of machines required in relation to the load on a system has a quadratic relationship, like in figure. Then the costs of running that system would rise dramatically over time. Increasing the load ten folds would increase costs by a hundred. Such systems usually are not feasible from business perspective.                                             
  • Performance: The system has to process Big Data at an acceptable performance. The various technologies used to process Big Data should be able to utilize the resources very efficiently in order to produce high throughput per unit of resource. Resources come with a cost, so it becomes important that economy of scale is maintained throughout the product life cycle. Example of resources are Memory, CPU, Network bandwidth, etc.
  • Minimum data transfer: Big Data means TB and PB of data so transmitting that amount of data over the network to a computational resource for processing is not a feasible option. The architecture of Big Data computing approach should restrict the movement of data over network. The computation needs to be taken near to data and not vice versa.
  • Fault tolerance: It should be possible for the system to continue the processing of data in the event of the failure of a part. Failures may be due to network partition, server crashes and disk breakdowns.
  • Auto Elasticity: It is ability of the resource to adjust to the incoming load. Resources should be able to scale up or scale down based as per the requirements at a given point of time.
  • Latency: Latency is the delay within a system based on delays in execution of a task. There is always a timeliness associated to an information that makes it valuable. So it very important that the latency of the system is such that information is available on time. Consistence and predictability in performance is very critical for Big Data systems especially the ones that perform analytics on the data.
    Latency is an issue in every aspect of computing, including communications, data management, system performance, and more. It can become a major challenge in real time and near-real time applications.

Data Security and Privacy


Data Security and Privacy deliver data protection across enterprise. Together, they comprise the people, process and technology required to prevent destructive forces and unwanted actions.

Data Security is protecting data, information, and systems from unauthorized access, use, disclosure, disruption, modification, or destruction in order to provide:

  • Integrity: Guarding against improper data modification or destruction, and includes ensuring data non-repudiation and authenticity; 
  • Confidentiality: Preserving authorized restrictions on access and disclosure, including means for protecting personal privacy and proprietary data; 
  • Availability: Ensuring timely and reliable access to and use of data. 

Privacy is the assured, proper, and consistent collection, processing, communication, use and disposition of data associated with personal information (PI) and personally-identifiable information (PII) throughout its life cycle.

Security and Privacy of Big Data is one of the major concerns in the Big Data Paradigm. Big Data applications domains are health care, finance, governance and many other both from private and public sectors. These domains deal with very sensitive information such as Personal Health Record, Credit History, Personal Identity, etc. This data needs to be protected both to meet compliance requirements of each domain and to ensure the privacy of individuals.

Security and Privacy issues of Big Data can be handled using the same techniques as conventional data environments but the 3Vs (volume, velocity and variety) of big data exacerbates the problem. The other characteristics of big data that are important for security and privacy are Volatility (Temporality) and Veracity (Provenance). Also, managing data integrity of Big Data presents additional challenge along all the "V" aspects. Governing and managing 

Big Data too brings in a lot many challenges e.g. restrict the access to sensitive data that may have come through a third party application and ad-hoc querying.

Big Data technologies perform their tasks in a distributed environment such as Cloud which makes it more challenging to comply with various Security and Privacy guidelines. Cloud Security Alliance [CSA12] has identified securing computations in distributed programming frameworks as one of the major challenges. At the same time, techniques of data provenance is going to be a lingering issue when it comes to Big Data analytics.

Data Governance and Management


Big Data gives enterprises ability to access, aggregate and analyze ever increasing amounts of data. It is an enormous opportunity to make information the leader of value creation, but without comprehensive principles, policies and frameworks, Big Data can generate enormous risks [Soares13]. Big Data needs governance framework that ensures trustworthy data practices. Without proper governance, the same data that brings value to the organization, can also bring misfortune if its security and privacy is compromised. With effective data/information governance in place, business stakeholders tend to have greater trust and confidence in data. There is complete accountability throughout the lifecycle of information.

The volume of information coming into most companies has exploded in recent years, and many IT shops are dealing with extremely large data sets. It makes Data Governance and Management as the continuously evolving process. With Big Data being stored for analytics, storage management tasks keep on growing. More drives and devices are needed to house the data. And to ensure high performance server CPUs are satiated, data must be selectively stored and moved to different storage tiers to meet the varying I/O and throughput performance characteristics of each application.
Business Intelligence extracted from new types and growing volumes of data has led companies to use dedicated systems optimized for the different business. This approach can make inefficient use of storage. Spare capacity in one effort goes idle, while another group's effort requires buying additional capacity. This increases CAPEX spending and its impact on IT budgets is compounded by an associated increase in OPEX costs as the added devices must be managed and maintained, take data center space, and must be powered and cooled.

This approach also results in siloing which further complicates the matter. Siloing also prevents organizations from realizing the advantages of a whole-company view of its data. Additionally, since some of the same data (a customer sales database or stock market indices, for example) might be used by different groups for different purposes, having multiple versions of this data increases the total data volume. It also increases the need for multiple data entry, which contributes to multiple versions of the truth.

Making matters worse, by using an optimized storage solution to match the various analytics application's performance needs, there are often different storage product lines used throughout the organizations. Each type of storage system would typically have its own storage management system. This is often the case even if all of the systems come from one vendor.
All this had led to development of new tools to manage Big Data.

Data Transfer


Traditional WAN-based transport methods cannot move terabytes of data at the speed dictated by businesses; they use a fraction of available bandwidth and achieve transfer speeds that are unsuitable for such volumes, introducing unacceptable delays in moving data around. It is also not logical to move around terabytes of data in a day for processing on different nodes.
Other methods like moving the data physically by writing it to a disk and then transporting the physical disk too have limitations as by the time data reaches to a processing facility, it may have already changed.

New innovative techniques need to be employed to minimize transfer of data over the network. 




2 comments:

  1. Big Data refers to very large datasets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions. The term "Big Data" has been used as an umbrella term for the related data processing technologies tools and methodologies.

    ReplyDelete
  2. Big Data refers to very large datasets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions. The term "Big Data" has been used as an umbrella term for the related data processing technologies tools and methodologies. Read More

    ReplyDelete