Solution: Big Data Reference Architecture


This section attempts to provide the solution to the Big Data problem in general by producing a reference architecture. This section is organized as follows



Reference Architecture Definition

Dept. of Defense (DoD), USA defines Reference Architecture [DoD10] as

Reference Architecture is an authoritative source of information about a specific subject area that guides and constrains the instantiations of multiple architectures and solutions.



As a reference architecture is abstract and designed with generality in mind, it is applicable in different contexts, where the concrete requirements of each context guide the adoption into a concrete architecture. The level of abstraction can however differ between reference architectures and with it the concreteness of guidance a reference architecture can offer [Galster11]


Capturing Requirements

To build the reference architecture on solid ground and to make evidence-based instead of arbitrary design decisions it is necessary to base it on a set of requirements. Extracting, formulating and specifying requirements helps understanding a problem in detail and is therefore a pre-requisite to design a solution for that problem [Maier13].

Requirement extraction for a general purpose reference architecture is a two phase process [NBD-WG-RA13].
  1. Capture specific requirement from an application of the domain for which the Reference Architecture is targeted. In this case, these could be applications derived from Use Cases defined in previous section. Requirements should include detailed information on the following
    • Data Source (data size, file formats, growth rate, data at rest, data at motion)
    • Data Life Cycle Management (curation, conversion, quality check, pre-processing, etc.)
    • Data Transformation (fusion, analytics)
    • Big Data Framework and Infrastructure (software tools, platform tools, hardware resources such as storage and networking)
    • Data Usage (processed results in text, table, visual, and other formats).  
  2. Aggregate each application's specific requirements into high-level generalized requirements which are vendor-neutral and technology agnostic.
Following the above defined process for requirement gathering, the requirements for Big Data Reference Architecture are organized into following categories

  • Functional Requirements
    • Requirements that define business requirements of the system.
  • Non-functional Requirements
    • Requirements that define the quality of the system.

Architecture Overview

The high-level reference architecture shown in Figure depicts the main functionality needed in a Big Data System. It can be viewed as a three step process i.e. acquiring the data, analyzing it and delivering the results.




The various functional components of Big Data Reference Architecture (BRDA) can be grouped under following categories
  • Data Providers
  • Data Acquisitors
  • Data Processors
  • Data Consumer
  • Cross-cutting Concerns

Data Providers

Data Providers are the source of Big Data. Though they could have been categorized under data sources, however a Big Data System doesn't have access to the data owned by "data source" custodians until and unless it is provided access to it. In other words, they provide data for processing by Big Data System so the name "Data Providers",

Based on the source of Big Data which were described previously in the Section 3.1.4, the various types of Data Providers are as follows
  • M2M Networks: Machine to Machine data.
  • Scientific Systems: Biometric
  • Enterprise Systems: Big Transactional Data.
  • Applications: Web and Social Media and Human-generated content e.g. Mails, Customer interaction data, office documents, archived documents, Library Data, Videos, music files, etc.

Data Acquisitors

Data Acquisitors are responsible for acquiring data from Data Providers. Data Acquisitor performs the role of introducing new data or information feeds into the Big Data System for discovery, access, and transformation by the Big Data System. The data can be either pushed to a Data Acquisitor or it may be pulled by a Data Acquisitor.

These components directly contribute to all the five challenges of Big Data described in previous section. The design and implementation of these components is derived from Distributing Computing Paradigm as that is the only way to make them handle high Velocity and Volume of data. These components are very scalable (linearly) and fault tolerant as they are the gateways to any Big Data System. Any failure at this functional point is going to dent the overall reliability of the Big Data System. Big Data System is all about data and any loss of data by this component can result into a wrong information to the end consumer.

Data Acquisitors can be grouped under following categories based on their mechanism to feed the data to the system. A system can be using one or more groups to acquire the data from Data Providers.

Data Collector
Data Collector is responsible for acquiring the data in motion. The acquisition includes capabilities to 'subscribe' to a data stream or to simulate the stream via quickly repeating API calls. If it is subscribed then data will be pushed to it such as by a M2M networks and/or Scientific Systems.
Big Data System publishes the information about its collector such as IP Address where it is listening and/or it may provide a client to Data Providers in order to push their data to the Big Data System.
Once the data is received by the Data Collector, it may then store it temporarily in-memory or a disk before passing it to Stream processor for near real-time processing of data. Data can also go through filtration before it being fed to the Stream processor.

The computational requirements of this component is very high and are directly influencing the "Data Processing"  and "Data Transfer" challenges as mentioned in Section 3.3. Final implementation of this component is a distributed, fault-tolerant, linearly scalable and high performance queues.
In Big Data technology landscape, the functionality of Data Collector can be performed by a Message Queue or distributed commit-logs as some of the queue implementers want themselves to describe. These are basically distributed, fault tolerant distributed queues.

Following are some of the successful open source products belonging to Message Queue category in Big Data technology landscape right now.

  • Rabbit MQ 
  • Hornet MQ 
  • ZeroMQ
  • Apache Kafka
  • ActiveMQ


Apache Kafka
Apache Kafka [ApacheKafka14] is publish-subscribe messaging rethought as a distributed commit log. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.

Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages without performance impact.

Technology Choice
Apache Kafka

Data Extractor
The job of Data Extractor is to fetch the static data from Data Providers. It is primarily a pull mechanism to feed the data into the system. The difference between data extraction and stream processing is, that data extraction aims at acquiring data at rest, while Data Collector as explained earlier aims at acquiring data in motion.

Data extraction can also refer to crawling data from web pages. Data can be extracted from structured, semi-structured and unstructured data sources.
Sometimes data may need to be filtered based on its suitability for further analysis e.g. if there is privacy concern. The filtering can be conducted rule- or attribute-based or based on some machine-learning model, e.g. to classify relevant data items.

The Data Extraction component is more batch related and includes connectors to the source systems acquire the data, but also management components, e.g. to monitor changes in the data source and handle delta loads. The extraction can e.g. be realized using standard or proprietary database interfaces (e.g. ODBC), API's provided by the source application, but also file-based interfaces.

Technology Choice
The technologies used for Data Extraction vary based on the data source. There is no one given set of technologies that can used for extracting data from Data Providers.

Data Processors

Data Processors contain the components that perform the intelligent tasks on the data to extract value from it. These components are critical to the overall scalability and performance of the system. It should be robust, fault-tolerant, and linearly scalable and support low latency reads and updates.
As the data propagates through the ecosystem, it is being processed and transformed in different ways in order to extract the value from the data. Modern Big Data Applications need data to be processed near real-time to generate events as well as in a batch mode to run analytic algorithms on the whole data set. This has been achieved in the Reference Architecture by having separate components for stream processing and batch processing. This pattern is named as “Lambda Architecture” by Nathan Marz [Lambda14]. However, Nathan uses Speed Layer, Batch Layer and Serving Layer names for Stream Processor, Batch Processor and Batch Service Provider respectively.

At a high level, the Data Processor performs following function
1.    All data entering the system is dispatched to both the Stream Processor and Batch Processor components.
2.    Batch Processor manages the master data set and computes the views.
3.    Batch Service Provider indexes the data views generated by Batch Processor in order to improve the performance of queries.
4.    The Stream processor compensates for the high latency (due to latency in scheduling Batch Processor jobs) of updates to the Batch Service Provider.
5.    Consolidated Analytics Services merges the results from Batch Service Provider and Stream processor.


The detailed description and technology choices for each of these components is as following.

Stream Processor

Stream Processor is responsible to handling the velocity characteristic of Big Data which brings in the challenge of "Data Processing" at very high speed with low latency. In other words, it is component that handles the processing of data in motion.

The main responsibility of Stream Processor is to process the incoming data real-time "in-stream" that is without the need to store them. This way it can achieve low latency while performing its actions in a pipeline. Here also data goes through the processes like filtration, aggregation, validation and analysis.

Complex Event Processor (CEP) can carry event processing within this component to generate alerts based on some business rules.

This is one of most complex components of the Reference Architecture. The challenge lies in timeliness completion of tasks by analyzing the data while it is flowing and accordingly react to analysis results. It is fundamentally different compared to analyzing the data at rest. The Stream Processor can however use partial results, models and rules pre-calculated during a deep analytics task at-rest to accelerate the analysis, therefore the backflow from the data analysis component [Maier13]. This component is completely distributed, fully fault-tolerant and also able to handle stream imperfections like missing and out of order data.

The various open-source most suited stream processing technology options available for Stream Processor are

  • Apache Storm
  • Apache Spark Streaming
  • Apache Samza


Technology Choice

Apache Spark Streaming

Apache Spark Streaming is stream processing framework built on top of Apache Spark. It is fault-tolerant and linearly scalable solution for Stream Processing.
Apache Spark Streaming is well integrated with the Hadoop eco-system. It shares the APIs with the Apache Spark, therefore making it possible to maintain a unified code base that will result into low maintenance overheads. Other attributes are high performance, exactly-once semantic for message delivery, out-of-box very efficient fault-tolerance.

The streaming frameworks such as Apache Storm and Apache Samza were also considered before finalizing on Spark Streaming.

Apache Storm though very popular and robust, however doesn't provide exactly-once semantics which means that a message can be replicated. It guarantees "at least once" delivery of messages. Storm also does not provide out-of-box fault tolerance since it expects users to provision for persistence of messages in case it needs full fault tolerance. The other benefit we get by using Spark Streaming instead of Storm is the ability to maintain a unified code base rather than having separate code base for streaming and batching components.

Apache Samza is very young and has just released 0.7 version. It does provides a rich set of functionality and extensively used by LinkedIn but will need some time to be stable for use by other companies.

Batch Processor

Batch Processor is the main component of Big Data System without which the system wouldn't exist. It handles the most important characteristic of Big Data i.e. Volume. This component brings in the Data Storage, Data Processing, Data Security and Privacy and Data Governance and Management challenges to the system. All these Big Data challenges have been described in detail in Section 3.3.
The Batch Processor needs to be able to do two things to do its job; store an immutable, constantly growing maser dataset, and compute arbitrary functions on that dataset [Marz14]. The data being immutable would mean that only write operation will be performed to add a new data unit to your dataset. The storage should therefore be optimized to handle a large, constantly growing set of data. 
The Batch Processor is also responsible for computing functions on the dataset to produce batch views. This means that Batch Processor storage needs to be efficient at reading lots of data at once and not necessary be optimized for random access of individual pieces of data.

Now, this is where Hadoop fits in and the concepts like MapReduce and Apache Spark should get 
introduced since they will be technologies used for Batch Processing extensively.

Apache Hadoop
Apache Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware [WikiHadoop14]. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software's ability to detect and handle failures at the application layer.

The Apache Hadoop framework is composed of the following modules (Please note we are describing Hadoop 2 since Hadoop 1 is now considered as legacy)
  • Hadoop Common: contains libraries and utilities needed by other Hadoop modules.
  • Hadoop Distributed File System (HDFS):  a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster.
  • Hadoop YARN: a resource-management platform responsible for managing compute resources in clusters and using them for scheduling of users' applications.
  • Hadoop MapReduce: a programming model for large scale data processing.
All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are common and thus should be automatically handled in software by the framework. Apache Hadoop's MapReduce and HDFS components originally derived respectively from Google's MapReduce and Google File System (GFS) papers [WikiHadoop14].
Beyond HDFS, YARN and MapReduce, the entire Apache Hadoop "platform" called Hadoop eco-system is now commonly considered to consist of a number of related projects as well - Apache Pig, Apache Hive, Apache HBase, Apache Spark, and many more. [Apache14]

Following diagram shows the complete eco-system of Hadoop with its high level architecture [Hortonworks14]



MapReduce
Distributed batch processing is an ideal technology choice when the use case is driven by processing massive amounts of data at rest. Distributed processing paradigms, such as MapReduce, enable some of these benefits by processing massive amounts of structured and unstructured data sources in support of new analytical capabilities. [Heudecker13]

MapReduce is a distributed computing paradigm originally pioneered by Google that provides primitives for scalable and fault-tolerant batch computation. With MapReduce, one writes the computations in terms of map and reduce functions that manipulate key-value pairs. These primitives are expressive enough to implement nearly any function, and the MapReduce framework executes those functions over the master dataset in a distributed and robust manner. Such properties make MapReduce an excellent paradigm for the precomputation needed in the batch layer, but it is also a low level abstraction where expressing computations can be a large amount of work.

The reason why MapReduce is such a powerful paradigm is because programs written in terms of MapReduce are inherently scalable. A program that runs on ten gigabytes of data will also run on ten petabytes of data. MapReduce automatically parallelizes the computation across a cluster of machines regardless of input size. All the details of concurrency, transferring data between machines, and execution planning are abstracted for you by the framework.

Apache Spark
Apache Spark is a fast, in-memory general purpose data processing engine specially built to process large volumes of data iteratively at low latency with full fault-tolerance [ApacheSpark14]. Spark is a computation system focused on the smart usage of memory. It has two modes of operation. First, it can be used as a batch system, where it implements MapReduce and also adds operators for caching datasets in memory. For many iterative batch algorithms, like graph and machine learning algorithms, this enables it to achieve much higher performance than the purely disk based approach taken by MapReduce. 

Technology Choice
Both the technologies i.e. MapReduce and Apache Spark will be used in the batch layer based on the requirements. Though, Hadoop MapReduce has been traditionally used as a data processing engine in Big Data domains, its latency has started becoming a major concern for modern applications. The latency is mainly due to a large number of IOs it has to make continuously to retrieve and push data into a file system. Since Apache Spark performs its processing in-memory so that makes it faster than Hadoop MapReduce. It is reported in Spark website that it is 10-100x [ApacheSpark14] faster than MapReduce both in memory and disk. It also provides a rich ecosystem with frameworks like Spark Streaming, Spark SQL, MLib, GraphX, etc. to handle other common requirements of Big Data applications.

Another reason for using Apache Spark is its approach to provide fault-tolerance capability. Spark uses RDD (Resilient Distributed Datasets) which rebuilds lost data using "lineage" (each RDD remembers how it was built from other datasets (by transformations like map, join, etc.) to rebuild itself [Zaharia11]. Therefore, RDD can be stored in memory without replication thus avoiding IOs in-between various passes while running an algorithm. It makes it suitable for analytics and data mining. Also, the memory performance is increasing while the disk IO rate has been stagnant for a while now, so memory based systems can become faster day by day.

Batch Service Provider

Batch Service Provider indexes the pre-computed views generated by Batch Processor and provides interface to access the data from these views efficiently. This component is tightly coupled with Batch Processor since the Batch Processor is responsible for continually updating its views. These views will always be out of date due to high latency nature of batch computation, but the Stream processor will rescue the system by providing any data not yet available in the Batch Service Provider.

The operations in the Batch Service Provider are very much data intensive so it is important to outline the requirements of database that should be used in this component which are as follows [Marz14]

  • Batch Writable: The batch views for this component are produced from scratch. When a new version of a view becomes available, it must be possible to completely swap out the older version with updated view. 
  • Scalable: A database must be capable of handling views of arbitrary sizes.
  • Random read: In order to have low latency queries, database must support random reads where indexes provide direct access to small portions of the view.
  • Fault-tolerant: Since database is distributed, it must be fault-tolerant.

It may be obvious by now that the database used in this component doesn't need to support random writes which should make this database very simple.

A simple database is more predictable since it fundamentally does fewer things. Since the Batch Service Provider views contain overwhelming majority of queryable data, the simplicity of database for this component is a great property to have for an architecture.

Various technology choices that are available for the database of Batch Service Provider are as follows

  • HBase
  • ElephantDB
  • SploutSQL
  • Voldemort

Technology Choice
  • HBase
  • ElephantDB

Consolidated Analytics Service

It is responsible for consolidating the results from Stream Processor and Batch Processor (exposed through Batch Service Provider) by querying their respective databases and merging the results. It because of this merging of views that this component gives the system ability to provide near-real-time analytics over complete dataset.

The complexity of the Consolidated Analytics Service comes from the fact that it has to handle completely different set of technologies in order to be able to merge the views from Stream Processor and Batch Processor. So, the technology used should be generic and extensible.

The technology choices here will be pretty straightforward, and that is picking up a programming language that best integrates with Stream Processor technology stack as well as Batch Processor (or specifically Batch Service Provider) technology stack.

The following are the options based on popularity of these technologies supported by most of the systems 
  • Java
  • Python
Technology Choice
Java

Java is the de-facto preferred language for server side programming because of its rich set of frameworks and libraries. It is robust and used in most enterprise solutions. It is also one of the most popular languages and so the total cost of development (tools, developers, and infrastructure) is very low. At the same time, it has integration with all major frameworks available for Big Data.

Service Delivery Interface

It is responsible for distributing and making data and analysis results available either to human users through a user interface or to other applications that further work with them through an application interface.

User Interface 

One or several user interfaces are used for visualization and presentation of data and results to end-users. This can e.g. include portals or distributing reports as pdf by e-mail. It decoupled presentation, which is typically done at a client, from analytical functionality, which is done closer to the data in an application server layer or even in-database.

Application Interface (API)

Data stored in a 'big data' system landscape and analysis results are often provided to other applications. This can include technical interfaces and APIs to access data and processing results from the system as well as message passing and brokering functionality. Note, that data consuming applications can very well be identical to data sources, in which case the communication represents some feedback loop.

Meta-data Management

Metadata management refers to the extraction, creation, storage and management of structural, process and operational metadata. 

The first type describes structure, schema and formats of the stored data and its containers, e.g. tables. Process metadata describes the different data processing and transformation steps, data cleaning, integration and analysis techniques. Operational data describes the provenance for each data item, when and from which sources it was extracted, which processing steps have been conducted and which are still to follow. Structural and process metadata needs to be provided for all data structures and steps along the processing pipeline and operational metadata needs to be collected accordingly. 

Metadata management is therefore a cross functional requirement consisting of 
  • Meta-data extraction
  • Meta-data storage
  • Provenance tracking
  • Meta-data access.

Data Store

Data Store is responsible for providing storage capability to all the components of Big Data System including Batch Processor and Stream Processor. This is where the major challenge lies for Big Data as explained in previous section on big data challenges.

 Reference Architecture recommends a "polyglot" persistence model [Fowler11] driven by the idea of "one size does not fit all" by Stonebraker and Cetintemel [Stonebraker05]. The idea is that different components within Big Data System would rely on different database systems depending on which data model and architectural characteristics best fits the requirement of a particular component. For instance, the database requirements of Stream Processor is different than Batch Processor. In Stream Processor, there will be random reads and writes while in Batch Processor there are batch writes and reads.

It is necessary to have storage as a separate concern and consolidated into a single layer in order to achieve optimal usage of underlying infrastructure as well as to have a single skilled team to manage it.
The various types of data storage provided in Data Store are as following
  • Distributed File System
  • RDBMS
  • NoSQL
  • In-Memory Databases (NewSQL)
  • Encrypted Databases

Distributed File System (DFS)

DFS is primarily used to store data in the Batch Processor. The amount of data that Batch Processor handles at any point of time is in true sense Big Data. It can go into TBs and PBs. So the technology used as DFS has to be fault tolerant, robust and highly reliable. There is only technology that has really capacity to hold such a volume of data reliability and efficiently, and that is Hadoop Distributed File System.

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. [ApacheHDFS14]

HDFS and Hadoop MapReduce are the two prongs of the Hadoop project: a Java framework for distributed storage and distributed processing of large amounts of data [Marz14]. Hadoop is deployed across multiple servers, typically called a cluster, and HDFS is a distributed and scalable file system that manages how data is stored across the cluster. Hadoop is a project of significant size and depth, so we will only provide a high level description.

In a Hadoop cluster, there are two types of HDFS nodes: a single namenode and multiple datanodes. When you upload a file to HDFS, the file is first chunked into blocks of a fixed size, typically between 64MB and 256 MB. Each block is then replicated across multiple datanodes (typically three) that are chosen at random. The namenode keeps track of the file-to-block mapping and where each block is located. This design is shown in the subsequent Figure

Distributing a file in this way across many nodes allows it to be easily processed in parallel. When a program needs to access a file stored in HDFS, it contacts the namenode to determine which datanodes host the file contents. This process is illustrated in following Figure

NoSQL DB

The past decade has seen a huge amount of innovation in scalable data systems. A new style of database called NoSQL (Not Only SQL) has emerged to process large volumes of multi-structured data.

A NoSQL or Not Only SQL [WikiNoSQL14] database provides a mechanism for storage and retrieval of data that is modeled in means other than the tabular relations used in relational databases. Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability. The data structure (e.g. key-value, graph, or document) differs from the RDBMS, and therefore some operations are faster in NoSQL and some in RDBMS. There are differences though, and the particular suitability of a given NoSQL DB depends on the problem it must solve (e.g., does the solution use graph algorithms?).

NoSQL databases are increasingly used in big data and real-time web applications. NoSQL systems are also called "Not only SQL" to emphasize that they may also support SQL-like query languages. Many NoSQL stores compromise consistency (in the sense of the CAP theorem) in favor of availability and partition tolerance. Barriers to the greater adoption of NoSQL stores include the use of low-level query languages, the lack of standardized interfaces, and huge investments in existing SQL. Most NoSQL stores lack true ACID transactions, although a few recent systems, such as Google Spanner and FoundationDB, have made them central to their designs.

NoSQL databases are often categorized according to the data model they support as following [Global13]

  • Key-value stores: It has a Big Hash Table of keys & values e.g. Riak, Amazon S3 (Dynamo).
  • Document stores: stores documents made up of tagged elements e.g. Couchbase, MongoDB
  • Column-family stores: Each storage block contains data from only one column e.g. HBase, Cassandra
  • Graph databases: A network database that uses edges and nodes to represent and store data e.g. Neo4J

The above categorization works well at a high level but at lower level the NoSQL DBs still have major differences between single databases within these categories. This refers to the architecture, replication and distribution models they use and therefore they trade-offs they make considering consistency, latency and availability. These differences make it difficult to discuss 'NoSQL' databases in a general way. [Maier13]

New SQLDB

NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system. [WikiNewSQL, 2014]

The various NewSQL DB available are as following

  • Google Spanner
  • VoltDB
  • MemSQL
  • SAP HANA
  • FoundationDB

Technology Choice

  • Distributed File System: HDFS
  • NoSQL DB: Couchbase, HBase
  • New SQL DB: VoltDB
  • RDBMS: MySQL

External Service Orchestrator

The responsibility of this component is to support the Data Processor components to external services in order to complete their business functionality. It can be viewed as an Enterprise Service Bus. It will promote agility and flexibility with regards to communications between applications.

There are various technologies available that can be used as ESB. Popular among them are

Apache Camel
Apache Camel is a rule-based routing and mediation engine that provides a Java object-based implementation of the Enterprise Integration Patterns using an API (or declarative Java Domain Specific Language) to configure routing and mediation rules [WikiCamel, 2014].

Mule
Mule is a lightweight enterprise service bus (ESB) and integration framework.[2] It can handle services and applications using disparate transport and messaging technologies [WikiMule14].

Technology Choice
Apache Camel 
As it is the most robust and with the active community support. It has large DSLs, most connectors and various vendors.

Security and Privacy

Security and Privacy considerations form a fundamental aspect of the Big Data Reference Architecture. This is diagrammatically depicted by having a Security and Privacy as the foundation over which all the other components are built. At a minimum, a big data reference architecture will provide and ensure verifiable compliance with both GRC (Governance, Risk & Compliance) and CIA (Confidentiality, Integrity & Availability) regulations, standards and best practices.

Security and Privacy concerns are classified in four categories [NIST-BD-Security13]
  • Infrastructure Security
  • Data Privacy
  • Data Management
  • Integrity and Reactive Security

Data Governance and Management

Soares, [Soares13] in one of its presentations mentions that Big Data governance is part of a broader data governance program that formulates policy relating to the optimization, privacy, and monetization of big data by aligning the objectives of multiple functions. Data Governances and its management plays a very crucial role in the success of Big Data solutions.

Data Management is responsible for managing data coming into the system, residing within the system, and going out of the system for application usage. In other words, the role of Data Management is to ensure Governance principles such that the data are accessible by other Provider Components throughout the lifecycle of the data, since the moment they are ingested into the system by the Data Provider, and until the data are disposition. Moreover, this accessibility has to comply with policies, regulations, and security requirements. 

In the context of Big Data, Lifecycle Management has to deal with the three V characteristics: Volume, Velocity and Variety. 

Configuration Management

Configuration management (CM) is a systems engineering process for establishing and maintaining consistency of a product's performance, functional and physical attributes with its requirements, design and operational information throughout its life [WikiCM14]

However, in case of Big Data, it is more about managing the cluster of servers used in distributed computing. Big Data is also big when it comes to number of nodes it has to configure and manage in order to solve a business problem. Here we talk about 100's and 1000's of nodes so Configuration Management becomes very critical aspect of Big Data System maintenance.

A centralized service is thus required for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications.

ZooKeeper
Apache ZooKeeper is a software project of the Apache Software Foundation, providing an open source distributed configuration service, synchronization service, and naming registry for large distributed systems. [WikiZK14]

ZooKeeper's architecture supports high availability through redundant services. The clients can thus ask another ZooKeeper master if the first fails to answer. ZooKeeper nodes store their data in a hierarchical name space, much like a file system or a Trie data structure. Clients can read and write from/to the nodes and in this way have a shared configuration service. Updates are totally ordered.

Technology Choice
ZooKeeper is preferred choice for most of the Big Data solutions for configuration management and major Hadoop cluster use ZooKeeper only. It is tried and tested by companies including Rackspace, Yahoo! and eBay as well as open source enterprise search systems like Solr.

Infrastructure

Big Data requires the ability to operate with sufficient network and infrastructure backbone.  For Big Data to deploy, it is critical that the Infrastructure Framework has been right-sized.

The infrastructure requirements of Big Data span across all the components of Big Data System. 
In Data Acquisitors since Big data refers to data streams of higher velocity and higher variety, the infrastructure required to support the acquisition of big data must deliver low, predictable latency in both capturing data and in executing short, simple queries; be able to handle very high transaction volumes, often in a distributed environment; and support flexible, dynamic data structures. Same is the case with Data Processor as well Data Stores. In case of Data Stores, the infrastructure should be reliable enough to be able to handle complete life cycle of Big Data.

The solution for infrastructure lies around virtualization using Cloud technologies. An Infrastructure for Big Data can be provide in the following forms
  • Public Cloud
  • Private Cloud
  • Hybrid
  • On-Premises.

Data Consumers

Data Consumer is the role performed by end users or other systems in order to use the results of Big Data Application Provider. Data Consumer uses the interfaces (or services) exposed by Big Data Application Provider to get access to the information of its interest. These services can include data reporting, data retrieval, and data rendering. [NBD-WG-RA13]
  • Data Consumer activities can include: 
  • Data search, query, retrieval
  • Exploring data using data visualization software
  • Creating reports and organized drill-down using business intelligence software
  • Ingesting data into their own system
  • Putting data to work for the business, for example to convert knowledge produced by the big data applications into business rule transformation
  • Conversion of data into additional data-derived products
Data Consumer can play the role of the Data Provider to the same system or to another system. Data Consumer can provide requirements to the System Orchestrator as a user of the output of the system, whether initially or in a feedback loop.
It includes all types of databases that are required by the system for efficient 




5 comments:

  1. Nice blog. Big data security is a constant concern for organizations because a single attack might leave your big data deployment vulnerable to ransom demands.

    ReplyDelete
  2. Automated big data engineering should understand the need of Data, and they should work to build more appropriate services to meet the requirements of their clients.

    ReplyDelete
  3. Thebig data service enables you to collect data from edge devices in a corporate network. The service collects data from devices in a corporate network and then sends the data to an Amazon Redshift database for analysis. The data collected from devices includes the device MAC addresses, device type, OS version, device model, and the hostname.

    ReplyDelete