Big Data: Definitions and Concepts


The following sections will provide the definition of Big Data and explain general concepts around it.
The section is organized as follows

Definition

Let’s look at some of the definitions of Big Data by leading research and consultancy firms.

Gartner, the known provenance of 3Vs of Big Data [Gartner14] defines Big Data as a high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

McKinsey Global Institute (MGI) [McKinsey11] used the following definition: Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered big data.

IDC [Gantz12]  defines Big Data technologies as a new generation of technologies and architectures, designed to economically extract value from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis. There are three main characteristics of Big Data: the data itself, the analytics of the data, and the presentation of the results of the analytics.

All the three definitions associate the definition of Big Data with the technology. Gartner talks about “innovative”, McKinsey describes it “beyond the ability” and IDC states “new generation” of technologies.

So, a summarized definition of Big Data is as follows
A dataset that traditional data architectures and technologies are not able to manage and process efficiently in order to deliver business functionality. Traditional will mean “data curation first storage later” data warehouse architectures, RDBMS style data modelling, schema-on-write data loading techniques, centralized resources and so on.

A size of dataset can range from a few Terabyte to many Petabyte or to Exabyte. In some domains, depending upon the technologies available for that particular domain, even a few hundred gigabytes of data can also be classified as "Big Data".  So the definition of Big Data is a moving definition to be determined by the advancement of technology in that particular domain at that point of time.


Big Data Characteristics

Doug Laney from Gartner [Laney01]  was first to attribute 3Vs to data though he didn’t mention term “Big Data”. They are VelocityVolume and Variety. In other words, high Volume, Velocity and Variety (3 Vs) are the attributes of a data that characterize it into a Big Data

Volume

Volume is the characteristic of data at rest that is most associated with big data. 90% of all data ever created, was created in the past 2 years. Estimates show that the amounts of data in the world are doubling every two years. Should this trend continue, by 2020, there will be 50 times the amount of data as there had been in 2011. The sheer volume of the data is colossal - the era of a trillion sensors is upon us. This volume presents the most immediate challenge to conventional information technology structures.

Velocity

The Velocity is the speed at which the data is created, stored, analysed and visualized. In the big data era, data is created in real-time or near real-time. With the availability of Internet connected devices, wireless or wired, machines and devices can pass-on their data the moment it is created.  Data Flow rates are increasing with enormous speeds and variability, creating new challenges to enable real or near real time data usage. Traditionally this concept has been described as streaming data

Variety

The second characteristic of data at rest is the increasing need to use a Variety of data, meaning the data represents a number of data domains and a number of data types. Traditionally, a variety of data was handled through transforms or pre-analytics to extract features that would allow integration with other data through a relational model. Given the wider range of data formats, structures, timescales and semantics that are desirous to use in analytics, the integration of this data becomes more complex. This challenge arises as data to be integrated could be text from social networks, image data, or a raw feed directly from a sensor source.

However, Shan Suthaharan [Suthaharan13
introduced 3C model to classify Big Data that are Cardinality, Continuity and Complexity.
It is his argument that compared to defining a metric to measure the Big Data characteristics in V3 space, it is much easier to develop a metric inC3 space using mathematical and statistical tools. In C3 space the cardinality defines the number of records in the dynamically growing dataset at a particular instance. The continuity defines two characteristics and they are: (i) representation of data by continuous functions, and (ii) continuously growth of data size with respect to time. The complexity defines three characteristics and they are: (i) large varieties of data types, (ii) high dimensional dataset; and (iii) the speed of data processing is very high.

Big Data Analytics

Big Data Analytics is the application of Data Science on Big Data. It is the process of extracting valuable information from Big Data by efficiently processing and analyzing it. The knowledge obtained is used to make better informative decisions related to a given subject.
Big Data Analytics is the term associated with the new type of information extracted from a complex dataset using new technical approaches such as MapReduce which otherwise would not have been possible previously due to technology limitations and economic feasibility. It adds three more Vs to the Big Data paradigm which are veracity of the data, value of the information extracted from that source as per the needs of business and visualization of that information in the format that conveys the information in clear terms.
Big data analytics engineering can be termed as the use of advanced analytic techniques against very large, diverse data sets that include different types such as structured/unstructured and streaming/batch, and different sizes from terabytes to zetabytes.

Data Science

Data Science is the end-to-end data life cycle for extracting information directly from the data. It is the scientific study of the creation, validation and transformation of data to create meaning.

Data Science is a paradigm that refers to formulation of a hypothesis, the collection of the data to address the hypothesis (new or pre-existing), and the analytical confirmation or denial of the hypothesis (or the determination that additional information or study is needed).

Data Science thus can be defined as "extraction of actionable knowledge directly from the data through a process of discovery, hypothesis, and hypothesis testing".

Data Science involves the knowledge of various disciplines and domains including mathematics, computer science, data warehousing, and high performance computing.

A practitioner of Data Science is called Data Scientist. A Data Scientist should have sufficient knowledge of the enterprise's business requirements, domain, analytics discipline, programming and system management to manage the end-to-end big data life cycle. Because of the complex nature of Data Scientist's job profile, it is usually practiced as a team.

Data Science is a super set of analytics. Various analytics approaches can be applied to derive the value from data are

  • Blue Sky Data Science: It is an open-ended discovery process, a purely curiosity driven exploratory branch of Data Science. It is not very efficient approach and it is suggested to follow a rapid hypothesis-exploratory confirmation cycle than to simply browse the data with no focus on specific goals.
  • Basic Data Science: It would refer to the more traditional analytics approach, where there is a specific goal in mind. This would more align with traditional statistics and data mining methods.
  • Applied Data Science: It is about the development of practical applications, technologies and engineering practices in the field of Data Science.

Big Data Elements

The concepts explained in this section are not specific to Big Data only and have been around all the time but they have a significant influence on approaches taken to engineer Big Data.
Data Elements can be categorized into
  • Structured data: Information stored in a well-defined data model e.g. RDBMS Tables, etc.
  • Semi-structured data: Data identified by tags or other markers e.g. XML, JSON, etc.
  • Unstructured Data: Data that does not follow any specific format. It can textual e.g. emails, word documents, system logs, etc. or it can be non-textual e.g. videos, images, audio files, etc. 
As the data is becoming more open such as social media activity, etc. and the usage of data persists far beyond the control of data producers, it becomes vital to have information about how the data was collected, transmitted and processed. The concept of data about data is known as meta-data.
Another concept though not again specific to only Big Data is the presence of complexity in the data elements. Complexity is the relationship between the data elements, their position and proximity to other elements that matters.

Data at Rest

Data at Rest is the inactive data which is stored physically in any digital form e.g. database, data warehouse, spread sheets, archives, tapes, mobile devices, etc. It sometimes refers to all data in computer storage while excluding data that is traversing a network or temporarily residing in computer memory to be read or updated (Data in Use).
Volume and Variety are the two typical characteristics of data at rest that associates it with big data.

Data in Motion

Data in Motion is the data that is being transmitted over a network. It represents the continuous interactions between people, process, data and things to drive the real-time data discovery.
Data in Motion is required to be processed and analyzed in real time or near real time as it retains its highest value for only a short period of time. 
The Velocity and Variability are the two typical characteristics of data in motion that identifies it with big data. The velocity is the speed at which the data is created, stored, analyzed and visualized while variability is change in the rate of flow of data.

Big Data Modelling

Big Data paradigm follows non-relational logical data models for the storage and manipulation of data across horizontally scalable resources. This concept is also known as schema-less data storage technique and is supported by “NoSQL” databases such as Hbase, Couchbase, MongoDB, Cassandra, etc.
Having said that, it doesn’t necessarily mean that the data stored in such databases don’t have a structure. These databases support “schema-on-read” which means that instead of defining a data schema prior to loading data into data repository, the data is dumped as-is into the repository and then schema is defined while querying the data. The traditional data warehouse is “schema-on-write” which means that schema of dataset shall be pre-defined before loading data into the repository.
As Big Data mostly deals with semi-structured and unstructured data such as text file, video, audio and graphic from multiple sources, it become beneficial to take a schema-on-read approach for storing and querying the data.

Big Data Transactions

Big Data paradigm is built on the principles of distributed computing. According to Brewer CAP Theorem, a distributed system has to forfeit one of the three requirements to achieve the other two. In case of Big Data, it is preferred to forfeit Consistency in exchange for Availability and Partition tolerance.
Big Data storage models are not able to support strict ACID transaction model for their distributed nature of deployment. So Big Data introduced a new transactional model called BASE. The BASE Acronym is often used to describe the types of transaction typically supported by NoSQL databases and it stands for following

  • Basically Available: It states that the system does guarantee the availability of the system but the date returned against a request could be in an inconsistent or changing state.
  • Soft State: The state of the system could be changing over time to achieve the eventual consistency. So the state of the system at any point is a soft state.
  • Eventual Consistency: The data will be propagated across all the nodes over a period of time which will make a system eventually consistent. However, the system will continue receiving updates during propagation process without confirming the consistency of previous transaction.

Analytics Time Window (ATM)

It can be described as the time difference between when data enters into the system and the action required to be taken after analysis of that data.
At a high level the available Analytics Time Window (ATM) can be derived from following processing requirements
  • Offline Processing
    • Data is first stored and later processed in batches. The  ATM for the offline processing is the largest. In Big Data paradigm it is popularly known as Batch processing.
  • Online Processing
    • Data is processed as it comes and results produced within a very small time interval. It is also known as Real-time processing.
  • Interactive Processing
    • It is an Online processing of data only but with a difference that the further inputs based on the results are processed interactively.

The expected latency of the processing the data, deriving the analysis and performing an action based on that analysis is the major consideration while designing a Big Data system.

Big Data Frameworks

Big data frameworks are software libraries along with their associated algorithms that enable distributed processing and analysis of big data problems across clusters of compute units (e.g., servers, CPUs, or GPUs) [Grance13].

Big Data Infrastructure

Big data infrastructure is an instantiation of one or more big data frameworks that includes management interfaces, actual servers (physical or virtual), storage facilities, networking, and possibly back-up systems. Big data infrastructure can be instantiated to solve specific big data problems or to serve as a general purpose analysis and processing engine [Grance13]

6 comments:

  1. Hello,
    The Article on Big Data with Concepts is amazing give detail information about it.Thanks for Sharing the information about concepts of Big Data.data science consulting

    ReplyDelete
  2. The solutions design and the development strategies related to iot services offered by this article have proven to help understand the platform identification

    ReplyDelete
  3. As per my opinion, videos play a vital role in learning. And when you consider AWS big data consultant , then you should focus on all the learning methods. Udacity seems to be an excellent place to explore machine learning.

    ReplyDelete
  4. Well written articles like yours renews my faith in today's writers. The article is very informative. Thanks for sharing such beautiful information.
    Best Data Migration tools
    Penetration testing companies USA
    What is Data Lake
    Artificial Intelligence in Banking
    What is Data analytics

    ReplyDelete
  5. Thanks for sharing this informative article on BIG DATA: DEFINITIONS AND CONCEPTS. If you want to Big Data Services for your project. Please visit us.

    ReplyDelete