Big Data Sources


Soares [Soares12] classifies typical sources for ‘Big Data’ into 5 categories:

1. Web and Social Media

Web is undoubtedly the richest source of data at present. It also is the most diverse form of data source.
The data from web can be categorized as
  • Unstructured Data
    • It is the data mostly meant for direct human consumption e.g. blogs, articles, images, videos, etc.
  • Semi-structured Data
    • This is the web-content that is structured to provide machine-readability. It is intended to enable applications to access the data, understand the data due to semantics, and allow them to integrate data from different sources, set them into context and infer new knowledge e.g. semantic web content. Semi-structured data can also include the weblogs, click streams, search queries etc.
  • Structured Data
    • There are applications that are available on web which collect information of their users in a structured manner such as profile information, feedback, etc. This data is in structured form and can be used more directly for any kind of analytics.
A last type of web data is data from social interactions. This can be communicational data, e.g. from instant messaging services, or status updates in social media sites, etc.

2. Machine to Machine (M2M) data

M2M term refers to the automatic data exchange between machines without human intervention. It often refers to a system of remote sensors/devices that is continuously transmitting data to a central system. The devices are used to measure a physical properties of a physical object such as movement or temperature and generate events accordingly. Devices used for measurements are typically sensors, RFID chips or GPS receiver.
A concept that has evolved using M2M technology is Internet of Things (IoT). Kevin Ashton [Ashton09] gave the insight to how IoT is going to evolve and become a main stream technology which will generate a huge amount of data. Machine to machine data is typically semi-structured.

3. Big Data Transactions

Transactional data grew with the dimensions[hello] of the systems recording it and the massive amount of operations they conduct [Chu06]. Transactions can e.g. be purchase items from large web shops, call detail records (CDR) from telecommunication companies or payment transactions from credit card companies. These typically create structured or semi-structured data. Furthermore, big transactions can also refer to transactions that are accompanied or formed by human-generated, unstructured, mostly textual data. Examples here are call center records accompanied with personal notes from the service agent, insurance claims accompanied with a description of the accident or health care transactions accompanied with diagnosis and treatment notes written by the doctor [Maier13].

4. Biometrics

Biometrics refers to metrics related to human characteristics and traits. Biometrics identification (or biometric authentication) is used in computer science as a form of identification and access control. It is also used to identify individuals in groups that are under surveillance [WikiBio14] .Examples of characteristics are fingerprints, DNA or retinal scans. It can also refer to behavioral characteristics such as handwriting or keystroke analysis. One important example of using large amounts of biometric data are scientific applications for genomic analysis.

5.  Human-generated Data

According to Soares [Soares12] human-generated data refers to all data created by humans. He mentions emails, notes, voice recording, paper documents and surveys. This data is mostly unstructured. It is also apparent, that there is a strong overlap with two of the other categories, namely big transaction data and web data. Big transaction data that is categorized as such because it is accompanied by textual data, e.g. call center agents’ notes, have an obvious overlap. The same goes for some web content, e.g. blog entries and social media posts. This shows that the categorization is not mutually exclusive, but data can be categorized in more than one category [Maier13].

1 comment:

  1. Event data recorder company I admire this article for the well-researched content and excellent wording. I got so involved in this material that I couldn’t stop reading. I am impressed with your work and skill. Thank you so much.

    ReplyDelete