Everyone defines Big Data with a set of 3 or 4 or 10 Vs. Are these V’s actually giving us the definition of the Big Data concept or is it something else that they are trying to tell us? The main reason for using this V-based characterization is to highlight the challenges that comes packed with this Big Data. Challenges like – capturing, cleaning, curation, integration, storage, processing and many more.
These V’s are giving the directions for preparing yourself of the probable Challenges. Challenges that may come your way when you would start to manage your Big Data which:
- Increases in big volumes
- Grows with big velocity
- Generates big varieties
- Changes with big Variability
- Requires process to maintain big Veracity
- On transforming gives big Visualization
- have big hidden values
These V’s explain the important aspects of Big Data and a Big Data Strategy that organization cannot ignore. Let’s look at all the V’s contributing to different attributes of the Big Data:
100 terabytes of data are uploaded daily to Facebook; Akamai analyses 75 million events a day to target online ads; Walmart handles 1 million customer transactions every single hour. 90% of all data ever created was generated in the past 2 years.
The above figures truly depict what it means when we say Big Volumes of data. It is this first characteristics of data which makes it a big data. This sheer volume of data in turn poses a challenge on us of storing this data.
See Also: Best 19 Free Data Mining Tools
1n 1999, every minute of every day, we upload 100 hours of video on YouTube, sent over 200 million emails and send 300,000 tweets.
Underlying the volume numbers is an even larger trend, which is that 90% of existing data have been created in just the last two years. This depicts the velocity or the speed at which the data is being created, stored, analyzed and visualized.
The challenge organizations have is to cope with the enormous speed the data is created and used in real-time.
In the past, all data that was created was structured data, it neatly fitted in columns and rows but those days are over. 90% of data generated today is unstructured, coming in all shapes and forms – from geo-spatial data, to tweets which can be analyzed for content and sentiment, to visual data as photos and videos.
Variety describes one of the biggest challenges of big data. It can be unstructured and it can include so many different types of data from XML to video to SMS. Organizing the data in a meaningful way is no simple task, especially when the data itself changes rapidly.
Variability is often confused with Variety. A simple example to distinguish it is: think of Starbucks – it has so many flavors in Cold Coffee. This is variety. For suppose you buy Cafe Mocha every day and it tastes and smells a bit different from every previous day. It is Variability.
Variability in big Data’s context refers to a few different things. One is the number of inconsistencies in the data. These need to be found by anomaly and outlier detection methods in order for any meaningful analytics to occur. Big data is also variable because of the multitude of data dimensions resulting from multiple disparate data types and sources. Variability can also refer to the inconsistent speed at which big data is loaded into your database.
See Also: Best Offline Data Cleaning Tools
What’s crucial to understanding Big Data is messy, noisy nature of it, and the amount of work that goes in to producing an accurate dataset before analysis can even begin. It is useless if the data being analyzed are inaccurate or incomplete.
This situation arises when data streams originate from diverse sources presenting a variety of formats with varying signal-to-noise ratios. The may be rife with this accumulated errors by the time it reaches for Big Data Analytics.
Veracity is all about making sure the data is accurate, which requires processes to keep the bad data from accumulating in your systems. The simplest example is contacts that enter your marketing automation system with false names and inaccurate contact information. How many times have you seen Mickey Mouse in your database? It’s the classic “garbage in, garbage out” challenge.
This is the hard part of Big Data, failing in which makes this huge volume of data useless. A core task for any Big Data processing system is to transform the immense scale of it into something easily comprehended and actionable. For human consumption, one of the best methods for this is converting it into graphical formats.
Current big data visualization tools face technical challenges due to limitations of in-memory technology and poor scalability, functionality, and response time. Traditional graphs cannot fulfil the need of plotting a billion data points, so you need different ways of representing data such as data clustering or using tree maps, sunbursts, parallel coordinates, circular network diagrams or cone trees.
Value is the end game. The potential value of Big Data is huge. After taking care of volume, velocity, variety, variability, veracity and visualization – which takes a lot of time and effort – it is important to be sure that your organization is getting value from the data.
Of course, data in itself is not valuable at all. The value is in the analyses done on that data and how the data is turned into information and eventually turning it into knowledge.
The above 7 V’s tells you about 3 important aspects of Big Data i.e. definition, characteristics and challenges. But when people started to do research on the big data for inventing methods to face afore said 7 V’s Challenges they came across some other V’s. Though they do not play that crucial part in the big data but completes the list of Characteristics and Challenges.
Similar to veracity, validity refers to how accurate and correct the data is for its intended use. Big Data veracity is a matter of validity, meaning that the data is correct and accurate for the intended use. Clearly valid data is the key for making the right decisions. Data validation is one that certifies uncorrupted transmission of data.
Just give a thought on the below statements:
- What effect does time of day or day of week have on buying behavior?
- Does a surge in Twitter or Facebook mentions presage an increase or decrease in purchases?
- How do geo-location, product availability, time of day, purchasing history, age, family size, credit limit, and vehicle type all converge to predict a consumer’s propensity to buy?
Our first task is to assess the viability of that data because, with so many varieties of data and variables to consider in building an effective predictive model, we want to quickly and cost-effectively test and confirm a particular variable’s relevance before investing in the creation of a fully featured model. In other words, we want to validate that hypothesis before we take further action and, in the process of determining the viability of a variable, we can expand our view to determine if other variables – those that were not part of our initial hypothesis – have a meaningful impact on our desired or observed outcomes.
How old does your data need to be before it is considered irrelevant, historic, or not useful any longer? How long does data need to be kept for?
When we talk about the volatility of Big Data, we can easily recall the retention policy of structured data that we implement every day in our businesses. Once retention period expires, we can easily destroy it.
Due to the velocity and volume of big data, however, its volatility needs to be carefully considered. You now need to establish rules for data currency and availability as well as ensure rapid retrieval of information when required.
Do you remember the Ashley Madison Hack in 2015? Or do you remember in May 2016 CRN reported “a hacker called Peace posted data on the dark web to sell, which allegedly included information on 167 million LinkedIn Accounts and 360 million emails and passwords for MySPace Users.
Big Data brings with it new security concerns. Especially with these characteristics it becomes a challenge to develop a security program for the Big Data. After all, a data breach is a big breach.
So what does all of this tell us about the nature of Big Data? Well, it’s massive and rapidly-expanding, but it’s also noisy, messy, constantly-changing, in hundreds of formats and virtually worthless without analysis and visualization.
Volume, velocity and variety are not just the key parameters of Big Data, but they are also the reason to give birth to the concept of Big Data and the key separating features between the normal data and the Big Data. Although they are intrinsic to Big Data itself, the other V’s Variability, veracity, visualization and value are important attributes that reflect the gigantic complexity that Big Data presents to those who would process, analyze and benefit from it.
Unquestionably, Big Data is a key trend that corporate IT must accommodate with proper computing infrastructures. But without high-performance analytics and data scientists to make sense of it all, you run the risk of simply creating Big Costs without creating the value that translates into business advantage.