As companies develop their big data business cases, the platform and speed discussions are only part of the overall conversation about big data delivery. In reality, we’re seeing seven steps necessary for realizing the full potential of big data:
- Collect: Data is collected from the data sources and distributed across multiple nodes – often a grid – each of which processes a subset of data in parallel.
- Process: The system then uses that same high-powered parallelism to perform fast computations against the data on each node. Next, the nodes reduce the resulting data findings into more consumable data sets to be used by either a human being (in the case of analytics) or machine (in the case of large-scale interpretation of results).
- Manage: Often the big data being processed is heterogeneous, originating from different transactional systems. Nearly all of that data needs to be understood, defined, annotated, cleansed and audited for security purposes.
- Measure: Companies will often measure the rate at which data can be integrated with other customer behaviors or records, and whether the rate of integration or correction is increasing over time. Business requirements should determine the type of measurement and the ongoing tracking.
- Consume: The resulting use of the data should fit in with the original requirement for the processing. For instance, if bringing in a few hundred terabytes of social media interactions demonstrates whether and how social media data delivers additional product purchases, then there should be rules for how social media data is accessed and updated. This is equally important for machine-to-machine data access.
- Store: As the “data-as-a-service” trend takes shape, increasingly the data stays in a single location, while the programs that access it move around. Whether the data is stored for short-term batch processing or longer-term retention, storage solutions should be deliberately addressed.
- Govern: Data governance encompasses the policies and oversight of data from a business perspective. As defined, data governance applies to each of the six
preceding stages of big data delivery.
By establishing processes and guiding principles, governance sanctions behaviors around data. And big data needs to be governed according to its intended consumption. Otherwise, the risk is disaffection of constituents, not to mention overinvestment.
Most of the early adopters charged with researching and acquiring big data solutions focus on the Collect and Store steps at the expense of the others. The question is implicit: “How do we gather all these petabytes of data, and where do we put ’em all once we have ’em?”
But the processes for defining discrete business requirements for big data still elude many IT departments. Business people often see the big data trend as just another pretext for IT résumé building with no clear endgame. Such an environ-ment of mutual cynicism is the single biggest culprit for why big data never transcends the tire-kicking phase.