-->

Keep up with all of the essential KM news with a FREE subscription to KMWorld magazine. Find out more and subscribe today!

How To Make Big Data Headache Go Away

Whew. I hope you got that. I need to spend more time at the Hadoop spa, that’s for sure.

But unlike typical business intelligence tools, Hadoop has the ability to “divide and conquer” large amounts of data in a unique way. As I have written many times on these pages, business intelligence (BI) is not my favorite topic. I have always thought it to be cumbersome, elitist and typically useless. But that’s just me.

I’ll start with what Doug Henschen, executive editor of InformationWeek (and coincidentally a former colleague of mine in another life) had to say about big data analytics: Analytics and BI standardization is waning. “A decade ago,” Doug writes, “the BI market was consolidating as companies tried to standardize on fewer BI products that could be deployed throughout the organization. Times are changing. Cloud and mobile innovation and big-data exploration favor experimentation with a new generation of tools.”

Well, Doug might find an argument with the participants in this month’s white paper. We found a couple of vendors in the space who might have another view:

“It is today widely recognized that the vast majority of information in any business is unstructured data, typically in text format such as reports, filled forms, emails, memos, log entries, transcripts, etc.” writes Normand Peladeau, CEO of Provalis Research. “Most of the time, this rich source of information remains untapped—sometimes because companies are not fully aware of its potential value, more often because of the tremendous effort it takes to sift and dig out information manually from such large volumes.

“Text mining provides a viable solution. By combining natural language processing, statistical and machine learning techniques, text mining can quickly extract useful information from large collections of documents. A text mining tool will typically process a million words in a few seconds to automatically extract topics and discover unknown relationships and patterns.

Companies see the real power behind text analytics when they combine text mining results with structured data.”

Normand continues, quite accurately, that “analyzing human language is a very complex task, and text mining is still, in many respects, in its infancy. Newcomers to text mining expecting their tools to readily provide comprehensive and precise answers to their questions may very well be disappointed. Moving beyond the most obvious to achieve greater details and precision often requires some efforts on the part of the text analyst. It involves building a custom dictionary composed of keywords, key phrases and rules. Such a crucial task may take days, weeks, in some cases months. Yet it still represents a tiny fraction of the time it would take to do manually. Once developed and validated, such taxonomy becomes invaluable, allowing one to fully automate the analysis of newly obtained text data or process incoming streams of text data in real-time.

“Text mining regularly turns up previously hidden gems, which companies quickly respond to positively. Such insights give them the competitive advantage they are looking for, hidden this whole time in their very own ‘backyard’ data,” says Normand.

Adaptive puts an even finer point on it. They refer to a “data lake” as often resembling a “data swamp.” “To get desired information, someone needs to have a basic understanding of where the data resides to extract it. In a data lake, (as Adaptive likes to call it) mass amounts of data are ‘thrown’ into the lake, with little contextual information. No one knows what the files are for, how up-to-date they are, who’s responsible for them, whether they can be used, etc. Likewise, any ‘marts’ formed out of the data lake need to have a detailed level of provenance back to the original data source. Analysts already have that for traditional ETL tools—it’s critical that the data lake provides the same capability. For any serious decisions made from the analysis, that same level of provenance is again needed, and in fact legally required by regulators.”

They move on to the crux of it: “Organizations have made investments in ‘small data’ for years and many are achieving data governance, or at least understand the gap they need to fill. They know how to work with the relatively small number of technologies in play—including databases (standardized on SQL), ETL, DQ and BI—ideally all linked with modeling tools and/or a business glossary.

These organizations are now embracing the promise of big data—a new frontier akin to the wild west or the gold rush—with programmers/data scientists let loose with a daily growing menagerie of languages and technologies outside the normal IT and governance structure. Sometimes this produces genuinely impressive-looking results and insights—especially those supporting marketing.

“But making decisions based on marketing insights can be low consequence compared to other potential analytical results, such as risk analysis and pricing information.”

So there’s no question that big data analytics and text mining is a huge task, but many prospective partners and trustful advisors are on hand to help. There are some in the pages to follow, and I hope you find a way through this maze with their help. Please read on and gather your own conclusions about the value of big data in your organizations. I know I have.

KMWorld Covers
Free
for qualified subscribers
Subscribe Now Current Issue Past Issues