I went to a conference and heard “We’re making a significant shift towards more fact-based decision making. And I was kind of like, what do people usually use? Horoscopes?” There’s now a big drive to be more data-analytical.
The first thing to know about big data according to Dave Coombes is that it’s an overhyped term. “There are few apocryphal stories like ‘We had this huge amount of data and then we shuffled it and, you know, it told us the answer to life, the universe and everything’ but they are few and far between.” he says. “Generally, it’s not a valid approach.‘”
Contrary to popular belief big data doesn’t happen when you get to gigabytes or terabytes but refers to any data analysis that is beyond the existing capabilities of a particular business.
The problem of analytical demands outstripping capability is not new. “We’ve always been at the point where we don’t quite have enough capacity.”, Dave says, “We’re always overstretching.” He has a theory: that the traditional data requirements of transactional applications tend to grow linearly, but post-Internet boom companies such as Google or Facebook base their business on network effects. Those effects scale quadratically, meaning such companies were always going to outgrow their data handling capabilities.
Often, too much data is not the problem. Although some businesses need to do high volume analysis, such as understanding Twitter streams or access logs, it’s not usually of their core business data. Most core business analysis is retrospective – describing things that happened last week, last month or the first quarter last year. But where things really get interesting is in predictive analysis – such as recommendation engines or unsupervised machine learning.
“My background is in theoretical physics and so I still value the scientific method where you posit some hypothesis and then run an experiment to look at your data and work out if what you thought was is true, is actually true.”, Dave says. But when you need to move your business quickly, you need to test and evaluate these hypotheses quickly and that doesn’t always leave room to build a traditional warehouse of historical data.
Dave’s latest project takes the same lean approaches used in application development and applies them to the areas of analytics, reporting and business intelligence to tailor the advertisements web customers see.
A strategic approach would be to consider everything known about customers and coerce all the available data into a single form that meets that requirement. But designing a data strategy that works now and in the future is difficult. It’s easier to take the problem we want to solve and work backwards. Dave asks, “What is the next most important thing, that if the business knew the answer to, they’d be able to make an informed decision?” When you start with a question, you can pull in the information you need rather than pushing everything into a model and analysing it.
Too often, says Dave, the Data Warehouse is forced to codify business rules about the data that other applications provide. The more effort put into creating the Data Warehouse, the more it becomes another application and, therefore, another database with it’s own semantic model. The compounding downside is the data quality is heavily influenced by the quality of its sources.
Codifying rules in the Data Warehouse stops it being an aggregator and creates place where assumptions are concreted in. These assumptions mean that the further you get from the source of the data, the fuzzier the answers you get will be. Keeping the data raw means that storage can be decoupled from presentation allowing you to quickly create new, testable views of that data.
Even so, quickly prototyping new views of data isn’t always the priority. With operational reporting such as end-of-year financial reporting, the accuracy of the information is valued above how quickly you can improve. With a single report such as an end-of-year financial report, you won’t get a chance to improve.
But in predictive analysis any new information is an improvement. If you want to start by increasing your confidence that you’re matching an advertisement to a particular type of person by 10 or 20 percent, then you have a chance to iterate and get feedback on the impact of your analysis. It’s a short cycle, and you can learn and adapt as you go.
The solution, Dave says, is not a custom reporting model. What is needed is to consider the analytics lifecycle when building applications. “The term ad-hoc reporting gives me the shivers.” he says. “Analysis shouldn’t be some sort of esoteric thing that sits off in a gridlock of Excel spreadsheets, or that last card in an Inception you write up when someone says ‘Oh, I guess we’ll need reporting’. It should be something that drives your line of business.”