Laddu Jo Sono Putra: 2014

Thursday, November 27, 2014

BigData

BIGDATA

The urge to be mobile and make a mark is as old as mankind. Ancient tribes moved around guided by stars, sent sound signals and left their ‘selfies’ on cave walls with colorful palm prints. Today we use GPS to get where we’re going, communicate on a variety of devices and are so obsessed with visuals that it would take an individual more than 5 million years to watch the amount of video that will cross global IP networks each month in 2018.

Sharing and generating all this information has led to a new era shaped by a phenomenon we call big data. It’s not easy to explain what big data is, or more important, how companies try to manage it so that it’s useful for business purposes. SAP’s head of Platform Solutions, Steve Lucas said in his interview between two schnitzelsthat the best way to manage data is by developing cloud applications that allow you to run your business from a mobile phone. Simple, right?

Photo: Shutterstock

What’s different now?

The big difference between data then and now is that before, only humans created and collected data as they went about inventing more and more things and ways to make life easier. With the rise of sensors and other technology that creates and collects data, humans are no longer the center of the data solar system, creating everything, but are just another node in an increasingly autonomous data universe. In an article published in the The Human Face of Big Data, Esther Dyson points out that the big in big data is about self organization. Without our awareness, data is organizing itself, mostly following human rules but without human intervention, acting more like the immune system than the nervous system ruled centrally by the brain. So what?

This means that while we may be able to observe the data around us, there is still much we don’t understand. Just as ancient people lacked knowledge about bacteria and its impact on health, we lack knowledge about how billions of objects really interact with their own virtual presence and identity, sending data they collect to other devices to coordinate common activity and making decisions humans are not even aware of.

Are we playing with fire?

Humans have always tried to model and shape the natural world and sometimes have lost control, leading to disasters and destruction. Think loss of habitat, the extinction of many species, or the change in climate and its impact around the globe. There are many lessons to be learned about messing with nature; the most important one is taking responsibility for the outcome. With big data revolutionizing what it means to mess with something humanity has never known before comes a new responsibility, because the purpose of managing data is not to predict the future but to shape it. And that’s a huge responsibility.

Rising to the challenge

Using technology that provides insight into data, today’s business leaders have a unique opportunity to make thoughtful decisions that will have long-lasting impact. A century ago no one could foresee how the automobile would become a ubiquitous mode of transportation that changed the world. Changes took place slowly in an evolutionary manner, unplanned and unmanaged, brought about by technological advances that led to safer and more reliable cars on one hand and messy traffic and massive ecological problems on the other.

Evolution hasn’t stopped! Connected cars are already here, and driverless, sensor-based cars are coming. In the new world, people will rely on service providers to get around and won’t need to own cars. There will be no need for car dealerships, insurance, or parking. There will be no car accidents, no speeding tickets, and entirely different energy sources. Data in the new, connected network will autonomously ‘drive’ us safely to wherever we are going.

This kind of transformation has far reaching implications on the entire ecosystem and our lifestyles. Innovative companies within the automotive ecosystem are already analyzing data to help them understand what needs to be done in 5-year chunks, so they can transform from an automotive company to a ‘mobility company’ using sustainable practices.

Disruptive changes like that are happening in every industry around the world. Will today’s leaders rise to the challenge of shaping the future in a responsible way? Let’s not just be another node in the data universe. Let’s leverage our tools and technology to better understand the data around us and use it to make a difference.

Big Data

BIG DATA

In a recent survey of CIOs conducted for EMC by CIO Magazine, almost half the respondents

said they agreed with this statement: “Big Data analytics is an evolution but

not a revolution in the area of data warehousing, databases, and big file systems.”

The rest were about evenly divided among “game changer,” “pipedream,” “and not

sure.” What’s up with that? Isn’t Big Data changing the world?

To add perspective to that decidedly equivocal survey result, EMC+ went to EMC’s

Bill Schmarzo and Ben Woo of tech strategy firm Neuralytix. Schmarzo, a regular

EMC Big Data blogger, works in EMC Consulting, and previously was Vice President

of Advertiser Analytics at Yahoo. Before founding Neuralytix, Woo was Program Vice

President of IDC’s Worldwide Storage Systems Research, where he launched the

firm’s Big Data research effort.

We asked Schmarzo and Woo what they thought of that survey result, and neither

was surprised.

Schmarzo: In fact, I would have thought the differentiation would be even been higher.

I think the problem our industry faces is that we focus in on the Facebooks, Yahoos,

Zyngas, Googles, and Twitters out there, who built this technology for themselves

because existing platforms didn’t work well for them. But those stories don’t resonate

with 99% of the marketplace.

The companies I talk to have almost no interest at all in what Facebook or Yahoo’s

been doing. It almost has no relevance to them. Whether it’s insurance policies, or

tractors, or toys, or shoes, whatever it is, the stories you hear are about companies

whose only business is data. Too many customers get the impression they can get

to the moon by climbing to the top of a tree.

EMC+: So is Big Data revolutionary or evolutionary?

Woo: Big data is not a new technology. It’s actually just an improvement on the old.

So the people who answered the survey are not entirely wrong in having that position.

Nothing about big data and its processes are anything revolutionary from that

perspective. What is revolutionary and game-changing is that for the first time we have

the necessary compute power, the necessary memory and network, the storage, and

most important, the necessary software to actually consider the entire set of data.

Schmarzo: It’s evolutionary in the technology, but it’s game-changing how it’s applied

to the business. That’s what makes this so interesting, that we’re not talking

game-changing from a technology perspective. That is, leverage your BI and data

warehouse assets to get more out of them. We’re talking game-changing in how you

deploy it at the point of customer engagement.

The challenge is that companies that don’t do this will be out of business. Big Data

is an evolutionary game-changer, where companies are figuring out, how do I bring

data into my product to make it more effective, more productive, more relevant

to the user?EMC+: So we have a semantic problem here? Because there is something

revolutionary going on, somewhere. Is it a revolution in attitude

and vision more than anything else?

Woo: I think that certainly is a start. One of the challenges we have is the fact that we

haven’t been in a position to do this before, as we didn’t think it was possible. We

always thought it was too expensive. And in many situations, that was absolutely

true. So the difference here is, it’s no longer expensive. We don’t need to go out

and buy a multi-million dollar specialized system. And we can do this at close to

real time. You know, for many, many years, we used to put all this data into a data

warehouse. Then we would do a separate process to mine that data and look at it

and make assumptions about it. With big data, we can do a lot of things in parallel.

Schmarzo: You have to see the data you have as an asset, do some level of intelligent

transformation, which is heavily analytical and data-centric, and turn it into

something that customers want. That’s the start. Maybe it’s turning visitors into

audiences or turning sites and properties into inventory. That is what data monetization

is all about. Do some creative analysis with your data to create something that

your users want to buy.

Whether it’s on the cloud or cheap scale-out architectures on commodity microprocessors,

I can be a regional retailer with maybe 100 stores and I can build out a big

data platform using x86 processors, Hadoop, and an MPP Postgres architecture. It

doesn’t have to cost me very much money.

EMC+: So you can spend a little money, maybe an evolutionary amount

of money, to make a lot?

Woo: We’ve been looking at cost efficiencies since day one of commercial computing.

IT spend, for most organizations, is relatively minimal. We’re talking about low

single digits spend on IT as a percent of revenue. So let’s just call it 2%, which is a

number that’s relatively well accepted.

If I save your company 10% on the technology, I’ve moved from 2% of revenue to

1.8% of revenue. That’s nice, it’s important, don’t get me wrong. But if I can take

technology as an investment and find or help generate new lines of revenue, new

products, or even more important, create some competitive advantage as a result

of it, now we’re talking about moving tens of percent on the top line.

And when you start looking at that, when you start moving the needle on revenue

and profitability, just by moving the revenue up, you’ve naturally moved the percent

of technology spend down. EMC+: So basically, it’s just understand your business, and fantasize

about the ways you can make your customers happier. And out of that

is very likely to come a use case for big data.

Schmarzo: Exactly. An auto manufacturer, a telecom, a retailer. Look for opportunities

to leverage this data to provide a much better customer, or user, or shopper, or driver

experience. That, to me, is game-changing. It’s evolutionary in a sense that I’m not

asking people to change their business models. We’re asking them to take advantage

of the data they have already around them to provide a much better experience for

their consumers, for their employees, for their partners, for their suppliers, whoever

it might be within their ecosystem.

What it costs me is having the right kind of creative people who can take what assets

I have and look at what my customers are trying to buy, and make that transformation.

It may be taking advantage of the data I have, or maybe even to instrument

further to get more data. It probably relies heavily on doing some analytics, so I can

take that data and turn it into something that’s useful.

Because none of our customers want to buy data. They want to buy insights. They

want to buy things that make them more productive, or makes data easier to use,

or creates a simpler experience.

EMC+: What is the greatest barrier to entry for a mid-sized company?

Schmarzo: Mindset and developing a Big Data skill set. In terms of mindset, it’s sitting

down and really thinking through what your customers want. I think that’s the

biggest challenge, because the technology is out there. As for the skill set, if you

have to rely on Hadoop to do some of this stuff, which today you have to in many

cases, it means you have to find or develop some level of Hadoop skills.

In fact, this is, I think, a problem that all organizations face, which is, how do I

re-task the skill sets I have? I’ve got all these BI and data warehouse people. They

understand how my customers think. They understand how to interface with them.

But they don’t know how to use some of these other new tools that allow them to

glean new insights out of this data.

So there’s a mindset challenge, and I think there’s a skills re-tasking challenge of how

do I move people from where they are today to give them some new skills. We’re not

asking them to scale Mount Everest. I think people started off by coding in COBOL.

They probably started in Fortran, to COBOL, to C++, to Java, and now we’re saying

you’re going to learn how to write Perl and MapReduce. It’s just coding.

Woo: Big data is not a technology decision. Big data is a business decision, a decision

to say, we believe in our customers and we have information about them we can use.

Every enterprise has information about their own customers. The ability to mine that

data, the ability to extract and abstract new ways of keeping that customer—that is

why you engage in big data. Generally IT is not the first to be initiating this stuff. This

is initiated from marketing or from sales or from the CFO or the CEO. And that’s where companies need to recognize, early in the conversation, that technology

is not just a spend. It is an opportunity to make money, and for the executives or

the business lines execs to ask themselves what do we have already that can make

us more money or have a better understanding of our customer? So those are the

first questions to ask.

EMC+: So you’ve got your mindset right. You’ve got some bright ideas

what you might do with your data. Can just about anybody start small

and accomplish something and build from there?

Schmarzo: Well actually, the key thing is that most organizations already have the

building blocks to have a big data game-changing experience. Let me give you an

example, going back to retail. Most companies have captured and stored, probably

on tape now, all their customer loyalty data, transaction data, right? They’ve got all

this data.

Now historically, they’ve aggregated it for reporting purposes, and so they take in all

the detailed point-of-sale data that came off the cash registers, and they aggregated

it to show, what are my sales by store, by product category, by product, etc. And they

lost the nuances in the data, because they’ve aggregated this data. But they still

have that detailed data out there, but they’re working off a data warehouse platform

that can’t handle that detailed data.

Well, guess what, folks? The platforms that can handle that detailed data? They’re out

there. We happen to have, in Greenplum, one of the better ones. Not only because

we’ve got the same sort of scale-out capabilities, but because of the software-only

solution, which really allows companies to get in really cheap and grow incrementally,

without having to take that big giant hurdle of buying some Superdome from HP.

Woo: Yes you can start small, but the thing is to start. I guarantee you that your

competitors are doing something with their data. The most valuable thing any organization

has are the records it holds, the transactions, and the digital data that they

have. It’s a goldmine waiting to be mined. If you can understand your own customers

better, you’ll have much lot better chance of understanding like-minded customers

and prospective customers.

EMC+: You’re going to need people to implement the program. Do you

buy those people or do you grow them?

Schmarzo: It depends on what you’re doing and who you have on your team—some

will have to do both. But I think to start you need to think about the assets you have

already. Can you grow these people? Because they’re the people who understand

how the organization works. If you bring in people from the outside, you bring in

people who have different business model exposures. Do you need that? You can

always teach technology. You don’t necessarily need to start with a whole new team.Woo: Let me give you an example. A major multi-billion dollar manufacturer looked

that at their own data in terms of revenue and customer satisfaction, etc. They wanted

to find deeper metrics about measuring themselves. But as they went deeper, they

found that they actually had more data on field service and field support of their

products than they realized. Ultimately they were able to do predictive analysis on

failure rates to the point that they could predict with a fairly high degree of certainty

the possible failures that are likely to happen as a result of wear and tear and other

things at their customer sites. Now they could be preemptive and proactive about

maintenance. They grew that insight from inside, with their own people, using data

they already had.

EMC+: How do you get the momentum going? Who starts the ball rolling?

Schmarzo: This stuff is only successful if it’s used to solve some very interesting

problems, or is used to enhance the value of an organization’s value chain. So you’ve

got to have an executive sponsor, and you’ve got to have somebody who understands

that it’s not about technology, it’s about business. The technology’s only an enabler.

I guess that gets back to the point. Is it a game-changer or is it evolution? The answer

is it’s both. And the companies that’ll be successful, the vendors who’ll be successful

in this space, are the ones who are able to convince IT that it’s an evolutionary move

that’s going to build off the assets you already built, but who can also convince the

business executives it’s a game-changer in how they reform their value chain, and

how they integrate and service their customers. That’s a tough line to walk. But once

you do, you can build a more competitive business.

Big Data Databases

BIG DATA DATABASES

Big data events almost inevitably offer an introduction to NoSQL and why you can't just keep everything in an RDBMS anymore. Right off the bat, much of your audience is in unfamiliar territory. There are several types of NoSQL databases and rational reasons to use them in different situations for different datasets. It's much more complicated than tech industry marketing nonsense like "NoSQL = scale."

Part of the reason there are so many different types of NoSQL databases lies in the CAP theorem, aka Brewer's Theorem. The CAP theorem states you can provide only two out of the following three characteristics: consistency, availability, and partition tolerance. Different datasets and different runtime rules cause you to make different trade-offs. Different database technologies focus on different trade-offs. The complexity of the data and the scalability of the system also come into play.

Another reason for this divergence can be found in basic computer science or even more basic mathematics. Some datasets can be mapped easily to key-value pairs; in essence, flattening the data doesn't make it any less meaningful, and no reconstruction of its relationships is necessary. On the other hand, there are datasets where the relationship to other items of data is as important as the items of data themselves.

Relational databases are based on relational algebra, which is more or less an outgrowth of set theory. Relationships based on set theory are effective for many datasets, but where parent-child or distance of relationships are required, set theory isn't very effective. You may need graph theory to efficiently design a data solution. In other words, relational databases are overkill for data that can be effectively used as key-value pairs and underkill for data that needs more context. Overkill costs you scalability; underkill costs you performance.

Key-value pair databases

Key-value pair databases include the current 1.8 edition of Couchbase and ApacheCassandra. These are highly scalable, but offer no assistance to developers with complex datasets. If you essentially need a disk-backed, distributed hash table and can look everything up by identity, these will scale well and be lightning fast. However, if you find that you're looking up a key to get to another key to get to another key to get to your value, you probably have a more complicated case.

There are a number of different permutations of key-value pair databases. These are basically different trade-offs on the CAP theorem and different configurations of storage and memory use. Ultimately, you have some form of what is basically a hash table.

This is fine for flat parts lists so long as they don't composite. This is also fine for stock quotes, "right now," or other types of lists where that key has meaning and is the primary way you're going to look up the value. Usually these are combined with an index, and there is a way to query against the values or generate a list of keys, but if you need a lot of that, you probably should look elsewhere.

Column family/big table databases

Most key-value stores (including Cassandra) offer some form of grouping for columns and can be considered "column family" or "big table" as well. Some databases such as HBase were designed as column family stores from the beginning. This is a more advanced form of a key-value pair database. Essentially, the keys and values become composite. Think of this as a hash map crossed with a multidimensional array. Essentially each column contains a row of data.

According to Robin Schumacher, the vice president of products for DataStax, which sells a certified version of Cassandra, "A popular use case for Cassandra is time series data, which can come from devices, sensors, websites (e.g., Web logs), financial tick data, etc. The data typically comes in at a high rate of speed, can come from multiple locations at once, adds up quickly, and requires fast write capabilities as well as high-performance reads over time slices."

You can use also use MapReduce on these, so they can be good analytical stores for semi-structured data. These are highly scalable, but not usually transactional. If the relationships between the data are as important as the data itself (such as distance or path calculations), then don't use a column family/big table database.

Document databases

Many developers think document databases are the Holy Grail since they fit neatly with object-oriented programming. With high-flying vendors like 10gen (MongoDB), Couchbase, and Apache's CouchDB, this is where most of the vendor buzz is generated.

Frank Weigel from Couchbase pointed out to me that the company is moving from a key-value pair database in version 1.8 to a document database in 2.0. According to him, the "document database is a natural progression. From clustering to accessing data, document databases and key-value stores are exactly the same, except in a document database, the database understands the documents in the datastore." In other words, the values are JSON, and the elements inside the JSON document can be indexed for better querying and search.

The sweet spot for these is where you're probably already generating JSON documents. As Max Schireson, president of 10gen told me, you should consider a document database if your "data is too complex to model in a relational database. For example, a complex derivative security might be hard to store in a traditional format. Electronic health records provide another good example. If you were considering using an XML store, that's a strong sign to consider MongoDB and its use of JSON/BSON."

This is probably your operational store -- where data being collected from users, systems, social networks, or whatever is being collected. This is not likely where you are reporting from, though databases such as MongoDB often have some form of MapReduceavailable. While at least in MongoDB, you can query on anything, you will not generally achieve acceptable performance without an index.

Graph databases

Graph databases are really less about the volume of data or availability and more about how your data is related and what calculations you're attempting to perform. As Philip Rathle, senior director of product engineering at Neo Technologies (makers of Neo4j), told me, graph databases are especially useful when "the data set is fundamentally interconnected and non-tabular. The primary data access pattern is transactional, i.e., OLTP/system of record vs. batch... bearing in mind that graph databases allow relatedness operations to occur transactionally that, in an RDBMS world, would need to take place in batch."

This flies in the face of most NoSQL marketing: A specific reason for a graph database is that you need a transaction that is more correct for your data structure than what is offered by a relational database.

Common uses for graph databases include geospatial problems, recommendation engines, network/cloud analysis, and bioinformatics -- basically, anywhere that the relationship between the data is just as important as the data itself. This is also an important technology in various financial analysis functions. If you want to find out how vulnerable a company is to a bit of "bad news" for another company, the directness of the relationship can be a critical calculation. Querying this in several SQL statements takes a lot of code and won't be fast, but a graph database excels at this task.

You really don't need a graph database if your data is simple or tabular. A graph database is also a poor fit if you're doing OLAP or length analysis. Typically, graph databases are paired with an index to allow for better search and lookup, but the graph part has to be traversed; for that, you need a fix on some initial node.

Sorting it all out

Graph databases provide a great example of why it's so hard to name these new database types. "NewDB" is my preferred name - except that, oops, some are as old as or older than the RDBMS. "NoSQL" isn't a great name because some of these support SQL and SQL is really orthogonal to the capabilities of these system.

Finally, "big data" isn't exactly right because you don't need large data sets to take advantage of databases that fit your data more naturally than relational databases. "Nonrelational" doesn't quite apply, either, because graph databases are very relational; they just track different forms of relationships than traditional RDBMSes.

In truth, these are the rest of the databases that solve the rest of our problems. The marketing noise of past decades combined with hardware and bandwidth limitations, as well as lower expectations in terms of latency and volume, prevented some of the older kinds of databases from reaching as wide notoriety as RDBMSes.

Just as we shouldn't try to solve all of our problems with an RDBMS, we shouldn't try to solve all of our math problems with set theory. Today's data problems are getting complicated: The scalability, performance (low latency), and volume needs are greater. In order to solve these problems, we're going to have to use more than one database technology.

Big Data's Challenges

BIG DATA'S CHALLENGES

These days, there are a number of buzzwords being thrown around the marketing industry and the data management space. One of the biggest? Say it with me: Big Data.

NPR argued last December that ‘big data’ should’ve been the “word of the year,” in part due to the re-election of President Barack Obama. Obama’s campaign managers didn’t let the Republicans’ monetary advantage discourage them. Instead, they gathered information on their voters and compiled important analytics based on that information. By handling this mass of data in an organized and well thought out process, they were able to more effectively appeal to voters and ultimately win the re-election.

Marketers and corporations across the country were inspired by the campaign’s success, and have turned to big data to solve their problems as well. Anyone who catches the news on a regular basis, shops online, or owns a smartphone can see this evolution firsthand. However, it’s worth mentioning that this progression doesn’t necessarily mean “big understanding” or “big information.” Many companies are faltering in their efforts to harness big data and make real use of it. The pool of information is constantly changing, and as so many businesses rush to gather the data in real-time, it becomes even more challenging to keep pace and actively comprehend information as it becomes available.

And the challenges go beyond the initial harnessing of the data. As big data continues to grow, companies are running into issues of incorrect and duplicate data in their systems. This erroneous data is a result of poor processes that companies have in place, and oftentimes begins at the point of data input.

For a number of companies, data input is performed on a daily basis via their call centers. When incorrect data is recorded, it prevents sales representatives from getting leads in a timely manner, and hampers them further when they try to contact the correct individuals seeking assistance. The resulting slower response time then goes on to impact a company’s SLA and credibility to the population they serve.

There is no doubt that when processed correctly, big data can be integral to a company looking to improve their understanding of the customer’s needs and wants. But data quality is an important consideration during the transition, and one that must be confronted before big data can reveal all it has to offer.

To learn more about big data and how it relates to the data quality initiatives that may be taking place within your organization, watch Experian QAS’ webinar, “Ensuring Data Quality in your Big Data Initiative.”

Laddu Jo Sono Putra