Posts Tagged ‘data cleanliness’

Click-through Data Adds to B2B Data Mining Possibilities

Thursday, June 26th, 2008

The knock on B2B data mining has always been that there isn’t B2C-like data available. Instead of multiple transactions that give us customer behavior patterns, we have company demographic information (industry, company size, revenue), some information about the person from the company who we’ll deal with (position/title), and where that person came from (lead source). It’s not behavioral data, which we know to be inherently better as a predictor than demographic data. But some data is better than none, right?

And we can certainly create transactional data that gives us some behavior pattern. If we throw in the contact schedule - the touches - from your company’s representatives, don’t you have a transactional pattern of both buying and non-buying customers? Coupled with the demographic data, you can drum up a model that predicts how many touches a lead might need to become a client and maybe a best guess at the path that should be pursued with a new lead.

More to the point of this post, this is the great thing about click-through data: it has a transactional quality. In fact, it just might be the transactional data for B2B companies. (Aside: This is also one of the reasons why companies like Omniture are becoming so notable: they provide some behavioral patterns, however small.) If we can combine click-through patterns from the person representing the prospect company with the company’s demographic information, then we might have a real interesting model that determines just how serious a lead is about buying from you and their company’s relative experience level with your product area.

Let me close out this post by refuting two of the main complaints about B2B data and its unsuitability for data mining-based models.

There’s Not Enough Data

Everybody loves data mining when it comes to consumer-focused companies. The vast amounts of transactional data are transfixing. The thinking goes something like this: “I’ve got hundreds of thousands of transactions here so whatever our predictive model spits out must be right.” Well, this may be true. And it mayn’t. But that doesn’t make a model built with less data any less compelling. It just means that one model has more data points. Don’t feel inadequate for the difference. Just make sure that you have data that’s important to the business problem you’re trying to solve. For example, if you want to know the next-best product for newly-minted customers, then you’d better have a solid set of second-time customers who bought a bunch of different products. Do you need thousands of these second-time customers? C’mon.

Missing and Bad Data

Isn’t this a reality everywhere? Even consumer-focused companies (with hundreds of thousands of transactions) have this issue. Oh, and I have a suggestion on what to do with that missing and bad data. Throw it out. Chances are, it will have absolutely no effect on the predictive models, unless of course all of the missing or bad data has a common characteristic that isn’t found in the rest of the data. For example, let’s say you’re building a model that predicts the next software product that a first-time customer might want from your company. Well, if everybody that bought a specific product as their first purchase is missing a zip code, then you can’t very well throw all of those records out. It would skew the model irreparably. But as long as the missing data is evenly distributed throughout the records, don’t be afraid to trash ‘em.

How Day to Day Data Becomes Predictive Intelligence

Wednesday, June 11th, 2008

Although predictive analytics systems have become more popular in the last couple of years the term and the systems themselves still have a great deal of mysticism behind their definitions and operations. In this post I will reveal the best-practices based process we follow when delivering our predictive analytics solution in an effort to remove some of the mysticism surrounding these valuable systems.

Let me set the stage by defining what predictive analytics is and what information is needed. As the name suggests, predictive analytics systems attempt to forecast trends and behavior based on historical information. Essentially they predict what will happen given past experience. A good marketing example is product bundling or cross selling. If many customers are buying Blue-Ray DVD players and a Spider-Man DVD then the predictive analytics system will report the correlation and possibly drive a new campaign to offer a movie-player bundle.

Not surprisingly, a predictive analytics solution is built on a foundation of data, specifically operational data. Operational data is a collective term having several definitions but for now we will define it as any data originating from a business operations system. Customer order information, on-line shopping activity and direct-mail responses are all examples of operational data.

There you have it. Predictive systems use your operational data to prophesize the future. In our case we are forecasting customer specific trends and predicting how your customers will behave in various marketing scenarios. Now that we know what we are dealing with let’s get into how your data is turned into a valuable analytics solution. The first step involves finding the right data to work with.

Data Selection and Retrieval

As you can imagine even a small business generates vast amounts of operational data so we must filter out the noise by locating and identifying the data relevant to our predictive analytics solution. Just like preparing to buy groceries this step requires a human to review the available data sources (on-line traffic logs, historical orders, and customer portfolios) and then grade each data source by fidelity and quality. The data grading checklist used by Istobe is too comprehensive to discuss in this post but here are some example questions to help you do the same:

  • Is the data redundant (e.g., do multiple account or customer numbers exist?)
  • Is the data updated by a human or a machine?
  • What is the data’s lifetime? Or how long does the data stay intact?
  • If the data is related to another source how is the relation made?
  • Does the data drive any business decisions or is it directly used in any reports?

After each data source is graded we can start to figure out what to keep and how to improve it. For the data sources that we want to keep it is usually necessary to filter out dirty data by running it through a cleansing process. You may be surprised to hear that your data is probably very dirty but even in 3rd party systems dirty data exists. Imagine these scenarios and you should get a feel for the hundreds of other ways dirty data can get inserted into your data sources:

  • Users trying out new features in a CRM system
  • Test data inserted for quality control
  • Data entry errors
  • Historical data that was updated but never removed
  • System upgrades or merges

In its most basic form the cleansing process sets out to eliminate the dirt by:

  • Standardizing specific values e.g. date and time formats
  • Removing duplicate information
  • Removing inconsistent data (e.g., orders which were never completed)

The Data Selection and Retrieval phase is the most intrusive (as it requires collaboration between the custodian(s) of the data and the group building the predictive analytics system) but it is also the most important as it sets the foundation from which everything else is built.

In the next post I will discuss how the cleansed data is used in the Knowledge Creation phase.