Data Mining: A Technical Overview

Technical Overview

Data mining is used to create models. These models can be used for two different purposes: to describe something present in a set of data, or to predict future behaviour using present data.

Predictive data mining involves finding patterns in data. Several types of models can be used for predictive mining: classification, clustering, and regression. The idea of classification is to separate datum into groups based on a single discrete variable. For example, such a variable might measure how often a consumer uses store coupons. The variable for each consumer would be of the set { never, sometimes, often, usually, always }. Edelstein describes the classification model as a model that "examines a collection of cases for which the group they belong to is already known. It then uses the data to inductively determine the pattern of attributes or characteristics that identifies the group to which each case belongs."[2] That is, the classification model partitions data into groups, which are then analyzed. Common attributes are discovered by analyzing present data, which is then used to predict, or make assumptions about, future behaviour. In the coupon example, one might be able to construct a "profile" of the type of person (demographically speaking) is more apt to alter their consumption based on coupon promotions. Then a business could focus their advertisements on that particular group.

Regression and "time-series forecasting" are used to map these discrete variables to continuous variables (that is, to some interval of real numbers.) Time series forecasting is dependent on time, however, and so it may consider additional attributes such as time of day, day the of the week, or time of year (holiday versus non-holiday, or seasonal differences.) [2]

Descriptive data mining involves looks at present data and trying to come up with relationships and trends. These relationships can be used to modify business processes. Like predictive data mining, clustering is a possible model, but also association analysis and sequence discovery are commonly used. Association analysis attempts to find relationships in data. These associations are generally of the form "If Event A occurs, then it is y% likely that Event B will also occur." Sequence discovery is the same as association analysis, except that temporal causality is added. That is, the results of "sequence discovery" might be "If Event A occurs, then y% of the time within x minutes Event B will also occur." [2]

Clustering, which may be used for both descriptive and predictive data mining, involves creating groups delineated along attributes The goal is to find both similarities and differences among partitions of data (that is, member A is part of group X but not group Y). Clustering requires the attention of an analyst, since the "clusters" created are not usually obvious. [2]