Data Mining: Models

Data Mining: Models

Many of the models used in data mining require some knowledge of statistics. In order to give an introductory "feel" for what's involved, a discussion on two of the simpler (and more well known) models are presented below. For those interested in more of the mechanical and theoretical aspects of data mining, a resources page is presented as an appendix for your use.

A Predictive Model: Neural Networks

A neutral network is a collection of nodes (inputs, or parameters), that impact an output variable. This output variable is predictive in that it may model a discrete set of variables as a continuous one. That is, the neutral network takes several plotted points and creates a "line of best fit."

Each node is assigned a particular weighting, which is altered through a series of rigourous "training". Training is performed by comparing output of the model (the neural net) with actually known values, and then adjusting the weights of the inputs to compensate. In time, the neural network will begin to be able to predict outputs with considerable accuracy. The neural net can then be used to predict unknown previously unknown scenarios.

Neural networks have two problems. One is that selecting the inputs to create the model is often a difficult thing to do; what if there is a very influential factor that is missed? Without such "hidden" factors, the predictive accuracy of a model is greatly reduced. Worse still, a model that fails to consider an important variable might seem to be accurate during training.

The second problem is that neural networks tend to be "overtrained". That is, they tend to be adjusted too much in favour of the test or training data, and so yield inaccurate results when presented with real world data. Both problems can be solved with known techniques. [2]

Decision Trees

Decision Trees tend to be easier to understand than neural networks. Decision trees involve a set of rules or conditions that lead to a decision. One might ask a question: "Is person X a good student?", and then utilize a decision tree to choose one way or the other. Decision Trees are easier to create, but because they are a simpler model (they suffer from the limitation that they are in fact a "tree" graph), have limitations. Such limitations include the fact that decisions can't be based on combinations of variables, tree paths aren't always enumerable (how does one partition something that's subjective?), and the fact that splits in the tree often don't consider "future" splits (that is, a decision tree can't be created keeping in mind what a split at level n-5 on a graph will have on level n of the graph.) [2]