Next: Methods Up: Integration of Data Mining Previous: An Example

Data Mining

Data Mining can be described informally as a method for discovering knowledge from large databases. This can be both user-driven where user knowledge together with data analysis produce new knowledge, or data-driven where algorithms produce knowledge from databases. We will focus on the second approach here, and view Data Mining as a step in the process of Knowledge Discovery in Databases, which Fayyad et al. [16] define as:

Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.

Where a pattern is an expression in a language that describes data, and has a representation that is simpler than listing all the data that it describes. That the patterns should be valid means that they should also apply for new data with some degree of certainty. Novel patterns are patterns that are new to the computer system or human user, with respect to expected values or knowledge contained. Potentially useful and ultimately understandable means that the discovery should lead to some useful actions, and the patterns should be understandable for humans. Brachman and Anand [9] has a more human-centered definition of Knowledge Discovery:

Knowledge Discovery is a knowledge intensive task consisting of complex interactions, protracted over time , between a human and a (large) database, possibly supported by a heterogeneous suite of tools .

In most existing systems, the discovery of knowledge involves many tasks such as selecting, cleaning and filtering data. Data Mining is then defined as:

Data Mining is a step in the KDD process consisting of particular Data Mining algorithms that, under some acceptable computational efficiency limitations, produces a particular enumeration of patterns.

The KDD process can be divided into nine steps which are described further in [16] and [9]. The process is illustrated in Fig 3.1, where the steps are:

**Figure 3.1:** The Data Mining Process.
$\begin{figure} \begin{center} \scalebox {0.5}{\includegraphics*[02cm,13cm][25cm,24cm]{dmproc.eps}} \end{center}\end{figure}$

1.: Develop an understanding of the application domain, the relevant prior knowledge, and the goals of the end-user. This might also involve getting juridical rights to use data, and ethical considerations regarding its usage.
2.: Create a target data set: select a data set, or focus on a subset of variables or data samples, to do discovery on. The data is usually retrieved from existing operational databases or data warehouses. Interesting features are selected and stored in a data table, often by using SQL. Most tools that exist now only handle data in flat ASCII format.
3.: Clean and preprocess data: remove noise, handle missing or not applicable data fields, handle time sequence information. For instance, most database systems store date values as the number of days elapsed since 1st of January 1900. (Julian system). This requires a knowledge of the data domain, and includes looking for common errors like mislabeled fields , special semantics (a feature may for example take the value 0 or 99 for ``missing''), and time travel (does a data object contain information that could not be known at the time the model was intended to apply?)
4.: Reduce and project data: find useful features to represent the data depending on the goal of the task, reduce dimensionality, transform variables, reduce the number of data.
5.: Choose the data mining task: classification, clustering, regression. (These tasks are explained in section 3.1).
6.: Choose data mining algorithm(s) to be used in searching for patterns in the data.
7.: Data mining: search for patterns of interest in a particular representational form such as classification rules or trees, regression, clustering.
8.: Interpret the mined patterns, possibly return to steps 1-7 for further iteration.
9.: Consolidating discovered knowledge: incorporate the knowledge into the performance system, or document it and report it to interested parties. Check and resolve potential conflicts with previously believed knowledge.

John [20] views the KDD process a bit differently, that the main issues are Data Extraction , Data Cleaning , Data Engineering , Algorithm Engineering , Running Algorithm on Data and finally to Analyze the Results .

Next: Methods Up: Integration of Data Mining Previous: An Example

Torgeir Dingsoyr
2/26/1998