Next: Methods
Up: Integration of Data Mining
Previous: An Example
Data Mining can be described informally as a method for discovering
knowledge from large databases. This can be both user-driven where user
knowledge together with data analysis produce new knowledge, or data-driven
where algorithms produce knowledge from databases. We will focus on the
second approach here, and view Data Mining as a step in the process
of Knowledge Discovery in Databases, which Fayyad et al. [16]
define as:
Knowledge Discovery in Databases (KDD) is the non-trivial process of
identifying valid, novel, potentially useful, and ultimately
understandable patterns in data.
Where a pattern is an expression in a language that describes data,
and has a representation that is simpler than listing all the data
that it describes. That the patterns should be valid means that they
should also apply for new data with some degree of certainty. Novel
patterns are patterns that are new to the computer system or human
user, with respect to expected values or knowledge contained.
Potentially useful and ultimately understandable means that the discovery
should lead to some useful actions, and the patterns should be
understandable for humans. Brachman and Anand [9] has a more
human-centered definition of Knowledge Discovery:
Knowledge Discovery is a knowledge intensive task consisting of
complex interactions, protracted over time , between a human
and a (large) database, possibly supported by a heterogeneous suite
of tools .
In most existing systems, the discovery of knowledge involves many
tasks such as selecting, cleaning and filtering data. Data Mining is
then defined as:
Data Mining is a step in the KDD process consisting of particular
Data Mining algorithms that, under some acceptable computational
efficiency limitations, produces a particular enumeration of
patterns.
The KDD process can be divided into nine steps which are described further in
[16] and [9]. The process is illustrated in Fig 3.1,
where the steps are:
Figure 3.1:
The Data Mining Process.
|
- 1.
- Develop an understanding of the application domain, the relevant
prior knowledge, and the goals of the end-user. This might also involve
getting juridical rights to use data, and ethical considerations
regarding its usage.
- 2.
- Create a target data set: select a data set, or focus on a
subset of variables or data samples, to do discovery on.
The data is usually retrieved from existing operational databases or
data warehouses. Interesting features are selected and stored in a
data table, often by using SQL. Most tools that exist now only handle
data in flat ASCII format.
- 3.
- Clean and preprocess data: remove noise, handle missing or not
applicable data fields, handle time sequence information. For
instance, most database systems store date values as the number of
days elapsed since 1st of January 1900. (Julian system). This requires
a knowledge of the data domain, and includes looking for common errors
like mislabeled fields , special semantics (a feature
may for example take the value 0 or 99 for ``missing''), and
time travel (does a data object contain information that could
not be known at the time the model was intended to apply?)
- 4.
- Reduce and project data: find useful features to represent the
data depending on the goal of the task, reduce dimensionality, transform
variables, reduce the number of data.
- 5.
- Choose the data mining task: classification, clustering, regression.
(These tasks are explained in section 3.1).
- 6.
- Choose data mining algorithm(s) to be used in searching for patterns in
the data.
- 7.
- Data mining: search for patterns of interest in a particular
representational form such as classification rules or trees, regression,
clustering.
- 8.
- Interpret the mined patterns, possibly return to steps 1-7 for further
iteration.
- 9.
- Consolidating discovered knowledge: incorporate the knowledge into
the performance system, or document it and report it to interested parties.
Check and resolve potential conflicts with previously believed knowledge.
John [20] views the KDD process a bit differently, that the main
issues are Data Extraction , Data Cleaning , Data
Engineering , Algorithm Engineering , Running Algorithm on Data
and finally to Analyze the Results .
Next: Methods
Up: Integration of Data Mining
Previous: An Example
Torgeir Dingsoyr
2/26/1998