Pre-study Report - Spring 1997

Torgeir Dingsøyr - Endre M. Lidal

Introduction

The purpose of this report is to document the progress in the pre-study phase. The second sections contains the definitions of technical terms that will be used throughout the project. Then there is a section on criteria for Data Mining. The data mining methods and tools we have found, are classified in the fourth section, together with a list of the methods we want to concentrate on in the rest of the project. I section five there is a list of the literature we want to read, classified by type and subject. The last section contains the plan for the literature study phase.

Definitions

Below are the definitions we will use for some of the words in the field of data mining and knowledge discovery. There are also other definitions on these words, but we will use the following in our project.

Knowledge Discovery in Databases
: is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. [Fayyad, Piatetsky-Shapiro, Smyth]

Data mining
: is a step in the KDD process consisting of particular data mining algorithms that, under some acceptable computational efficiency limitations, produces a particular enumeration of patterns. [Fayyad, Piatetsky-Shapiro, Smyth]

Bayesian networks
: is a graphical presentation of probabilistic information, which uses directed acyclic graphs, where each node represents an uncertain variable, and each link represents direct influence, usually of casual nature. [Jueda Perl]

Inductive logic programming
: ILP systems develop predicate descriptions from examples and background knowledge. [Steve Muggleton]

Fuzzy logic
: Truth values are in the interval [0,1] rather than true or false. [D. Dubois, H. Prade]

Rough sets
: Rough set analysis is a first step in analyzing incomplete or uncertain information. Rough set analysis uses only internal information and does not rely on additional model assumptions as fuzzy set methods or probabilistic methods do. [Ivo Düntsch, Günther Gediga]

Neural networks
: Neural networks are composed of a number of nodes, or units, connected by links. Each link has a numeric weight associated with it. Weights are the primary means of long-term storage in neutral networks, and learning usually takes place by updating the weights. Some of the units are connected to the external environment, and can be designated as input or output units. [Stuart J. Russel, Peter Norvig]

Criteria

Some technical criteria for data mining are defined in [Fayyad, Piatetsky-Shapiro, Smyth, Uthurusamy], and [Ronald J. Brachman, Tej Anand]

Practical criteria can be:

Methods and Tools

The subsections contain a list of all the methods we have encountered, a list of the methods we want to concentrate our work on, and tools of the methods.

Early Classification

Preliminary, we have chosen to classify the data mining methods in two groups, logically founded methods and statistically founded methods.
Logical methods:

Statistical methods

Methods Selected for Further Study

We suggest to narrow down on three methods to study them more in detail. Based on the preliminary information on each method we have selected the following:

We have chosen Rough Sets and ILP because research in these fields have been carried out at NTNU for some time. Bayesian Network seems to be the most promising statistical techniques for Data Mining.

Tools

The tools we have found so far are:

Literature

The literature is first sorted by type, e.g. printed or electronic, and then sorted by subject.

Printed Literature

Printed literature includes books, articles, journals and magazines.

Electronic Literature

Electronic literature includes WWW-sites and mailing-lists.

Interviews

We have interviewed these persons so far:

Courses

We have attended some lectures from two courses to get more information for the project:

Literature Study Phase

The amount of literature in this project makes it difficult and impractical for both group members to read all it. We have therefore divided some of the literature between us. The groupmebers will therefore make a short resume from the text they have read, and present this for the other member. Important things to include in the resume:


Torgeir Dingsoyr
Sat May 3 15:20:22 MET DST 1997