Pre-study Report - Spring 1997

Torgeir Dingsøyr - Endre M. Lidal

Introduction

The purpose of this report is to document the progress in the pre-study phase. The second sections contains the definitions of technical terms that will be used throughout the project. Then there is a section on criteria for Data Mining. The data mining methods and tools we have found, are classified in the fourth section, together with a list of the methods we want to concentrate on in the rest of the project. I section five there is a list of the literature we want to read, classified by type and subject. The last section contains the plan for the literature study phase.

Definitions

Below are the definitions we will use for some of the words in the field of data mining and knowledge discovery. There are also other definitions on these words, but we will use the following in our project.

Knowledge Discovery in Databases: : is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. [Fayyad, Piatetsky-Shapiro, Smyth]
Data mining: : is a step in the KDD process consisting of particular data mining algorithms that, under some acceptable computational efficiency limitations, produces a particular enumeration of patterns. [Fayyad, Piatetsky-Shapiro, Smyth]
Bayesian networks: : is a graphical presentation of probabilistic information, which uses directed acyclic graphs, where each node represents an uncertain variable, and each link represents direct influence, usually of casual nature. [Jueda Perl]
Inductive logic programming: : ILP systems develop predicate descriptions from examples and background knowledge. [Steve Muggleton]
Fuzzy logic: : Truth values are in the interval [0,1] rather than true or false. [D. Dubois, H. Prade]
Rough sets: : Rough set analysis is a first step in analyzing incomplete or uncertain information. Rough set analysis uses only internal information and does not rely on additional model assumptions as fuzzy set methods or probabilistic methods do. [Ivo Düntsch, Günther Gediga]
Neural networks: : Neural networks are composed of a number of nodes, or units, connected by links. Each link has a numeric weight associated with it. Weights are the primary means of long-term storage in neutral networks, and learning usually takes place by updating the weights. Some of the units are connected to the external environment, and can be designated as input or output units. [Stuart J. Russel, Peter Norvig]

Criteria

Some technical criteria for data mining are defined in [Fayyad, Piatetsky-Shapiro, Smyth, Uthurusamy], and [Ronald J. Brachman, Tej Anand]

Availability of sufficient data (cases)
Prior knowledge (Background knowledge)
Relevance of attributes
Noise level
Interpretability
Output
Scalability
Handling of uncertainty
Metaknowledge
Who is the user? Human control/interaction
Number of concepts
Model underfit/overfit

Practical criteria can be:

Greater revenue
Lower costs
Higher quality
Time saving

Methods and Tools

The subsections contain a list of all the methods we have encountered, a list of the methods we want to concentrate our work on, and tools of the methods.

Early Classification

Preliminary, we have chosen to classify the data mining methods in two groups, logically founded methods and statistically founded methods.
Logical methods:

Rough Sets
Inductive Logic Programming
- Empirical Methods
- Interactive Methods
Fuzzy Logic/Sets

Statistical methods

Bayesian Network
Neural Networks
Adaptive Splines
Linear Models
Non-Parametric Methods
Projection Pursuit
Polynomial Networks
Decisions Threes

Methods Selected for Further Study

We suggest to narrow down on three methods to study them more in detail. Based on the preliminary information on each method we have selected the following:

Logical methods:
- Rough Sets
- Inductive Logic Programming
Statistical Methods:
- Bayesian Network

We have chosen Rough Sets and ILP because research in these fields have been carried out at NTNU for some time. Bayesian Network seems to be the most promising statistical techniques for Data Mining.

Tools

The tools we have found so far are:

Rough Sets: Rosetta and RSDA
ILP: Progol, Riper and ID3 (C45)
Bayesian network: AutoClass
General tools: Clementine, DataEngine, DataScope,Dblearn and Explora

Literature

The literature is first sorted by type, e.g. printed or electronic, and then sorted by subject.

Printed Literature

Printed literature includes books, articles, journals and magazines.

Rough Sets:
- A Rough Set Approach to Data Mining: Extracting a Logic of Default Rules from Data Torulf Mollestad - chapter from doctoral thesis
- A Rough Set Framework for Data Mining of Propositional Default Rules Torulf Mollestad and Andrzej Skowron - article
- Rough Sets - Basic Notions Zdzislaw Pawlak - article
- Rosetta - A Rough Set Toolkit for Analysis of Data Aleksander Ørn and Jan Komorowski - article
- The Rough Set Model of Data Analysis - Introduction and Overview Ivor Düntsch and Günter Gediga - article
- Comparative Analysis of Data Mining Tools Finn Vidar Larsen - Diploma thesis
- Rough Sets as a Framework for Data Mining Øyvind Tuseth Aasheim and Helge Grenager Solheim - Project report
Inductive Logic Programming:
- Knowledge Discovery in Databases applied in Inductive Logic Programming Herman Midelfart - diploma thesis
- Inductive Logic Programming in Knowledge Discovery in Databases Erling Alfheim and Amund Tveit - project report
Bayesian Network:
- A guide to the literature on learning probabilistic networks form data Wray Buntine - article
- A Tutorial on Learning Bayesian Networks David Heckerman - article
General:
- Advances in Knowledge Discovery and Data Mining Fayyad, Piatetsky-Shapiro, Smyth and Uthurusamy - collection of articles
- KDD-93: Progress and Challenges in Knowledge Discovery in Databases Piatetsky-Shapiro, Matheus, Smyth and Uthurusamy - proceedings
- Data Mining: A Performance Perspective Rakesh Agrawal - article
- Data Mining John Krogslie - lecture notes
- Reference List for the Data Mining Tutorial at CPM'95 Summer School Heikki Mannila - lecture notes
NOEMIE:
- First Users Requirements
- First Users Requirements: Schlumberger Detailed Approach
- First Users Requirements: Norsk Hydro Detailed Approach

Electronic Literature

Electronic literature includes WWW-sites and mailing-lists.

General:
- http://www.cs.bham.ac.uk/ anp/TheDataMine.html
- http://ai.iit.nrc.ca/misc.html
- Mailinglist:KDD Nuggets
- Mailinglist:DBworld
Bayesian Network:
- http://hugin.dk/
- http://bayes.stat.washington.edu/almond/belief.html
- Mailinglist:AI and Statistics
Rough Sets:
- http://renoir.vill.edu/fac/cw/rough.html
Inductive Logic Programming:
- http://igd.rz-berlin.mpg.de/ www/prolog/formilp.htm
- http://www.iospress.nl/html/node47.html
- http://www-ai.ijs.si/ilpnet.html
- Mailinglist:Machine Learning list
- Mailinglist:ILP-newsletter

Interviews

We have interviewed these persons so far:

Torulf Mollestad, Dr.ing student - on Rough Sets in Data Mining.
Øystein Nytrø, amanuensis - on ILP to control the Data Mining process.
Erling Alfheim, diploma student - on ILP, techniques and tools.

Courses

We have attended some lectures from two courses to get more information for the project:

Logics for Computer Science - a Dr.ing course in Data Mining and Knowledge Discovery, mainly Rough set.
- Literature:
  - Lecture notes
  - Articles from AI-magazine - Fall 1996
  - Articles from Communication of the ACM - November 1996
Computer-intensive statistics - a course in Bayesian networks (among others).
- Literature:
  - Lecture notes
  - Adaptive Rejection Metropolis Sampling within Gibbs Sampling W.R. Gilks, N.G. Best and K.K.C. Tan
  - Pattern Recognition and Neural Networks B.D. Ripley
  - BUGS - Bayesian inference Using Gibbs Sampling manual David Spiegelhalter, Andrew Thomas, Nicky Best and Wally Gilks
  - Graphical Models in Applied Multivariate Statistics Joe Whittaker

Literature Study Phase

The amount of literature in this project makes it difficult and impractical for both group members to read all it. We have therefore divided some of the literature between us. The groupmebers will therefore make a short resume from the text they have read, and present this for the other member. Important things to include in the resume:

Description of the method(s) described in the text.
List of demands made by the methods (e.g. pre-possessing).
List of the algorithm(s) described in the text.
Descriptions of the method's properties like time-complexity, resource used, utility value and retractability.

Torgeir Dingsoyr
Sat May 3 15:20:22 MET DST 1997