Pre-study Report - Spring 1997
Torgeir Dingsøyr - Endre M. Lidal
The purpose of this report is to document the progress in the
pre-study phase. The second sections contains the definitions of
technical terms that will be used throughout the project. Then there
is a section on criteria for Data Mining. The data mining methods and
tools we have found, are classified in the fourth section, together
with a list of the methods we want to concentrate on in the rest of
the project. I section five there is a list of the literature we want
to read, classified by type and subject. The last section contains the
plan for the literature study phase.
Below are the definitions we will use for some of the words in the
field of data mining and knowledge discovery. There are also other
definitions on these words, but we will use the following in our
project.
- Knowledge Discovery in Databases
- : is the non-trivial process of
identifying valid, novel, potentially useful, and ultimately
understandable patterns in data. [Fayyad, Piatetsky-Shapiro, Smyth]
- Data mining
- : is a step in the KDD process consisting of
particular data mining algorithms that, under some acceptable
computational efficiency limitations, produces a particular
enumeration of patterns. [Fayyad, Piatetsky-Shapiro, Smyth]
- Bayesian networks
- : is a graphical presentation of probabilistic
information, which uses directed acyclic graphs, where each node
represents an uncertain variable, and each link represents direct
influence, usually of casual nature. [Jueda Perl]
- Inductive logic programming
- : ILP systems develop predicate
descriptions from examples and background knowledge. [Steve Muggleton]
- Fuzzy logic
- : Truth values are in the interval [0,1] rather than
true or false. [D. Dubois, H. Prade]
- Rough sets
- : Rough set analysis is a first step in analyzing
incomplete or uncertain information. Rough set analysis uses only
internal information and does not rely on additional model
assumptions as fuzzy set methods or probabilistic methods do. [Ivo
Düntsch, Günther Gediga]
- Neural networks
- : Neural networks are composed of a number of
nodes, or units, connected by links. Each link has a numeric weight
associated with it. Weights are the primary means of long-term storage
in neutral networks, and learning usually takes place by updating the
weights. Some of the units are connected to the external environment,
and can be designated as input or output units. [Stuart J. Russel,
Peter Norvig]
Some technical criteria for data mining are defined in [Fayyad,
Piatetsky-Shapiro, Smyth, Uthurusamy], and [Ronald J. Brachman, Tej Anand]
- Availability of sufficient data (cases)
- Prior knowledge (Background knowledge)
- Relevance of attributes
- Noise level
- Interpretability
- Output
- Scalability
- Handling of uncertainty
- Metaknowledge
- Who is the user? Human control/interaction
- Number of concepts
- Model underfit/overfit
Practical criteria can be:
- Greater revenue
- Lower costs
- Higher quality
- Time saving
The subsections contain a list of all the methods we have encountered, a
list of the methods we want to concentrate our work on, and tools of
the methods.
Preliminary, we have chosen to classify the data mining methods in two
groups, logically founded methods and statistically founded methods.
Logical methods:
- Rough Sets
- Inductive Logic Programming
- Empirical Methods
- Interactive Methods
- Fuzzy Logic/Sets
Statistical methods
- Bayesian Network
- Neural Networks
- Adaptive Splines
- Linear Models
- Non-Parametric Methods
- Projection Pursuit
- Polynomial Networks
- Decisions Threes
We suggest to narrow down on three methods to study them more in
detail. Based on the preliminary information on each method we have
selected the following:
- Logical methods:
- Rough Sets
- Inductive Logic Programming
- Statistical Methods:
We have chosen Rough Sets and ILP because research in these fields
have been carried out at NTNU for some time. Bayesian Network seems to
be the most promising statistical techniques for Data Mining.
The tools we have found so far are:
- Rough Sets: Rosetta and RSDA
- ILP: Progol, Riper and ID3 (C45)
- Bayesian network: AutoClass
- General tools: Clementine, DataEngine, DataScope,Dblearn
and Explora
The literature is first sorted by type, e.g. printed or electronic,
and then sorted by subject.
Printed literature includes books, articles, journals and magazines.
- Rough Sets:
- A Rough Set Approach to Data Mining: Extracting a
Logic of Default Rules from Data Torulf Mollestad - chapter from
doctoral thesis
- A Rough Set Framework for Data Mining of Propositional
Default Rules Torulf Mollestad and Andrzej Skowron - article
- Rough Sets - Basic Notions Zdzislaw Pawlak - article
- Rosetta - A Rough Set Toolkit for Analysis of Data
Aleksander Ørn and Jan Komorowski - article
- The Rough Set Model of Data Analysis - Introduction
and Overview Ivor Düntsch and Günter Gediga - article
- Comparative Analysis of Data Mining Tools Finn Vidar
Larsen - Diploma thesis
- Rough Sets as a Framework for Data Mining Øyvind
Tuseth Aasheim and Helge Grenager Solheim - Project report
- Inductive Logic Programming:
- Knowledge Discovery in Databases applied in Inductive
Logic Programming Herman Midelfart - diploma thesis
- Inductive Logic Programming in Knowledge Discovery in
Databases Erling Alfheim and Amund Tveit - project report
- Bayesian Network:
- A guide to the literature on learning probabilistic
networks form data Wray Buntine - article
- A Tutorial on Learning Bayesian Networks David
Heckerman - article
- General:
- Advances in Knowledge Discovery and Data Mining
Fayyad, Piatetsky-Shapiro, Smyth and Uthurusamy - collection of articles
- KDD-93: Progress and Challenges in Knowledge Discovery
in Databases Piatetsky-Shapiro, Matheus, Smyth and Uthurusamy - proceedings
- Data Mining: A Performance Perspective Rakesh Agrawal
- article
- Data Mining John Krogslie - lecture notes
- Reference List for the Data Mining Tutorial at CPM'95
Summer School Heikki Mannila - lecture notes
- NOEMIE:
- First Users Requirements
- First Users Requirements: Schlumberger Detailed Approach
- First Users Requirements: Norsk Hydro Detailed Approach
Electronic literature includes WWW-sites and mailing-lists.
- General:
- http://www.cs.bham.ac.uk/ anp/TheDataMine.html
- http://ai.iit.nrc.ca/misc.html
- Mailinglist:KDD Nuggets
- Mailinglist:DBworld
- Bayesian Network:
- http://hugin.dk/
- http://bayes.stat.washington.edu/almond/belief.html
- Mailinglist:AI and Statistics
- Rough Sets:
- http://renoir.vill.edu/fac/cw/rough.html
- Inductive Logic Programming:
- http://igd.rz-berlin.mpg.de/ www/prolog/formilp.htm
- http://www.iospress.nl/html/node47.html
- http://www-ai.ijs.si/ilpnet.html
- Mailinglist:Machine Learning list
- Mailinglist:ILP-newsletter
We have interviewed these persons so far:
- Torulf Mollestad, Dr.ing student - on Rough Sets in Data Mining.
- Øystein Nytrø, amanuensis - on ILP to control the Data Mining process.
- Erling Alfheim, diploma student - on ILP, techniques and tools.
We have attended some lectures from two courses to get more
information for the project:
- Logics for Computer Science - a Dr.ing course in Data Mining and
Knowledge Discovery,
mainly Rough set.
- Literature:
- Lecture notes
- Articles from AI-magazine - Fall 1996
- Articles from Communication of the ACM - November 1996
- Computer-intensive statistics - a course in Bayesian networks
(among others).
- Literature:
- Lecture notes
- Adaptive Rejection Metropolis Sampling within Gibbs
Sampling W.R. Gilks, N.G. Best and K.K.C. Tan
- Pattern Recognition and Neural Networks B.D. Ripley
- BUGS - Bayesian inference Using Gibbs Sampling manual
David Spiegelhalter, Andrew Thomas, Nicky Best and Wally Gilks
- Graphical Models in Applied Multivariate Statistics Joe Whittaker
The amount of literature in this project makes it difficult and
impractical for both group members to read all it. We have therefore
divided some of the literature between us. The groupmebers will
therefore make a short resume from the text they have read, and
present this for the other member. Important things to include in the
resume:
- Description of the method(s) described in the text.
- List of demands made by the methods (e.g. pre-possessing).
- List of the algorithm(s) described in the text.
- Descriptions of the method's properties like time-complexity,
resource used, utility value and retractability.
Torgeir Dingsoyr
Sat May 3 15:20:22 MET DST 1997