An Evaluation of Data Mining Methods and Tools

Torgeir Dingsøyr - Endre M. Lidal

Abstract:

Three methods for Data Mining were discussed for use in cooperation with Case-Based Reasoning: Bayesian Networks, Inductive Logic Programming and Rough Sets. Experiments were carried out on AutoClass, which is a Bayesian classifier, and on Rosetta, which is a Rough Set tool producing logic rules. The dataset used for the experiments were taken from a reliability database for the oil industry, OREDA. The Rough Set method produces many rules, and the overfit increases with the size of the training set. Bayesian classification can reveal poor human classification in the dataset, but requires much human interpretation of the result. The rule generated from Rosetta can be used as background knowledge in Case-Based Reasoning, and in making predictions about failures.

Preface

This report is written as a project in Course 45073 Computer Science Projects at the Department of Computer and Information Science at the Norwegian University of Science and Technology in Trondheim. It is a part of the NOEMIE project. The work has been during the spring semester, and is stipulated to 20 hours of work per week per person, giving a total of 520 hours.

Acknowledgments

Through the work with this project, we have had many useful of discussions with people from the Department of Computer and Information Science.

We would like to thank Alexander Øhrn for help with Rosetta, Torulf Mollestad for explaining properties of Rough Sets. Øystein Nytrø and Erling Alfheim for discussions on Inductive Logic Programming, and Knut Magne Risvik for discussions on Data Mining in general.

We would also like to thank John Stutz and Will Taylor at NASA Ames Research Center for answering questions about AutoClass. Knut Haugen at SINTEF helped us gain an understanding of the OREDA database.

Through the work on the project, we have followed lectures in Computer Intensive Statistics and Logics for Computer Science. Through the last course we have been able to discuss many of the problems we have met in experimenting with the data.

Finally, we would like to thank our supervisors, professor Agnar Aamodt and professor Jan Komorowski at the Department of Computer and Information Science, NTNU, and Ole Martin Winnem at SINTEF.

Trondheim, 29th of April, 1997,

¯ Endre Mølster Lidal Torgeir Dingsøyr

List of Figures
List of Tables
Introduction
- Data Mining and Knowledge Discovery
  - Definitions
  - The steps in the KDD process
- The NOEMIE project
Criterias for Data Mining
- Criterias for Knowledge Discovery in Databases
- Criterias for Data Mining
Methods for Data Mining
Experiments
Discussion
Conclusions and Further Work
Project Formulation in Norwegian
- Evaluering av data mining teknikker og -verktøy
Terminology
Resources on the Internet
- Mailing Lists
- WWW-Sites
SQL Queries
Description of Selected Attributes
References
About this document ...

List of Figures

List of Tables

Introduction

The purpose of this report is to evaluate methods and tools for Knowledge Discovery and Data Mining, in the context of integrating Case Based Reasoning and Data Mining techniques. We have chosen the methods Bayesian Networks, Inductive Logic Programming and Rough Sets for evaluation. Reading this report requires an understanding of logic, logic programming and elementary statistics.

Unfortunately we did not have time to evaluate the ILP tool because we received the dataset later than planned, and this left us with less time to do the experiments. We excluded the ILP tool because the database we experimented with contains much data, and ILP tools has proven to be most effective on smaller datasets.

In chapter criterias for describing and evaluating Knowledge Discovery and Data Mining methods and tools are listed. Then the three methods are described in chapter . Chapter contains the results from our experiments. The results are discussed in chapter , conclusions drawn, and further work is outlined in chapter . Resources on the Internet, terminology, SQL queries and descriptions of attribute values from the experiment data sets are listed in the appendixes.

Data Mining and Knowledge Discovery

Today's computers and data bases have made it possibles to collect and store huge amounts of data. The size of many databases makes it impossible for humans to interpret the data. In fact some of the databases are merely in a write-only mode, because the amount and structure of the data is too complex for humans to extract useful knowledge from. The process of extracting new, useful knowledge, must therefore be done by computers.

Statisticians have worked with this process for several years. Lately other sciences like machine learning, artificial intelligence and logics have made progress in this field. Today a larger and larger part of the cooperations have discovered the advantages of Data Mining and Knowledge Discovery. The cooperations want to reuse the information stored to reduce costs, increase sales, increase revenue and reduce accidents and failure.

Definitions

There seems to be some confusion in the terminology in the field of extracting knowledge from databases. This may be because it has emerged from many different fields. We have chosen to follow the terminology used by Fayyad, Piatetsky-Shapiro, Smyth and Uthurusamy in [19] which seems to have become the standard definitions. According to these definitions Knowledge Discovery in Databases (KDD) is defined as:

The non-trivial process of identifying valid, novel, potential useful, and ultimately understandable patterns in data

Data Mining (DM) is defined as:

A step in the KDD process consisting of particular data mining algorithms that, under some acceptable computational efficiency limitations, produces a particular enumeration of patterns.

Thus Knowledge Discovery in Databases is the name for the whole process, whereas Data Mining is a step in KDD.

The steps in the KDD process

Even though the KDD process have emerged from different fields it has almost the same steps in all of the different approaches. These steps are[19]:

Developing an understanding of the application domain, the relevant prior knowledge, and the goal(s) of the end-user.
Creating a target data set; selecting a data set, or focusing on a subset of variables or data samples, on which discovery is to be performed.
Data cleaning and preprocessing: basic operations such as removing noise or model for noise, deciding on strategies for handling missing data fields, accounting for time sequence information and known changes.
Data reduction: preparing the data set, removing some attributes to suit the set to the goal.
Choosing the data mining task: deciding whether the goal for the KDD process is classification, regression, clustering, etc.
Choosing the data mining algorithm(s): selection method(s) to be used for searching for patterns in the data. This also includes deciding which models and parameters may be appropriate.
Data mining: search for patterns of interest in a particular representational form or set of forms: classification rules or trees, regression or clustering.
Interpreting mined patterns, possible return to any of the steps 1-7 for further iteration.
Consolidating discovered knowledge.

There is a distinction between information and knowledge. Information is a collection of data, where as knowledge is some higher understanding that can tell us something more about relations.

The main difference among the various KDD approaches lies thus in the step(s) called Data Mining; how to detect patterns in the data in the best way. This have resulted in a number of different algorithms and methods. This may be because the application area is very hetrogenous. There are similarities between a medical database containing records of patients and a cooperate management database, but the differences are usually bigger. Thus, an algorithm proven useful for a medical database may show not to be useful in a cooperate database. There is a quest to find the right method for a specific problem. The ultimate goal must thus be to design methods and algorithms that are universal.

A more comprehensive introduction to KDD and DM can be found in [19] and [16].

The NOEMIE project

The objective of the NOEMIE project is to improve decision processes by making better use of information through techniques in Data Mining and Case-Based Reasoning. This will be done by developing a methodology and tools for automated use of industrial experience from large collections of data in technical, safety and business domains.

Participants in the project are Université Dauphine, SINTEF, ISIS/JRC, Matra Cap Systems, Acknosoft, Schlumberger and Norsk Hydro.

Criterias for Data Mining

We have divided the criterias into criterias for Knowledge Discovery in Databases, which refer to the whole process, and criterias for Data Mining, which are the algorithms for extracting previously unknown, potentially useful information from large databases.

Criterias for Knowledge Discovery in Databases

Criterias for knowledge discovery are divided into practical criterias and technological criterias, which are more general than for the Data Mining process.

Practical Use

Is the information that can be extracted useful? Will the discovery of new information prohibit the re-occurance of unwanted events? [3] Will it lead to decreased costs or to save time?

Integration between Tasks and Tools

Primary tasks in Data Mining are defined in [19]: Classification - learning a function that maps data into classes. Regression - learning functions which map data into real-valued prediction variables. Clustering - identify a set of categories to describe the data. Summarization - find a compact description for a subset of data. Dependency Modeling - find a model which describes significant dependencies between variables. Change and deviation detection - find the most significant changes in the data from previously measured or normative values. What tasks in the KDD process does the system support. Is there a shared workspace between tools?

Who is the User? Human-Computer Interaction

Who is the intended user of the system? Is it a data analyst or a business user? How are results presented to the user? [16] Does the user get an idea of the complete ``state'' of the system? How large is the bandwidth between the user and the computer? Does the system learn from interacting with the user and from data?

Novelty of the Knowledge Extracted

How unexpected or novel are the results produced [19]. Is it possible to find a metric to evaluate the degree of novelty?

Relevance of Attributes

Are the attributes relevant to what has been discovered? Does the system allow manual check of the attributes. Is it using metaknowlege?

Criterias for Data Mining

Some criterias for Data Mining techniques are listed. The criterias will be used both to describe and to evaluate the methods and tools in chapter and . Some criterias are described further in [5].

Noise Level

Are data missing, biased or not applicable? Is the system able to reason with noisy data, or must the data be cleaned?

Consistency

Does the system discover inconsistency? Is it able to reason with inconsistency? Is the system trying to ``hide'' uncertainty or is it actively using and revealing it? [4] Will the system discover latent attributes?

Prior Knowledge

Do we have any prior knowledge of the system that is to be analyzed? Is it in the form of a data dictionary or domain knowledge? Is it extracted automatically? Is the method using metadata to eliminate relations between data that are impossible or unwanted in the model?

Output

What type of output can the system generate? Is it interpretable for humans? Can the output be used as prior knowledge later? Does it produce graphs, written reports or logic rules. Does it state why exactly this output is produced?

Complexity

What is the time and space complexity of the algorithm? What is the performance of the algorithm on computers that exist today? How does the system handle large amounts of data? Is it scalable both in data and in the number of concepts that it is able to extract?

Retractability

Is it possibly to find out why the algorithm produced the results it did?

Robustness

How does the algorithm handle errors, invalid input, shortage of memory?

Overfit/Underfit

Does the algorithm perform well on test data?

Dynamic

Is the method dynamic, in the sense that it allows results to be refined by adding more data, or does it just allow static data sets?

Methods for Data Mining

Data Mining has emerged from many different fields. Thus researchers have approached the knowledge discovery process from different angels, with different algorithms, based on their scientific interests and backgrounds. Well known techniques and methods have been applied to new tasks.

Methods in General

A variety of Data Mining methods and algorithms exist. Some are well founded in mathematics and statistics, whereas others are used simply because they produce useful results. We have listed a few methods we have encountered in Tab .

table124
Table 3.1: List of some Data Mining methods.

Selection of Methods

We have chosen to study the methods Rough Sets, Bayesian Networks and Inductive Logic Programming. We chose them because we wanted to study methods that were logically retractable, and we wanted to look at methods from different fields. Another reason for choosing the above methods is that the methods are currently studied at NTNU.

Bayesian Networks

Bayesian, or belief networks, is a method to represent knowledge using directed acyclic graphs (DAG's). The vertices are events which describe the state of some part of the desired universe in some time interval. The edges represent relations between events, and the absence of edges represents independence between events.

A variable takes on values that correspond to events. Variables may be discrete or continuous. A database is denoted D in the following, the set of all variables U, and a Bayesian network-structure is denoted B. If D consists of all variables in U, then D is complete. Two structures are said to be equivalent if they have the same set of probability distributions.

The existing knowledge of an expert or set of experts is encoded into a Bayesian network, then a database is used to update the knowledge, and thus creates one or several new networks.

Bayesian Networks are described in [7], [13], [15] and [1]. In Fig , a Bayesian network is illustrated. A disease is dependent on age, work and environment. A symptom is again dependent on a disease.

Figure 3.1: A Bayesian Network example.

Theoretical Overview

In probability theory, P(A) denotes the probability that event A will occur. Belief measures in Bayesian formalism obeys the three basic axioms of probability theory:

If A and B are mutually exclusive, then:

The Bayes rule is the basis for Bayesian Networks:

Where is called the posterior probability and P(H) is called the the prior probability.

Probability can be seen as a measure of belief in how probable outcomes are, given an event (subjective interpretation). It can also be interpreted as a frequency; how often it has happened in the past (objective interpretation).

Bayesian Networks Applied in Data Mining

When Bayesian Networks are applied in Data Mining, the user may want to apply techniques to learn the structure of a Bayesian network with one or more variables, or to reason on a structure that is already known.

One Variable

With only one variable, the task is to compute a distribution for the variable given any database. For instance, if we want to compute the probability distribution for a coin that is flipped, the distribution for the variable would be: ([7])

Where denote the background knowledge of the model, h denotes the number of heads and t the number of tails. This result is only dependent on the data in the database. In general, for a variable that can have more outcomes, the distribution for the set of physical probabilities of outcomes, will be:

Where is the number of times x = k in D, and c is a normalization constant.

Known Structure

If the structure of the network, , is known in advance, the aim of the analysis is to compute the probabilities on the vertices from the database D. If the database is complete, and the variables are discretized into r different outcomes, the probabilities can be computed according to the following formula:

equation191

Learning Structure and Variables

To learn a structure, we want to generate candidate Bayesian Networks, and choose the structure that is best according to the database we have.

The probability distribution in equation gives the probability distribution for the next observation after looking in the database:

To find a Bayesian Network B for a database D, we can use metrics together with a search algorithm. The metrics approximate equation , and some of the metrics are:

Bayes Factor (BF)
Bayesian Dirichlet (BD)
Maximum a Posteriori (MAP)
A Information Criterion (AIC)
Bayes Information Criterion (BIC)
Minimum Description Length (MDL)

Some search methods that exist are:

Local Search - a heuristic search algorithm on the graph.
Iterated Hill-Climb - uses Local Search, but avoids getting stuck in a local maxima.

If data are missing in the database, it can either be filled in using the posterior distribution. Or we can use Gibbs sampling, which approximates the expectation of any function with respect to it's probability distribution. Another alternative is to use the expectation-maximization (EM)-algorithm which can be used to compute a score from the BIC-metric.

Properties

Noise

Missing data can be filled in using Gibbs sampling or the posterior distribution. The method is relatively resistant to noise.

Consistency

Bayesian Networks are able to reason with uncertainty. Uncertainty in the data will be included in the model as probabilities of different outcomes.

Prior knowledge

In order to make the Bayesian network structure, it has to be learned from the database. It can also exist as prior knowledge.

Output

The output from Bayesian Network algorithms is a graph, which is easy to interpret for humans.

Complexity

There exist classifiers that perform in linear time, but further iterations give better results.

Retractability

The method is statistically founded, and is always possible to examine why the bayesian method produced certain results.

Inductive Logic Programming

Theoretical Overview

Inductive Logic Programming (ILP) is a method for learning first-order definite clauses, or Horn clauses, from data. A Horn clause can be:

Where the predicate daughter is defined by the predicates female and parent. The two last predicates should be present as attributes in a relational schema in a database, and the new predicate daughter can be the result that has been learned.

The training examples that are used to learn a predicate p are denoted . The ground facts which are known to belong to the target relation are denoted + and the facts that are known not to belong to the target relation -. The ground facts are taken from a database.

We use a concept description language to specify syntactic restrictions on the definition of the predicates.

To learn one or more predicates, we may also have background knowledge, denoted and other predicates, denoted .

The quest is then to find a definition for a new predicate or set of predicates, , expressed in such that is complete ( +) and consistent ( -) with respect to the training examples, the background knowledge and other predicates.

A clause c is more general than clause d if c can be matched with a subset of d. For instance, the clause is more general than .

The least general generalization of two clauses and is the least general clause which is more general than both and with respect to the background knowledge .

ILP Applied in Data Mining

ILP systems can be divided in three groups [6]: Empirical systems, which are single-predicate learning systems that are able to analyze large example sets with little or no human interaction. Interactive systems incrementally build complex domain theories consisting of multiple predicates where the user controls and initiates the model construction and refinement process. Programming assistants learn from small example sets without interaction with humans. The systems can use a multitude of techniques when inducing knowledge. The techniques are discussed briefly below. A more thorough discussion can be found in [2].

Inverse Resolution

Inverse Resolution is motivated by the fact that in deduction, results found by resolution are clauses. We invert this procedure to find clauses from results.

Least General Generalization

The method uses the relative least general generalization (described in section ) to compute the most general clause given the background knowledge .

Search Refinement Graph

The top-down learner starts from the most general clause and repeatedly refine until it no longer covers negative examples. During the search the learner ensures that the clause considered covers at least one positive example.

Rule Modeling

A rule model is a rule on the form , where T and are literal schemas. Literal schemas has the form or not where all non-predicate variables are implicitly universally quantified.

All possible instantiations of the predicate variables are tried systematically for each rule model, and the resulting clauses are tested against the examples and background knowledge to produce new clauses that can be learned.

Transformation

An ILP problem is transformed from relational to attribute-value form and solved by an attribute-value learner. This approach is only feasible for a small number of ILP problems where the clauses are typed, constrained and non-recursive. The induced hypothesis is transformed back into relational form.

Properties

Noise

Early ILP systems did not handle noise, but recent systems such as FOIL and LINUS are able to cope with noise.

Consistency

General rules can not be generated from inconsistent data. Inconsistency must be removed by preprocessing the data.

Prior knowledge

The ILP methods require background knowledge on the form of definite clauses.

Output

The output is given as definite clauses, often in Prolog syntax.

Retractability

Inductive logic programming has a mathematical basis, and it is always possible to analyze why results are produced.

Overfit/Underfit

Overfitting can be a problem, if rules are generated that has low support.

Rough Sets

The Rough Sets method was designed as a mathematical tool to deal with uncertainty in AI applications. Rough sets have in the last years also proven useful in Data Mining. The method was first introduced by Zdzislaw Pawlak in 1982 [11], the work have been continued by [18],[12], [17], [9] and others. This section is a short introduction to the theory of Rough Sets. A more comprehensive introduction can be found in [9] or [10].

Theoretical Overview

The starting point of Rough Sets theory is an information system, an ordered pair . U is a non empty finite set called universe, A is a non empty finite set of attributes. The elements of the universe are called objects, and for each attribute , a(x) denotes the value of attribute a for object . It is often useful to illustrate an information system in a table, called an information system table. One of the most important concepts in Rough Sets is indiscernibility, denoted IND(). For an information system , an indiscernability relation is:

where . The interpretation of is that the objects and can not be separated by using the attributes in the set B. The indiscernible objects can be classified in equivalence classes:

again . denotes an equivalence class based in the attributes in set B. The set U/IND(B) is the set of all equivalence classes in the relation IND(B). Intuitively the objects of an equivalence class are indiscernible from all other objects in the class, with respect to the attributes in the set B. In most cases we therefore consider the equivalence classes and not the individual objects. The discernibility matrix of an information system A, with n attributes, is a symmetric matrix with given by:

hence, contains the attributes upon witch classes and differs.

Some of the objects attributes may be less important than others when doing the classification. An attribute a is said to be dispersible or superfluous in if IND(B)=IND(B-{a}) Otherwise the attribute is indispensable in B. A minimal set of all the attributes that preserves the partitioning of the universe is called a reduct, RED. Thus a reduct is a set of attributes such that all attributes are dispensible and IND(B)=IND(A). There are also object-related reducts that gives the set of attributes needed to separate one particular object/class from all the other.

The theory is illustrated in the example below.

ex341

The next to define is the lower and upper approximation, Rough definability of sets and the Rough membership function. The lower and upper approximation are used to classify a set of objects, X, from the universe that may belong to different equivalence classes. Let and , the B-lower, and B-upper, approximation are:

Hence this is a way to approximate X using only the attributes in the set . The set is called the boundary of X. Intuitively, the objects in are the objects that are certain to belong to X, while the objects in are possible members of X, as illustrated in Fig .

Figure 3.2: The B-upper and B-lower approximations of X.

If and then X is said to be Roughly B-definable. If and then X is said to be Internally B-undefinable. If and then X is said to be Externally B-undefinable. If and then X is said to be Totally B-undefinable

The rough membership function , where , is given by:

The rough membership function denotes the frequency of in X, or the probability of X being a member of .

A natural extension of the information system is the decision system. A decision system is an information system with an extra set of attributes, called decision attributes, hence . In most cases the set D has only one element, , and we will use this simplification in the rest of this section. The decision attribute, d, reflects a decision made by an expert. The value of d, is often a set of integers {1,...,r(d)}, where r(d) denotes the range of d. It is also possible to represent non numerical decisions, but then it is much simpler to encode the decision into integers, e.g. false and true can be encoded 0 and 1 respectively. The partitioning of the universe is denoted a decision class, and is a classification of the objects in determined by the decision attribute d. A decision system is said to be consistent if for all

Otherwise, the decision system is inconsistent. An inconsistent system is a system where two or more of the objects have the same value for the set of attributes A, hence belonging to the same equivalence class, but the decision attribute d differs. A Decision table is a table of the indiscernability classes, the attributes and the decision attribute. The example below illustrates this.

ex463

Rough Sets Applied in Data Mining

When using rough sets as a algorithm for data mining, we generate decision rules that map the value of an objects attribute to a decision value. The rules are generated from a test set. If is a decision system, and a descriptor a=v, where , then a decision rule is an expression generated from descriptor:

Where , are values of the attributes in A, d is the decision attribute and v is the value of d. There are two types of decision rules, definite and default decision rules. Definite decision rules are rules that map the attribute values of an object into exactly one decision class. A deterministic decision system can be completely expressed by a set of definite rules, but an indeterministic decision system can not.

ex503

To model an inconsistent decision system [9] suggests using a set of rules called default rules. Given a decision system and a threshold value . A default rule for with respect to the threshold is any decision rule where:

Depending on the threshold value, it will be generated zero, one or more mutually inconsistent rules. A threshold of 1 gives only the set of default rules that are exactly identical to the set of definite rules. A threshold value greater than 0.5 maps the value of an objects attributes of an inconsistent equivalence class into only one decision class,where as a threshold value below 0.5 may map into several decision classes.

ex537

Definite and default rules generated directly from the training set, tends to be very specialized. This can be a problem when the rules are applied to unknown data, because many of the objects may fall outside, i.e. not match, of the specialized rules. If we make the rules more general, chances are that a greater number of the objects can be matched by one or more of the rules. One way to do this is to remove some of the attributes in the decision system, and by this glue together equivalence classes. This will certainly increase the inconsistency in the decision system, but by assuming that the attributes removed only discerns a few rare objects from a much larger decision class, this may not be a problem in most cases.

ex566

Properties

The subsections below compare the Rough Set method to the criterias outlined in chapter . This comparison will give us a framework that can be used to compare the various data mining techniques.

Noise

The rough set method is capable of handling most types of noise. If the input data is missing an attribute that is dispencible then the classification process is not affected. If the input data are missing important attributes, then the Rough Sets can not classify the object and the object must be thrown away. There are algorithms implemented that can 'guess' the value of a missing attribute, by doing statistical insertion among others, and thereby keeping the object.

Consistency

As described before, the default rules in Rough Sets can handle inconsistency in the information system quite nicely. In fact, as pointed out, it may be a advantage to introduce inconsistency, by removing attributes, to make the decision rules more general.

Prior Knowledge

Rough set method does not use prior knowledge when generating default rules. As mentioned before the default rules generated by Rough Set algorithms are generated only from the objects in the input data.

Output

The output from the Rough Set algorithms are rules, preferably default rules. These rules can then be uses to predict decisions based on the input attributes. By making variations in the threshold value, one can tune the number of default rules to get the most optional result. Rules are easy for humans to interpret and use.

Complexity

The complexity of partitioning is O(nlog(n)), a sort algorithm. Computing the reducts is NP-complete, but there are approximation algorithms (genetic, Johnson, heuristical search) that have proven very efficient.

Retractability

It is easy for the user to find out how the results were generated. This is because Rough Sets method can display the indiscernability classes it created and the distribution of the objects in these classes. The user can from this confirm the rules generated, and argument, in basis of the classes, why the rules are correct.

Experiments

To experiment with the methods chosen, we selected the tools Rosetta, which is a Rough Set analysis tool and AutoClass, which is a Bayesian Networks classifier. The experiments were done on the OREDA database, which was supplied through the NOEMIE project.

The OREDA Database

OREDA is a reliability database for the oil industry, which has been in use for more than ten years. It contains data from ten different oil companies, and consists of three parts:

Inventory - contains description of equipment units. This includes technical data such as the capacity and size of a unit, operating data such as operating mode. And finally environmental data, such as the climate where the unit is kept.
Failure - contains all failure events reported. A record is stored for each failure. The failure event is related to one or several inventory items. Failures can be discovered during inspection, testing or maintenance.
Maintenance - contains preventive and corrective maintenance carried out. This can be related to inventory if the maintenance is preventive, or to failure, if the maintenance is corrective.

Most attributes in the database are composed of a number of codes which for instance describe types of failures. Text strings that can give further information are usually provided. There are several missing attribute values in the database.

Experimental Setup

The experiments was done on a 200MHz Pentium Pro processor running Windows NT 4.0. The OREDA database was stored in Oracle, and data retrieved using SQL*Plus 3.3.

AutoClass C version 3.0, compiled for MS-DOS, was used in testing. This is a public domain C-version of AutoClass III. Except where indicated, AutoClass searched for five minutes for classes. The Rosetta program used was a trial version 1.0b, limiting the decision system to 50 attributes and 5000 objects. Both programs had some minor bugs.

The Tools

AutoClass - A Bayesian Networks Classifier

AutoClass is a tool for unsupervised classification using Bayesian methods to determine the optimal classes. AutoClass was developed at the NASA Ames Research Center. A description is given in [14].

In AutoClass, data must be represented in a data vector, with discrete nominal values or real scalar values. Each information vector is called a case, but we will still use the term object for each vector. The vectors can be stored in text- or binary files.

A class is a particular set of attribute (or parameter) values and their associated model. A model is either Single multinomial, Single normal or Multi normal. The first two models assume that attributes are conditionally independent, given a class; the probability that a class would have a particular value of any attribute depend only on the class and is independent of all other attribute values. The Multi normal model expresses mutual dependencies within the class.

Class membership is expressed probabilistically. The sum of probabilities for an object must always sum to one.

AutoClass has three options:

Search: Search for classes given a database, a description of the type of the data (discrete variable, scalar), and number of attributes in the database. The model which is to be used is stated in a model file. The result is stored as a classification of the objects in the dataset in a binary file.
Report: Creates three reports from the results of the search. An Influence value report gives Class strengths, which is defined as the geometric mean probability that any instance belonging to a class would have been generated from the class probability model. It is a measure on how strongly each class predicts its member objects. Then a rough heuristic measure of the relative importance of attributes in the classification is given, and finally a description of each class containing a probability that an object will belong to each class.
The Case number membership lists objects for each class with a configurable number of attributes. It also lists the probability that the object is put in each class. An example is given in Fig .

Figure 4.1: An extract from a Case Number Membership cross reference.

Most probable class lists each object in the database, and lists probabilities for which classes it is likely to be grouped in. An example of a Most probable Class cross reference is given in Fig .

Figure 4.2: An extract from a Most Probable Class cross reference.
Predict: Predicts the class membership for new objects. Will generate a list of Most probable classes and of Case number membership for the test-set.

Interpretation of results from AutoClass

The Class strength given in the Influence value report is a heuristic measure of how strongly each class predicts its instances.

The Normalized weight is a value between 0 and 1 that gives a probability value that a new object will be classified as a member of the given class.

Rosetta - A Rough Set Toolkit for Analysis of Data

Rosetta is a toolkit for pattern recognition and Data Mining, within the framework of Rough Sets. It is developed as an cooperative effort between the Knowledge System Group at NTNU, Norway and the Logic Group at the University of Warsaw, Poland. Further information on Rosetta can be found in [8].

Rosetta runs on Windows NT, and consists of a kernel and a Graphical User Interface (GUI). The user interacts with the kernel through the GUI, but the kernel is an independent unit. The GUI in Rosetta relies on human interaction and human editing. Data are imported as a decision table from a text file. The format of this text file is shown in Fig .

Figure 4.3: Example of input file for Rosetta.

The first line contains the attribute names, the second contains the attribute types; integers, real numbers (float) or strings. Then the objects are listed successively. Missing attributes are denoted Undefined.

Rosetta organizes the data in a project and displays relations between the various project items in project trees. Through these trees the user can monitor and manually edit the steps in the process. Most of the structure are displayed in tabular form. Fig is a screen shot from Rosetta, showing the project tree to the left and a decision table to the right.

Figure 4.4: Screen shot from Rosetta.

Rosetta can handle missing attribute values by either removing the objects with missing values or inserting a mean value based on the value from the other objects attribute values. Other kind of preprocessing like splitting into test set and training set, manual removal of selected attributes and/or objects and discretizations are also supported. The user can calculate reducts, both normal and object oriented, and generate rules. The reducts can be calculated by three different algorithms; RSES exhaustive calculation, Johnson algorithm and genetic algorithm. All of the algorithms also supports dynamic calculation, which can be seen as combining normal reduct computation with statistical resampling techniques. Both the reducts and rules can be pruned away based on support value or with respect to a decision value or an attribute. It is also possible to calculate the rough classification for the decision system. The decision system and the rules can be exported as Prolog clauses.

Rosetta has also a possibility for testing the rules that have been generated. The rules are applied on a test set with known decision values. The result from the test is a confusion matrix, which is a matrix where the predicted decisions values are compared to the actual decision values, see Tab for an example. The steps done on the data are recorded together with the data, therefore reproduction of the results are easy to do.

table685
Table 4.1: An example of a confusion matrix.

Presentation of Results

We did three different experiments, with different data sets. The presentation of their results from each experiment is split into two parts, one for Rosetta results and one for the results from AutoClass. We have tried to use tables and figures to show the results. Comments and evaluation of the results are presented in chapter . An overview of values, with explanations of selected attributes are included in appendix .

Preprocessing

We extracted the datasets from the OREDA database using SQL. The SQL statements are included in appendix . We manually preprocessed the data in MS-Notepad and through PERL scripts. This was done to remove unwanted space between the attribute values, and to insert attribute types and Undefined for missing values in the Rosetta input file.

The model used in AutoClass in the experiments to follow was Single multinomial. All attributes were treated as discrete values. The reason for this is that AutoClass expects a zero-point and a relative error for scalars. By using discrete values, we were able to make the input more similar to the input to Rosetta. The datasets were preprocessed in Rosetta and the splits used in Rosetta were exported using cut and pasted into MS-Notepad.

Experiment I

We manually read the database diagrams and chose the attributes that we thought had some possible functional dependencies. The attributes, with explanation are listed in Tab .

table703
Table 4.2: Table of the attributes in Experiment I.

When we manually inspected the data we found that the values in attributes I_MANUFACT_CTRL_SYS and I_SUBSEA_CS were mostly missing, so we removed them. We also removed three objects out of 3357 that had most of the attribute values missing. Tab below shows the number of objects and attributes in the dataset of experiment I.

Table 4.3: Facts on Experiment I data set.

The dataset was randomly split into a 30% training set, and a 70% test set.

Results from Rosetta

We wanted to see if it was possible to predict the failure type given the values from the other attributes. We chose the attribute FM_CODE as decision attribute. Then we calculated dynamic reducts, both normal and object oriented, and generated rules. We tested the rules on the test set, and the overall hit rate was around 59%. We then tried to remove four of the attributes that had missing attributes, but this did not improve the results.

Results from AutoClass

After 62 tries, AutoClass found nine classes. The strength of each class is given in Tab , and the influence of each attribute is given in Tab .

table729
Table 4.4: Strength and normalized weight for the classes in Experiment I.

table741
Table 4.5: The relative influence of each attribute in Experiment I.

Experiment II

To get more background information we contacted an expert on the OREDA database. He suggested that we restricted the dataset to faults on one type of equipment. We selected Gas Turbines because it had many reported failures, and Electro Motors because it had few reported failures. In this way we could test the tools on a set with many objects and on a different set with fewer objects. We selected a new set of attributes, which are listed in Tab . The facts about the Gas Turbines dataset and Electro Motors dataset is shown in Tab . The Electro Motors set was randomly split into a 30% training set and a 70% test set. The Gas Turbines set was randomly split in two different sets for training and testing, with the percentage for training set and test set respectively, 20-80% and 30-70%.

table761
Table 4.6: Table of the attributes in Experiment II.

table770
Table 4.7: Facts on Experiment II data set.

Results from Rosetta

Electro Motors

We kept FM_CODE as the decision attribute calculated the dynamic reducts for the training set, and generated rules. We then created the confusion matrix, but the overall results was only 19%. We then tried to remove the missing attribute values in the decision table by applying mean/mode fill. This removed the incompleteness, but the rules we got now did not give us any better results. The overall performance for all the tests are recorded in Tab .

table783
Table 4.8: The results from the Electro Motor test.

Gas Turbine

We kept FM_CODE as decision attribute, and did five different tests. These were:

Calculated genetic object-oriented dynamic reducts on the 20% training set.
Calculated genetic object-oriented dynamic reducts on the 30% training set.
Removed inconsistency by removing objects and calculated genetic object-oriented dynamic reducts on the 30% training set. Number of objects was now 641.
Removed the attributes FCAUSE, I_MANUFACT, SHAFTS and I_NO_OF_STARTS from the 30% training set and calculated genetic object-oriented dynamic reducts
Removed both the attributes and the inconsistent objects on the 30% test set and calculated genetic object-oriented dynamic reducts. Number of objects was now 1126.

The results we got are shown in Tab

table804
Table 4.9: The results from Gas Turbine test.

Results from AutoClass

Experiments with the Gas Turbine set was not done in AutoClass.

Electro Motors

After 2087 tries, AutoClass found four classes. The strength of each class is given in Tab , and the influence of each attribute is given in Tab .

table820
Table 4.10: Strength and normalized weight for the classes in the Electro Motor dataset.

table832
Table 4.11: The relative influence of each attribute in the Electro Motor dataset.

Experiment III

We tried running tests on a decision attribute with a smaller range. We selected SEVERITY as the new decision attribute. It has four different values, Critical - C, Degraded - D, Incipient - I and Unknown - U, hereas FM_CODE has a range of 42 different values. The goal of our search has now changed from finding the failure type to predict the severity of the failures, given certain attributes. This time we only worked with the Gas Turbine set. The attributes we used also reduced from 21 to 14, see Tab .

table854
Table 4.12: Table of the attributes in Experiment III.

The attributes removed were either attributes where the value was equal, or missing, in most objects, f.x. IC_CODE, or the value had little or no information, f.x. FCAUSE, or the value where implicitly given by other values, f.x. I_MANUFACT. We noticed that the value U was only used in six out of 1198 objects, so we removed those objects to reduce the range of SEVERITY. The distribution of SEVERITY is listed in Tab

table870
Table 4.13: The distribution of Severity.

The gas turbine set was randomly split into three different sets of test/training set, with the percentage for training set and test set 20-80%, 30-70% and 50-50%, respectively. Facts about the set are shown in Tab

Table 4.14: Facts on Experiment III data set.

Results from Rosetta

This time we did three different tests one for each of the splitting types. The results before pruning are shown i Tab

table891
Table 4.15: Results from Experiment III-a.

We decided to do the pruning more carefully to try to discover how many rules was needed to make good predictions. In Fig we have plotted the overall results versus the support, for all of the three test sets, and in Tab we have listed the number of rules left after pruning away the rules with support less than 2, 5, 10, 25, 40 and 50.

table902
Table 4.21: Number of rules left after pruning w.r.t. support.

Figure 4.5: Overall result versus the support for Experiment III-a.

Tab , Tab and Tab contains examples of the strongest reducts, e.g. the reducts with the greatest support, the strongest rules and an example of a confusion matrix.

table925
Table 4.17: Example of strong reducts.

table934
Table 4.18: Example of strong rules.

table943
Table 4.19: Example of confusion matrix.

The confusion matrixes in Tab shows that the decision C is well predicted, but the decisions D and I are more difficult to distinguish. This can be a problem if we want to identify every severity type. But if we want to distinguish the critical failures from the non-critical it is possible to group D and I into one decision value O. This leaves 378 objects with decision C and 819 objects with decision O We tried this, and the results are presented in Tab .

table962
Table 4.20: Results from Experiment III-b.

Pruning was done on the rules generated from the normal dynamic reducts, and are plotted in Fig , listed in Tab , in the same way as for the previous experiment. We also have examples of reducts, rules and confusion matrixes, they are in Tab , Tab and Tab , respectively.

table976
Table: Number of rules left after pruning w.r.t. support.

Figure 4.6: Overall result versus the support for Experiment III-b.

table996
Table 4.22: Examples of strong reducts.

table1005
Table 4.23: Examples of strong rules.

table1014
Table 4.24: Example of a confusion matrix.

Results from AutoClass

Experiments were conducted on 20%, 30% and 50% of the data as training set. Finally, the range of the decision attribute in Rosetta was reduced, and the experiment ran in AutoClass as well to see if it had any influence on the classes generated.

20 Training Set

After 541 tries, AutoClass found nine classes. The strength of each class is given in Tab , and the influence of each attribute is given in Tab .

The classes were manually inspected, and identified after the name of the manufacturer. A description of each class is listed below:

Class 0 - Avon1535-130C/RT56 units with power=19458.
Class 1 - RB211/-24C/RT56 units with power = 28000.
Class 2 - 1535SuperMaxiAvon units with power = 13500.
Class 3 - Turmo.3d units with power = 750.
Class 4 - GT35B-850, TORNADO and KG5 units with power= 14700, 6900 and 3100.
Class 5 - LM2500PE units with power = 22000 and 216660.
Class 6 - 1535-121G and Avnon1525-121G units with power 29847.
Class 7 - 1304-10 units with power 9250.

table1029
Table 4.25: Strength and normalized weight for the classes in the Gas Turbine set with 20% training set.

table1041
Table 4.26: The relative influence of each attribute for the different training sets.

30 Training Set

After 246 tries, AutoClass found seven classes. The strength of each class is given in Tab , and the influence of each attribute is given in Tab . We also tried to let AutoClass search for classes for five hours on the set, but this did not change the result.

The classes generated were:

Class 0 - Avon1535-130C/RT56 units with power=19458.
Class 1 - RB211/-24C/RT56 units with power = 28000.
Class 2 - Turmo.3d units with power = 750.
Class 3 - TORNADO and 1304-10 units with power = 6900 and 9250.
Class 4 - 1535SuperMaxiAvon units with power = 13500.
Class 5 - GT35B-850 and KG5 units with power= 14700 and 3100.
Class 6 - 1535-121G and Avnon1525-121G units with power 29847.
Class 7 - LM2500PE units with power = 22000 and 21660.

table1060
Table 4.27: Strength and normalized weight for the classes in the Gas Turbine set, 30% training set.

50 Training Set

After 139 tries, AutoClass found twelve classes. The strength of each class is given in Tab , and the influence of each attribute is given in Tab .

The classes generated were:

Class 0 - RB211/-24C/RT56 units with power = 28000.
Class 1 - Avon1535-130C/RT56 units with power=19458 and high frequency of OTH and AIR in FM_CODE and low values in FD_NO (25, 33).
Class 2 - 1535SuperMaxiAvon units with power = 13500.
Class 3 - Avon1535-130C/RT56 units with power=19458 and high frequency of OHE and STD in FM_CODE, and high values in FD_NO (92).
Class 4 - 1535-121G and Avnon1525-121G units with power = 29847, CLIMATE_CODE = ONSHORE, and SYS_CODE = GT.
Class 5 - Turmo.3d units with power = 750.
Class 6 - 1304-10 units with power = 9250.
Class 7 - GT35B-850 and KG5 units with power= 14700 and 3100.
Class 8 - LM2500PE units with power = 22000.
Class 9 - TORNADO units with power = 6900.
Class 10 - LM2500PE units with power 21660.

table1077
Table 4.28: Strength and normalized weight for the classes in the Gas Turbine set, 50% training set.

Two decisions

After 252 tries, AutoClass found nine classes. The strength of each class is given in Tab , and the influence of each attribute is given in Tab . The training set was 30%.

The classes generated were:

Class 0 - Avon1535-130C/RT56 units with power=19458.
Class 1 - RB211/-24C/RT56 units with power = 28000.
Class 2 - 1535SuperMaxiAvon units with power = 13500.
Class 3 - Turmo.3d units with power = 750.
Class 4 - 1535-121G and Avnon1525-121G units with power 29847.
Class 5 - LM2500PE units with power = 22000 and 21660.
Class 6 - GT35B-850 and KG5 units with power= 14700 and 3100.
Class 7 - 1304-10 units with power = 9250.
Class 7 - TORNADO units with power = 6900.

table1094
Table 4.29: Strength and normalized weight for the classes in the Gas Turbine set, two decisions.

Discussion

In this chapter, we discuss the results from the experiments described in chapter . We will first evaluate results from Rosetta, then from AutoClass, and finally we compare the results with respect to the NOEMIE project.

Rosetta

The results obtained from Rosetta was only partly good. The first two experiments had rather low overall results, especially experiment II. Reasons for the low results in experiment I may be a combination of wrong attributes, missing attribute values and a too large range of the decision value. The electro motor part of experiment II suffered from too many missing attribute values. When they were removed we were left with to few objects to get a good classification. In the gas turbine part, the same problems as in experiment I occurred.

Experiment III gave a much better overall result than the two previous ones. It may therefore be interesting to examine some of the reducts and rules generated here, in greater detail. The reducts are presented in Tab and in Tab . The first observation to make is that the reducts are short, only one or two attributes in experiment III-a, three decisions, and two to four in experiment III-b, two decisions. This indicates that decisions can be drawn based on just a few attributes. We also notice that the attributes FM_MODE and FD_NO are present, either alone or together in all of the reducts. This makes sense, since SEVERITY ought to depend on the failure type and description.

When we look at the rules in Tab and Tab , we first notice that there are many. They also generally have low support. This is not an advantage, because then the rules are somewhat specialized. When the weaker rules, the most special, were pruned away, the overall results were dropping. This is illustrated in Fig and Fig . We see that the rules with the lowest support (<2) can be removed without loosing classification power. This halves the number of rules. If we remove more rules, the one with support<5, we reduce the the rule set with approximately another 60%, but we also increase the number of unclassified objects. There is a trade off between few, strong rules and many unclassified objects. But the quality of some of the strongest rules are high. In Tab the second and third rule are short and general. This is because, due to rules in the OREDA database, the SEVERITY of a failure of type AIR, FM_MODE(AIR) is always Incipient (I), and the SEVERITY of a failure of type FTS, FM_MODE(FTS) is always Critical (C). Rosetta had not this background information, and it showed strength by finding these dependencies.

Indications

Below are some of the indications that we think can be drawn on Rosetta used on the OREDA database based on the results we have obtained.

Attributes

The number of attributes is not as important as to what attributes to include. Irrelevant attributes can in some cases confuse Rosetta, but in most of the experiments Rosetta used only the most important attributes, and discarded the rest, in the reducts. We also found that there is a relationship between the number of attributes, their range, and the number of objects needed to make good rules. Thus, if we have many attributes, we must have more objects than with fewer attributes, to generate good rules.

Objects

We found that in most cases the rules from Rosetta gave us an overfit problem, there were too many specialized rules. This problem could be corrected by applying pruning. We saw some indications, especially in experiment III that the degree of overfit increased with increasing size of the training set.

Decisions

We saw for experiment II and III, that the range of decision values is important to the overall result. It is difficult for Rosetta to make classification on too many different decisions. But, again, this depends much on the dataset.

Efficiency

We did not do tests that could indicate runtime complexity, nor did we do a comparison on the results versus time used on calculation. But we found that the time used for calculating reducts varied for the type of reduct. Dynamic reducts are much more computer intensive than normal reducts, and object-related reducts are more computer intensive than normal related. Calculating the dynamic reducts took from 10 to 40 minutes. Normal reducts was calculated within few minutes, sometimes less than one minute.

AutoClass

AutoClass classifies the data into classes that seems logical for humans. In Experiment II, the relative influence in classifying objects is highest for the attribute POWER. However, it is easy to interpret the classes after the attribute with the second highest influence, the model. The equipment units having I_MODEL = 1535-121G and Avnon1525-121G are grouped together. The units are probably the same, but have been given two labels in OREDA.

We observe that GT35B-850 and KG5 are always grouped together. Both are of design class "industrial" (DC_CODE = IN), as are the TORNADO units, which are grouped together with the two when the training set is 20%.

The attribute with the highest relative influence in experiment I was the DESIGN CLASS attribute, which is what we would expect that a human would use to discern between the data.

Indications

Some indications that can be drawn from the experiments with AutoClass on OREDA are listed below:

Attributes

We varied the number of attributes in the experiments from 14 to 21. The number of classes found were between four and eleven. Experiment I with 18 attribute values produced eight classes, while experiment II with 21 attributes generated between eight and eleven classes when experiments was done on different training set sizes. Experiment III generated four classes, from a dataset with 14 attributes. It seems that the number of attributes have little influence on the number of classes produced.

Objects

The classes generated by AutoClass are described below for the different training set sizes:

20%: Eight classes, two strong (relative normal weight > 0.26), the rest of the classes were weak (relative normal weight < 0.1).
30% Eight classes, two strong classes (relative normal weight > 0.24), the rest of the classes were weak (relative normal weight < 0.1).
50%: Eleven classes, one strong class (relative normal weight = 0.24), one of medium strength (.19) and the rest weak (< 0.11).

It seems like more data leads to a more precise classification. We could expect that an unsupervised classifier would give an overfit classification for a small data size, but the number of classes are less (8) when the training set is 20% than when it is 50% (11).

Range

AutoClass produced seven classes for the Gas Turbine set with a 30% training set. When the range for SEVERITY was reduced from three to two attribute values, eight classes was produces. Both had two classes with high relative normal weight (over 0.25), and the rest had low relative normal weight (less than 0.12). The relative influence of SEVERITY increased from 0.034 to 0.049. Class 4 was split into two classes of equal size, Class 7 and Class 8. Class 7 contained objects having POWER=9250 and I_MODEL=1304-10, and Class 8 had objects containing POWER=6900 and I_MODEL = TORNADO. We were not able to discover any pattern in the SEVERITY attribute (it was distributed equally in the classes).

The new classification may have occurred because the datasets are slightly different in the two experiments. The whole data set was split into test- and training set with different random seeds, but this can probably not explain the split of the classes.

Efficiency

AutoClass did not produce any different results when we increased the search time from five minutes to five hours. We did not investigate further how much time AutoClass should search to find sufficiently good classes.

Comparison of Tools and Methods

Rosetta and AutoClass differs in both appearance and underlying theory. The tools and methods are discussed in the context of the NOEMIE project. The comparisons follows most of the criterias outlined in chapter .

Usage and robustness

Rosetta is much more a Data Mining tool than AutoClass. Rosetta support functions that preprocess the data, as for instance remove incompleteness and discretization. AutoClass expects a ready dataset, and offers no way of working with the data except for removing whole attributes. AutoClass is run as a batch program, while Rosetta relay on human interaction, and thus lacks a script language which could automate some of the use. The Data Mining tool for NOEMIE should, partially, support automatic usage. Neither tool requires, or uses, any background knowledge of the data, except that AutoClass need to know the range of each attribute. The user is able to control the process in Rosetta by deciding on a decision attribute. AutoClass gives no control of how the classification is done. None of the tools learns by interaction with the user, and none of them are able to use knowledge gained through analysis of data as background knowledge later.

Neither of the tools are robust (they are not commercial products). AutoClass crashed when trying to produce reports after analyzing certain data sets, while Rosetta crashed after some series of operations.

Output

The output from Rosetta is rules, whereas AutoClass returns classes. The rules generated by Rosetta and the classes that AutoClass finds can have several usages in NOEMIE. Rules generated can be used to:

Find functional dependencies between attributes in the dataset.
Discover rules that govern processes (Can be used as background knowledge for other tools, for instance in Case-Based Reasoning).
Predict future events.
Find attributes of importance.

Classes can be used to:

Discover similarities between attribute-values.
Group data for rule generation later.
Develop knowledge of the domain.

The AutoClass output requires interpretation, and is not usable in the current form except for human analysis.

Rosetta has a way of testing how good the rules generated are on a test set, while AutoClass does not have a decision to test against and is only using the model it has learned to classify the test set.

Noise Handling and Retractability

Both methods are able to handle noise, but only as a normal attribute values. Neither of the methods handles attribute values that are ``not applicable''. The user must be cautious when trying to fill in values for missing and ``not applicable'' attribute values, because this can create unreliable results.

As a logical and statistical method, both are retractable, it is possible to find out how the tools discovered the rules or classes.

Conclusions and Further Work

This chapter contains conclusions based on the theoretical descriptions of methods, experiments and the discussion. We also outline possible further work.

Conclusions

From the discussion in chaper , the following conclusions can be drawn after using Rosetta on the OREDA dataset:

When the data set has many attribute values, more objects are required to produce good results.
The overfit increases with the size of the training set. The weak rules should be pruned away.
Rosetta produces many rules, between 900 and 4000.

For AutoClass, we can draw these conclusions after using it on OREDA:

The number of attributes has little influence on the number of classes generated.
AutoClass results requires much interpretation by humans.
More data leads to a more precise classification.
Automatic classification of objects can reveal poor human classification in the data.

Both tools can be used in the NOEMIE project, but for different purposes: AutoClass gives an understanding of the data by grouping them into classes, while the output of Rosetta can be used directly, as background knowledge in Case-Based Reasoning, and in making predictions about failures.

Integration towards Case-Based Reasoning

A central issue in the NOEMIE project will be to integrate the results from Data Mining with Case-Based Reasoning [3].

A case is defined as a previously experienced situation, and it is usually a problem situation with an interpretation and possibly annotations. Case-Based Reasoning methods identify a problem, and tries to find past cases that are similar to the new one. Then the old cases are used to suggest a solution for the new problem. The solution is evaluated, and the systems knowledge-base is updated with the new experience.

Questions that will have to be investigated further, are how the the results from the Data Mining process are to be collected for use in Case-Based Reasoning. The system should be able to handle uncertain data. It could also use metadata to decide what information to include when constructing cases.

Further Experimens

After having carried out the experiments in chapter , we see several topics that would be interesting to investigate further:

Bayesian network classifications and Rough Set rule generation could be integrated. First, AutoClass would group data into classes, and then Rosetta could make rules for each class. This may result in better rules, because the data would be more similar.
We could also be able to evaluate the tools better by trying to compare them to a standard dataset that has been used with other tools previously. This could give a more broad evaluation than to look at the OREDA database which is one of several that will be used in the NOEMIE project.
It would be interesting to compare the output of ordinary regression analysis to the results that we achieved using the Data Mining tools.

Project Formulation in Norwegian

Evaluering av data mining teknikker og -verktøy

Data mining (DM) eller Knowledge Discovery in Databases (KDD) som det også blir kalt, inneholder mange teknikker og verktøy for å finne informasjon i store datamengder. Metoden er en utvidelse av data analyse metoder for å finne sammenhenger med flere relasjoner. Noen av disse teknikkene er inductive learning, pattern identification, concept discovery, statistical cluster og bayesian nettworks. Oppgaven går ut på å evaluere noen av teknikkene som blir brukt og verktøy knyttet til disse. Spesielt vil du se på verktøy som er utviklet i europeiske prosjekter (f. eks. INRECA, SODAS, STARS). Oppgaven er en del av et større EU-prosjekt med Sintef Tele og data, Sintef Sikkerhet og pålitelighet i tillegg til partnere fra Frankrike og Italia. Du vil jobbe nært knyttet til prosjektets deltakere, delta i det videre arbeidet i prosjektet, og du vil kunne fortsette med diplomoppgave.

Mål: Evaluere verktøy og teknikker.

Terminology

Below, we list the definitions we used for some of the terms in the field of Data Mining and Knowledge Discovery.

Knowledge Discovery in Databases

: is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. [Fayyad, Piatetsky-Shapiro, Smyth]

Data mining

: is a step in the KDD process consisting of particular data mining algorithms that, under some acceptable computational efficiency limitations, produces a particular enumeration of patterns. [Fayyad, Piatetsky-Shapiro, Smyth]

Bayesian networks

: is a graphical presentation of probabilistic information, which uses directed acyclic graphs, where each node represents an uncertain variable, and each link represents direct influence, usually of casual nature. [Jueda Perl]

Inductive logic programming

: ILP systems develop predicate descriptions from examples and background knowledge. [Steve Muggleton]

Fuzzy logic

: Truth values are in the interval [0,1] rather than true or false. [D. Dubois, H. Prade]

Rough sets

: Rough set analysis is a first step in analyzing incomplete or uncertain information. Rough set analysis uses only internal information and does not rely on additional model assumptions as fuzzy set methods or probabilistic methods do. [Ivo Düntsch, Günther Gediga]

Neural networks

: Neural networks are composed of a number of nodes, or units, connected by links. Each link has a numeric weight associated with it. Weights are the primary means of long-term storage in neutral networks, and learning usually takes place by updating the weights. Some of the units are connected to the external environment, and can be designated as input or output units. [Stuart J. Russel, Peter Norvig]

Resources on the Internet

Below is a list of the resources on the Internet we encountered during the work on this project.

Mailing Lists

KDD Nudgets (kdd-request@gte.com)
DBWorld (listproc@cs.wisc.edu)
AI and Statistics (ai-stats-request@watstat.uwaterloo.ca)
Machine Learning List (ml-request@ics.uci.edu)

WWW-Sites

The Knowledge Discovery Mine: http://www.kdnuggets.com
The Data Mine: http://www.cs.bham.ac.uk/ anp/TheDataMine.html
Artificial Intelligence Subject Index: http://ai.iit.nrc.ca/misc.html
Hugin: http://hugin.dk/
Software for Manipulating Belief Networks: http://bayes.stat.washington.edu/almond/belief.html
ILPnet http://www-ai.ijs.si/ilpnet.html

SQL Queries

Below, we list the SQL queries we used to get the data sets from the OREDA data base.

Experiment I

select failure_event.ec_code, failure_event.su_code, fcause_no,
sc_code, fd_no, fc_code, om_code, failure_event.inst_id,
failure_event.i_id,mi_code, ic_code, oc_code, climate_code, dc_code,
sys_code, op_code, ee_code, i_manufact, i_model, i_manufact_ctrl_sys,
i_no_of_starts, i_subsea_cs, fm_code 

from failure_event,failure_item, inventory, installation 

where failure_event.i_id=inventory.i_id AND
failure_event.owner_id=inventory.owner_id AND
failure_event.inst_id=inventory.inst_id AND
failure_event.f_id=failure_item.f_id AND
failure_event.owner_id=failure_item.owner_id AND
failure_event.inst_id=failure_item.inst_id AND
failure_event.owner_id=installation.owner_id AND
failure_event.inst_id=installation.inst_id ;

Experiment II

% SQL for Electric Motors

select fe.inst_id, fcause_no, fd_no, mi_code, fi.su_code, mi_code,
climate_code, ic_code, dc_code, sys_code, op_code, ee_code,
i_manufact, i_model, T.inv_spec_char Driver_App, T2.inv_spec_char
Power, T3.inv_spec_char Speed, T4.inv_spec_char Varspeed,
T5.inv_spec_char Voltage, T6.inv_spec_char Motor, T7.inv_spec_char
Safe, fm_code 

from failure_event fe, failure_item fi, installation ins, inventory
iv, inv_spec T,inv_spec T2, inv_spec T3,inv_spec T4,inv_spec T5,
inv_spec T6, inv_spec T7

where fe.i_id=iv.i_id and iv.i_id=T.i_id and T.i_id=T2.i_id and
T2.i_id=T3.i_id and T3.i_id=T4.i_id and T4.i_id=T5.i_id and
T5.i_id=T6.i_id and T6.i_id=T7.i_id and fe.inst_id=iv.inst_id and
iv.inst_id=ins.inst_id and ins.inst_id=fi.inst_id and
fi.inst_id=T.inst_id and T.inst_id=T2.inst_id and
T2.inst_id=T3.inst_id and T3.inst_id=T4.inst_id and
T4.inst_id=T5.inst_id and T5.inst_id=T6.inst_id and
T6.inst_id=T7.inst_id and T.attr_field='APPLIC1' and
T2.attr_field='POWER' and T3.attr_field='SPEED' and
T4.attr_field='VAR_SPEED' and T5.attr_field='VOLTAGE' and
T6.attr_field ='TYPE3' and T7.attr_field='SAFECLS' and fe.f_id=fi.f_id

% SQL for Gas Turbines

select fe.inst_id, fcause_no, fd_no, mi_code, fi.su_code, mi_code,
climate_code, ic_code, dc_code, sys_code, op_code, ee_code,
i_manufact, i_model, i_no_of_starts, T.inv_spec_char Driver_app,
T2.inv_spec_char Power, T3.inv_spec_char Speed, T4.inv_spec_char
Shafts, T5.inv_spec_char Fuel, fm_code

from failure_event fe, failure_item fi, installation ins, inventory
iv, inv_spec T,inv_spec T2, inv_spec T3,inv_spec T4,inv_spec T5

where fe.i_id=iv.i_id and iv.i_id=T.i_id and T.i_id=T2.i_id and
T2.i_id=T3.i_id and T3.i_id=T4.i_id and T4.i_id=T5.i_id and
fe.inst_id=iv.inst_id and iv.inst_id=ins.inst_id and
ins.inst_id=fi.inst_id and fi.inst_id=T.inst_id and
T.inst_id=T2.inst_id and T2.inst_id=T3.inst_id and
T3.inst_id=T4.inst_id and T4.inst_id=T5.inst_id and
T.attr_field='APPLIC1' and T2.attr_field='POWER' and
T3.attr_field='SPEED' and T4.attr_field='NUMBSHAFT' and
T5.attr_field='FUEL' and fe.f_id=fi.f_id

Experiment III

% SQL for Gas Turbines - severity as decition

select fe.inst_id, fcause_no, fd_no, mi_code, fi.su_code, mi_code,
climate_code, ic_code, dc_code, sys_code, op_code, ee_code,
i_manufact, i_model, i_no_of_starts, T.inv_spec_char Driver_app,
T2.inv_spec_char Power, T3.inv_spec_char Speed, T4.inv_spec_char
Shafts, T5.inv_spec_char Fuel, fm_code, sc_code

from failure_event fe, failure_item fi, installation ins, inventory
iv, inv_spec T,inv_spec T2, inv_spec T3,inv_spec T4,inv_spec T5

where fe.i_id=iv.i_id and iv.i_id=T.i_id and T.i_id=T2.i_id and
T2.i_id=T3.i_id and T3.i_id=T4.i_id and T4.i_id=T5.i_id and
fe.inst_id=iv.inst_id and iv.inst_id=ins.inst_id and
ins.inst_id=fi.inst_id and fi.inst_id=T.inst_id and
T.inst_id=T2.inst_id and T2.inst_id=T3.inst_id and
T3.inst_id=T4.inst_id and T4.inst_id=T5.inst_id and
T.attr_field='APPLIC1' and T2.attr_field='POWER' and
T3.attr_field='SPEED' and T4.attr_field='NUMBSHAFT' and
T5.attr_field='FUEL' and fe.f_id=fi.f_id

Description of Selected Attributes

This appendix explains the codes used in some of the attribute values.

table1195
Table E.1: Description of the value codes for attribute Design Class, DC_CODE, which is a classification based on the design of an item.

table1206
Table E.2: Description of the value codes for attribute External Environment, EE_CODE, which is the local environment where the item is placed.

table1218

table1229
Table E.4: Description of the value codes for attribute Failure Mode, FM_CODE, which describes the type of failure that has occured.

table1240

table1250

table1260

table1271

References

1: Wray Buntine. A guide to the literature on learning probabilistic networks from data. To appear in IEEE Transactions on Knowledge and Data Engineering.
2: Saso Dzeroski. Inductive logic programming and knowledge discovery in databases. In Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
3: Agnar Aamodt et al. Integrating data mining and case-based reasoning for reuse of industrial experience - initial steps.
4: Clark Glymour et al. Statistical inference and data mining. Communications of the ACM, 39(11), 1996.
5: Klaus-Dieter Althoff et. al. INRECA - An Evaluation of the Final Integrated System. 1995.
6: Nada Lavrac et al. Ilpnet repositories on www: Inductive logic programming systems, datasets and bibliography. AI Communications, 9(4), 1996.
7: David Heckerman. Bayesian networks for knowledge discovery. In Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
8: A. Øhrn and J. Komorowski. Rosetta - a rough set toolkit for analysis of data.
9: Torulf Mollestad. A Rough Set Approach to Data Mining: Extracting a Logic of Default Rules from data. PhD thesis, NTNU, 1997.
10: Z Pawlac. Rough Sets, Theorethcal Aspects of Reasoning about Data. Kluwer Academic Publishers, 1991.
11: Z. Pawlak. Rough sets. International Journal of Information and Computer Science, 11(5):341-356, 1982.
12: Z. Pawlak and A. Skowron. A rough set approch to decision rules generation. Technical report, University of Warsaw, 1993.
13: Judea Pearl. Probabilistic Reasoning in intelligent systems. Morgan Kaufmann, 1988.
14: John Stutz Peter Cheeseman. Bayesian classification (autoclass): Theory and results. In Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
15: B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, 1996.
16: Tej Anand Ronald J. Brachman. The process of knowledge discovery in databases. In Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.
17: A. Skoworon. Syntesis of adaptive decision systems from experimental data. In A. Aamodt and J. Komorowski, editors, Fifth Scandinavian Conference on Artificial Intelligence. IOS Press, 1995.
18: A. Skowron. Boolean reasoning for decision rules generation. In J. Komorowski and Z.W. Ras, editors, 7th International Symposium for Methodologies for Inteligent Systems, ISMIS '93, pages 295-305. Springer Verlag, 1993.
19: Padhraic Smyth Usama M. Fayyad, Gregory Piatetsky-Shapiro. From data mining to knowledge discovery: An overview. In Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996.

Torgeir Dingsoyr
Sat May 3 15:11:21 MET DST 1997