Next: Notation Up: Integration of Data Mining Previous: List of Tables

Introduction

The purpose of this report is to investigate the integration of algorithms for Data Mining and Case-Based Reasoning. Reading this report requires some knowledge of artificial intelligence, database systems and statistics.

Data Mining (DM) has become a popular method for extracting information from large databases. During the last few years, new technology has reduced the cost of storing data, and better technology in database management has made it easier to handle databases with gigabytes or terrabytes of data. Most enterprises, organizations and governments now have huge databases, and the focus is thus changing from data collection to data analysis. This task is extremely hard, or sometimes impossible for humans, given the size of many databases, the complexity of the data representation and a large number of dimensions. This has motivated the creation of algorithms which search for patterns in databases, which are called Data Mining algorithms. There are several kinds of algorithms, that come from different fields like artificial intelligence, statistics and logic. Some methods are Inductive Logic Programming , Rough sets , Symbolic Data Analysis and Bayesian networks .

Case-Based Reasoning (CBR) is a method for solving problems by comparing a problem situation - a case - to previously experienced ones. The aim is to store information about earlier situations, and when new ones arrive, find the situation that is most similar, and reuse - or revise it to match the new problem if the most similar problem does not match sufficiently. This may involve using background knowledge or asking a user. Information about the problem solving experience is learned by the Case-Based Reasoning system and the aim is to be able to handle an increased number of situations and also reason more on each situation to certify that it is handled correctly.

So what is the motivation for integrating the two methods? Both methods are used for decision support ; to organize and process information to make it available for improving the quality of decisions. The decisions might be taken by humans within an organization or by a computer system. The mathematician Seymour Papert stated the following principle on mental growth for humans:

Some of the most crucial steps in mental growth are based not simply on acquiring new skills, but on acquiring new administrative ways to use what one already knows.

By integrating the methods, we hope to make better use of information, and that this can lead to a growth for computer systems as well.

CBR relies heavily on the quality of the data collected, the amount of data, the amount of background knowledge and a way of comparing cases to decide which is most similar. The method is best suited for domains that change, and where we have little knowledge of underlying processes that govern the domain. Data Mining is a way of extracting information from databases and can thus be used for extracting information which is relevant for a problem situation - a case. It could also be used to find ``unexperienced'' problem situations from a database other than the database of cases - and represent it as a case, possibly by interacting with a user. Data Mining can infer rules, classifications and graphs from the data which can be used as background knowledge in a CBR system, and also to compute the similarity between cases. Some Data Mining algorithms require background knowledge, which can be taken from a CBR system.

CBR and DM can obviously be integrated in several ways. The aim of this report is to sketch different methods for integration, and to show that integration is both possible and worthwhile by exemplifying an integrated CBR-DM system. The system implemented is a ``deep'' integration method, which has attracted little focus of research.

At present, there are some integrated systems under development, like Case-Method, developed by the NEC corporation for handling corporate memory [21], and a system for forecasting of epidemics, developed at the University of Rostock [11]. Most of the focus of research has been to integrate Bayesian networks with CBR. Microsoft has developed a prototype system with codename ``Aladdin'' to diagnose problems in customer support [10]. A system named INBANCA (Integrating Bayes Networks with Case-Based Reasoning for Planning) for planning in a simulated soccer environment has been outlined in [5], but it has been discontinued due to lack of funding. At the University of Salford, a system using Bayesian Networks for indexing cases has been developed [28], and at the University of Helsinki, CBR and Bayesian Networks are used for doing classification [30]. These approaches are described further in chapter 4.

In this report, the scope of the integration is for use within the Esprit NOEMIE project [2]. The aim of the project is to prevent re-occurrence of unwanted events. Unwanted events is defined as ``all events which may influence on production availability, costly repairs or implications on safety''. Management must be able to prevent, or reduce the likelihood of the event happening in the future. Prevention, or reduction of likelihood, will be done by utilizing large databases collected in the oilfield sector by structuring and analyzing the data using Data Mining and Case-Based Reasoning methods. The data used for experiments in this report are taken from the NOEMIE project, to solve problems in a scenario described by the Norwegian oil company Norsk Hydro.

The rest of this report is organized as follows: First, some terminology used is introduced, then the theoretic foundation for Case-Based Reasoning and Data Mining is established in chapters 2 and 3. The methods are decomposed to be able to describe which parts of the methods that can be integrated, and a method using Bayesian Networks for reasoning on cases is implemented and used for experiments as an example of an integrated CBR-DM method in chapter 4. Results of the new method are given in chapter 5, and are discussed and compared to other approaches in chapter 6. Conclusions are drawn and further work outlined in chapter 7.

Notation

Next: Notation Up: Integration of Data Mining Previous: List of Tables

Torgeir Dingsoyr
2/26/1998