In Case-Based Reasoning Integrations, notes from the 1998 Workshop, AAAI Press.

Retrieval of Cases by using a Bayesian Network

Torgeir Dingsøyr
Dept. of Computer and Information Science
Norwegian University of Science and Technology
7034 Trondheim, NORWAY
dingsoyr@idi.ntnu.no

Abstract:

A framework for integrating methods for decision support; Case-Based Reasoning (CBR) and Data Mining (DM) is outlined. The integration approaches are divided according to which method that is considered to be master and which is the slave . A system using Bayesian networks for computing similarity metrics is implemented and compared to a traditional CBR system. Data are taken from a database from the oil industry. The retrieved cases vary greatly between the systems, especially on features that are unspecified in the ``new case''. If many features of the ``new case'' are specified, the new system performs better, according to an evaluation by a domain expert.

Introduction

Data Mining and Case-Based Reasoning are methods used for decision support ; to organize and process information to make it available for improving the quality of decisions. It is likely that integration of the two methods will lead to a better usage of information. Here, we give a quick introduction to the methods, briefly describe existing integrated methods, outline a framework for different integration approaches. Then we describe an implemented system that uses Bayesian networks to compute similarly metrics for retrieving cases.

Data Mining (DM) has become a popular method for extracting information from large databases, in the form of patterns. The patterns can be informative or predictive , and some DM methods are classification, regression, clustering, dependency modeling and change and deviation analysis. The whole process of discovering knowledge in databases is referred to as Knowledge Discovery in Databases [Fayyad, Piatetsky-Shapiro, & Smyth1996].

Case-Based Reasoning (CBR) is a method for solving problems by comparing a problem situation - a case - to previously experienced ones. The aim is to store information about earlier situations, and when new ones arrive, retrieve the situation that is most similar, and reuse it - or revise it to match the new problem if the most similar problem does not match sufficiently. If the new case provides new insight it should be retained . An introduction to CBR is given in [Aamodt & Plaza1994].

At present, there are some integrated systems under development, like Case-Method, developed by the NEC corporation for handling corporate memory [Kitano, Shimazu, & Shibata1993], and a system for forecasting of epidemics, developed at the University of Rostock [Bull, Kundt, & Gierl1997]. Most of the research have been to integrate Bayesian networks with CBR. Microsoft has developed two prototype systems with code-name ``Aladdin'' to diagnose problems in customer support [Breese & Heckerman1995]. A system named INBANCA (Integrating Bayes networks with CBR for Planning) for planning in a simulated soccer environment has been outlined in [Aha & Chang1996]. At the University of Salford, a system using Bayesian networks for indexing cases have been developed [Rodriguez, Vadera, & Sucar1997], and at the University of Helsinki, CBR and Bayesian networks are used for classification [Tirri, Kontkanen, & Myllymäksi1996]. Tools for combining CBR and decisions trees were developed in the Esprit project INRECA [Althof et al. 1995b].

Integration Framework

Integration of DM and CBR can be done with either of the methods as the master and the other as the slave, depending on which method that uses information provided by the other. First we outline methods for integration with CBR as the master, then with DM as the master. We assume that there exists a casebase (database of cases) and in some scenarios also an external database from which further information can be mined.

With CBR as the master, we can:

With DM as the master:

We have now given the status of research on integration of DM and CBR, outlined a framework for different methods of integration, which should be areas for further research. To demonstrate that integration of the two methods can lead to better usage of information, and to better results, we now describe an implemented prototype system which combines CBR and domain knowledge mined from a database in the form of a Bayesian network. The network is used for for doing causal inference.

Specification of an Integrated System

To demonstrate an integrated system, we propose the following algorithm which we will denote CBRDM:

The similarity metric is then:

\begin{displaymath}
Similarity(x,y) = -\sqrt{\sum_{i=1}^{p} f(x_{i},y_{i}) }\end{displaymath}

Where x and y are two cases, p the number of features and f is defined in Figure 1.


  
Figure: The Function used in the Similarity Metric.
\begin{figure}
\begin{center}
\tiny
\begin{displaymath}
f(x_{i},y_{i}) = \left \...
 ...herwise}
 \end{array}\right.\end{displaymath}\end{center}\normalsize\end{figure}

The procedure of constructing the Bayesian network and casebase, and an overview of the system is given in Figure 2. The system is described in further detail in [Dingsøyr1998].

To test the new system, we used a database from the oil industry, OREDA, which contains data about inventory items, failures and maintenance on platform equipment. We focused on data on compressors , which gave 4646 cases which was described by nine features , where some were missing. Then we used Bayesian Knowledge Discoverer [Ramoni & Sebastini1997] to construct the Bayesian network, and the case-based reasoning tool KATE[*], to compare with the results obtained from the new system (see [Althof et al. 1995a] for a description of this and other industrial CBR tools).


  
Figure: Architecture of the Proposed Integrated System (CBRDM).
\begin{figure}
\begin{center}

\scalebox {0.40}{\includegraphics{newsyst.eps}}
\end{center}\end{figure}

We constructed 45 queries (``new cases'') which we divided into five series. Each series thus consisted of five queries which had a similar number of feature values given, and where certain feature values were varies in each series. Other feature values were set to missing . The number of given feature values were increased for the later series.

An example part of a query is (for the five first features):

 
Figure: Architecture of the Proposed Integrated System (CBRDM).
  1 2 3 4 5
Query1 NA NNF MATFAIL GASLEAK NA
KATE 2 NNF MATFAIL NA MINOR
CBRDM 2 NFI MATFAIL NA MINOR

Which retrieved the indicated best cases, where ``NA'' is a feature value which is not available.

Results

The differences in the returned cases are shown in Table 1, where the number of features that have different values when KATE and CBRDM is compared to the query is shown. It also shows the difference between KATE and CBRDM. For instance, the first line in the table states that there is one case in Series 1 that differs by one feature for CBRDM and KATE, and two cases that differ by four and five features.


 
Table: Number of Differences Between Cases, for All Series.
Cases 0 1 2 3 4 5 6 7 8 9 Series
CBRDM-KATE   1     2 2          
KATE-Query 2 3                 1
CBRDM-Query 3 2                  
CBRDM-KATE     1 1 3            
KATE-Query   5                 2
CBRDM-Query   3 2                
CBRDM-KATE       3 2            
KATE-Query   5                 3
CBRDM-Query   5                  
CBRDM-KATE 1     1 2   1        
KATE-Query   5                 4
CBRDM-Query   5                  
CBRDM-KATE       2 2   1        
KATE-Query   5                 5
CBRDM-Query   5                  
CBRDM-KATE       3 1     1      
KATE-Query   5                 6
CBRDM-Query   5                  
CBRDM-KATE           2 2     1  
KATE-Query 1 3 1               7
CBRDM-Query 3 2                  
CBRDM-KATE         1 1 2     2  
KATE-Query 3   2               8
CBRDM-Query 3 2                  
CBRDM-KATE       1   1 1 2      
KATE-Query   1 1 2 1           9
CBRDM-Query 1 1 2 1              
 

An expert on the OREDA database was given a list of the processed queries, where the information on which tool had produced which result was removed. The order of the presentation of the results was picked at random. The expert was told to choose which of the two retrieved cases was most similar to the new case. The results from the expert is given in Figure 3, where a point was awarded if the result could be said to be better for one system. The expert did not evaluate the first four series.


  
Figure: KATE and CBRDM was Given a Point by a Domain Expert if the Result was Better for one Method.
\begin{figure}
\begin{center}

\scalebox {0.30}{\includegraphics{evalgraf.eps}}\end{center}\end{figure}

Why Differences Occur

The difference between the query and CBRDM, and the query and KATE are never higher than 4, the average is 1.02, and 17 out of 45 are exactly similar. The difference between the results from CBRDM and KATE is always larger. This indicates that the differences experienced come mostly from other feature values than the ones that are given by the user in the ``new'' case.

Why does this occur? In the similarity metric for CBRDM we add 1 for each feature value that is similar. If the feature value is not similar, we add the probability (<1) for having the value that is similar, given the rest of the feature values of the case. In that way, if we cannot find an exact hit, we choose the one that has the highest probability of occurring in the future (if we assume that the frequency in the casebase approximates the probability when the number of cases is large).

The difference between the query and the retrieved cases are relatively stable over the different series. But the difference between CBRDM and KATE results vary greatly with the series. This is not surprising, as there are nodes in the Bayesian network (not shown here) that are not connected, and other nodes that have several connections. The similarity metrics with features that have connections will be higher than those without connections.

Evaluation of Results

Series 5 to 8 have approximately the same number of features given in the query, while more feature values are given in the queries in Series 9. It seems that CBRDM produces better results when more feature values are given in the query, and when the feature values that are given are a sub-node in the Bayesian network.

Conclusion

From the discussion, we can draw the following conclusions:

Acknowledgments

The work reported here was done in the Esprit NOEMIE project, with professor Agnar Aamodt at the Norwegian University of Science and Technology (NTNU) and professor Edwin Diday at Universite Dauphine as supervisors. I would like to thank Helge Langseth at Sintef, Moufida Massrali at Universite Dauphine and M. Letizia Jaccheri at NTNU for comments.

References

Aamodt & Plaza1994
Aamodt, A., and Plaza, E.
1994.
Case-based reasoning: Foundational issues, methodological variations, and system approaches.
AI Communications 7(1):39-59.

Aha & Chang1996
Aha, D. W., and Chang, L. W.
1996.
Cooperative bayesian and case-based reasoning for solving multiagent planning tasks.
Technical report, Navy Center for Applied Research in AI, Naval Research Laboratory, Washington, DC, USA.

Althof et al. 1995a
Althof, K.-D.; Auriol, E.; Barlette, R.; and Manago, M.
1995a.
A Review of Industrial Case-Based Reasoning Tools.
AI Intelligence.

Althof et al. 1995b
Althof, K.-D.; Auriol, E.; Traphöner, R.; and Wess, S.
1995b.
Inreca - a seamlessly integrated system based on inductive inference and case-based reasoning.
In Aamodt, A., and Veloso, M., eds., Case-Based Reasoning Research and Development, ICCBR-95, 371-380.

Breese & Heckerman1995
Breese, J. S., and Heckerman, D.
1995.
Decision-theoretic case-based reasoning.
In Proceedings of the Fifth International Workshop on Artificial Intelligence and Statistics, 56-63.

Bull, Kundt, & Gierl1997
Bull, M.; Kundt, G.; and Gierl, L.
1997.
Discovering of health risks and case-based forecasting of epidemics in a health surveillance system.
In Komorowski, J., and Zytkow, J., eds., Principles of Data Mining and Knowledge Discovery. Proceedings, 68-77.

Dingsøyr1998
Dingsøyr, T.
1998.
Integration of data mining and case-based reasoning.
Technical report, Norwegian University of Science and Technology, 7034 Trondheim, NORWAY.

Fayyad, Piatetsky-Shapiro, & Smyth1996
Fayyad, U. M.; Piatetsky-Shapiro, G.; and Smyth, P.
1996.
Advances in Knowledge Discovery and Data Mining.
AAAI/MIT Press.
chapter From Data Mining to Knowledge Discovery: An overview.

Kitano, Shimazu, & Shibata1993
Kitano, H.; Shimazu, H.; and Shibata, A.
1993.
Case-method: A methodology for building large-scale case-based systems.
In Proceedings of the AAAI, 303-308.

Ramoni & Sebastini1997
Ramoni, M., and Sebastini, P.
1997.
Discovering bayesian networks in incomplete databases.
Technical report, Knowledge Media Institute, The Open University.

Rodriguez, Vadera, & Sucar1997
Rodriguez, A. F.; Vadera, S.; and Sucar, L. E.
1997.
A probabilistic model for case-based reasoning.
In Smith, I., and Faltings, B., eds., Case-Based Reasoning Research and Development, ICCBR-97. Proceedings, 623-632.

Tirri, Kontkanen, & Myllymäksi1996
Tirri, H.; Kontkanen, P.; and Myllymäksi, P.
1996.
A bayesian framework for case-based reasoning.
In Smith, I., and Faltings, B., eds., Advances in Case-Based Reasoning, EWCBR-96, 413-427.

Appendix

1.
Integration name/category: CBRDM/ Integration of CBR and Data Mining.
2.
Performance Task: Retrieval.
3.
Integration Objective: Solution quality. Select cases that evaluate better by a domain expert.
4.
Reasoning Components: Bayesian Network.
5.
Control Architecture: CBR as master.
6.
CBR Cycle Step(s) Supported: Retrieval.
7.
Representations: Cases.
8.
Additional Reasoning Components: -
9.
Integration Status: Proposed.
10.
Priority future work: Application of the technique on experience databases for software process improvement.


Footnote: KATE from Acknosoft was selected because it is an industrial CBR tool, and was easy available through the project NOEMIE, where this work was conducted.

Go back.
Updated 06.08.1998, dingsoyr@idi.ntnu.no