grobid-corpus/segmentation/public/tei/550._10.1.1.47.7508.training.segmentation.tei.xml at master

Fork: 0
istex / grobid-corpus
Find file
Newer
Older
grobid-corpus / segmentation / public / tei / 550._10.1.1.47.7508.training.segmentation.tei.xml
zeynalig on 26 Apr 2017 74 KB initialisation des corpus
Raw Blame History
<?xml version="1.0" ?>
<tei>
	<teiHeader>
		<fileDesc xml:id="-1"/>
	</teiHeader>
	<text xml:lang="en">
			<front> Submitted to the Future Generation Computer Systems special issue on Data Mining. <lb/> Using Neural Networks for Data Mining <lb/> Mark W. Craven <lb/>School of Computer Science <lb/>Carnegie Mellon University <lb/>Pittsburgh, PA 15213-3891 <lb/> mark.craven@cs.cmu.edu <lb/> Jude W. Shavlik <lb/>Computer Sciences Department <lb/>University of Wisconsin-Madison <lb/>Madison, WI 53706-1685 <lb/> shavlik@cs.wisc.edu <lb/> Abstract <lb/> Neural networks have been successfully applied in a wide range of supervised and unsuper-<lb/>vised learning applications. Neural-network methods are not commonly used for data-mining <lb/>tasks, however, because they often produce incomprehensible models and require long training <lb/>times. In this article, we describe neural-network learning algorithms that are able to produce <lb/>comprehensible models, and that do not require excessive training times. Speciically, we discuss <lb/>two classes of approaches for data mining with neural networks. The 񮽙rst type of approach, <lb/>often called rule extraction, involves extracting symbolic models from trained neural networks. <lb/>The second approach is to directly learn simple, easy-to-understand networks. We argue that, <lb/>given the current state of the art, neural-network methods deserve a place in the tool boxes of <lb/>data-mining specialists. <lb/> Keywords: machine learning, neural networks, rule extraction, comprehensible <lb/>models, decision trees, perceptrons <lb/></front> 
			
			<body>1 Introduction <lb/> The central focus of the data-mining enterprise is to gain insight into large collections of data. <lb/>Often, achieving this goal involves applying machine-learning methods to inductively construct <lb/>models of the data at hand. In this article, we provide an introduction to the topic of using neural-<lb/>network methods for data mining. Neural networks have been applied to a wide variety of problem <lb/>domains to learn models that are able to perform such interesting tasks as steering a motor vehicle, <lb/>recognizing genes in uncharacterized DNA sequences, scheduling payloads for the space shuttle, <lb/>and predicting exchange rates. Although neural-network learning algorithms have been successfully <lb/>applied to a wide range of supervised and unsupervised learning problems, they have not often been <lb/>applied in data-mining settings, in which two fundamental considerations are the comprehensibility <lb/>of learned models and the time required to induce models from large data sets. We discuss new <lb/>developments in neural-network learning that eeectively address the comprehensibility and speed <lb/>issues which often are of prime importance in the data-mining community. Speciically, we describe <lb/>algorithms that are able to extract symbolic rules from trained neural networks, and algorithms <lb/>that are able to directly learn comprehensible models. <lb/>Inductive learning is a central task in data mining since building descriptive models of a col-<lb/>lection of data provides one way of gaining insight into it. Such models can be learned by either <lb/>supervised or unsupervised methods, depending on the nature of the task. In supervised learning, <lb/>the learner is given a set of instances of the form h~ x; yi, where y represents the variable that we <lb/>want the system to predict, and ~ x is a vector of values that represent features thought to be relevant <lb/>to determining y. The goal in supervised learning is to induce a general mapping from ~ x vectors to <lb/>
			y values. That is, the learner must build a model, ^ <lb/> y = f(~ x), of the unknown function f, that allows <lb/>it to predict y values for previously unseen examples. In unsupervised learning, the learner is also <lb/>given a set of training examples but each instance consists only of the ~ x part; it does not include <lb/>the y value. The goal in unsupervised learning is to build a model that accounts for regularities in <lb/>the training set. <lb/>In both the supervised and unsupervised case, learning algorithms diier considerably in how <lb/>they represent their induced models. Many learning methods represent their models using languages <lb/>that are based on, or closely related to, logical formulae. Neural-network learning methods, on the <lb/>other hand, represent their learned solutions using real-valued parameters in a network of simple <lb/>processing units. We do not provide an introduction to neural-network models in this article, but <lb/>instead refer the interested reader to one of the good textbooks in the 񮽙eld (e.g., Bishop, 1996). <lb/>A detailed survey of real-world neural-network applications can be found elsewhere (Widrow et al., <lb/>1994). <lb/>The rest of this article is organized as follows. In the next section, we consider the applicability <lb/>of neural-network methods to the task of data mining. Speciically, we discuss why one might want <lb/>to consider using neural networks for such tasks, and we discuss why trained neural networks are <lb/>usually hard to understand. The two succeeding sections cover two diierent types of approaches for <lb/>learning comprehensible models using neural networks. Section 3 discusses methods for extracting <lb/> comprehensible models from trained neural networks, and Section 4 describes neural-network learn-<lb/>ing methods that directly learn simple, and hopefully comprehensible, models. Finally, Section 5 <lb/>provides conclusions. <lb/> 2 The Suitability of Neural Networks for Data Mining <lb/> Before describing particular methods for data mining with neural networks, we 񮽙rst make an ar-<lb/>gument for why one might want to consider using neural networks for the task. The essence of <lb/>the argument is that, for some problems, neural networks provide a more suitable inductive bias <lb/> than competing algorithms. Let us brieey discuss the meaning of the term inductive bias. Given <lb/>a xed set of training examples, there are innnitely many models that could account for the data, <lb/>and every learning algorithm has an inductive bias that determines the models that it is likely <lb/>to return. There are two aspects to the inductive bias of an algorithm: its restricted hypothesis <lb/>space bias and its preference bias. The restricted hypothesis space bias refers to the constraints <lb/>that a learning algorithm places on the hypotheses that it is able to construct. For example, the <lb/>hypothesis space of a perceptron is limited to linear discriminant functions. The preference bias <lb/>of a learning algorithm refers to the preference ordering it places on the models that are within <lb/>its hypothesis space. For example, most learning algorithms initially try to t a simple hypothesis <lb/>to a given training set and then explore progressively more complex hypotheses until they nd an <lb/>acceptable t. <lb/>In some cases, neural networks have a more appropriate restricted hypothesis space bias than <lb/>other learning algorithms. For example, sequential and temporal prediction tasks represent a class of <lb/>problems for which neural networks often provide the most appropriate hypothesis space. Recurrent <lb/>networks, which are often applied to these problems, are able to maintain state information from <lb/>one time step to the next. This means that recurrent networks can use their hidden units to learn <lb/>derived features relevant to the task at hand, and they can use the state of these derived features <lb/>at one instant to help make a prediction for the next instance. <lb/>In other cases, neural networks are the preferred learning method not because of the class <lb/>of hypotheses that they are able to represent, but simply because they induce hypotheses that <lb/>

			<page>2 <lb/></page>

			generalize better than those of competing algorithms. Several empirical studies have pointed out <lb/>that there are some problem domains in which neural networks provide superior predictive accuracy <lb/>to commonly used symbolic learning algorithms (e.g., Shavlik et al., 1991). <lb/>Although neural networks have an appropriate inductive bias for a wide range of problems, <lb/>they are not commonly used for data mining tasks. As stated previously, there are two primary <lb/>explanations for this fact: trained neural networks are usually not comprehensible, and many <lb/>neural-network learning methods are slow, making them impractical for very large data sets. We <lb/>discuss these two issues in turn before moving on to the core part of the article. <lb/>The hypothesis represented by a trained neural network is deened by (a) the topology of the <lb/>network, (b) the transfer functions used for the hidden and output units, and (c) the real-valued <lb/>parameters associated with the network connections (i.e., the weights) and units (e.g., the biases <lb/>of sigmoid units). Such hypotheses are diicult to comprehend for several reasons. First, typical <lb/>networks have hundreds or thousands of real-valued parameters. These parameters encode the <lb/>relationships between the input features, ~ x, and the target value, y. Although single-parameter <lb/>encodings of this type are usually not hard to understand, the sheer number of parameters in a <lb/>typical network can make the task of understanding them quite diicult. Second, in multi-layer <lb/>networks, these parameters may represent nonlinear, nonmonotonic relationships between the input <lb/>features and the target values. Thus it is usually not possible to determine, in isolation, the eeect <lb/>of a given feature on the target value, because this eeect may be mediated by the values of other <lb/>features. <lb/>These nonlinear, nonmonotonic relationships are represented by the hidden units in a network <lb/>which combine the inputs of multiple features, thus allowing the model to take advantage of depen-<lb/>dencies among the features. Hidden units can be thought of as representing higher-level, \derived <lb/>features.&quot; Understanding hidden units is often diicult because they learn distributed representa-<lb/>tions. In a distributed representation, individual hidden units do not correspond to well understood <lb/>features in the problem domain. Instead, features which are meaningful in the context of the prob-<lb/>lem domain are often encoded by patterns of activation across many hidden units. Similarly each <lb/>hidden unit may play a part in representing numerous derived features. <lb/>Now let us consider the issue of the learning time required for neural networks. The process of <lb/>learning, in most neural-network methods, involves using some type of gradient-based optimization <lb/>method to adjust the network&apos;s parameters. Such optimization methods iteratively execute two <lb/>basic steps: calculating the gradient of the error function (with respect to the network&apos;s adjustable <lb/>parameters), and adjusting the network&apos;s parameters in the direction suggested by the gradient. <lb/>Learning can be quite slow with such methods because the optimization procedure often involves a <lb/>large number of small steps, and the cost of calculating the gradient at each step can be relatively <lb/>expensive. <lb/>One appealing aspect of many neural-network learning methods, however, is that they are <lb/> on-line algorithms, meaning that they update their hypotheses after every example is presented. <lb/>Because they update their parameters frequently, on-line neural-network learning algorithms often <lb/>converge much faster than batch algorithms. This is especially the case for large data sets. Often, <lb/>a reasonably good solution can be found in only one pass through a large training set! For this <lb/>reason, we argue that training-time performance of neural-network learning methods may often <lb/>be acceptable for data-mining tasks, especially given the availability of high-performance, desktop <lb/>computers. <lb/>

			<page>3 <lb/></page>

			3 Extraction Methods <lb/> One approach to understanding a hypothesis represented by a trained neural network is to translate <lb/>the hypothesis into a more comprehensible language. Various approaches using this strategy have <lb/>been investigated under the rubric of rule extraction. In this section, we give an overview of various <lb/>rule-extraction approaches, and discuss a few of the successful applications of such methods. <lb/>The methods that we discuss in this section diier along several primary dimensions: <lb/> 񮽙 Representation language: the language that is used by the extraction method to describe <lb/>the neural network&apos;s learned model. The languages that have been used by various methods <lb/>include conjunctive (if-then) inference rules, m-of-n rules, fuzzy rules, decision trees, and nite <lb/>state automata. <lb/> 񮽙 Extraction strategy: the strategy used by the extraction method to map the model repre-<lb/>sented by the trained network into a model in the new representation language. Speciically, <lb/>how does the method explore a space of candidate descriptions, and what level of description <lb/>does it use to characterize the given neural network. That is, do the rules extracted by the <lb/>method describe the behavior of the network as a whole, the behavior of individual units in <lb/>the network, or something in-between these two cases. We use the term global methods to <lb/>refer to the rst case, and the term local methods to refer to the second case. <lb/> 񮽙 Network requirements: the architectural and training requirements that the extraction <lb/>method imposes on neural networks. In other words, the range of networks to which the <lb/>method is applicable. <lb/>Throughout this section, as we describe various rule-extraction methods, we will evaluate them <lb/>with respect to these three dimensions. <lb/> 3.1 The Rule-Extraction Task <lb/> Figure 1 illustrates the task of rule extraction with a very simple network. This one-layer network <lb/>has ve Boolean inputs and one Boolean output. Any network, such as this one, which has discrete <lb/>output classes and discrete-valued input features, can be exactly described by a nite set of symbolic <lb/> if-then rules, since there is a nite number of possible input vectors. The extracted symbolic rules <lb/>specify conditions on the input features that, when satissed, guarantee a given output state. In our <lb/>example, we assume that the value false for a Boolean input feature is represented by an activation <lb/>of 0, and the value true is represented by an activation of 1. Also we assume that the output unit <lb/>employs a threshold function to compute its activation: <lb/> a y = <lb/> ( 1 if P <lb/> i w i a i + &gt; 0 <lb/>0 otherwise <lb/> where a y is the activation of the output unit, a i is the activation of the ith input unit, w i is the <lb/>weight from the ith input to the output unit, and 񮽙 is the threshold parameter of the output <lb/>unit. We use x i to refer to the value of the ith feature, and a i to refer to the activation of the <lb/>corresponding input unit. For example, if x i = true then a i = 1. <lb/>Figure 1 shows three conjunctive rules which describe the most general conditions under which <lb/>the output unit has an activation of unity. Consider the rule: <lb/> y 񮽙 x 1 ^ x 2 ^ :x 5 : <lb/>

			<page> 4 <lb/></page>

			y <lb/> x 1 <lb/> x 2 <lb/> x 3 <lb/> x 4 <lb/> x 5 <lb/> 6 <lb/>4 <lb/>4 <lb/>−4 <lb/> θ = −9 <lb/> 0 <lb/> extracted rules: y 񮽙 x1 ^ x2 ^ x3 <lb/>y 񮽙 x1 ^ x2 ^ :x5 <lb/> y 񮽙 x1 ^ x3 ^ :x5 <lb/> Figure 1: A network and extracted rules. The network has ve input units representing ve Boolean features. <lb/>The rules describe the settings of the input features that result in the output unit having an activation of 1. <lb/> This rule states that when x 1 = true, x 2 = true, and x 5 = false, then the output unit representing <lb/> y will have an activation of 1 (i.e., the network predicts y = true). To see that this is a valid rule, <lb/>consider that for the cases covered by this rule: <lb/> a 1 w 1 + a 2 w 2 + a 5 w 5 + 񮽙 = 1: <lb/>Thus, the weighted sum exceeds zero. But what eeect can the other features have on the output <lb/>unit&apos;s activation in this case? It can be seen that: <lb/>0 񮽙 a 3 w 3 + a 4 w 4 񮽙 4: <lb/>No matter what values the features x 3 and x 4 have, the output unit will have an activation of 1. <lb/>Thus the rule is valid; it accurately describes the behavior of the network for those instances that <lb/>match its antecedent. To see that the rule is maximally general, consider that if we drop any one of <lb/>the literals from the rule&apos;s antecedent, then the rule no longer accurately describes the behavior of <lb/>the network. For example, if we drop the literal :x 5 from the rule, then for the examples covered <lb/>by the rule: <lb/> ?3 񮽙 <lb/> X <lb/> a i w i + 񮽙 񮽙 5 <lb/>and thus the network does not predict that y = true for all of the covered examples. <lb/>So far, we have deened an extracted rule in the context of a very simple neural network. What <lb/>does a \rule&quot; mean in the context of networks that have continuous transfer functions, hidden <lb/>units, and multiple output units? Whenever a neural network is used for a classiication problem, <lb/>there is always an implicit decision procedure that is used to decide which class is predicted by <lb/>the network for a given case. In the simple example above, the decision procedure was simply to <lb/>predict y = true when the activation of the output unit was 1, and to predict y = false when it <lb/>was 0. If we used a logistic transfer function instead of a threshold function at the output unit, <lb/>then the decision procedure might be to predict y = true when the activation exceeds a speciied <lb/>value, say 0.5. If we were using one output unit per class for a multi-class learning problem (i.e., <lb/>a problem with more than two classes), then our decision procedure might be to predict the class <lb/>associated with the output unit that has the greatest activation. In general, an extracted rule <lb/>(approximately) describes a set of conditions under which the network, coupled with its decision <lb/>procedure, predicts a given class. <lb/>

			<page>5 <lb/></page>

			y <lb/> x 2 x 3 x 4 x 5 <lb/> x 1 <lb/>1 <lb/> h <lb/>h 2 <lb/> h 3 <lb/> extracted rules: y 񮽙 h1 _ h2 _ h3 <lb/>h1 񮽙 x1 ^ x2 <lb/>h2 񮽙 x2 ^ x3 ^ x4 <lb/>h3 񮽙 x5 <lb/> Figure 2: The local approach to rule extraction. A multi-layer neural network is decomposed into a set of single <lb/>layer networks. Rules are extracted to describe each of the constituent networks, and the rule sets are combined to <lb/>describe the multi-layer network. <lb/> x 1 <lb/> x 2 <lb/> x 3 <lb/> x x <lb/>x 1 2 3 <lb/> x <lb/> ¬  1 <lb/> x <lb/> ¬  2 <lb/> x <lb/> ¬  3 <lb/> x <lb/> ¬  1 x <lb/> ¬  2 x <lb/> ¬  3 <lb/> x 1 x 2 x <lb/> ¬  3 <lb/> Figure 3: A rule search space. Each node in the space represents a possible rule antecedent. Edges between nodes <lb/>indicate specialization relationships (in the downward direction). The thicker lines depict one possible search tree for <lb/>this space. <lb/> As discussed at the beginning of this section, one of the dimensions along which rule-extraction <lb/>methods can be characterized is their level of description. One approach is to extract a set of global <lb/> rules that characterize the output classes directly in terms of the inputs. An alternative approach <lb/>is to extract local rules by decomposing the multi-layer network into a collection of single-layer <lb/>networks. A set of rules is extracted to describe each individual hidden and output unit in terms of <lb/>the units that have weighted connections to it. The rules for the individual units are then combined <lb/>into a set of rules that describes the network as a whole. The local approach to rule extraction is <lb/>illustrated in Figure 2. <lb/> 3.2 Search-Based Rule-Extraction Methods <lb/> Many rule-extraction algorithms have set up the task as a search problem which involves exploring <lb/>a space of candidate rules and testing individual candidates against the network to see if they <lb/>are valid rules. In this section we consider both global and local methods which approach the <lb/>rule-extraction task in this way. <lb/>Most of these algorithms conduct their search through a space of conjunctive rules. Figure 3 <lb/>

			<page>6 <lb/></page>

			shows a rule search space for a problem with three Boolean features. Each node in the tree <lb/>corresponds to the antecedent of a possible rule, and the edges indicate specialization relationships <lb/>(in the downward direction) between nodes. The node at the top of the graph represents the most <lb/>general rule (i.e., all instances are members of the class y), and the nodes at the bottom of the tree <lb/>represent the most speciic rules, which cover only one example each. Unlike most search processes <lb/>which continue until the 񮽙rst goal node is found, a rule-extraction search continues until all (or <lb/>most) of the maximally-general rules have been found. <lb/>Notice that rules with more than one literal in their antecedent have multiple ancestors in the <lb/>graph. Obviously when exploring a rule space, it is ineecient for the search procedure to visit a <lb/>node multiple times. In order to avoid this ineeciency, we can impose an ordering on the literals <lb/>thereby transforming the search graph into a tree. The thicker lines in Figure 3 depict one possible <lb/>search tree for the given rule space. <lb/>One of the problematic issues that arises in search-based approaches to rule extraction is that <lb/>the size of the rule space can be very large. For a problem with n binary features, there are 3 n <lb/> possible conjunctive rules (since each feature can be absent from a rule antecedent, or it can occur <lb/>as a positive or a negative literal in the antecedent). To address this issue, a number of heuristics <lb/>have been employed to limit the combinatorics of the rule-exploration process. <lb/>Several rule-extraction algorithms manage the combinatorics of the task by limiting the number <lb/>of literals that can be in the antecedents of extracted rules (Saito &amp; Nakano, 1988; Gallant, 1993). <lb/>For example, Saito and Nakano&apos;s algorithm uses two parameters, k pos and k neg , that specify the <lb/>maximum number of positive and negative literals respectively that can be in an antecedent. By <lb/>restricting the search to a depth of k, the rule space considered is limited to a size given by the <lb/>following expression: <lb/> k <lb/> X <lb/> i=0 <lb/> 񮽙 n <lb/>k <lb/> ! <lb/> 2 k : <lb/> For xed k, this expression is polynomial in n, but obviously, it is exponential in the depth k. This <lb/>means that exploring a space of rules might still be intractable since, for some networks, it may be <lb/>necessary to search deep in the tree in order to nd valid rules. <lb/>The second heuristic employed by Saito and Nakano is to limit the search to combinations of <lb/>literals that occur in the training set used for the network. Thus, if the training set did not contain <lb/>an example for which x 1 = true and x 2 = true, then the rule search would not consider the rule <lb/> y 񮽙 x 1 ^ x 2 or any of its specializations. <lb/>Exploring a space of candidate rules is only one part of the task for a search-based rule-extraction <lb/>method. The other part of the task is testing candidate rules against the network. The method <lb/>developed by Gallant operates by propagating activation intervals through the network. The rst <lb/>step in testing a rule using this method is to set the activations of the input units that correspond <lb/>to the literals in the candidate rule. The next step is to propagate activations through the network. <lb/>The key idea of this second step, however, is the assumption that input units whose activations are <lb/>not speciied by the rule could possibly take on any allowable value, and thus intervals of activations <lb/>are propagated to the units in the next layer. EEectively, the network computes, for the examples <lb/>covered by the rule, the range of possible activations in the next layer. Activation intervals are <lb/>then further propagated from the hidden units to the output units. At this point, the range of <lb/>possible activations for the output units can be determined and the procedure can decide whether <lb/>to accept the rule or not. Although this algorithm is guaranteed to accept only rules that are valid, <lb/>it may fail to accept maximally general rules, and instead may return overly speciic rules. The <lb/>reason for this deeciency is that in propagating activation intervals from the hidden units onward, <lb/>the procedure assumes that the activations of the hidden units are independent of one another. In <lb/>

			<page>7 <lb/></page>

			most networks this assumption is unlikely to hold. <lb/>Thrun (1995) developed a method called validity interval analysis (VIA) that is a generalized <lb/>and more powerful version of this technique. Like Gallant&apos;s method, VIA tests rules by propagating <lb/>activation intervals through a network after constraining some of the input and output units. <lb/>The key diierence is that Thrun frames the problem of determining validity intervals (i.e., valid <lb/>activation ranges for each unit) as a linear programming problem. This is an important insight <lb/>because it allows activation intervals to be propagated backward, as well as forward through the <lb/>network, and it allows arbitrary linear constraints to be incorporated into the computation of <lb/>validity intervals. Backward propagation of activation intervals enables the calculation of tighter <lb/>validity intervals than forward propagation alone. Thus, Thrun&apos;s method will detect valid rules that <lb/>Gallant&apos;s algorithm is not able to connrm. The ability to incorporate arbitrary linear constraints <lb/>into the extraction process means that the method can be used to test rules that specify very <lb/>general conditions on the output units. For example, it can extract rules that describe when one <lb/>output unit has a greater activation than all of the other output units. Although the VIA approach <lb/>is better at detecting general rules than Gallant&apos;s algorithm, it may still fail to connrm maximally <lb/>general rules, because it also assumes that the hidden units in a layer act independently. <lb/>The rule-extraction methods we have discussed so far extract rules that describe the behavior <lb/>of the output units in terms of the input units. Another approach to the rule-extraction problem is <lb/>to decompose the network into a collection of networks, and then to extract a set of rules describing <lb/>each of the constituent networks. <lb/>There are a number of local rule-extraction methods for networks that use sigmoidal transfer <lb/>functions for their hidden and output units. In these methods, the assumption is made that <lb/>the hidden and output units can be approximated by threshold functions, and thus each unit <lb/>can be described by a binary variable indicating whether it is \on&quot; (activation 񮽙 1) or \oo&quot; <lb/>(activation 񮽙 0). Given this assumption, we can extract a set of rules to describe each individual <lb/>hidden and output unit in terms of the units that have weighted connections to it. The rules for <lb/>each unit can then be combined into a single rule set that describes the network as a whole. <lb/>If the activations of the input and hidden units in a network are limited to the interval 0, 1], <lb/>then the local approach can signiicantly simplify the rule search space. The key fact that simpliies <lb/>the search combinatorics in this case is that the relationship between any input to a unit and its <lb/>output is a monotonic one. That is, we can look at the sign of the weight connecting the ith input <lb/>to the unit of interest to determine how this variable innuences the activation of the unit. If the sign <lb/>is positive, then we know that this input can only push the unit&apos;s activation towards 1, it cannot <lb/>push it away from 1. Likewise, if the sign of the weight is negative, then the input can only push <lb/>the unit&apos;s activation away from 1. Thus, if we are extracting rules to explain when the unit has an <lb/>activation of 1, we need to consider :x i literals only for those inputs x i that have negative weights, <lb/>and we need consider non-negated x i literals only for those inputs that have positive weights. When <lb/>a search space is limited to including either x i or :x i , but not both, the number of rules in the <lb/>space is 2 n for a task with n binary features. Recall that when this monotonicity condition does <lb/>not hold, the size of the rule space is 3 n . <lb/>Figure 4 shows a rule search space for the network in Figure 1. The shaded nodes in the graph <lb/>correspond to the extracted rules shown in Figure 1. Note that this tree exploits the monotonicity <lb/>condition, and thus does not show all possible conjunctive rules for the network. <lb/>A number of research groups have developed local rule-extraction methods that search for <lb/>conjunctive rules (Fu, 1991; Gallant, 1993; Sethi &amp; Yoo, 1994). Like the global methods described <lb/>previously, the local methods developed by Fu and Gallant manage search combinatorics by limiting <lb/>the depth of the rule search. When the monotonicity condition holds, the number of rules considered <lb/>

			<page>8 <lb/></page>

			x x <lb/> ¬ <lb/> 4 <lb/>5 <lb/> x 1 x x x x <lb/> ¬ <lb/> 2 3 4 <lb/>5 <lb/> x x x x <lb/> ¬ <lb/> 2 3 4 <lb/>5 <lb/> x 1 <lb/> x 2 <lb/> x 3 <lb/> x 4 <lb/> x <lb/> ¬  5 <lb/> x x x <lb/> ¬ <lb/> 3 4 <lb/>5 <lb/> Figure 4: A search tree for the network in Figure 1. Each node in the space represents a possible rule <lb/>antecedent. Edges between nodes indicate specialization relationships (in the downward direction). Shaded nodes <lb/>correspond to the extracted rules shown in Figure 1. <lb/> in a search of depth k is bounded above by: <lb/> k <lb/> X <lb/> i=0 <lb/> 񮽙 n <lb/>k <lb/> ! <lb/> : <lb/> There is another factor that simpliies the rule search when the monotonicity condition is true. <lb/>Because the relationship between each input and the output unit in a perceptron is speciied by a <lb/>single parameter (i.e., the weight on the connection between the two), we know not only the sign of <lb/>the input&apos;s contribution to the output, but also the possible magnitude of the contribution. This <lb/>information can be used to order the search tree in a manner that can save eeort. For example, <lb/>when searching the rule space for the network in Figure 1, after determining that y 񮽙 x 1 is not <lb/>a valid rule, we do not have to consider other rules that have only one literal in their antecedent. <lb/>Since the weight connecting x 1 to the output unit is larger than the weight connecting any other <lb/>input unit, we can conclude that if x 1 alone cannot guarantee that the output unit will have an <lb/>activation of 1, then no other single input unit can do it either. Sethi and Yoo (1994) have shown <lb/>that, when this heuristic is employed, the number of nodes explored in the search is: <lb/> O <lb/> 񮽙 r <lb/> 2n <lb/> 񮽙 <lb/> 2 n <lb/> n <lb/> ! <lb/> : <lb/> Notice that even with this heuristic, the number of nodes that might need to be visited in the <lb/>search is still exponential in the number of variables. <lb/>It can be seen that one advantage of local search-based methods, in comparison to global <lb/>methods, is that the worst-case complexity of the search is less daunting. Another advantage of <lb/>local methods, is that the process of testing candidate rules is simpler. <lb/>A local method developed by Towell and Shavlik (1993) searches not for conjunctive rules, but <lb/>instead for rules that include m-of-n expressions. An m-of-n expression is a Boolean expression <lb/>that is speciied by an integer threshold, m, and a set of n Boolean literals. Such an expression <lb/>is satissed when at least m of its n literals are satissed. For example, suppose we have three <lb/>Boolean features, x 1 , x 2 , and x 3 ; the m-of-n expression 2-of-fx 1 ; :x 2 ; x 3 g is logically equivalent <lb/>to (x 1 ^ :x 2 ) _ (x 1 ^ x 3 ) _ (:x 2 ^ x 3 ). <lb/>

			<page>9 <lb/></page>

			There are two advantages to extracting m-of-n rules instead of conjunctive rules. The 񮽙rst <lb/>advantage is that m-of-n rule sets are often much more concise and comprehensible than their <lb/>conjunctive counterparts. The second advantage is that, when using a local approach, the combi-<lb/>natorics of the rule search can be simpliied. The approach developed by Towell and Shavlik extracts <lb/> m-of-n rules for a unit by rst clustering weights and then treating weight clusters as equivalence <lb/>classes. This clustering reduces the search problem from one deened by n weights to one deened <lb/>by (c 񮽙 n) clusters. This approach, which assumes that the weights are fairly well clustered after <lb/>training, was initially developed for knowledge-based neural networks (Towell &amp; Shavlik, 1993), in <lb/>which the initial weights of the network are speciied by a set of symbolic inference rules. Since <lb/>they correspond the symbolic rules, the weights in these networks are initially well clustered, and <lb/>empirical results indicate that the weights remain fairly clustered after training. The applicability <lb/>of this approach was later extended to ordinary neural networks by using a special cost function <lb/>for network training (Craven &amp; Shavlik, 1993). <lb/> 3.3 A Learning-Based Rule-Extraction Method <lb/> In contrast to the previously discussed methods, we have developed a rule-extraction algorithm <lb/>called Trepan (Craven &amp; Shavlik, 1996; Craven, 1996), that views the problem of extracting a <lb/>comprehensible hypothesis from a trained network as an inductive learning task. In this learning <lb/>task, the target concept is the function represented by the network, and the hypothesis produced <lb/>by the learning algorithm is a decision tree that approximates the network. <lb/> Trepan diiers from other rule-extraction methods in that it does not directly test hypothesized <lb/>rules against a network, nor does it translate individual hidden and output units into rules. Instead, <lb/> Trepan&apos;s extraction process involves progressively reening a model of the entire network. The <lb/>model, in this case, is a decision tree which is grown in a best--rst manner. <lb/>The Trepan algorithm, as shown in Table 1, is similar to conventional decision-tree algorithms, <lb/>such as CART (Breiman et al., 1984) and C4.5 (Quinlan, 1993), which learn directly from a training <lb/>set. These algorithms build decision trees by recursively partitioning the input space. Each internal <lb/>node in such a tree represents a splitting criterion that partitions some part of the input space, and <lb/>each leaf represents a predicted class. <lb/>As Trepan grows a tree, it maintains a queue of leaves which are expanded into subtrees as <lb/>they are removed from the queue. With each node in the queue, Trepan stores (i) a subset of <lb/>the training examples, (ii) another set of instances which we shall refer to as query instances, and <lb/>(iii) a set of constraints. The stored subset of training examples consists simply of those examples <lb/>that reach the node. The query instances are used, along with the training examples, to select the <lb/>splitting test if the node is an internal node, or to determine the class label if it is a leaf. The <lb/>constraint set describes the conditions that instances must satisfy in order to reach the node; this <lb/>information is used when drawing a set of query instances for a newly created node. <lb/>Although Trepan has many similarities to conventional decision-tree algorithms, it is substan-<lb/>tially diierent in a number of respects, which we detail below. <lb/> Membership Queries and the Oracle. When inducing a decision tree to describe the given <lb/>network, Trepan takes advantage of the fact that it can make membership queries. A membership <lb/>query is a question to an oracle that consists of an instance from the learner&apos;s instance space. Given <lb/>a membership query, the role of the oracle is to return the class label for the instance. Recall that, <lb/>in this context, the target concept we are trying to learn is the function represented by the network. <lb/>Thus, the network itself serves as the oracle, and to answer a membership query it simply classiies <lb/>the given instance. <lb/>The instances that Trepan uses for membership queries come from two sources. First, the <lb/>

			<page>10 <lb/></page>

			Table 1: The Trepan algorithm. <lb/> Trepan <lb/> Input: Oracle(), training set S, feature set F, min sample <lb/> initialize the root of the tree, R, as a leaf node <lb/> /* get a sample of instances */ <lb/> use S to construct a model MR of the distribution of instances covered by node R <lb/>q := max(0, min sample? j S j) <lb/> query instancesR := a set of q instances generated using model MR <lb/> /* use the network to label all instances */ <lb/> for each example x 2 (S 񮽙 query instancesR ) <lb/>class label for x := Oracle(x) <lb/> /* do a best-first expansion of the tree */ <lb/> initialize Queue with tuple hR, S, query instancesR, fg i <lb/> while Queue is not empty and global stopping criteria not satissed <lb/> /* make node at head of Queue into an internal node */ <lb/> remove h node N, SN, query instancesN, constraintsNi from head of Queue <lb/> use F, SN , and query instancesN to construct a splitting test T at node N <lb/> /* make children nodes */ <lb/> for each outcome, t, of test T <lb/> make C, a new child node of N <lb/>constraintsC := constraintsN fT = tg <lb/> /* get a sample of instances for the node C */ <lb/> SC := members of SN with outcome t on test T <lb/> construct a model MC of the distribution of instances covered by node C <lb/>q := max(0, min sample? j SC j) <lb/> query instancesC := a set of q instances generated using model MC and constraintsC <lb/> for each example x 2 query instancesC <lb/> class label for x := Oracle(x) <lb/> /* make node C a leaf for now */ <lb/> use SC and query instancesC to determine class label for C <lb/> /* determine if node C should be expanded */ <lb/> if local stopping criteria not satissed then <lb/>put hC, SC, query instancesC, constraintsCi into Queue <lb/> Return: tree with root R <lb/>

			<page> 11 <lb/></page>

			examples that were used to train the network are used as membership queries. Second, Trepan <lb/> also uses the training data to construct models of the underlying data distribution, and then uses <lb/>these models to generate new instances { the query instances { for membership queries. The ability <lb/>to make membership queries means that whenever Trepan selects a splitting test for an internal <lb/>node or selects a class label for a leaf, it is able to base these decisions on large samples of data. <lb/> Tree Expansion. Unlike most decision-tree algorithms, which grow trees in a depth--rst <lb/>manner, Trepan grows trees using a best--rst expansion. The notion of the best node, in this <lb/>case, is the one at which there is the greatest potential to increase the 񮽙delity of the extracted tree <lb/>to the network. By 񮽙delity, we mean the extent to which the tree agrees with the network in its <lb/>classiications. The function used to evaluate node N is: <lb/> f(N) = reach(N) 񮽙 (1 ? fidelity(N)) <lb/> where reach(N) is the estimated fraction of instances that reach N when passed through the <lb/>tree, and fidelity(N) is the estimated delity of the tree to the network for those instances. The <lb/>motivation for expanding an extracted tree in a best--rst manner is that it gives the user a 񮽙ne <lb/>degree of control over the size of the tree to be returned: the tree-expansion process can be stopped <lb/>at any point. <lb/> Splitting Tests. Like some of the rule-extraction methods discussed earlier, Trepan exploits <lb/> m-of-n expressions to produced more compact extracted descriptions. Speciically, Trepan uses a <lb/>heuristic search process to construct m-of-n expressions for the splitting tests at its internal nodes <lb/>in a tree. <lb/> Stopping Criteria. Trepan uses three criteria to decide when to stop growing an extracted <lb/>tree. First, Trepan uses a statistical test to decide if, with high probability, a node covers only <lb/>instances of a single class. If it does, then Trepan does not expand this node further. Second, <lb/> Trepan employs a parameter that allows the user to place a limit on the size of the tree that <lb/> Trepan can return. This parameter, which is speciied in terms of internal nodes, gives the user <lb/>some control over the comprehensibility of the tree produced by enabling a user to specify the <lb/>largest tree that would be acceptable. Third, Trepan can use a validation set, in conjunction <lb/>with the size-limit parameter, to decide on the tree to be returned. Since Trepan grows trees in <lb/>a best--rst manner, it can be thought of as producing a nested sequence of trees in which each <lb/>tree in the sequence diiers from its predecessor only by the subtree that corresponds to the node <lb/>expanded at the last step. When given a validation set, Trepan uses it to measure the delity of <lb/>each tree in this sequence, and then returns the tree that has the highest level of 񮽙delity to the <lb/>network. <lb/>The principal advantages of the Trepan approach, in comparison to other rule-extraction <lb/>methods, are twofold. First, Trepan can be applied to a wide class of networks. The generality of <lb/> Trepan derives from the fact that its interaction with the network consists solely of membership <lb/>queries. Since answering a membership query involves simply classifying an instance, Trepan does <lb/>not require a special network architecture or training method. In fact, Trepan does not even <lb/>require that the model be a neural network. Trepan can be applied to a wide variety of hard-to-<lb/>understand models including ensembles (or committees) of classiiers that act in concert to produce <lb/>predictions. <lb/>The other principal advantage of Trepan is that it gives the user ne control over the complexity <lb/>of the hypotheses returned by the rule-extraction process. This capability derives from the fact <lb/>that Trepan represents its extracted hypotheses using decision trees, and it expands these trees <lb/>in a best--rst manner. Trepan rst extracts a very simple (i.e., one-node) description of a trained <lb/>network, and then successively reenes this description to improve its 񮽙delity to the network. In <lb/>

			<page>12 <lb/></page>

			Table 2: Test-set accuracy (%) and tree complexity (# feature references). <lb/> accuracy <lb/>tree complexity <lb/>problem domain <lb/>networks Trepan C4.5 Trepan C4.5 <lb/> protein-coding region recognition <lb/>94.1 <lb/>93.1 <lb/>90.4 <lb/>70.5 153.3 <lb/>heart-disease diagnosis <lb/>84.5 <lb/>83.2 <lb/>74.6 <lb/>24.4 15.5 <lb/>promoter recognition <lb/>90.6 <lb/>87.4 <lb/>85.0 <lb/>105.5 7.0 <lb/>telephone-circuit fault diagnosis <lb/>65.3 <lb/>63.3 <lb/>60.7 <lb/>26.3 35.0 <lb/>exchange-rate prediction <lb/>61.6 <lb/>60.6 <lb/>54.6 <lb/>14.0 53.0 <lb/>this way, Trepan explores increasingly more complex, but higher delity, descriptions of the given <lb/>network. <lb/> Trepan has been used to extract rules from networks trained in a wide variety of problem <lb/>domains including: gene and promoter identiication in DNA, telephone-circuit fault diagnosis, <lb/>exchange-rate prediction, and elevator control. Table 2 shows test-set accuracy and tree complexity <lb/>results for 񮽙ve such problem domains. The table shows the predictive accuracy of feed-forward <lb/>neural networks, decision trees extracted from the networks using Trepan, and decision trees <lb/>learned directly from the data using the C4.5 algorithm (Quinlan, 1993). It can be seen that, for <lb/>every data set, neural networks provide better predictive accuracy than the decision trees learned <lb/>by C4.5. This result indicates that these are domains for which neural networks have a more <lb/>suitable inductive bias than C4.5. Indeed, these problem domains were selected for this reason, <lb/>since it is in cases where neural networks provide superior predictive accuracy to symbolic learning <lb/>approaches that it makes sense to apply a rule-extraction method. Moreover, for all ve domains, <lb/>the trees extracted from the neural networks by Trepan are more accurate than the C4.5 trees. <lb/>This result indicates that in a wide range of problem domains in which neural networks provide <lb/>better predictive accuracy than conventional decision-tree algorithms, Trepan is able to extract <lb/>decision trees that closely approximate the hypotheses learned by the networks, and thus provide <lb/>superior predictive accuracy to trees learned directly by algorithms such as C4.5. <lb/> The two rightmost columns in Table 2 show tree complexity measurements for the trees produced <lb/>by Trepan and C4.5 in these domains. The measure of complexity used here is the number of <lb/> feature references used in the splitting tests in the trees. An ordinary, single-feature splitting test, <lb/>like those used by C4.5, is counted as one feature reference. An m-of-n test, like those used at <lb/>times by Trepan, is counted as n feature references, since such a split lists n feature values. We <lb/>contend that this measure of syntactic complexity is a good indicator of the comprehensibility of <lb/>trees. The results in this table indicate that, in general, the trees produced by the two algorithms <lb/>are roughly comparable in terms of size. The results presented in this table are described in greater <lb/>detail elsewhere (Craven, 1996). <lb/> 3.4 Finite State Automata Extraction Methods <lb/> One specialized case of rule extraction is the extraction of nite state automata (FSA) from recur-<lb/>rent neural networks. A recurrent network is one that has links from a set of its hidden or output <lb/>units to a set of its input units. Such links enable recurrent networks to maintain state information <lb/>from one input instance to the next. Like a nite state automaton, each time a recurrent network <lb/>is presented with an instance, it calculates a new state which is a function of both the previous <lb/>state and the given instance. A \state&quot; in a recurrent network is not a predeened, discrete entity, <lb/>but instead corresponds to a vector of activation values across the units in the network that have <lb/>

			<page>13 <lb/></page>

			1 <lb/> 2 <lb/>3 <lb/>4 <lb/>0 <lb/>1 <lb/>1 <lb/>state unit 1 activation <lb/> state unit 2 activation <lb/> Figure 5: The correspondence between a recurrent network and an FSA. Depicted on the left is a recurrent <lb/>network that has three input units and two state units. The two units that are to the right of the input units <lb/>represent the activations of the state units at time t ? 1. Shown in the middle is the two-dimensional, real-valued <lb/>space deened by the activations of the two hidden units. The path traced in the space illustrates the state changes <lb/>of the hidden-unit activations as the network processes some sequence of inputs. Each of the three arrow styles <lb/>represents one of the possible inputs to the recurrent network. Depicted on the right is a nite state automaton that <lb/>corresponds to the network when the state space is discretized as shown in the middle gure. The shade of each node <lb/>in the FSA represents the output value produced by the network when it is in the corresponding state. <lb/> outgoing recurrent connections { the so-called state units. Another way to think of such a state is <lb/>as a point in an s-dimensional, real-valued space deened by the activations of the s state units. <lb/>Recurrent networks are usually trained on sequences of input vectors. In such a sequence, the <lb/>order in which the input vectors are presented to the network represents a temporal order, or <lb/>some other natural sequential order. As a recurrent network processes such an input sequence, its <lb/>state-unit activations trace a path in the s-dimensional state-unit space. If similar input sequences <lb/>produce similar paths, then the continuous-state space can be closely approximated by a nite state <lb/>space in which each state corresponds to a region, as opposed to a point, in the space. This idea <lb/>is illustrated in Figure 5, which shows a nite state automaton and the two-dimensional state-unit <lb/>space of a recurrent network trained to accept the same strings as the FSA. The path traced in the <lb/>space illustrates the state changes of the state-unit activations as the network processes a sequence <lb/>of inputs. The non-shaded regions of the space correspond to the states of the FSA. <lb/>Several research groups have developed algorithms for extracting 񮽙nite state automata from <lb/>trained recurrent networks. The key issue in such algorithms, is deciding how to partition the <lb/> s-dimensional real-valued space into a set of discrete states. The method of Giles et al. (Giles <lb/>et al., 1992), which is representative of this class of algorithms, proceeds as follows. First, the <lb/>algorithm partitions each state unit&apos;s activation range into q intervals of equal width, thus dividing <lb/>the s-dimensional space into q s partitions. The method initially sets q = 2, but increases its value <lb/>if it cannot extract an FSA that correctly describes the network&apos;s training set. The next step is <lb/>to run the input sequences through the network, keeping track of (i) the state transitions, (ii) the <lb/>input vector associated with each transition, and (iii) the output value produced by the network. It <lb/>is then a simple task to express this record of the network&apos;s processing as a nite state automaton. <lb/>Finally, the FSA can be minimized using standard algorithms. <lb/>Recently, this approach has been applied to the task of exchange-rate prediction. Lawrence et <lb/>al. (1997) trained recurrent neural networks to predict the daily change in 񮽙ve foreign exchange <lb/>rates, and then extracted 񮽙nite state automata from these networks in order to characterize the <lb/>learned models. More speciically, the task they addressed was to predict the next (log-transformed) <lb/>change in a daily exchange rate x(t + 1), given the previous four values of the same time series: <lb/> X(t) = (x(t); x(t ? 1); x(t ? 2); x(t ? 3)): <lb/>

			<page>14 <lb/></page>

			Their solution to this task involves three main components. The rst component is a neural network <lb/>called a self-organizing map (SOM) (Kohonen, 1995) which is trained by an unsupervised learning <lb/>process. An SOM learns a mapping from its input space to its output space that preserves the <lb/>topological ordering of points in the input space. That is, the similarity of points in the input space, <lb/>as measured using a metric, is preserved in the output space. In their exchange-rate prediction <lb/>architecture, Lawrence et al. used SOMs to map from a continuous, four-dimensional input space <lb/>into a discrete output space. The input space, in this case, represents X(t), and the output space <lb/>represents a three-valued discrete variable that characterizes the trend in the exchange rate. <lb/>The second component of the system is a neural network that has a set of recurrent connections <lb/>from each of its hidden units to all of the other hidden units. The input to the recurrent network is <lb/>a three-dimensional vector consisting of the last three discrete values output by the self-organizing <lb/>map. The output of the network is the predicted probabilities that the next daily movement of <lb/>the exchange rate will be upward or downward. In other words, the recurrent network learns a <lb/>mapping from the SOM&apos;s discrete characterization of the time series to the predicted direction of <lb/>the next value in the time series. <lb/>The third major part of the system is the rule-extraction component. Using the method de-<lb/>scribed above, nite state automata are extracted from the recurrent networks. The states in the <lb/>FSA correspond to regions in the space of activations of the state units. Each state is labeled by the <lb/>corresponding network prediction (up or down), and each state transition is labeled by the value of <lb/>the discrete variable that characterizes the time series at time t. <lb/> After extracting automata from the recurrent networks, Lawrence et al. compared their predic-<lb/>tive accuracy to that of the neural networks and found that the FSA were only slightly less accurate. <lb/>On average, the accuracy of the recurrent networks was 53.4%, and the accuracy of the nite state <lb/>automata was 53.1% (both of which are statistically distinguishable from random guessing). <lb/> 3.5 Discussion <lb/> As we stated at the beginning of this section, there are three primary dimensions along which rule-<lb/>extraction methods diier: representation language, extraction strategy, and network requirements. <lb/>The algorithms that we have discussed in this section provide some indication of the diversity of <lb/>rule-extraction methods with respect to these three aspects. <lb/>The representation languages used by the methods we have covered include conjunctive infer-<lb/>ence rules, m-of-n inference rules, decision trees with m-of-n tests, and 񮽙nite state automata. In <lb/>addition to these representations, there are rule-extraction methods that use fuzzy rules, rules with <lb/>conndence factors, \majority-vote&quot; rules, and condition/action rules that perform rewrite opera-<lb/>tions on string-based inputs. This multiplicity of languages is due to several factors. One factor is <lb/>that diierent representation languages are well suited for diierent types of networks and tasks. A <lb/>second reason is that researchers in the eld have found that it is often hard to concisely describe <lb/>the concept represented by a neural network to a high level of delity. Thus, some of the described <lb/>representations, such as m-of-n rules, have gained currency because they often help to simplify <lb/>extracted representations. <lb/>The extraction strategies employed by various algorithms also exhibit similar diversity. As <lb/>discussed earlier, one aspect of extraction strategy that distinguishes methods is whether they <lb/>extract global or local rules. Recall that global methods produce rules which describe a network as <lb/>a whole, whereas local methods extract rules which describe individual hidden and output units in <lb/>the network. Another key aspect of extraction strategies is the way in which they explore a space of <lb/>rules. In this section we described (i) methods that use search-like procedures to explore rule spaces, <lb/>(ii) a method that iteratively reenes a decision-tree description of a network, and (iii) a method <lb/>

			<page>15 <lb/></page>

			that extracts nite state automata by rst clustering unit activations and then mapping the clusters <lb/>into an automaton. In addition to these rule-exploration strategies, there are also algorithms that <lb/>extract rules by matching the network&apos;s weight vectors against templates representing canonical <lb/>rules, and methods that are able to directly map hidden units into rules when the networks use <lb/>transfer functions, such as radial basis functions, that respond to localized regions of their input <lb/>space. <lb/>Another key dimension we have considered is the extent to which methods place requirements on <lb/>the networks to which they can be applied. Some methods require that a special training procedure <lb/>be used for the network. Other methods impose restrictions on the network architecture. or require <lb/>that hidden units use sigmoidal transfer functions. Some of the methods we have discussed place <lb/>restrictions on both the network&apos;s architecture and its training regime. Another limitation of many <lb/>rule-extraction methods is that they are designed for problems that have only discrete-valued <lb/>features. The tradeoo that is involved in these requirements is that, although they may simplify <lb/>the rule-extraction process, they reduce the generality of the rule-extraction method. <lb/>Readers who are interested in more detailed descriptions of these rule-extraction methods, as <lb/>well as pointers to the literature are referred elsewhere (Andrews et al., 1995; Craven, 1996). <lb/> 4 Methods that Learn Simple Hypotheses <lb/> The previous section discussed methods that are designed to extract comprehensible hypotheses <lb/>from trained neural networks. An alternative approach to data mining with neural networks is <lb/>to use learning methods that directly learn comprehensible hypotheses by producing simple neural <lb/>networks. Although we have assumed in our discussion so far that the hypotheses learned by neural <lb/>networks are incomprehensible, the methods we present in this section are diierent in that they learn <lb/>networks that have a single layer of weights. In contrast to multi-layer networks, the hypotheses <lb/>represented by single-layer networks are usually much easier to understand because each parameter <lb/>describes a simple (i.e., linear) relationship between an input feature and an output category. <lb/> 4.1 A Supervised Method <lb/> There is a wide variety of methods for learning single-layer neural networks in a supervised learn-<lb/>ing setting. In this section we focus on one particular algorithm that is appealing for data-mining <lb/>applications because it incrementally constructs its learned networks. This algorithm, called BBP <lb/> (Jackson &amp; Craven, 1996), is unlike traditional neural-network methods in that it does not in-<lb/>volve training with a gradient-based optimization method. The hypotheses it learns, however, are <lb/>perceptrons, and thus we consider it to be a neural-network method. <lb/>The BBP algorithm is shown in Table 3. The basic idea of the method is to repeatedly add <lb/>new input units to a learned hypothesis, using diierent probability distributions over the training <lb/>set to select each one. Because the algorithm adds weighted inputs to hypotheses incrementally, <lb/>the complexity of these hypotheses can be easily controlled. <lb/>The inputs incorporated by BBP into a hypothesis represent Boolean functions that map to <lb/> f?1; +1g. In other words, the inputs are binary units that have an activation of either ?1 or <lb/>+1. These inputs may correspond directly to Boolean features, or they may represent tests on <lb/>nominal or numerical features (e.g., color = red, x 1 &gt; 0:8), or logical combinations of features <lb/>(e.g., 񮽙color = red] ^ 񮽙shape = round]). Additionally, the algorithm may also incorporate an input <lb/>representing the identically true function. The weight associated with such an input corresponds <lb/>to the threshold of the perceptron. <lb/>

			<page>16 <lb/></page>

			Table 3: The BBP algorithm. <lb/></body>

			<div type="annex"> BBP <lb/> Input: training set S of m examples, set of candidate inputs C that map to f?1; +1g, <lb/>number of iterations T <lb/> /* set the initial distribution to be uniform */ <lb/> for all x 2 S <lb/>D1(x) := 1=m <lb/> for t := 1 to T do <lb/> /* add another feature */ <lb/> ht := argmax c i 2C jED t 񮽙f(x) 񮽙 ci(x)]j <lb/> /* determine error of this feature */ <lb/> 񮽙t := 0 <lb/> for all x 2 S <lb/> if ht(x) 6 = f(x) then 񮽙t := 񮽙t + Dt(x) <lb/> /* update the distribution */ <lb/> 񮽙t := 񮽙t = (1 ? 񮽙t) <lb/> for all x 2 S <lb/> if ht(x) = f(x) then <lb/> Dt+1(x) := 񮽙tDt(x) <lb/> else Dt+1(x) := Dt(x) <lb/> /* re-normalize the distribution */ <lb/> Zt := <lb/> P <lb/> x Dt+1(x) <lb/> for all x 2 S <lb/>Dt+1(x) := Dt+1(x)=Zt <lb/> Return: h(x) 񮽙 sign <lb/> 񮽙 T <lb/> X <lb/> i=1 <lb/> ?ln(񮽙i) hi(x) <lb/> ! <lb/></div>

			<page> 17 <lb/></page>

			<body>On each iteration of the BBP algorithm, an input is selected from the pool of candidates and <lb/>added to the hypothesis under construction. BBP measures the correlation of each input with the <lb/>target function being learned, and then selects the input whose correlation has the greatest mag-<lb/>nitude. The correlation between a given candidate and the target function varies from iteration to <lb/>iteration because it is measured with respect to a changing distribution over the training examples. <lb/>Initially, the BBP algorithm assumes a uniform distribution over the training set. That is, when <lb/>selecting the rst input to be added to a perceptron, BBP assigns equal importance to the various <lb/>instances of the training set. After each input is added, however, the distribution is adjusted so <lb/>that more weight is given to the examples that the input did not correctly predict. In this way, the <lb/>learner&apos;s attention is focused on those examples that the current hypothesis does not explain well. <lb/>The algorithm stops adding weighted inputs to the hypothesis after a pre-speciied number of <lb/>iterations have been reached, or after the training set error has been reduced to zero. Since only one <lb/>input is added to the network on each iteration, the size of the nal perceptron can be controlled <lb/>by limiting the number of iterations. The hypothesis returned by BBP is a perceptron in which <lb/>the weight associated with each input is a function of the error of the input. The perceptron uses <lb/>the sign function to decide which class to return: <lb/>sign(x) = <lb/> ( 1 if x &gt; 0 <lb/> ?1 if x 񮽙 0: <lb/>The BBP algorithm has two primary limitations. First, it is designed for learning binary <lb/>classiication tasks. The algorithm can be applied to multi-class learning problems, however, by <lb/>learning a perceptron for each class. The other limitation of the method is that it assumes that <lb/>the inputs are Boolean functions. As discussed above, however, domains with real-valued features <lb/>can be handled by discretizing the features. <lb/>The BBP method is based on an algorithm called AdaBoost (Freund &amp; Schapire, 1996) which <lb/>is a hypothesis-boosting algorithm. Informally, a boosting algorithm learns a set of constituent <lb/>hypotheses and then combines them into a composite hypothesis in such a way that, even if each <lb/>of the constituent hypotheses is only slightly more accurate than random guessing, the composite <lb/>hypothesis has an arbitrarily high level of accuracy on the training data. In short, a set of weak <lb/> hypotheses are boosted into a strong one. This is done by carefully determining the distribution <lb/>over the training data that is used for learning each weak hypothesis. The weak hypotheses in a <lb/> BBP perceptron are simply the individual inputs. Although the more general AdaBoost algorithm <lb/>can use arbitrarily complex functions as its weak hypotheses, BBP uses very simple functions for <lb/>its weak hypotheses in order to facilitate comprehensibility. <lb/>Table 4 shows test-set accuracy and hypothesis-complexity results for the BBP algorithm, <lb/>ordinary feed-forward neural networks, and C4.5 applied to three problem domains in molecular <lb/>biology. As the table indicates, simple neural networks, such as those induced by BBP, can provide <lb/>accuracy comparable to multi-layer neural networks in some domains. Moreover, in two of the three <lb/>domains, the accuracy of the BBP hypotheses is signiicantly superior to decision trees learned by <lb/> C4.5. <lb/> The three rightmost columns of Table 4 show one measure of the complexity of the models <lb/>learned by the three methods { the total number of features incorporated into their hypotheses. <lb/>These results illustrate that, like decision-tree algorithms, the BBP algorithm is able to selectively <lb/>incorporate input features into its hypotheses. Thus in many cases, the BBP hypotheses use <lb/>signiicantly fewer features, and have signiicantly fewer weights, than ordinary multi-layer networks. <lb/>It should also be emphasized that multi-layer networks are usually much more diicult to interpret <lb/>than BBP hypotheses because their weights may encode nonlinear, nonmonotonic relationships <lb/>

			<page>18 <lb/></page>

			Table 4: Test-set accuracy (%) and hypothesis complexity (# features used ). <lb/> accuracy <lb/>hypothesis complexity <lb/>problem domain <lb/>networks BBP C4.5 networks BBP C4.5 <lb/> protein-coding region recognition <lb/>93.6 <lb/>93.6 84.9 <lb/>464 171 150 <lb/>promoter recognition <lb/>90.6 <lb/>92.7 85.0 <lb/>57 <lb/>30 <lb/>10 <lb/>splice-junction recognition <lb/>95.4 <lb/>94.6 94.5 <lb/>60 <lb/>56 <lb/>22 <lb/>between the input features and the class predictions. In summary, these results suggest the utility <lb/>of the BBP algorithm for data-mining tasks: it provides good predictive accuracy on a variety of <lb/>interesting, real-world tasks, and it produces syntactically simple hypotheses, thereby facilitating <lb/>human comprehension of what it has learned. Additional details concerning these experiments can <lb/>be found elsewhere (Craven, 1996). <lb/> 4.2 An Unsupervised Method <lb/> As stated in the Introduction, unsupervised learning involves the use of inductive methods to <lb/>discover regularities that are present in a data set. Although there is a wide variety of neural-<lb/>network algorithms for unsupervised learning, we discuss only one of them here: competitive learning <lb/> (Rumelhart &amp; Zipser, 1985). Competitive learning is arguably the unsupervised neural-network <lb/>algorithm that is most appropriate for data mining, and it is illustrative of the utility of single-<lb/>layer neural-network methods. <lb/>The learning task addressed by competitive learning is to partition a given set of training <lb/>examples into a nite set of clusters. The clusters should represent regularities present in the data <lb/>such that similar examples are mapped into similar classes. <lb/>The variant of competitive learning that we consider here, which is sometimes called simple <lb/>competitive learning, involves learning in a single-layer network. The input units in such a network <lb/>represent the relevant features of the problem domain, and the k output units represent the k <lb/> classes into which examples are clustered. <lb/>The net input to each output unit in this method is a linear combination of the input activations: <lb/> net i = <lb/> X <lb/> j <lb/> w ij a j : <lb/> Here, a j is the activation of the jth input unit, and w ij is the weight linking the jth input unit <lb/>to the ith output. The name competitive learning derives from the process used to determine the <lb/>activations of the hidden units. The output unit that has the greatest net input is deemed the <lb/>winner, and its activation is set to one. The activations of the other output are set to zero: <lb/> a i = <lb/> ( 1 if P <lb/> j w ij a j &gt; P <lb/> j w hj a j for all output units h 6 = i <lb/> 0 otherwise: <lb/>The training process for competitive learning involves minimizing the cost function: <lb/> C = 1 <lb/>2 <lb/> X <lb/> i <lb/> X <lb/> j <lb/> a i (a j ? w ij ) 2 <lb/> where a i is the activation of the ith output unit, a j is the activation of the jth input unit, and w ij <lb/> is the weight from the jth input unit to the ith output unit. The update rule for the weights is <lb/>then: <lb/>񮽙w ij = ?@C@w ij = a i (a j ? w ij ): <lb/>

			<page>19 <lb/></page>

			where 񮽙 is a learning-rate parameter. <lb/>The basic idea of competitive learning is that each output unit takes \responsibility&quot; for a <lb/>subset of the training examples. Only one output unit is the winner for a given example, and <lb/>the weight vector for the winning unit is moved towards the input vector for this example. As <lb/>training progresses, therefore, the weight vector of each output unit moves towards the centroid <lb/>of the examples for which the output has taken responsibility. After training, each output unit <lb/>represents a cluster of examples, and the weight vector for the unit corresponds to the centroid of <lb/>the cluster. <lb/>Competitive learning is closely related to the statistical method known as k-means clustering. <lb/> The principal diierence between the two methods is that competitive learning is an on-line al-<lb/>gorithm, meaning that during training it updates the network&apos;s weights after every example is <lb/>presented, instead of after all of the examples have been presented. The on-line nature of com-<lb/>petitive learning makes it more suitable for very large data sets, since on-line algorithms usually <lb/>converge to a solution faster in such cases. <lb/> 5 Conclusion <lb/> We began this article by arguing that neural-network methods deserve a place in the tool box <lb/>of the data miner. Our argument rests on the premise that, for some problems, neural networks <lb/>have a more suitable inductive bias (i.e., they do a better job of learning the target concept) than <lb/>other commonly used data-mining methods. However, neural-network methods are thought to have <lb/>two limitations that make them poorly suited to data-mining tasks: their learned hypotheses are <lb/>often incomprehensible, and training times are often excessive. As the discussion in this article <lb/>shows, however, there is a wide variety of neural-network algorithms that avoid one or both of <lb/>these limitations. <lb/>Speciically, we discussed two types of approaches that use neural networks to learn comprehen-<lb/>sible models. First, we described rule-extraction algorithms. These methods promote comprehen-<lb/>sibility by translating the functions represented by trained neural networks into languages that are <lb/>easier to understand. A broad range of rule-extraction methods has been developed. The primary <lb/>dimensions along which these methods vary are their (i) representation languages, (ii) strategies <lb/>for mapping networks into the representation language, and (iii) the range of networks to which <lb/>they are applicable. <lb/>In addition to rule-extraction algorithms, we described both supervised and unsupervised meth-<lb/>ods that directly learn simple networks. These networks are often humanly interpretable because <lb/>they are limited to a single layer of weighted connections, thereby ensuring that the relationship <lb/>between each input and each output is a simple one. Moreover, some of these methods, such as <lb/> BBP, have a bias towards incorporating relatively few weights into their hypotheses. <lb/>We have not attempted to provide an exhaustive survey of the available neural-network algo-<lb/>rithms that are suitable for data mining. Instead, we have described a subset of these methods, <lb/>selected to illustrate the breadth of relevant approaches as well as the key issues that arise in apply-<lb/>ing neural networks in a data-mining setting. It is our hope that our discussion of neural-network <lb/>approaches will serve to inspire some interesting applications of these methods to challenging data-<lb/>mining problems. <lb/>
		
		</body>

		<back>

			<div type="acknowledgement">Acknowledgements <lb/> The authors have been partially supported by ONR grant N00014-93-1-0998 and NSF grant IRI-<lb/>9502990. Mark Craven is currently supported by DARPA grant F33615-93-1-1330. <lb/></div>

			<page>20 <lb/></page>

			<listBibl> References <lb/> Andrews, R., Diederich, J., &amp; Tickle, A. B. (1995). A survey and critique of techniques for <lb/>extracting rules from trained artiicial neural networks. Knowledge-Based Systems, 8(6). <lb/>Bishop, C. M. (1996). Neural Networks for Pattern Recognition. Oxford University Press, Oxford, <lb/>England. <lb/>Breiman, L., Friedman, J., Olshen, R., &amp; Stone, C. (1984). Classiication and Regression Trees. <lb/> Wadsworth and Brooks, Monterey, CA. <lb/>Craven, M. &amp; Shavlik, J. (1993). Learning symbolic rules using artiicial neural networks. In <lb/> Proceedings of the Tenth International Conference on Machine Learning, (pp. 73{80), Amherst, <lb/>MA. Morgan Kaufmann. <lb/>Craven, M. W. (1996). Extracting Comprehensible Models from Trained Neural Networks. PhD <lb/>thesis, Computer Sciences Department, University of Wisconsin, Madison, WI. Available as CS <lb/>Technical Report 1326. Available by WWW as ftp://ftp.cs.wisc.edu/machine-learning/shavlik-<lb/>group/craven.thesis.ps.Z. <lb/>Craven, M. W. &amp; Shavlik, J. W. (1996). Extracting tree-structured representations of trained <lb/>networks. In Touretzky, D., Mozer, M., &amp; Hasselmo, M., editors, Advances in Neural Information <lb/>Processing Systems (volume 8). MIT Press, Cambridge, MA. <lb/>Freund, Y. &amp; Schapire, R. E. (1996). Experiments with a new boosting algorithm. In Proceedings of <lb/>the Thirteenth International Conference on Machine Learning, (pp. 148{156), Bari, Italy. Morgan <lb/>Kaufmann. <lb/>Fu, L. (1991). Rule learning by searching on adapted nets. In Proceedings of the Ninth National <lb/>Conference on Artiicial Intelligence, (pp. 590{595), Anaheim, CA. AAAI/MIT Press. <lb/>Gallant, S. I. (1993). Neural Network Learning and Expert Systems. MIT Press, Cambridge, MA. <lb/>Giles, C. L., Miller, C. B., Chen, D., Chen, H. H., Sun, G. Z., &amp; Lee, Y. C. (1992). Learning and <lb/>extracting nite state automata with second-order recurrent neural networks. Neural Computation, <lb/> 4:393{405. <lb/>Jackson, J. C. &amp; Craven, M. W. (1996). Learning sparse perceptrons. In Touretzky, D., Mozer, <lb/>M., &amp; Hasselmo, M., editors, Advances in Neural Information Processing Systems (volume 8). <lb/> MIT Press, Cambridge, MA. <lb/>Kohonen, T. (1995). Self-Organizing Maps. Springer-Verlag, Berlin, Germany. <lb/>Lawrence, S., Giles, C. L., &amp; Tsoi, A. C. (1997). Symbolic conversion, grammatical inference <lb/>and rule extraction for foreign exchange rate prediction. In Abu-Mostafa, Y., Weigend, A. S., &amp; <lb/>Refenes, P. N., editors, Neural Networks in the Capital Markets. World Scientiic, Singapore. <lb/>Quinlan, J. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA. <lb/>Rumelhart, D. E. &amp; Zipser, D. (1985). Feature discovery by competitive learning. Cognitive <lb/>Science, 9:75{112. <lb/></listBibl>

			<page>21 <lb/></page>

			<listBibl>Saito, K. &amp; Nakano, R. (1988). Medical diagnostic expert system based on PDP model. In <lb/> Proceedings of the IEEE International Conference on Neural Networks, (pp. 255{262), San Diego, <lb/>CA. IEEE Press. <lb/>Sethi, I. K. &amp; Yoo, J. H. (1994). Symbolic approximation of feedforward neural networks. In <lb/>Gelsema, E. S. &amp; Kanal, L. N., editors, Pattern Recognition in Practice (volume 4). North-<lb/>Holland, New York, NY. <lb/>Shavlik, J., Mooney, R., &amp; Towell, G. (1991). Symbolic and neural net learning algorithms: An <lb/>empirical comparison. Machine Learning, 6:111{143. <lb/>Thrun, S. (1995). Extracting rules from artiicial neural networks with distributed representations. <lb/>In Tesauro, G., Touretzky, D., &amp; Leen, T., editors, Advances in Neural Information Processing <lb/>Systems (volume 7). MIT Press, Cambridge, MA. <lb/>Towell, G. &amp; Shavlik, J. (1993). Extracting reened rules from knowledge-based neural networks. <lb/> Machine Learning, 13(1):71{101. <lb/>Widrow, B., Rumelhart, D. E., &amp; Lehr, M. A. (1994). Neural networks: Applications in industry, <lb/>business, and science. Communications of the ACM, 37(3):93{105. <lb/></listBibl>

			<page>22 </page>

		</back>
	</text>
</tei>