Newer
Older
grobid-corpus / segmentation / public / tei / ims.training.segmentation.tei.xml
@zeynalig zeynalig on 26 Apr 2017 139 KB initialisation des corpus
<?xml version="1.0" ?>
<tei>
	<teiHeader>
		<fileDesc xml:id="_ims"/>
	</teiHeader>
	<text xml:lang="en">
			<front> Intrinsic Motivation Systems for Autonomous <lb/>Mental Development <lb/> Pierre-Yves Oudeyer, Frédéric Kaplan, Verena V. Hafner <lb/>Sony Computer Science Lab, Paris <lb/>6 rue Amyot 75005 Paris <lb/> {oudeyer, kaplan, hafner}@csl.sony.fr <lb/>http://www.csl.sony.fr <lb/> Abstract <lb/> Exploratory activities seem to be intrinsically rewarding for children <lb/>and crucial for their cognitive development. Can a machine be endowed <lb/>with such an intrinsic motivation system? This is the question we study <lb/>in this paper, presenting a number of computational systems that try to <lb/>capture this drive towards novel or curious situations. After discussing <lb/>related research coming from developmental psychology, neuroscience, de-<lb/>velopmental robotics and active learning, this article presents the mech-<lb/>anism of Intelligent Adaptive Curiosity, an intrinsic motivation system <lb/>which pushes a robot towards situations in which it maximizes its learning <lb/>progress. This drive makes the robot focus on situations which are neither <lb/> 1 <lb/> too predictable nor too unpredictable thus permitting autonomous mental <lb/>development. The complexity of the robot&apos;s activities autonomously in-<lb/>creases and complex developmental sequences self-organize without being <lb/>constructed in a supervised manner. Two experiments are presented illus-<lb/>trating the stage-like organization emerging with this mechanism. In one <lb/>of them, a physical robot is placed on a baby play mat with objects that <lb/>it can learn to manipulate. Experimental results show that the robot first <lb/>spends time in situations which are easy to learn, then shifts its attention <lb/>progressively to situations of increasing di±culty, avoiding situations in <lb/>which nothing can be learnt. Finally, these various results are discussed <lb/>in relation to more complex forms of behavioural organization and data <lb/>coming from developmental psychology. <lb/> Keywords: intrinsic motivation, curiosity, values, development, learn-<lb/>ing, autonomy, epigenetic robotics, behaviour, developmental trajectory, <lb/>complexity, active learning. <lb/></front> 
			
			<body>1 The challenge of autonomous mental develop-<lb/>ment <lb/> All humans develop in an autonomous open-ended manner through life-long <lb/>learning. So far, no robot has this capacity. Building such a robot is one of the <lb/>greatest challenges to robotics today, and is the long-term goal of the growing <lb/>field of developmental robotics ([1, 2]). This article explores a possible route <lb/>towards such a goal. Our approach is inspired by developmental psychology and <lb/>

			<page>2 <lb/></page>

			our ambition is to build systems featuring some of the fundamental aspects of <lb/>an infant&apos;s development. More precisely two remarkable properties of human <lb/>infant development inspire us. <lb/> 1.1 Development is progressive and incremental <lb/> First of all, development involves the progressive increase of the complexity of <lb/>the activities of children with an associated increase of their capabilities. More-<lb/>over, infants&apos; activities always have a complexity which is well-fitted to their <lb/>current capabilities. Children undergo a developmental sequence during which <lb/>each new skill is acquired only when associated cognitive and morphological <lb/>structures are ready. For example, children first learn to roll over, then to <lb/>crawl and sit, and only when these skills are operational, they begin to learn <lb/>how to stand. Development is progressive and incremental. Taking inspiration <lb/>from these observations, some roboticists argue that learning a given task could <lb/>be made much easier for a robot if it followed a developmental sequence (e.g. <lb/> &quot; Learning form easy mission &quot; ([3]). But very often, the developmental sequence <lb/>is crafted by hand: roboticists manually build simpler versions of a complex task <lb/>and put the robot successively in versions of the task of increasing complexity. <lb/>For example, if they want to teach a robot the grammar of a language, they first <lb/>give it examples of very simple sentences with few words, and progressively they <lb/>add new types of grammatical constructions and complications such as nested <lb/>subordinates ([4]). This technique is useful in many cases, but has shortcomings <lb/>which limit our capacity to build robots that develop in an open-ended manner. <lb/>

			<page>3 <lb/></page>

			Indeed, this is not practical: for each task that one wants the robot to learn, <lb/>one has to design versions of this task of increasing complexity, and one also has <lb/>to design manually a reward function dedicated to this particular task. This <lb/>might be all right if one is interested in only one or two tasks, but a robot <lb/>capable of life-long learning should eventually be able to perform thousands of <lb/>tasks. And even if one would engage in such a daunting task of designing manu-<lb/>ally thousands of specific reward functions, there is another limit. The robot is <lb/>equipped with a learning machine whose learning biases are often not intuitive: <lb/>this means that it is also conceptually di±cult most of the time to think of <lb/>simpler versions of a task that might help the robot. It is often the case that <lb/>a task that one considers to be easier for a robot might turn out in fact to be <lb/>more di±cult. <lb/> 1.2 Development is autonomous and active <lb/> This leads us to a second property of child development from which we should be <lb/>inspired: it is autonomous and active. Of course, adults help by scaAEolding their <lb/>environment, but this is just a help: eventually, infants decide by themselves <lb/>what they do, what they are interested in, and what their learning situations <lb/>are. They are not forced to learn the tasks suggested by adults, they can invent <lb/>their own. Thus, they construct by themselves their developmental sequence. <lb/>Anyone who has ever played with an infant in its first year knows for example <lb/>that it is extremely di±cult to get the child to play with a toy that is chosen by <lb/>the adult if other toys and objects are around. In fact, most often the toys that <lb/>

			<page>4 <lb/></page>

			 we think are adapted to them and will please them are not the ones they pre-<lb/>fer: they can have much more fun and instructive play experiences with adult <lb/>objects, such as magazines, keys, or flowers. Also, most of the time, infants <lb/>engage in particular activities for their own sake, rather than as steps towards <lb/>solving practical problems. This is indeed the essence of play. This suggests the <lb/>existence of a kind of intrinsic motivation system, as proposed by psychologists <lb/>like White ([5]), which provide internal rewards during these play experiences. <lb/>Such internal rewards are obviously useful, since they are incentives to learn <lb/>many skills that will potentially be readily available later on for challenges and <lb/>tasks which are not yet foreseeable. <lb/>In order to develop in an open-ended manner, robots should certainly be <lb/>equipped with capacities for autonomous and active development, and in par-<lb/>ticular with intrinsic motivation systems, forming the core of a system for task-<lb/>independent learning. Yet, this crucial issue is still largely underinvestigated. <lb/>The rest of the article is organized in the following way. The next section <lb/>presents a general discussion of research related to intrinsic motivation in the <lb/>domain of psychology, neuroscience, developmental robotics and active learn-<lb/>ing. Section III presents a critical review and a classification of existing intrin-<lb/>sic motivation systems and determines key characteristics important to permit <lb/>autonomous mental development. Section IV describes in detail the algorithm <lb/>of Intelligent Adaptive Curiosity. Section V discusses methodological issues <lb/>for characterizing the behaviour and performances of such systems. Section <lb/>

			<page>5 <lb/></page>

			VI presents a first experiment using Intelligent Adaptive Curiosity with a sim-<lb/>ple simulated robot. Section VII presents a second more complex experiment <lb/>involving a physical robot discovering aAEordances about entities in its environ-<lb/>ment. Section VIII discusses the results obtained in these two experiments in <lb/>relation to more complex issues associated with behavioural organization and <lb/>observation in infant development. <lb/> 2 Background <lb/> 2.1 Psychology <lb/> White ([5]) presents an argumentation explaining why basic forms of motivation <lb/>such as those related to the need for food, sex or physical integrity maintenance <lb/>cannot account for an animal&apos;s exploratory behaviour, and in particular for hu-<lb/>mans. He proposed rather that exploratory behaviours can be by themselves a <lb/>source of rewards. Some experiments have been conducted showing that explo-<lb/>ration for its own sake is an activity which is not always a secondary reinforcer: <lb/>it is certainly a built-in primary reinforcer. The literature on education and <lb/>development also stresses the distinction between intrinsic and extrinsic moti-<lb/>vations ([6]). Psychologists have proposed possible mechanisms which explain <lb/>the kind of exploratory behaviour that for example humans show. Berlyne ([7]) <lb/>proposed that exploration might be triggered and rewarded for situations which <lb/>include novelty, surprise, incongruity and complexity. He also refined this idea <lb/>by observing that the most rewarding situations were those with an interme-<lb/>

			<page>6 <lb/></page>

			diate level of novelty, between already familiar and completely new situations. <lb/>This theory has strong resonance points with the theory of flow developed by <lb/>Csikszentmihalyi ([8]) which argues that a crucial source of internal rewards <lb/>for humans is the self-engagement in activities which require skills just above <lb/>their current level. Thus, for Csikszentmihalyi, exploratory behaviour can be <lb/>explained by an intrinsic motivation for reaching situations which represent a <lb/>learning challenge. Internal rewards are provided when a situation which was <lb/>previously not mastered becomes mastered within an amount of time and eAEort <lb/>which must be not too small but also not too large. Indeed, in analogy with <lb/>Berlyne ([7]), Csikszentmihalyi insists that the internal reward is maximal when <lb/>the challenge is not too easy but also not too di±cult. <lb/> 2.2 Neuroscience <lb/> Recent discoveries showing a convergence between patterns of the activity in the <lb/>midbrain dopamine neurons and computational model of reinforcement learning <lb/>have led to an important amount of speculations about learning activities in the <lb/>brain ([9]). Central to some of these models is the idea that dopamine cells re-<lb/>port the error in predicting expected reward delivery. Most experiments in this <lb/>domain focus on the involvement of dopamine for predicting extrinsic (or exter-<lb/>nal) reward (e.g. food). Yet, recently some researchers provided ground for the <lb/>idea that dopamine might also be involved in the processing of types of intrinsic <lb/>motivation associated with novelty and exploration ([10], [11]). In particular, <lb/>some studies suggest that dopamine responses could be interpreted as report-<lb/>

			<page>7 <lb/></page>

			 ing  &quot; prediction error &quot; (and not only &quot; reward prediction error &quot; ) ([12]). These <lb/>findings supports the idea that intrinsic motivation systems could be present in <lb/>the brain in some forms or another and that signals reporting prediction error <lb/>could play a critical role in this context. <lb/> 2.3 Developmental robotics <lb/> Given this background, a way to implement an intrinsic motivation system might <lb/>be to build a mechanism which can evaluate operationally the degree of &quot; nov-<lb/>elty &quot; , &quot; surprise &quot; , &quot; complexity &quot; or &quot; challenge &quot; that diAEerent situations provide <lb/>from the point of view of a learning robot, and then measuring an associated re-<lb/>ward ideally being maximal when these features are in an intermediate level, as <lb/>proposed by Berlyne ([7]) and Csikszentmihalyi ([13]). Autonomous and active <lb/>exploratory behaviour can then be achieved by acting so as to reach situations <lb/>which maximize this measure. The di±cult task becomes to find a sensible man-<lb/>ner to operationalize the concepts behind the words &quot; novelty &quot; , &quot; complexity &quot; , <lb/> &quot; surprise &quot; or &quot; challenge &quot; which are only verbally described and often vaguely <lb/>defined in the psychology literature. <lb/>Only a few researchers have suggested such implementations, and even fewer <lb/>have tested them on real robots. Typically, they call these systems of au-<lb/>tonomous and active exploratory behaviour &quot; artificial curiosity &quot; . Schmidhuber, <lb/>Thrun and Hermann ([14], [15], and [16]) provided initial implementations of <lb/>artificial curiosity, but they did not integrate this concept within the problem-<lb/>atic of developmental robotics, in the sense that they were not concerned with <lb/>

			<page>8 <lb/></page>

			the emergent development sequence and with the increase of the complexity of <lb/>their machines (and they did not use robots, but learning machines on some <lb/>abstract problems). They were only concerned in how far artificial curiosity can <lb/>speed up the acquisition of knowledge. The first integrated view of developmen-<lb/>tal robotics that incorporated a proposal for a novelty drive was described by <lb/>Weng and colleagues ([17]; [18]). Then, Kaplan and Oudeyer proposed an im-<lb/>plementation of artificial curiosity within a developmental framework ([19]), and <lb/>Marshall, Blank and Meeden as well as Barto, Sing and Chentanez suggested <lb/>variations on the novelty drive ([20], [21]). As we will explain later on in the <lb/>paper, these pioneering systems have a number of limitations making them im-<lb/>possible to use on real robots in real uncontrolled environments. Furthermore, <lb/>to our knowledge, it has not yet been shown how they could successfully lead to <lb/>the autonomous formation of a developmental sequence comprising more than <lb/>one stage. This means that typically they have allowed for the development and <lb/>emergence of one level of behavioural patterns, but did not show how new levels <lb/>of more complex behavioural patterns could emerge without the intervention of <lb/>a human or a change in the environment provoked by a human. <lb/> 2.4 Active Learning <lb/> Interestingly, the mechanisms developed in these papers devoted to the imple-<lb/>mentation of artificial curiosity have strong similarities with mechanisms devel-<lb/>oped in the field of statistics, where it is called &quot; optimal experiment design &quot; <lb/>([22]), and in machine learning, where it is called &quot; active learning &quot; ([23], [24]). <lb/>

			<page>9 <lb/></page>

			In these contexts, the problem is summarized with the question: how to choose <lb/>the next example for a learning machine in order to to minimize the number of <lb/>examples necessary to achieve a given level of performance in generalization? <lb/>Or said another way: how to choose the next example so that the gain in in-<lb/>formation for the machine learner will be maximal? A number of techniques <lb/>developed in active learning have proved to speed up significantly the learning <lb/>of machines (e.g. [25], [26], [27], [28], [29], [30], [31]) and even to allow per-<lb/>formance on generalization which are not possible with passive learning ([32]). <lb/>Yet, these techniques were developed for applications in which the mapping <lb/>to be learnt was clean and typically presented as pre-processed well-prepared <lb/>datasets. They are also typically based on mathematical theory like Optimal <lb/>Experiment Design which assumes that the noise is independently normally dis-<lb/>tributed ([33]). On the contrary, the domain that real robots shall investigate <lb/>is the real unconstrained world, which is a highly complicated and &quot; muddy &quot; <lb/>structure, as pointed out by Weng ([34]), full of very diAEerent kinds of inter-<lb/>twined non-gaussian inhomogeneous noise. As a consequence, these methods <lb/>cannot be used directly in the developmental robotics domain, and there is no <lb/>obvious way to extend them in this direction. Moreover, there exists no e±-<lb/>cient implementation for methods like optimal experiment design in continuous <lb/>spaces, and already in discrete spaces the computational cost is high ([35]). <lb/>

			<page>10 <lb/></page>

			3 Existing intrinsic motivation systems <lb/> Existing computational approaches to intrinsic motivations and artificial cu-<lb/>riosity are typically based on an architecture which comprises a machine which <lb/>learns to anticipate the consequences of the robot&apos;s actions, and in which these <lb/>actions are actively chosen according to some internal measures related to the <lb/>novelty or predictability of the anticipated situation. Thus, the robots in these <lb/>approaches can be described as having two modules: 1) one module implements <lb/>a learning machine M which learns to predict the sensorimotor consequences <lb/>when a given action is executed in a given sensorimotor context; 2) another <lb/>module is a meta learning machine metaM which learns to predict the errors <lb/>that machine M makes in its predictions: these meta-predictions are then used <lb/>as the basis of a measure of the potential interestingness of a given situation. <lb/>The existing approaches can be divided into three groups, according to the way <lb/>action-selection is made depending on the predictions of M and metaM. <lb/> 3.1 Group 1: Error maximization <lb/> In the first group (e.g. [18]; [15]; [20], [21]) robots directly use the error pre-<lb/>dicted by metaM to choose which action to do  1  . The action that they choose at <lb/>each step is the one for which metaM predicts the largest error in prediction of <lb/>M. This has shown to be e±cient when the machine M has to learn a mapping <lb/>which is learnable, deterministic and with homogeneous Gaussian noise ([32]; <lb/> 
			
			<note place="footnote">1  Of course, we are only talking about the &quot; novelty &quot; drive here: their robots are sometimes <lb/>equipped with other competing drives or can respond to external human based reward sources. <lb/></note>

			<page> 11 <lb/></page>

			[15]; [17]; [21]). But this method shows limitations when used in a real uncon-<lb/>trolled environment. Indeed, in such a case, the mapping that M has to learn is <lb/>not anymore deterministic, and the noise is vastly inhomogeneous. Practically, <lb/>this means that a robot using this method will for example be stuck by white <lb/>noise or more generally by situations which are inherently too complex for its <lb/>learning machinery or situations for which the causal variables are not perceiv-<lb/>able or observable by the robot. For example, a robot equipped with a drive <lb/>which pushes it towards situations which are maximally unpredictable might <lb/>discover and stay focused on movement sequences like running fast against a <lb/>wall, the shock resulting in an unpredictable bounce (in principle, the bounce is <lb/>predictable since it obeys the deterministic laws of classic mechanics but in prac-<lb/>tice this prediction requires the perfect knowledge of all the physical properties <lb/>of the robot body as well as those of the wall, which is typically far from being <lb/>the case for a robot). So, in uncontrolled environments, a robot equipped with <lb/>this intrinsic motivation system will get stuck and display behaviours which do <lb/>not lead to development and that can sometimes even be dangerous. <lb/> 3.2 Group 2: Progress maximization <lb/> A second group of models tried to avoid getting stuck in the presence of pure <lb/>noise or unlearnable situations by using more indirectly the prediction of the <lb/>error of M (e.g. [16, 19]). In these models a third module that we call KGA <lb/> for Knowledge Gain Assessor is added to the architecture. Figure 1 shows an <lb/>illustration of these systems. This new module enhances the capabilities of the <lb/>

			<page>12 <lb/></page>

			meta-machine metaM: KGA predicts the mean error rate of M in the close <lb/>future and in the next sensorimotor contexts. KGA also stores the recent mean <lb/>error rate of M in the most recent sensorimotor contexts. The crucial point of <lb/>these models is that the interestingness of candidate situations are evaluated <lb/>using the diAEerence between the expected mean error rate of the predictions <lb/>of M in the close future, and the mean error rate in the close past. For each <lb/>situation that the robot encounters, it is given an internal reward which is equal <lb/>to the inverse of this diAEerence (which also corresponds to the local derivative of <lb/>the error rate curve of M). This internal reward is positive when the error rate <lb/>decreases, and negative when it increases. The motivation system of the robot <lb/>is then a system in which the action chosen is that for which KGA predicts <lb/>that it will lead to the greatest decrease of the mean error rate of M. This <lb/>ensures that the robot will not stay in front of white noise for a long time or in <lb/>unlearnable situations because this does not lead to a decrease of its errors in <lb/>prediction. <lb/>However, this method has only been tested in spaces in which the robot <lb/>can do only one kind of activity, such as for example moving the head and <lb/>learning to predict the position of high luminance points ([19]). But the ideal <lb/>characteristic of a developmental robot is that it may engage in various kinds of <lb/>activities, such as learning to walk, learning to grip things in its hand, learning <lb/>to track a visual target, learning to catch the attention of other social beings, <lb/>learning to vocalize, etc. In such cases, the robot can typically switch rapidly <lb/>from one activity to the other: for example, making a trial at gripping an object <lb/>

			<page>13 <lb/></page>

			that it sees and suddenly shifting to trying to track the movement of another <lb/>being in its environment. In such a case, measuring the evolution in time of its <lb/>performance in predicting what happens will lead to a measure which is hardly <lb/>interpretable. Indeed, using the method we described in the last paragraph will <lb/>make the robot compare its error rate in anticipation while it is trying to grip <lb/>an object with its error rate in anticipation while it is trying to anticipate the <lb/>reaction of the other being when he vocalizes, if these two kinds of activities <lb/>are sequenced. Thus, it will often lead the robot to compare its performances <lb/>for activities which are of a diAEerent kind, which has no obvious meaning. And <lb/>indeed, using this direct measure of the decrease in the error rate in prediction <lb/>will provide the robot with internal rewards when shifting from an activity with <lb/>a high mean error rate to activities with a lower mean error rate, which can <lb/>be higher than the rewards corresponding to an eAEective increase of the skills <lb/>of the robot in one of the activities. This will push the robot towards instable <lb/>behaviour, in which it focuses on the sudden shifts between diAEerent kinds of <lb/>activities rather than concentrate on the actual activities. <lb/> 3.3 Group 3: Similarity-based progress maximization <lb/> Changes are needed so that methods based on the decrease of the error rate in <lb/>prediction can still work in a realistic complex developmental robotics set-up. <lb/>It is necessary that the robot monitors the evolution of its error rate in predic-<lb/>tion in situations which are of the same kind. It will not anymore compare its <lb/>current error rate with its error rate in the close past, whatever the current sit-<lb/>

			<page>14 <lb/></page>

			 uation and the situation in the close past are. The similarity between situations <lb/>must be taken into account. Building a system which can do that correctly <lb/>represents a big challenge. Indeed, a developmental robot will not be given <lb/>an innate mechanism with a pre-programmed set of kinds of situations and a <lb/>mechanism for categorizing each particular situation into one of these kinds. A <lb/>developmental robot has to be able to build by itself a measure of the similarity <lb/>of situations and ultimately an organization of the infinite continuous space of <lb/>particular situations into higher-level categories (or kinds) of situations. For <lb/> example, a developmental robot does not know initially that on the one hand <lb/>there can be the &quot; gripping objects &quot; kind of activity and on the other hand the <lb/> &quot; vocalizing to others &quot; kind of activity. Initially, the world is just a continuous <lb/>stream of sensations and low-level motor commands for the robot. <lb/>A related approach, but with an active learning point of view rather than a <lb/>developmental robotics point of view, was proposed presenting an implementa-<lb/>tion of the idea of evaluation the learning progress by monitoring the evolution <lb/>of the error rate in similar situations ([14]). The implementation described was <lb/>tested for discrete environments like a two-dimensional grid virtual world on <lb/>which an agent could move and do one of four discrete actions. The similar-<lb/>ity of two situations was evaluated by a binary function stating whether they <lb/>correspond exactly to the same discrete state or not. From an active learning <lb/>point of view, it was shown that in this case the system can significantly speed <lb/>up the learning, even if some parts of the space are pure noise. This system <lb/>was not studied under the developmental robotics point of view: it was not <lb/>
			
			<page>15 <lb/> </page>
			
			shown whether this allowed for a self-organization of the behaviour of the robot <lb/>into a developmental sequence featuring clearly several stages of increasing com-<lb/>plexity. Moreover, because the system was only tested on a discrete simulated <lb/>environment, it is di±cult to generalize the results to the general case in which <lb/>the environment and action spaces are continuous, and where two situations are <lb/>never numerically exactly the same. Nevertheless, this article suggests a possi-<lb/>ble manner to use this method in continuous spaces. It is based on the use of <lb/>a learning machine such as a feed-forward neural network which takes as input <lb/>a particular situation and predicts the error associated with the anticipation of <lb/>the consequence of a given action in this situation. This measure is then used in <lb/>a formula to evaluate the learning progress. Thanks to the generalization prop-<lb/>erties of a machine like a neural network, the author claims that the mechanism <lb/>will correctly generalize the evaluation of learning progress from one situation <lb/>to similar situations. Yet, it is not clear how this will work in practice since the <lb/>error function, and thus the learning progress function, is locally highly non-<lb/>stationary. This provokes a risk of over-generalization. Another limit of this <lb/>work resides within the particular formula that is used to evaluate the learning <lb/>progress associated with a candidate situation, which consists in making the <lb/>diAEerence between the error in the anticipation of this situation before it has <lb/>been experienced and the error in the anticipation of exactly the same situation <lb/>after it has been experienced. On the one hand, this can only work for a learn-<lb/>ing machine with a low learning rate, as pointed out by the author, and will <lb/>not work with for example one-shot learning of memory-based methods. On the <lb/>
			
			<page>16 <lb/></page> 
			
			other hand, considering the state of the learning machine just before and just <lb/>after one single experience can possibly be sensitive to stochastic fluctuations. <lb/>The next section will present a system which provides an implementation <lb/>of the idea of evaluating the learning progress by comparing similar situations. <lb/>This system is made to work in continuous spaces, and we will actually show <lb/>that this system works both in a virtual robot set-up and in a real robotic set-<lb/>up with continuous motor and/or perceptual spaces. One of its crucial features <lb/>is that it introduces a mechanism of situation categorization, which splits the <lb/>space incrementally and autonomously into diAEerent regions which correspond <lb/>to diAEerent kinds of activities from the point of view of the robot. This allows <lb/>to compare the similarity of two situations not directly based on their intrinsic <lb/>metric distance, but on their belonging to a given situation category. Another <lb/>feature is the fact that we monitor in each of these regions the evolution of <lb/>the error rate in prediction for an extended period of time, which allows to use <lb/>smoothing procedures and avoid problems due to stochastic fluctuations. The <lb/> &quot; regional &quot; evaluation of similarity combined with the smoothing of the error rate <lb/>curve is a way to cope with the non-stationarity of the learning progress function. <lb/>Another feature is that it makes no pre-supposition on the learning rate of the <lb/>learning machines, and thus can be used with one-shot learning methods like <lb/>nearest neighbours algorithms as well as with slowly learning neural networks <lb/>for example. <lb/>
			
			<page>17 <lb/></page> 
			
			4 Intelligent Adaptive Curiosity <lb/> The system described in this section is called Intelligent Adaptive Curiosity <lb/> (IAC): <lb/> • it is a motivation, or drive, in the same sense than food level maintenance <lb/>or heat maintenance are drives, but instead of being about the mainte-<lb/>nance of a physical variable, the IAC drive is about the maintenance of <lb/>an abstract dynamic cognitive variable: the learning progress, which <lb/>must be kept maximal. This definition makes it an intrinsic motivation. <lb/> • it is called curiosity because maximizing the learning progress pushes (as <lb/>a side eAEect) the robot towards novel situations in which things can be <lb/>learnt. <lb/> • it is adaptive because the situations that are attractive change over time: <lb/>indeed, once something is learnt, it will not provide learning progress <lb/>anymore. <lb/> • it is called intelligent because it keeps, as a side eAEect, the robot away <lb/>both from situations which are too predictable and from situations which <lb/>are too unpredictable (i.e. the edge of order and chaos in the cognitive <lb/>dynamics). Indeed, thanks to the fact that one evaluates the learning <lb/>progress by comparing situations which are similar and in a &quot; regional &quot; <lb/>manner, the pathologic behaviours that we described in the previous sec-<lb/>tion are avoided. <lb/>
			
			<page>18 <lb/></page> 
			
			We will now describe how this system can be fully implemented. This imple-<lb/>mentation can be varied in many manners, for example by replacing the imple-<lb/>mentation of the learning machines M, metaM and KGA. The one we provide <lb/>is basic and was developed for its practical e±ciency. Also, it will be clear to <lb/>the reader that in an e±cient implementation, the machines M, metaM and <lb/> KGA are not easily separable (keeping them separate entities in the previous <lb/>paragraphs was for reasons of keeping the explanation easier to understand). <lb/> 4.1 Summary <lb/> IAC relies on a memory which stores all the experiences encountered by the <lb/>robot in the form of vector exemplars. There is a mechanism which incremen-<lb/>tally splits the sensorimotor space into regions, based on these exemplars. Each <lb/>region is characterized by its exclusive set of exemplars. Each region is also <lb/>associated with its own learning machine, which we call an expert. This expert <lb/>is trained with the exemplars available in its region. When a prediction corre-<lb/>sponding to a given situation has to be made by the robot, then the expert of <lb/>the region which covers this situation is picked up and used for the prediction. <lb/>Each time an expert makes a prediction associated to an action which is actually <lb/>executed, its error in prediction is measured and stored in a list which is associ-<lb/>ated to its region. Each region has its own list. This list is used to evaluate the <lb/>potential learning progress that can be gained by going in a situation covered <lb/>by its associated region. This is made based on a smoothing of the list of errors, <lb/>and on an extrapolation of the derivative. When in a given situation, the robot <lb/>
			
			<page>19 <lb/></page> 
			
			creates a list of possible actions and chooses the one for which it evaluates it <lb/>will lead to a situation with maximal expected learning progress  2 <lb/> 4.2 Sensorimotor apparatus <lb/> The robot has a number of real-valued sensors s  i  (t) which are here summarized <lb/>by the vector S(t). Its actions are controlled by the setting of the real number <lb/>values of a set of action/motor parameters m  i  (t), which we summarize using <lb/>the vector M(t). These action parameters can potentially be very low level <lb/>(for example the speed of motors) or of a higher-level (for example the control <lb/>parameters of motor primitives such as the biting or bashing movement that we <lb/>will describe in the section devoted to the &quot; Playground Experiment &quot; ). We de-<lb/>note the sensorimotor context SM(t) as the vector which summarizes the values <lb/>of all the sensors and the action parameters at time t (it is the concatenation <lb/>of S(t) and M(t)). In all that follows, there is an internal clock in the robot <lb/>which discretized the time, and new actions are chosen at every time step. <lb/> 
			
			<note place="footnote">2  A variant of this system is the use of only one monolithic learning system, keeping the <lb/>mechanism of region construction by incremental space splitting. In this case, for each pre-<lb/>diction of the single learning system, its error is stored in the list corresponding to the region <lb/>covering the associated situation. The evaluation of the expected learning progress of a candi-<lb/>date situation is the same as in the system presented here. Yet, we prefer to use one learning <lb/>system per region in order to avoid forgetting problems which are typical of monolithic learning <lb/>machines when used in a life-long learning set-up with various kinds of situations. <lb/></note> 
			
			<page>20 <lb/></page> 
			
			4.3 Regions <lb/> IAC equips the robot with a memory of all the exemplars (SM(t), S(t + 1)) <lb/> which have been encountered by the robot. There is a mechanism which incre-<lb/>mentally splits the sensorimotor space into regions, based on these exemplars. <lb/>Each region is characterized by its exclusive set of exemplars. At the beginning, <lb/>there is only one region R  1  . Then, when a criterion C  1  is met, this region is <lb/>split into two regions. This is done recursively. A very simple criterion C  1  can <lb/>be used: when the number of exemplars associated to the region is above a <lb/>threshold T = 250, then split. This criterion allows to guarantee a low number <lb/>of exemplars in each leaf, which renders the prediction and learning mechanism <lb/>that we will describe in the next paragraphs computationally e±cient. The <lb/>counterpart is that it will lead to systems with many regions which are not <lb/>easily interpretable from a human point of view. <lb/>When a splitting has been decided, then another criterion C  2  must be used <lb/>to find out how the region will be split. Again, the choice of this criterion was <lb/>made so that it is computationally and experimentally e±cient. The idea is that <lb/>we split the set of exemplars into two sets so that the sum of the variances of <lb/> S(t + 1) components of the exemplars of each set, weighted by the number of <lb/>exemplars of each set, is minimal. Let us explain this mathematically. Let us <lb/>denote <lb/> °  n  = {(SM(t), S(t + 1))  i  } <lb/> the set of exemplars possessed by region R  n  . Let us denote j a cutting dimension <lb/>and v  j  an associated cutting value. Then the split of °  n  into °  n+1  and °  n+2  is <lb/>
			
			<page>21 <lb/></page> 
			
			done by choosing a j and a v  j  such that (criterion C  2  ): <lb/> • all the exemplars (SM(t), S(t + 1))  i  of °  n+1  have the jth component of <lb/>their SM(t) smaller than v  j  ; <lb/> • all the exemplars (SM(t), S(t + 1))  i  of °  n+2  have the jth component of <lb/>their SM(t) greater than v  j  ; <lb/> • the quantity <lb/> |°  n+1  |.ae({S(t + 1)|(SM(t), S(t + 1)) 2 °  n+1  })+ <lb/>|°  n+2  |.ae({S(t + 1)|(SM(t), S(t + 1)) 2 °  n+2  }) <lb/> is minimal, where <lb/> ae(S) = <lb/> P <lb/> v2S  ||v ° <lb/> P <lb/> v2S  v <lb/> |S|  ||  2 <lb/> |S| <lb/> where S is a set of vectors and |S| denotes the cardinal of S. <lb/> Then recursively and for each region, if the criterion C  1  is met, the region is <lb/>split into two regions with the criterion C  2  . This is illustrated in figure 2. <lb/>Each region stores all the cutting dimensions and the cutting values that <lb/>were used in its generation as well as in the generation of its parent experts. As <lb/>a consequence when a prediction has to be made of the consequences of SM(t), <lb/> it is easy to find out the expert specialist for this case: it is the one for which <lb/> SM(t) satisfies all the cutting tests (and there is always a single expert which <lb/>corresponds to each SM(t)). <lb/>
			
			<page>22 <lb/></page> 
			
			4.4 Experts <lb/> To each region R  n  , there is an associated learning machine E  n  , called an expert. <lb/>A given expert E  n  is responsible for the prediction of S(t + 1) given SM(t) <lb/> when SM(t) is a situation which is covered by its associated region R  n  . Each <lb/>expert E  n  is trained on the set of exemplars which is possessed by its associated <lb/>region R  n  . An expert can be a neural-network, a support-vector machine or a <lb/>Bayesian machine for example. For all learning machines whose training can <lb/>be incremental, such as neural networks, support-vector machines, or memory-<lb/>based methods, then the system is e±cient since it is not necessary to re-train <lb/>each expert on all the exemplars of each region, but just to update one single <lb/>expert by feeding the new exemplar to it. Still, when a region is split, one <lb/>cannot use directly the &quot; parent &quot; expert to implement the two children experts. <lb/>Each child expert is typically a fresh expert re-trained with the exemplars that <lb/>its associated region has inherited. The computational cost associated with this <lb/>is limited thanks to the fact that the number of exemplars is never higher than <lb/> T = 250 as guaranteed by the C  1  criterion.  3 <lb/> 
			
			<note place="footnote">3  Even computationally demanding learning machines such as non-linear support vector <lb/>machines require only a few dozens milliseconds on a standard computer to be trained with 250 <lb/>examples, even if these examples have several hundred dimensions ([36]). In the experiments <lb/>described in the next sections, we use a very simple learning algorithm for implementing the <lb/>expert: the nearest-neighbours algorithm. In this case, there is not even a need for re-training <lb/>the expert, since the expert is the set of exemplars. In general, the use of the nearest-<lb/>neighbour algorithm is computationally costly when used at the prediction stage, since it <lb/>requires as many computations of distances as there are exemplars. Again, the criterion C  1 <lb/> guarantees that the number of exemplars is always low and allows for a fast computation of <lb/></note> 
			
			<page>23 <lb/></page> 
			
			4.5 Evaluation of learning progress <lb/> This partition of the sensorimotor space into diAEerent regions is the basis of <lb/>our regional evaluation of learning progress. Each time an action is executed <lb/>by the robot in a given sensorimotor context SM(t) covered by the region R  n  , <lb/>the robot can measure the discrepancy between the sensory state e <lb/> S(t + 1) that <lb/>the expert E  n  predicted and the actual sensory state S(t + 1) that it measures. <lb/>This provides a measure of the error of the prediction of E  n  at time t + 1: <lb/> e  n  (t + 1) = ||S(t + 1) ° e <lb/> S(t + 1)||  2 <lb/> 
			
			<note place="footnote">the closest exemplar. It is also interesting to note that if one would use a monolithic learning <lb/>system with only one global expert, which is a variation of IAC mentioned earlier, then the <lb/> use of the nearest neighbours algorithm would become soon computationally very expensive <lb/>since a life-long learning robot can accumulate millions of exemplars. On the contrary, using <lb/>local experts to which access is computed with a tree of cheap numerical comparisons (see <lb/>figure 2) allows to compute approximately correct global nearest neighbours with a logarithmic <lb/>complexity (O(log(N ))) rather than with a linear complexity (log(N )). And in fact, using a <lb/>tree structure with local experts not only allows to speed up the nearest neighbours algorithm, <lb/>but it also allows to increase the performances in generalization. In practice, this means that <lb/>the system we present in this paper, when used for example with the nearest neighbours <lb/>algorithm, can update itself as well as make predictions when it already possesses 3000000 <lb/>exemplars in a few milliseconds on a personal computer, since in this case it requires about 17 <lb/>scalar comparisons (depth of the corresponding balanced tree) and 250 distance computation <lb/>between points. Admittedly, this requires a lot of memory, but it is interesting to note that <lb/>the collection of 3000000 exemplars composed of for example 20 dimensions, which would take <lb/>approximately 34 days in the case of the robots presented in the &quot; Playground Experiment &quot; <lb/>section, would require about 230Mb in memory, which is much less than the capacity of most <lb/>hand held computers nowadays. <lb/></note>

			<page> 24 <lb/></page>

			This squared error is added to the list of past squared errors of E  n  , which are <lb/>stored in association to the region R  n  . We denote this list: <lb/> e  n  (t), e  n  (t ° 1), e  n  (t ° 2), ..., e  n  (0) <lb/>Note that here t denotes a time which is specific to the expert, and not to the <lb/>robot: this means that e  n  (t ° 1) might correspond to the error made by the <lb/>expert E  n  in an action performed at t ° 10 for the robot, and that no actions <lb/>corresponding to this expert were performed by the robot since that time. These <lb/>lists associated to the regions are then used to evaluate the learning progress that <lb/>has been achieved after an action M(t) has been achieved in sensory context <lb/> S(t), leading to a sensory context S(t + 1). The learning progress that has been <lb/>achieved through the transition from the SM(t) context, covered by region R  n  , <lb/>to the context with a perceptual vector S(t + 1) is computed as the smoothed <lb/>derivative of the error curve of E  n  corresponding to the acquisition of its recent <lb/>exemplars. Mathematically, the computation involves two steps: <lb/> • the mean error rate in prediction is computed at t + 1 and t + 1 ° ø : <lb/> &lt; e  n  (t + 1) &gt;= <lb/> P  µ <lb/> i=0  e  n  (t + 1 ° i) <lb/> µ + 1 <lb/> &lt; e  n  (t + 1 ° ø ) &gt;= <lb/> P  µ <lb/> i=0  e  n  (t + 1 ° ø ° i) <lb/> µ + 1 <lb/>where ø is a time window parameter typically equal to 15, and µ a smooth-<lb/>ing parameter typically equal to 25. <lb/> • the actual decrease in the mean error rate in prediction is defined as: <lb/> D(t + 1) =&lt; e  n  (t + 1) &gt; ° &lt; e  n  (t + 1 ° ø ) &gt; <lb/> (1) <lb/>

			<page>25 <lb/></page>

			We can then define the actual learning progress as <lb/> L(t + 1) = °D(t + 1) <lb/>(2) <lb/>Eventually, when a region is split into two regions, both new regions inherit the <lb/>list of past errors from their parent region, which allows them to make evalua-<lb/>tion of learning progress right from the time of their creation. <lb/> 4.6 Action selection <lb/> We have now in place a prediction machinery and a mechanism which provides <lb/>an internal reward (positive or negative) <lb/> r(t) = L(t) <lb/> each time an action is performed in a given context, depending on how much <lb/>learning progress has been achieved  4  . The goal of the intrinsically motivated <lb/>robot is then to maximize the amount of internal reward that it gets. Mathemat-<lb/>ically, this can be formulated as the maximization of future expected rewards <lb/>(i.e. maximization of the return), that is <lb/> E{ <lb/> X <lb/> t∏t  n <lb/> ∞  t°t  n  r(t))} <lb/> 
			
			<note place="footnote">4  To integrate reward resulting from learning progress with other kinds of (possibly extrin-<lb/>sic) rewards, a weighted sum can be used. A parameter AE  i  specifies the relative weight of each <lb/>reward type. <lb/> r(t) = <lb/> X <lb/> i <lb/> AE  i  .r  i  (t) <lb/>( 3 ) <lb/>. <lb/></note>

			<page> 26 <lb/></page>

			where ∞ (0 ∑ ∞ ∑ 1) is the discount factor, which assigns less weigh on the <lb/>reward expected in the far future. <lb/>This formulation corresponds to a reinforcement learning problem formula-<lb/>tion [37] and thus the techniques developed in this field can be used to implement <lb/>an action selection mechanism which will allow the robot to maximize future <lb/>expected rewards e±ciently. Indeed, in reinforcement learning models, a con-<lb/>troller chooses which action a to take in a context s based on rewards provided <lb/>by a critic. Traditional models view the critic as being external to the agent. <lb/>Such situations correspond to extrinsically motivated forms of learning. But the <lb/>critic can as well be part of the agent itself (as clearly argued by Sutton and <lb/>Barto [37] p.51-54). As a consequence, the algorithm described in this section <lb/>can be interpreted as a critic capable of producing internal rewards r(t) in order <lb/>to guide the agent in its development. Thus, any existing reinforcement learning <lb/>technique can be associated with the IAC drive. <lb/>One simple example would be to use Watkins&apos; Q-learning [38]. The algorithm <lb/>learns an action-value function Q(s, a), estimating how good it is to perform a <lb/>given action a (M(t) in our context) in a given contextual state s (S(t) in our <lb/>context). &quot; Good &quot; actions are expected to lead to more future rewards (e.g. <lb/>more future learning progress in our context). The algorithm can be described <lb/>in the following procedural form: <lb/> • Initialise Q(s, a) with small random uniform values <lb/> • Repeat <lb/>
			
			<page>27 <lb/></page> 
			
			– In situation s, choose a using a policy derived from Q. For instance <lb/>choose a that maximize Q in most cases but every once in a while, <lb/>with a probability ≤ instead select an action at random, uniformly <lb/>(this is called an ≤ -greedy action selection rule [37]) <lb/> – Perform action a, observe r and the resulting state s  0 <lb/> – Q(s, a) √ Q(s, a) + AE[r + ∞ · max  a  0  (Q(s  0  , a  0  )) ° Q(s, a)] <lb/> – s √ s  0 <lb/> where the parameter AE is the learning rate controlling how fast the action-<lb/>value function is updated by experience. Of course, all the complex issues <lb/>traditionally encountered in reinforcement learning like trade-oAE between ex-<lb/>ploration and exploitation stay crucial for systems using internal rewards based <lb/>on intrinsic motivation. <lb/>The purpose of this article is to focus on the study and understanding <lb/>of the learning progress definition that we presented. Using a complex re-<lb/>inforcement machinery brings complexity and biases which are specific to a <lb/>particular method, especially concerning the way they process delayed rewards. <lb/>While using such a method with intrinsic motivation systems will surely be <lb/>useful in the future, and is in fact an entire subject of research as illustrated <lb/>by the work of Barto, Singh and Chentanez ([21]) who have studied the use <lb/>of sophisticated re-inforcement learning techniques on a simple novelty-based <lb/>intrinsic motivation system, we will make now a simplification which will allow <lb/>us not to use such sophisticated re-inforcement learning methods so that the <lb/>
			
			<page>28 <lb/></page> 
			
			results we will present in the experiment section can be interpreted more easily. <lb/>Indeed, this is a necessary step since our intrinsic motivation system involves <lb/>a non-trivial measure of learning progress which must be carefully understood. <lb/>This simplification consists in having the system try to maximize only the ex-<lb/>pected reward it will receive at t + 1, i.e. E{r(t + 1)} This permits to avoid <lb/>problems related to delayed rewards and it makes it possible to use a simple <lb/>prediction system which can predict r(t + 1), and so evaluate E{r(t + 1)}, and <lb/>then be used in a straightforward action selection loop. The method we use to <lb/>evaluate E{r(t + 1)} given a sensory context S(t) and a candidate action f <lb/> M(t), <lb/> constituting a candidate sensorimotor context g <lb/> SM(t) covered by region R  n  , is <lb/>straightforward but revealed to be e±cient: it is equal to the learning progress <lb/>that was achieved in R  n  with the acquisition of its recent exemplars, i.e. <lb/> E{r(t + 1)} º L(t ° µ  R  n  ) <lb/>(4) <lb/>where t ° µ  Rn  is the time corresponding to the last time region R  n  and ex-<lb/>pert E  n  processed a new exemplar. <lb/>Based on this predictive mechanism, one can deduce a straightforward mech-<lb/>anism which manages action selection in order to maximize the expected reward <lb/>at t + 1: <lb/> • in a given sensory S(t) context, the robot makes a list of the possible <lb/>actions f <lb/> M(t) which it can do; If this list is infinite, which is often the case <lb/>since we work in continuous action spaces, a sample of candidate actions <lb/>
			
			<page>29 <lb/></page> 
			
			is generated; <lb/> • each of these candidate actions f <lb/> M(t) associated with the context makes a <lb/>candidate g <lb/> SM(t) vector for which the robot finds out the corresponding <lb/>region R  n  ; then the formula we just described is used to evaluate the <lb/>expected learning progress E{r(t+1)} that might be the result of executing <lb/>the candidate action f <lb/> M(t); <lb/> • the action for which the system expects the maximal learning progress <lb/>is chosen and executed except in some cases when a random action is <lb/>selected (≤ -greedy action selection rule). In the following experiments ≤ <lb/> is typically 0.35. <lb/> • after the action has been executed and the consequences measured, the <lb/>system is updated. <lb/> 5 Methodological issues for measuring behavioural <lb/>complexity <lb/> From a developmental robotics point of view, intrinsic motivation systems are <lb/>interesting as a way to achieve a continuous increase in behavioural complex-<lb/>ity. This raises issues for finding adequate methods to evaluate such systems. <lb/>Evaluation based on performance level for a set of predefined tasks are the most <lb/>common way to assess learning progress of adaptive robots. However, as intrin-<lb/>sic motivation systems are designed to result in task-independent autonomous <lb/>
			
			<page>30 <lb/></page> 
			
			development, using an evaluation paradigm coming from task-oriented design is <lb/>not well adapted. Moreover, such evaluation methods are associated with the <lb/>tempting anthropomorphic bias to evaluate how well robots manage to learn <lb/>the tasks that humans can learn. <lb/>The issue is therefore to evaluate the increase of a robot&apos;s behavioural com-<lb/>plexity during a developmental sequence. It is important to stress that there is <lb/>not a single objective way for assessing the increase of complexity of a system. <lb/>Complexity is always related to a given observer ([39]). Three complementary <lb/>approaches can be envisioned. <lb/> • First, it is possible to evaluate the increase in complexity from the robot&apos;s <lb/> point of view. This means measuring internal variables that account for <lb/>the open-endedness of its development (e.g. cumulative amount of learning <lb/>progress, evolution of the performance of anticipations, evolution of the <lb/>way sensorimotor situations are categorized and represented). <lb/> • Second, behavioural complexity can be measured from an external point <lb/>of view based on various complexity measures (information-theoretical <lb/>measures such as the ones presented by Sporns and Pegors could be used <lb/>in that respect ([40]). The increase in behavioural complexity is assessed <lb/>by pattern changes in these measures. <lb/> • Finally, the experimenter can adopt a method more similar to one used by <lb/>a psychologist, interpreting developmental sequences as a set of successive <lb/> stages. The stages of development introduced by Piaget are among the <lb/>
			
			<page>31 <lb/></page> 
			
			most famous examples of such qualitative descriptions [41]. Each tran-<lb/>sition between stages corresponds to a broad change in the structure or <lb/>logic of children&apos;s intelligence and/or behaviour. Based on clinical obser-<lb/>vations, dialogues and small-scale experiments, the psychologist tries to <lb/>interpret the signs of an internal reorganization. Therefore, the issue is to <lb/>map external observations to a series of pre-existing interpretative models. <lb/>Transitions are most of the time progressive and cutting a developmental <lb/>sequences into sharp division is usually di±cult. <lb/>The following experiments will illustrate how a combination of some of these <lb/>methods can be used to assess the development of a robot with an intrinsic <lb/>motivation system. <lb/> 6 A first experiment with a simple simulated <lb/>robot <lb/> We present here a robotic simulation implemented with the Webots simulation <lb/>software ([42]). The purpose of this initial simulated experiment is to show and <lb/>understand in detail the working of the IAC system in a continuous sensorimotor <lb/>environment in which there are parts which are clearly inhomogeneous from the <lb/>learning point of view: there is a part of the space which is easy to learn, a part <lb/>of the space which contains more complex structures which can be learnt, and <lb/>a part of the space which is unlearnable. <lb/>
			
			<page>32 <lb/></page> 
			
			6.1 Motor control <lb/> The robot is a box with two wheels (see figure 3). Each wheel can be controlled <lb/>by setting its speed (real number between -1 and 1). The robot can also emit <lb/>a sound of a particular frequency. The action space is 3-dimensional and con-<lb/>tinuous, and deciding for an action consists in setting the values of the motor <lb/>vector M(t): <lb/> M(t) = (l, r, f ) <lb/>where l is the speed of the motor on the left, r the speed of the motor on the <lb/>right, and f the frequency of the emitted sound. The robot moves in a room. <lb/>There is a toy in this room that can also move. This toy moves randomly if the <lb/>sound emitted by the robot has a frequency belonging to zone f  1  = [0; 0.33]. It <lb/>stops moving if the sound is in zone f  2  = [0, 34; 0, 66]. The toy jumps into the <lb/>robot if the sound is in zone f  3  = [0, 67; 1]. <lb/> 6.2 Perception <lb/> The robot perceives the distance to the toy with simulated infra-red sensors, so <lb/>its sensory vector S(t) is one-dimensional: <lb/> S(t) = (d) <lb/>where d is the distance between the robot and the toy at time t. <lb/> 6.3 Action perception loop <lb/> As a consequence, the mapping that the robot is trying to learn is: <lb/>
			
			<page>33 <lb/></page> 
			
			f : SM(t) = (l, r, f, d) 7 °! S(t + 1) = ( e <lb/> d) <lb/> Using the IAC algorithm, the robot will thus act in order to maximize its <lb/>learning progress in terms of predicting the next toy distance. The robot has <lb/>no prior knowledge, and in particular it does not know that there is a qualita-<lb/>tive diAEerence between setting the speed of the wheels and setting the sound <lb/>frequency (for the robot, these are unlabeled motor channels). It does not know <lb/>that there are three zones of the sensory-motor space of diAEerent complexities: <lb/>the zone corresponding to sounds in f  1  , where the distance to the toy cannot <lb/>be predicted since its movement is random; the zone with sounds in f  3  , where <lb/>the distance to the toy is easy to learn and predict (it is always 0 plus a noise <lb/>component because Webots simulates the imprecision of sensors and actuators); <lb/>and the zone with sounds in f  2  , where the distance to the toy is predictable (and <lb/>learnable) but complex and dependant of the setting of the wheel speeds. <lb/>Yet, we will now show that the robot manages to autonomously discover <lb/>these three zones, evaluate their relative complexity, and exploit this information <lb/>for organizing its own behaviour. <lb/> 6.4 Results <lb/> First of all, one can study the behaviour of the robot during a simulation from an <lb/>external point of view. A way to do that is to use our knowledge of the structure <lb/>of the environment in which the robot lives and build corresponding relevant <lb/>measures characterizing the behaviour of the robot within a given period of <lb/>
			
			<page>34 <lb/></page> 
			
			time: 1) the frequency of situations in which it emits a sound within f  1  ; 2) the <lb/>frequency of situations in which it emits a sound within f  2  ; 3) the frequency of <lb/>situations in which it emits a sound within f  3  . Figure 4 shows the evolution of <lb/>these measures for 5000 time steps. Several phases can be identified: <lb/> Stage 1: Initially, the robot produces all kinds of actions with a uniform <lb/>probability, and in particular produces sounds with frequencies within the <lb/>whole [0, 1] spectrum. <lb/> Stage 1: After the first 250 first steps, the robot concentrates on emitting <lb/>sounds within f  3  , and emits sounds with frequencies within f  1  or f  2  very <lb/>rarely. <lb/> Stage 2: There is then a phase within which the robot concentrates on emit-<lb/>ting sounds within f  2  , and emits sounds with frequencies within f  1  or f  3 <lb/> very rarely. <lb/>This shows that the robot consistently avoids the situations in which nothing <lb/>can be learnt, and begins by easy situations and then shifts autonomously to a <lb/>more complex situation. <lb/>We can now study what happens from the robot&apos;s point of view. Figure 5 <lb/>shows a representation of the successive values of &lt; e  n  (t) &gt; for all the regions <lb/> R  n  constructed by the robot at a given time t. As the time is here defined <lb/>internally as the number of action selection loops, it corresponds to the number <lb/>of actions that have been chosen by the robot, and to the number of exemplars <lb/>
			
			<page>35 <lb/></page> 
			
			that have been provided to it. The graph appears as a tree, which corresponds <lb/>to the successive splitting of the space into regions. For example, between t = 0 <lb/>and t = 250, there is only one curve because during that time there was only <lb/>one region R  1  . This initial curve is the sequence of values of &lt; e  1  (t) &gt;. Then, <lb/>because the criterion C  1  was met, this region splits into two regions R  2  and R  3  , <lb/>which also splits the curve into two curves, one corresponding to the successive <lb/>values of &lt; e  2  (t) &gt; and the other corresponding to the successive values of <lb/> &lt; e  3  (t) &gt;. Then the curves split again when their associated regions split, etc. <lb/>By looking at the trace of the simulation and the definitions of the regions <lb/>associated to each curve, it is possible to figure out what the regions which <lb/>are iteratively created look like. It appears that the first split appearing at <lb/> t = 250 corresponds to a split between situations in which the robot emits <lb/>sounds with a frequency within f  3  (R  2  on the graph), and situations in which <lb/>the robot emits sounds with a frequency within f  2  or f  3  (R  3  on the graph). <lb/>To be exact, the system made a split by using the 3rd dimension of SM(t), <lb/> i.e. the frequency f , and using the cut value 0.35, which means that the region <lb/> R  2  includes possibly a small portion of situations with a sound in f  2  , since f  2 <lb/> begins at 0.34  5  . Now, we observe that the curve corresponding to R  2  shows a <lb/>sharp decrease in its error rate, while the curve in R  3  shows an increase in the <lb/>error rate. This explains why during this period, the robot will emit sounds with <lb/>frequencies within f  3  : indeed, this corresponds to situations which are internally <lb/> 
			
			<note place="footnote">5  This also shows that the splitting criteria C  1  and C  2  that we presented operate e±ciently, <lb/>since the system finds out by itself that this is the f dimension which is the most relevant for <lb/>cutting the space at the beginning of the development <lb/></note> 
			
			<page>36 <lb/></page> 
			
			evaluated as providing the highest amount of learning progress at this time <lb/>of its development. Nevertheless, as the robot sometimes does some random <lb/>actions, the region R  3  accumulates some more exemplars, and we observe that <lb/>around t = 320, it splits into R  4  and R  5  . Looking at the trace shows that R  4 <lb/> corresponds to situations with sounds within f  2  and R  5  with sounds within f  1  . <lb/>We observe that the error rate continues to increase until a plateau is reached <lb/>for R  5  , while it begins to decrease for R  4  . During that time, the robot finally <lb/>predicts perfectly well situations with sounds with a frequency within f  3  and <lb/>associated with R  2  (it still takes a while because of the noise), and a plateau <lb/>close to 0 in the error rate is reached. This is why at some point the robot <lb/>shifts to situations in which it emits sounds with frequencies within f  2  , which <lb/>are situations which are a higher source of learning progress at this point in its <lb/>development. The robot then tries to vary its motor speeds within this sub-<lb/>space with sounds with frequencies in f  2  , learning to predict how these speeds <lb/>aAEect the distance to the toy. The accumulation of new exemplars pushes the <lb/>robot to split R  4  into more regions, which is a refinement of its categorization <lb/>of this kind of situations. Now, the system splits the space using the l and r <lb/> dimensions, and the robot figures out how to explore e±ciently the sub-space of <lb/>situations with sounds with frequencies within f  2  , in terms of learning progress. <lb/> 6.5 Performance in terms of active learning <lb/> The e±ciency of the exploration of this sub-space of situations with sounds in <lb/> f  2  , where interesting things can be learnt, can be evaluated if we reformulate <lb/>
			
			<page>37 <lb/></page> 
			
			IAC within the same problematic as active learning. This will also allow us to <lb/>evaluate the e±ciency of the IAC algorithm from the point of view of active <lb/>learning. Indeed, as we explained in the introduction, in the field of machine <lb/>learning and data mining, the search for methods which allow to reduce the <lb/>number of examples needed to achieve a given level of performance in gener-<lb/>alization for a machine which learns an input-output mapping, is of growing <lb/>interest (here the input is SM(t) and the output is SM (t + 1)). While IAC was <lb/>designed as a system for driving the development of a robot, it can also be con-<lb/>sidered as a pure active learning algorithm, and in this respect it is interesting <lb/>to evaluate how it compares with standard existing algorithms. Thus, we will <lb/>use two reference algorithms to evaluate the performance of IAC. The first one <lb/>follows the most common idea in the field of active learning ([25], [15], [24]): <lb/>the choice of the next action (also called query or experiment depending on the <lb/>authors) is done such that it corresponds to an input-output pair for which the <lb/>machine evaluates that its prediction for this pair will be maximally false as <lb/>compared to its prediction for possible other pairs. It is easy to adapt this idea <lb/>using the same algorithmic architecture than the one used for IAC: when the <lb/>robot has to decide for an action in a given context, it makes the list of possi-<lb/>ble actions within that context, then for each of them evaluates the expected <lb/>error in prediction using the quantity &lt; e  mean  (t) &gt; defined earlier, and finally <lb/>chooses the action for which this quantity is maximal. Everything else is equal. <lb/>We will call this algorithm &quot; MAX &quot; . The second reference algorithm that we use <lb/>is the &quot; RANDOM &quot; algorithm, which simply consists in random action selection <lb/>
			
			<page>38 <lb/></page> 
			
			(and so is not an active learning algorithm, but serves as a baseline). <lb/>IAC, MAX and RANDOM will be compared in terms of their performance <lb/>in generalization in predicting the consequence of actions characterized by a <lb/>frequency within the f  2  zone. This means that we will evaluate each of them in <lb/>the part which we know is interesting. Yet, the whole space with all ranges of <lb/>frequencies is made available to the robot, which does not know as earlier that <lb/>there is a particular zone where it can actually learn non trivial things. <lb/>For a given simulation using a given algorithm among IAC, MAX and RAN-<lb/>DOM, we evaluate every 100 actions the performance in generalization of the <lb/>current learning machine. To do that, we initially made a simulation with ran-<lb/>dom action selection and collected a database of input-output by storing the <lb/>experienced (SM(t), S(t + 1)) couples for which the action included an emis-<lb/>sion of a sound with a frequency within f  2  . This provides an independent test <lb/>set which we used to test the capacity of prediction that the robot acquired at <lb/>a given time in its development. For this test which is done every 100 actions, <lb/>we freeze the learning machine and make it predict the output corresponding <lb/>to all the inputs in the test database. The freezing ensures that the machine <lb/>does not learn while it is tested. The prediction accuracy is measured using the <lb/>mean squared error over the database. After evaluating the performance, we <lb/>unfreezed the system until the next evaluation. <lb/>Figures 6 shows typical resulting curves of the three algorithms. We see that <lb/>initially, the algorithm which learns fastest is the RANDOM algorithm. This <lb/>is normal since MAX spends times in uninteresting situations, and IAC at the <lb/>
			
			<page>39 <lb/></page> 
			
			beginning spends time in the easy situation, so RANDOM is the algorithm which <lb/>provides initially the highest amount of examples related to the production of <lb/>the sounds with frequencies within f  2  (33 percent of examples are of this type <lb/>in this case). Then, after 3000 actions, the curve corresponding to the IAC <lb/>algorithm suddenly drops down: this corresponds to the shift of attention of <lb/>the robot towards situations with sounds with frequencies within f  2  . Now, this <lb/>robot spends 85 percent of its time in situations with sound with frequency <lb/>within f  2  (and not 100 percent due to the 0.15 probability to do a random <lb/>action). Quickly, the curve gets significantly below the RANDOM algorithm, <lb/>and reaches a low plateau around 5000 actions (where the mean prediction <lb/>error stays around 0.09). The RANDOM curve reaches a low plateau much <lb/>later (this is not represented on this curve) after about 11000 actions. The <lb/>value of the plateau, interestingly, is higher than with the IAC algorithm: it is <lb/>0.096. We repeated 100 times the experiments in order to see whether this had <lb/>some statistical significance. In each simulation, we measured the time where a <lb/>plateau was reached (defined as 500 successive points where the mean squared <lb/>error has a variance smaller than 0.0001), and what the mean squared error <lb/>was at that time. It turned out that the plateau was reached at t = 4583 in <lb/>average for IAC, with a standard deviation of 452, and at t = 11980 in average <lb/>with a standard deviation of 561 for RANDOM. The mean squared error was <lb/> e = 0.89 in average with a standard deviation of 0.009 for IAC, and was e = 0.96 <lb/>with a standard deviation of 0.004 for RANDOM. As a consequence, we can <lb/>say consistently that IAC allows to learn the interesting part of the mapping <lb/>
			
			<page>40 <lb/></page> 
			
			about 2.6 times faster and with a higher performance in generalization than <lb/>the RANDOM algorithm. This increase of the performances in generalization <lb/>is similar to what has already been described in other active learning algorithm <lb/>([32]). <lb/> 6.6 Summary <lb/> With this experiment we have shown a first embodiment of the IAC system <lb/>within a simulated robot. This has allowed us to show how IAC could manage <lb/>the development of the robot in an inhomogeneous sensorimotor environment <lb/>with parts which were not learnable by the robot. We have shown how the robot <lb/>consistently avoided this zone of unlearnability and on the other hand explored <lb/>autonomously sensorimotor situations of increasing complexity. This simple <lb/>set-up also allowed us to detail the evolution of the internal structures built by <lb/>the IAC system. We could explain for example the progressive formation of <lb/>regions with varying potentials for learning progress. Finally, this set-up not <lb/>only allowed us to show the interest of IAC as an intrinsic motivation system <lb/>which could self-organize the behaviour of a robot in a developmental manner, <lb/>but also it showed that IAC is an e±cient and robust active learning system. <lb/>Indeed, we proved that it was faster than both the RANDOM algorithm and <lb/>traditional active learning methods which are not suited to mappings with strong <lb/>inhomogeneities and even unlearnable parts. <lb/>Yet, the simplicity of this set-up did not allow to show how a developmental <lb/>sequence with more than one transition could self-organize autonomously (here, <lb/>
			
			<page>41 <lb/></page> 
			
			there was only a transition between a stage in which the robot focused on actions <lb/>with sounds in f  1  , and then a stage in which the robot focused on actions with <lb/>sounds in f  2  ). We are now going to present a more complex experiment in which <lb/>we will show that multiple sequential levels of self-organization of the behaviour <lb/>of the robot can happen. <lb/> 7 The Playground Experiment: the discovery of <lb/>sensorimotor aAEordances <lb/> This new experimental set-up is called &quot; The Playground Experiment &quot; . This <lb/>involves a physical robot as well as a more complex sensorimotor system and <lb/>environment. We use a Sony AIBO robot which is put on a baby play mat with <lb/>various toys that can be bitten, bashed or simply visually detected (see figure <lb/>7). The environment is very similar to the ones in which two or three month old <lb/>children learn their first sensorimotor skills, although the sensorimotor appara-<lb/>tus of the robot is here much more limited. We have developed a web site which <lb/>presents pictures and videos of this set-up: http://playground.csl.sony.fr/. <lb/> 7.1 Motor control <lb/> The robot is equipped initially only with simple motor primitives. In particular <lb/>it is not able to walk around. There are three basic motor primitives: turning the <lb/>head, bashing and crouch biting. Each of them is controlled by a number of real <lb/>number parameters, which are the action parameters that the robot controls. <lb/>
			
			<page>42 <lb/></page> 
			
			The  &quot; turning head &quot; primitive is controlled with the pan and tilt parameters of <lb/>the robot&apos;s head. The &quot; bashing &quot; primitive is controlled with the strength and <lb/>the angle of the leg movement (a lower-level automatic mechanism takes care of <lb/>setting the individual motors controlling the leg). The &quot; crouch biting &quot; primitive <lb/>is controlled by the depth of crouching (and the robot crouches in the direction <lb/>in which it is looking at, which is determined by the pan and tilt parameters). <lb/>To summarize, choosing an action consists in setting the parameters of the 5-<lb/>dimensional continuous vector M(t): <lb/> M(t) = (p, t, b  s  , b  a  , d) <lb/> where p is the pan of the head, t the tilt of the head, b  s  the strength of <lb/>the bashing primitive, b  a  the angle of the bashing primitive, and d the depth <lb/>of the crouching of the robot for the biting motor primitive. All values are <lb/>real numbers between 0 and 1, plus the value -1 which is a convention used <lb/>for not using a motor primitive: for example, M(t) = (0.3, 0.95, °1, °1, 0.29) <lb/>corresponds to the combination of turning the head with parameters p = 0.3 <lb/>and t = 0.95 with the biting primitive with the parameter d = 0.29 but with no <lb/>bashing movement. <lb/> 7.2 Perception <lb/> The robot is equipped with three high-level sensors based on lower-level sensors. <lb/>The sensory vector S(t) is thus 3-dimensional: <lb/>
			
			<page>43 <lb/></page> 
			
			S(t) = (O  v  , B  i  , O  s  ) <lb/>where: <lb/> • O  v  is the binary value of an object visual detection sensor: It takes the <lb/>value 1 when the robot sees an object, and 0 in the other case. In the <lb/>playground, we use simple visual tags that we stick on the toys and are <lb/>easy to detect from the image processing point of view. These tags are <lb/>black and white patterns similar to the Cybercode system developed by <lb/>Rekimoto ([43]). <lb/> • B  i  is the binary value of a biting sensor: It takes the value 1 when the <lb/>robot has something in its mouth and 0 otherwise. We use the cheek <lb/>sensor of the AIBO; <lb/> • O  s  is the binary value of an oscillation sensor: It takes the value 1 when <lb/>the robot detects that there is something oscillating in front of it, and 0 <lb/>otherwise. We use the infra-red distance sensor of the AIBO to implement <lb/>this high-level sensor. This sensor can detect for example when there is <lb/>an object that has been bashed in the direction of the robot&apos;s gaze, but <lb/>can also detect events due to human walking around the playground (we <lb/>do not control the environment). <lb/>It is crucial to note that initially the robot knows nothing about sensorimotor <lb/>aAEordances. For example, it does not know that the values of the object visual <lb/>detection sensor are correlated with the values of its pan and tilt. It does not <lb/>
			
			<page>44 <lb/></page> 
			
			know that the values of the biting or object oscillation sensors can become 1 <lb/>only when biting or bashing actions are performed towards an object. It does <lb/>not know that some objects are more prone to provoke changes in the values of <lb/>the B  i  and O  s  sensors when only certain kinds of actions are performed in their <lb/>direction. It does not know for example that to get a change in the value of the <lb/>oscillation sensor, bashing in the correct direction is not enough, because it also <lb/>needs to look in the right direction (since its oscillation sensors are on the front <lb/>of its head). These remarks allow to understand easily that a random strategy <lb/>will not be e±cient in this environment. If the robot would do random action <lb/>selection, in a vast majority of cases nothing would happen (especially for the <lb/> B  i  and O  s  sensors). <lb/> 7.3 The action perception loop <lb/> To summarize, the mapping that the robot has to learn is: <lb/> f : SM(t) = (p, t, b  s  , b  a  , d, O  v  , B  i  , O  s  ) <lb/> 7 °! S(t + 1) = ( f <lb/> O  v  , f <lb/> B  i  , f <lb/> O  s  ) <lb/>The robot is equipped with the Intelligent Adaptive Curiosity system, and thus <lb/>chooses its actions according to the potential learning progress that it can pro-<lb/>vide to one of its expert. In this experiment, the action perception loop is rather <lb/>long: when the robot chooses and executes an action, it waits that all its motor <lb/>primitives have finished their execution, which lasts approximately one second, <lb/>
			
			<page>45 <lb/></page> 
			
			before choosing the next action. This is how the internal clock for the IAC <lb/>system is implemented. On the one hand, this allows the robot to make all the <lb/>measures necessary for determining adequate values of (O  v  , B  i  , O  s  ). On the <lb/>other hand and most importantly, this allows the environment to come back <lb/>to its &quot; resting state &quot; . This means that environment has no memory: after an <lb/>action has been executed by the robot, all the objects are back in the same <lb/>state. For example, if the object that can be bashed has actually been bashed, <lb/>then it has stopped oscillating before the robots tries a new action. This is a <lb/>deliberate choice to have an environment with no memory: while keeping all <lb/>the advantages, the constraints and the complexity of a physical embodiment, <lb/>this makes that mapping from actions to perception learnable in a reasonable <lb/>time. This is crucial if one wants to do several experiments (already in this case, <lb/>each experiment lasts for nearly one day). Furthermore, introducing an environ-<lb/>ment with memory frames the problem of the maximization of internal reward <lb/>within delayed reward reinforcement problems, for which there exists powerful <lb/>but complicated techniques whose biases would certainly make the results more <lb/>complex and render them more di±cult to interpret. <lb/> 7.4 Results <lb/> During an experiment we continuously measure a number of features which help <lb/>us characterize the dynamics of the robot&apos;s development. First, we measure the <lb/>frequency of the diAEerent kinds of actions that the robot performs in a given <lb/>time window. More precisely: <lb/>
			
			<page>46 <lb/></page> 
			
			• the percentage of actions which do not involve the biting and the bashing <lb/>motor primitive in the last 100 actions (i.e. the robot&apos;s action boils down <lb/>to &quot; just looking &quot; in a given direction). <lb/> • the percentage of actions which involve the biting motor primitive in the <lb/>last 100 actions. <lb/> • the percentage of actions which involve the bashing motor primitive; <lb/>Then, we track the gaze of the robot and at each action measure if it is <lb/>looking towards 1) the bitable object, or 2) the bashable object, or 3) no object. <lb/>This is possible since from an external point of view we know where the object <lb/>are and so it is easy to derive the information from the head position. <lb/>Third, we measure the evolution of the frequency of successful biting actions <lb/>and the evolution of successful bashing actions. A successful biting action is <lb/>defined as an action which provokes a &quot; 1 &quot; value on the B  i  sensor (an object has <lb/>actually be bitten). A successful bashing action is defined as an action which <lb/>provokes an oscillation in the O  s  sensor. <lb/>Figure 8 shows an example of result, showing the evolution of the three kinds <lb/>of measures on three diAEerent levels. A striking feature of these curves is the <lb/>formation of sequences of peaks. Each of these peaks means basically that at the <lb/>moment it occurs the robot is focusing its activity and its attention on a small <lb/>subset of the sensorimotor space. So it is qualitatively diAEerent from random <lb/>action performance in which the curves would be stationary and rather flat. By <lb/>looking in details at these peaks and at their co-occurence (or not) within the <lb/>
			
			<page>47 <lb/></page>
			 
			diAEerent kinds of measures, we can make a description of the evolution of the <lb/>robot&apos;s behaviour. On figure 8, we have marked a number of such peaks with <lb/>letters from A to G. We can see that before the first peak, there is an initial <lb/>phase during which all actions are produced equally often, that most often no <lb/>object is seen, and that a successful bite or bash only happens extremely rarely. <lb/>This corresponds to a phase of random action selection. Indeed, initially the <lb/>robot categorizes the sensorimotor space using only one big region (and so there <lb/>is only one category), and so all actions in any contexts are equally interesting. <lb/>Then we observe a peak (A) in the &quot; just looking &quot; curve: this means that for <lb/>a while, the robot stops biting and bashing, and focuses on just moving its <lb/>head around. This means that at this point the robot has split the space into <lb/>several regions, and that a region corresponding to the sensorimotor loop of <lb/> &quot; just looking around &quot; is associated to the highest learning progress from the <lb/>robot&apos;s point of view. Then, the next peak (B) corresponds to a focus on the <lb/>biting action primitive (with various continuous parameters), but it does not <lb/>co-occur with the looking towards the bitable object. This means that the robot <lb/>is trying to bite basically in all directions around him : it did not discover yet <lb/>the aAEordances of the biting actions with particular objects. The next peak (C) <lb/>corresponds to a focus on the bashing action primitive (with various continuous <lb/>parameters) but again the robot does not look towards a particular direction. <lb/>As the only way to discover that a bashing action can make an object move is <lb/>by looking in the direction of this object (because the IR sensor is on the cheek), <lb/>this means that the robot does not use at this point the bashing primitive with <lb/>
			
			<page>48 <lb/></page> 
			
			the right aAEordances. The next peak (D) corresponds to a period within which <lb/>the robot stops again biting and bashing and concentrates on moving the head, <lb/>but this times we observe that the robot focuses these &quot; looking &quot; movement in <lb/>a narrow part of the visual field : it is basically looking around one of the <lb/>objects, learning how it disappears/reappears in its field of view. Then, there <lb/>is a peak (E) corresponding to a focus on the biting action, which is this time <lb/>coupled with a peak in the curve monitoring the looking direction towards the <lb/>bitable object, and a peak in the curve monitoring the success in biting. It <lb/>means that during this period the robot uses the action primitive with the right <lb/>aAEordances, and manages to bite the bitable object quite often. This peak is <lb/>then repeated a little bit later (F). Then finally a co-occurrence of peaks (G) <lb/>appears that corresponds to a period during which the robot concentrates on <lb/>using the bashing primitve with the right aAEordances, managing to actually <lb/>bash the bashable object quite often. <lb/>This example shows that several interesting phenomena have appeared in <lb/>this run of the experiment. First of all, the presence and co-occurrence of peaks <lb/>of various kinds shows a self-organization of the behavior of the robot, which <lb/>focuses on particular sensorimotor loops at diAEerent periods in time. Second, <lb/>when we observe these peaks, we see that they are not random peaks, but <lb/>show a progressive increase in the complexity of the behaviour to which they <lb/>correspond. Indeed, one has to remind that the intrinsic dimensionality of the <lb/> &quot; just looking &quot; behaviour (pan and tilt) is lower than the &quot; biting &quot; behaviour <lb/>(which adds the depth of the crouching movement), which is itself lower than <lb/>
			
			<page>49 <lb/></page> 
			
			the  &quot; bashing &quot; behaviour (which adds the angle and the strength dimensions). <lb/>The order of appearance of the periods within which the robot focuses on one <lb/>of these activities is precisely the same. If we look in more details, we also see <lb/>that the biting behaviour appears first in a non-aAEordant version (the robot <lb/>tries to bite things which cannot be bitten), and then only later in the aAEordant <lb/>version (where it tries to bite the biteable object). The same observation holds <lb/>for the bashing behaviour: first it appears without the right aAEordances, and <lb/>then it appears with the right aAEordances. The formation of focused activities <lb/>whose properties evolve and are refined with time can be used to describe the <lb/>developmental trajectories that are generated in terms of stages: indeed, one <lb/>can define that a new stage begins when a co-occurence of peaks that never <lb/>occured before happens (and so which denotes a novel kind of focused activity). <lb/>We ran several times the experiment with the real robots, and whereas each <lb/>particular experiment produced curves which were diAEerent in the details, it <lb/>seemed that some regularities in the patterns of peak formation, and so in terms <lb/>of stage sequences, were present. We then proceeded to more experiments in <lb/>order to assess precisely the statistical properties of these self-organized devel-<lb/>opmental trajectories. Because each experiment with the real robot lasts several <lb/>hour, an in order to be able to run many experiments (200), we developped a <lb/>model of the experimental set-up. Thanks to the fact that the physical envi-<lb/>ronment was memoryless after each action of the robot, it was possible to make <lb/>an accurate model of it using the following procedure: we let the robot perform <lb/>several thousands actions and we recorded each time SM (t) and S(t+1). Then, <lb/>
			
			<page>50 <lb/></page> 
			
			from this database of examples we trained a prediction machine based on locally <lb/>weighted regression [44]. This machine was then used as a model of the physical <lb/>environment and the IAC algorithm of the robot was directly plugged into it. <lb/>Using this simulated world set-up, we ran 200 experiments, each time mon-<lb/>itoring the evolution using the same measures as above. We then constructed <lb/>higher-level measures about each of the runs, and based on the structure of the <lb/>peak sequence. Peaks where here defined using a threshold on the height and <lb/>width of the bumps in the curves. These measures correspond to the answer to <lb/>these following questions: <lb/> • (Measure 1) number of peaks?: How many peaks are there in the action <lb/>curves (top curves) ? <lb/> • (Measure 2) complete scenario?: Is the following developmental sce-<lb/>nario matched: first there is a &quot; just looking &quot; peak, then there is a peak <lb/>corresponding to &quot; biting &quot; with the wrong aAEordances which appears be-<lb/>fore a peak corresponding to &quot; biting &quot; with the right aAEordances, and there <lb/>is a peak corresponding to &quot; bashing &quot; with the wrong aAEordances which ap-<lb/>pears before a peak corresponding to &quot; bashing &quot; with the right aAEordance <lb/>(and the relative order between &quot; biting &quot; -related peaks and &quot; bashing &quot; -<lb/>related peaks is ignored). Biting with the right aAEordance is here defined <lb/>as the co-occurence between a peak in the &quot; biting &quot; curve and a peak in the <lb/> &quot; seeing the biteable object &quot; curve, and biting with the wrong aAEordances <lb/>is defined as all other situations. The corresponding definition applies to <lb/> &quot; bashing &quot; . <lb/>
			
			<page>51 <lb/></page> 
			
			• (Measure 3) nearly complete scenario?: Is the following less con-<lb/>strained developmental scenario matched: there is a peak corresponding <lb/>to &quot; biting &quot; with the wrong aAEordances which appears before a peak cor-<lb/>responding to &quot; biting &quot; with the right aAEordances, and there is a peak <lb/>corresponding to &quot; bashing &quot; with the wrong aAEordances which appears be-<lb/>fore a peak corresponding to &quot; bashing &quot; with the right aAEordances (and <lb/>the relative order between &quot; biting &quot; -related peaks and &quot; bashing &quot; -related <lb/>peaks is ignored). <lb/> • (Measure 4) non-aAEordant bite before aAEordant bite?: Is there is a <lb/>peak corresponding to &quot; biting &quot; with the wrong aAEordances which appears <lb/>before a peak corresponding to &quot; biting &quot; with the right aAEordances? <lb/> • (Measure 5) non-aAEordant bash before aAEordant bash?: there is <lb/>a peak corresponding to &quot; bashing &quot; with the wrong aAEordances which ap-<lb/>pears before a peak corresponding to &quot; bashing &quot; with the right aAEordances? <lb/> • (Measure 6) period of systematic successful bite? Does the robot <lb/>succeeds systematically in biting often at some point (= is there a peak <lb/>in the &quot; successful bite &quot; curve)? <lb/> • (Measure 7) period of systematic successful bash? Does the robot <lb/>succeeds systematically in bashing often at some point (= is there a peak <lb/>in the &quot; successful bash &quot; curve? <lb/> • (Measure 8) bite before bash ? Is there a focus on biting which appears <lb/>before a focus on bashing (independantly of aAEordance) ? <lb/>
			
			<page>52 <lb/></page> 
			
			• (Measure 9) successful bite before successful bash? Is there a focus <lb/>on successfully biting which appear before a focus on successfully bashing <lb/>? <lb/>The numerical results of these measures are summarized in table 1. This <lb/>table shows that indeed some structural and statistical regularities arise in the <lb/>self-organized developmental trajectories. First of all, one has to note that the <lb/>complex and structured trajectory described by Measure 2 appears in 34 percent <lb/>of the cases, which is high given the number of possible co-occurences of peaks <lb/>which define a combinatorics of various trajectories. Furthermore, if we remove <lb/>the test on &quot; just looking &quot; , we see that in the majority of experiments, there is <lb/>a systematic sequencing from non-aAEordant to aAEordant actions for both biting <lb/>and bashing. This shows an organized and progressive increase in the complexity <lb/>of the behaviour. Another measure confirms this increase of complexity from <lb/>another point of view: if we compare the relative order of appearance of periods <lb/>of focused bite or bash, then we find that &quot; focused bite &quot; appears in the large <lb/>majority of the cases before the &quot; focused bash &quot; , which corresponds to their <lb/>relative intrinsic dimension (3 for biting and 4 for bashing). Finally, one can <lb/>note that the robot reaches in 100 percent of the experiments a period during <lb/>which it repeatedly manages to bite the biteable object, and in 78 percent of <lb/>the experiments it reaches a period during which it repeatedly manages to bash <lb/>the bashable object. This last point is interesting since the robot was not pre-<lb/>programmed to achieve this particular task. <lb/>
			
			<page>53 <lb/></page> 
			
			These experiments show how the intrinsic motivation system which is imple-<lb/>mented (IAC) drives the robot into a self-organized developmental trajectory <lb/>in which periods of focused sensorimotor activities of progressively increasing <lb/>complexity arise. We have seen that a number of structural regularities arose in <lb/>the system, such as the tendancy of non-aAEordant behaviour to be explored be-<lb/>fore aAEordant behaviour, or the tendancy to explore a certain kind of behaviour <lb/>(bite) before another kind (bash). Yet, one has also to stress that these reg-<lb/>ularities are only statistical: two developmental trajectories are never exactly <lb/>the same, and more importantly it happens that some particular trajectories <lb/>observed in some experiments diAEer qualitatively from the mean. Figure 9 illus-<lb/>trate this point. The figures on the top-left and top-right corners presents runs <lb/>which are very typical and corresponds to the &quot; complete scenario &quot; described <lb/>by Measure 1. On the contrary, the runs presented on the bottom-left and <lb/>bottom-right corners corresponds to atypical results. The experiment of which <lb/>curves are presented in the bottom-left corner shows a case where the focused <lb/>exploration of bashing was performed before the focused exploration of biting. <lb/>Nevertheless, in this case the regularity &quot; non-aAEordant before aAEordant &quot; is pre-<lb/>served. On the bottom-right corner, we observe a run in which the aAEordant <lb/>bashing activity appears very early and before any other focused activity. This <lb/>balance between statistical regularities and diversity has parallels in infant sen-<lb/>sorimotor development [45]: there are some strong structural regularities but <lb/>from individual to individual there can be some substantial diAEerences (for e.g. <lb/>some infants learn how to crawl before they can sit and other do the reverse). <lb/>
			
			<page>54 <lb/></page> 
			
			8 Discussion <lb/> 8.1 Developing complex behavioural schemas <lb/> We have discussed how to design a source of internal rewards suited for active <lb/>and autonomous development. Such an intrinsic motivation system permits <lb/>to realize an e±cient active exploration of a given sensorimotor space. In the <lb/>experiments described, we deliberately considered simple spaces. Enhancing the <lb/>complexity of perception and motor spaces seems crucial in order to expect the <lb/>emergence of more complex forms of behaviour. However, designing suitable <lb/>spaces that can lead to complex behavioural patterns raises several di±cult <lb/>issues. <lb/>A first issue is whether perception and motor spaces should be considered <lb/>as two independent spaces. The intrinsic links that bind perception with action <lb/>have been stressed by many authors. In some circumstances, relevant infor-<lb/>mation about a given environment arises from sensorimotor trajectories rather <lb/>than from simple analysis of perceptual data. Several experiments have shown <lb/>that agents can simplify problems of categorizing situations by actively modi-<lb/>fying their own position or orientation with respect to the environment or by <lb/>modifying the environment itself. In the same manner, certain environmental <lb/>regularities can be detected only by producing particular stereotyped behaviour <lb/>(e.g. [46, 47]). The fact that perception is fundamentally active, naturally leads <lb/>to consider abstractions like behavioural schemas as relevant unit for under-<lb/>standing development. <lb/>
			
			<page>55 <lb/></page> 
			
			Schemas are famously known as central elements of Piaget&apos;s developmental <lb/>psychology but the term has also been used in neurology, cognitive psychol-<lb/>ogy and motor control ([48] p.36–40) and related notions appeared in artificial <lb/>intelligence under names like frames or scripts [49, 50]. In Piaget&apos;s theory, chil-<lb/>dren&apos;s development can be interpreted as the incremental organization of a set <lb/>of schemas. Schemas are skills that serve both for perceiving the environment <lb/>and acting upon it. Piaget calls assimilation the ability to make sense of a situ-<lb/>ation in terms of a current set of schemas and accommodation the way in which <lb/>schemas are updated as the expectations based on assimilation are not met. The <lb/>child starts with basic sensorimotor schemas such as suckling, grasping and some <lb/>primary forms of eye-hand coordination. Through accommodation and assimi-<lb/>lation, new schemas are created, and sets of existing schemas get coordinated. <lb/>The child makes progressively more complex abstract inferences about the en-<lb/>vironment, leading eventually to language and logic, forms of abstract thought <lb/>that are no longer directly grounded in particular sensorimotor situations. The <lb/>whole developmental trajectory can be interpreted as an extension from a simple <lb/>sensorimotor space to an elaborated mental space. The space changes but the <lb/>fundamental dynamics of accommodation and assimilation that actively drive <lb/>the child&apos;s behaviour remain the same. <lb/>It is important to stress that schemas are primarily functional units. In that <lb/>sense, they are a priori distinct from structural units that can be identified in <lb/>the organization of the organism or the machine that produces the observed <lb/>behaviour. However, many artificial intelligence models make use of internal <lb/>
			
			<page>56 <lb/></page> 
			
			explicit schema structures. In such systems, there is a one-to-one mapping <lb/>between these internal structures and the functional operation that the agent <lb/>can perform. For instance, Drescher describes a system inspired by Piaget&apos;s <lb/>theories in which a developing agent explicitly creates, modifies and merges <lb/>schema structures in order to interact with a simple simulated environment <lb/>[51]. Using explicit schema structures has several advantages: such structures <lb/>can be manipulated via symbolic operations, creation of new skills can be easily <lb/>monitored by following the creation of new schemas, etc. <lb/>Other systems do not rely on such explicit representations. These are typ-<lb/>ically subsymbolic systems, using continuous representations of their environ-<lb/>ment. Nevertheless, such systems may display some organized forms of be-<lb/>haviour where clear functional units can be identified. Their developmental tra-<lb/>jectories can also be interpreted as a progressive organization of schemas. For <lb/>instance, the developmental trajectories produced by the typical experiments <lb/>of section VII can be interpreted as assimilation and accommodation phases. <lb/>In these typical runs, the robot &quot; discovers &quot; the biting and bashing schema by <lb/>producing repeated sequences of these kinds of behaviour, but initially these ac-<lb/>tions are not systematically oriented towards the biteable or the bashable object. <lb/>This stage corresponds to &quot; assimilation &quot; . It is only later that &quot; accommodation &quot; <lb/>occurs as biting and bashing starts to be associated with their respective appro-<lb/>priate context of use. Our experiments show that functional organization can <lb/>emerge even in the absence of explicit internal schema structures. However, the <lb/>current limitations of such a system may appear when considering more complex <lb/>
			
			<page>57 <lb/></page> 
			
			forms of behavioural organization such as formation of hierarchical structures <lb/>and the emergence of goals. <lb/> 8.1.1 Hierarchical organization <lb/> Complex behavior patterns are hierarchically organized. For instance, a com-<lb/>plex motor program is often described as an abstract event sequence at a high <lb/>level and a detailed motor program in a lower level. Therefore, possibility for <lb/>forming level structures is a key issue. DiAEerent authors have already tried <lb/>to tackle how combinations of primitives could be autonomously organized <lb/>in higher level structures. Option theory oAEers an interesting mathematical <lb/>framework to address hierarchical organization of systems using explicit schema <lb/>structures [52]. Options are like subroutines associated with closed-loop control <lb/>structures. They can invoke other options as components. Barto, Singh and <lb/>Chentanez have recently illustrated in a simple environment how options could <lb/>be used to develop a hierarchical collection of skills [21]. Hierarchical organi-<lb/>zation of explicit schemas is also illustrated by the work of Drescher among <lb/>others [51]. But, can hierarchically-organized behavior appear in the absence of <lb/>explicit schemas? DiAEerent attempts have been made in this direction. A mul-<lb/>tiple model-based reinforcement learning capable of decomposing a task based <lb/>on predictability levels was proposed by Doya, Samejima, Katagiri and Kawato <lb/>[53]. Tani and Nolfi presented a system capable of combining local experts us-<lb/>ing gated modules [54]. However, in all these studies explicit level structures <lb/>
			
			<page>58 <lb/></page> 
			
			are predetermined by the network architecture. The question whether hierar-<lb/>chical structures can simply self-organize without being explicitly programmed <lb/>remains open. <lb/> 8.1.2 Goal-directedness <lb/> Complex behavior patterns are also associated with intentionally directed pro-<lb/>cesses. This means that they are performed by an agent trying to achieve a <lb/>particular desirable situation that constitutes its aim or goal (e.g. reducing <lb/>hunger, following someone, learning something). The agent&apos;s behavior reflects <lb/>his or her intention, that is the plan of action that the agent chooses for realizing <lb/>this particular goal. This plan includes both the means and the pursued goal <lb/>[55]. Once again, systems using explicit schema structure embed these notions <lb/>of goals and means as explicit symbolic representations. Such explicit goals <lb/>can be created, updated, deleted and more importantly easily monitored. This <lb/>has led to numerous systems in classical artificial intelligence, and research in <lb/>this area has influenced importantly the way we consider decision making or <lb/>planning. More recently, research on agent architectures [56] has put a major <lb/>emphasis on the same issues. However, these models do not give much in-<lb/>sight on the developmental and cognitive mechanisms that lead to the notion <lb/>of intentionally-directed behaviour. Can goals and means simply emerge out of <lb/>subsymbolic dynamics? This is one of the most challenging issue developmental <lb/>approaches to cognition have to face [57]. To some extent, certain reinforcement <lb/>
			
			<page>59 <lb/></page> 
			
			learning models have demonstrated that the organization of behavior into goals <lb/>and subgoals can be interpreted as emergent features resulting of simpler drives <lb/>[37]. But no subsymbolic systems currently matches the performances and the <lb/>flexibility of systems using explicit goal-directed schemas. <lb/> 8.1.3 Generalization, transfer, analogy <lb/> Generalization, transfer or analogies between schemas are also thought to be <lb/>central for the emergence of complex behavior patterns (see [58] for a general <lb/>discussion of the issue of transfer in cognition). Skills do not develop inde-<lb/>pendently from one another. The ones that have structural relationship boot-<lb/>strap each other. In particular, processes of analogy and metaphors are crucial <lb/>for transferring know-how developed in sensorimotor contexts to more abstract <lb/>spaces [59]. There is an important literature on how to compare explicit schema <lb/>structure (e.g. [60]), but many authors have argued that generalization and <lb/>transfer of skills could also be (maybe even more) e±cient in the absence of <lb/>symbolic representation [61]. This debate bears some resemblance with the op-<lb/>position between localists or distributed kinds of representation. Systems with <lb/>explicit schema structures, but also many subsymbolic systems using memo-<lb/>ries organized into local structures (e.g. sets of experts) are called localists. <lb/>In this scheme, learning a new behavior schema corresponds to the addition <lb/>of a template to an existing set of modules. The independence of the mod-<lb/>ules facilitates incremental learning as each addition do not cause interferences <lb/>
			
			<page>60 <lb/></page> 
			
			with the existing memory contents. However, extension to unknown patterns <lb/>must be realized with ad-hoc processes that specify the way similarity should be <lb/>computed. In the same manner, generalization across a large set of local repre-<lb/>sentations is intrinsically di±cult. On the contrary, in systems with distributed <lb/>representations, behavior schemas are not assigned to particular modules but <lb/>are memorized in a distributed manner (e.g. as synaptic weights of global neu-<lb/>ral network). This means that each schema can only exist in relation to others. <lb/>Self-organized generalization processes are facilitated in such context [62]. <lb/>Developmental trajectories of intrinsically motivated agents are constrained <lb/>by many factors. We have briefly discussed some of the important issues for <lb/>designing systems capable of developing reusable, goal-directed, hierarchically-<lb/>organized behavioural schemas. Investigating the dynamics resulting of the <lb/>intrinsic motivation systems embedded in such kinds of more complex spaces <lb/>will be the topic of future research. <lb/> 8.2 Relation to developmental psychology <lb/> Our research takes clear inspiration from developmental psychology both con-<lb/>ceptually (the notion of intrinsic motivation originally comes from psychology) <lb/>and methodologically (analysis of the development in terms of qualitative se-<lb/>quences of diAEerent kinds of behavioural patterns). Could our model be in-<lb/>teresting in return for interpreting processes underlying infant&apos;s development? <lb/>More precisely: <lb/>
			
			<page>61 <lb/></page> 
			
			• Can we interpret particular developmental processes as being the result <lb/>of a progress drive, an intrinsic motivation system driving the infant into <lb/>situations expected to result in maximal learning progress? <lb/> • Can operant models of intrinsic motivation provide useful abstraction that <lb/>address the complexity of infant&apos;s development? <lb/>Some initial attempts have been taken to start answering these questions. <lb/>Taking ground on preliminary experimental results, we discussed in [63] a sce-<lb/>nario presenting the putative role of the progress drive for the development of <lb/>early imitation. We argue in particular that progress-driven learning could help <lb/>understanding why children focus on specific imitative activities at a certain <lb/>age and how they progressively organize preferential interactions with particu-<lb/>lar entities present in their environment. <lb/> 8.2.1 Progress niches <lb/> To facilitate interpretation, we introduced the notion of progress niches to char-<lb/>acterize the behaviour of our model. The progress drive pushes the agent to dis-<lb/>cover and focus on situations which lead to maximal learning progress. These <lb/>situations, neither too predictable nor too di±cult to predict, are &quot; progress <lb/>niches &quot; . Progress niches are not intrinsic properties of the environment. They <lb/>result from a relation between a particular environment, a particular embod-<lb/>iment (sensors, actuators, feature detectors and techniques used by the pre-<lb/>diction algorithms) and a particular time in the developmental history of the <lb/>agent. Once discovered, progress niches progressively disappear as they become <lb/>
			
			<page>62 <lb/></page> 
			
			more predictable. The notion of progress niches is related to Vygotsky&apos;s zone of <lb/>proximal development, where the adult deliberately challenges the child&apos;s level <lb/>of understanding. Adults push children to engage in activities beyond their <lb/>current mastery level, but not too far beyond so that they remain comprehen-<lb/>sible [64]. We could interpret the zone of proximal development as a set of <lb/>potential progress niches organized by the adult in order to help the child learn. <lb/>But it should be clear that independently of the adults&apos; eAEorts, what is and <lb/>what is not a progress niche is ultimately defined from the child&apos;s point view. <lb/>Progress niches share also similarities with Csikszentmihalyi&apos;s flow experiences <lb/> [8]. Csikszentmihalyi argues that some activities are autotelic when challenges <lb/>are appropriately balanced with the skills required to cope with them (see also <lb/>[65]). We prefer to use the term progress niche by analogy with ecological niches <lb/>as we refer to a transient state in the evolution of a complex &quot; ecological &quot; system <lb/>involving the embodied agent and its environment. <lb/> 8.2.2 Self-other distinction <lb/> Using this terminology, the computational model presented in this paper shows <lb/>how an agent can (1) separate its sensorimotor space into zones of diAEerent <lb/>predictability levels and (2) choose to focus on the one which leads to maximal <lb/>learning progress, called a &quot; progress niche &quot; . With this kind of operant models, it <lb/>could be speculated that meaningful sensorimotor distinctions (self, others and <lb/>objects in the environment) may be the result of discriminations constructed <lb/>
			
			<page>63 <lb/></page> 
			
			during a progress-driven process. We can more specifically oAEer an interpreta-<lb/>tion of several fundamental stages characterizing infant&apos;s development during <lb/>their first year. <lb/> • Stage 1: Like-me stance (0-1m). Simple forms of imitative behaviour <lb/>have been argued to be present just after birth. They could constitute a <lb/>process of early identification. Some totally or partially nativist explana-<lb/>tions could account for this early &quot; like-me stance &quot; [66, 67]. This would <lb/>suggest the possibility of an early distinction between persons and things. <lb/>If an intermodal mapping facilitating the match between what is seen and <lb/>what is felt exists, the hypothesis of a progress drive would suggest that <lb/>infants will indeed create a discrimination between such easily predictable <lb/>couplings (interaction with peers) and unpredictable situations (all the <lb/>other cases) and that they will focus on the first zone of their sensorimo-<lb/>tor space that constitutes a &quot; progress niche &quot; . Neonates imitation (when <lb/>it occurs) would be the result of the exploitation of the most predictable <lb/>coupling present just after birth. <lb/> • Stage 2: Circular reactions (1-2m). During the first two months of <lb/>their life, infants perform repeated body motion. They kick their legs <lb/>repeatedly, they wave their arms. This process is sometimes referred as <lb/> &quot; body babbling &quot; . However, nothing indicates that this exploratory be-<lb/>haviour is randomly organised. Rochat argues that children are in fact <lb/>performing self-imitation, trying to imitate themselves [68]. This would <lb/>mean that children are structuring their own behaviour in order to make <lb/>
			
			<page>64 <lb/></page> 
			
			it more predictable and form this way &quot; circular reactions &quot; [69, 41]. Such <lb/>self-imitative behaviours can be well explained by the progress drive hy-<lb/>pothesis. Sensorimotor trajectories directed towards the child&apos;s own body <lb/>can be easily discriminated from trajectories directed towards other peo-<lb/>ple by comparing their relative predictability di±culty. By many respects, <lb/>making progress in understanding primary circular reactions is easier than <lb/>in the cases involving other agents: Self-centered types of behaviour are <lb/> &quot; progress niches &quot; . In such a scenario the &quot; self &quot; emerges as a meaningful <lb/>discrimination for achieving better predictability. Once this distinction is <lb/>made, progress for predicting the eAEects of self-centered actions can be <lb/>rapidly made. <lb/> • Stage 3: Self-other interactions (2-4m). After two months, infants <lb/>become more attentive to the external world and particularly to people. <lb/>Parental scaAEolding plays a critical role for making the interaction with <lb/>the child more predictable [70]. Parents adapt their own responses so that <lb/>interactions with the child follow the normal social rules that characterize <lb/>communicative exchanges (e.g. turn taking). Moreover, if an adult imi-<lb/>tates an infant&apos;s own actions, it can trigger continued activity in the infant. <lb/>This early imitative behaviour is referred as &quot; pseudo-imitation &quot; by Piaget <lb/>[71]. Pseudo-imitation and focus on scaAEolded adult behaviour could be <lb/>seen as predictable eAEects of the progress drive. As the self-centered tra-<lb/>jectories start to be well mastered (and do not constitute &quot; progress niches &quot; <lb/>anymore), the child&apos;s focus shifts to another branch of the discrimination <lb/>
			
			<page>65 <lb/></page>
			 
			tree, the &quot; self-other &quot; zone. <lb/> • Stage 4: Interactions with objects (5-7m). After five months, at-<lb/>tention shifts again from people to objects. Children gain increased con-<lb/>trol over the manipulation of some objects on which they discover &quot; aAEor-<lb/>dances &quot; [72]. Parents recognize this shift and initiate interactions about <lb/>those aAEordant objects. However, children do not alternate easily their at-<lb/>tention between the object and their caregiver. A progress-driven process <lb/>can account for this discrimination between aAEordant objects and unmas-<lb/>tered aspects of the environment. Although this stage is typically not <lb/>seen as imitative, it could be argued that the exploratory process involved <lb/>in the discovery of the object aAEordances shares several common features <lb/>with the one involved for self-centered activities: the child structures its <lb/>world looking for &quot; progress niches &quot; . <lb/>We have to stress that the system discussed in this paper is not meant to <lb/>re-enact precisely infant&apos;s developmental sequence, and is not a model of human <lb/>development. For instance, the playground experiment focuses directly on the <lb/>discovery of object&apos;s aAEordances. Yet, in addition to the developmental robotics <lb/>engineering techniques that it explores, we think that this system, as well as <lb/>other existing intrinsic artificial intrinsic motivation systems, can also be used <lb/>as a &quot; tool for thoughts &quot; in developmental psychology. In that sense, it may help <lb/>formulating new concepts useful for the interpretation of the developmental <lb/>dynamics underlying children&apos;s development. For example, the existence of <lb/>a progress drive could explain why certain types of imitative behaviour are <lb/>
			
			<page>66 <lb/></page> 
			
			produced by children at a certain age and stop to be produced later on. It <lb/>could also explain how discrimination between actions oriented towards the self, <lb/>towards others and towards the environment may occur. However, we do not <lb/>even imagine that a drive for maximizing learning progress could be the only <lb/>motivational principle driving children&apos;s development. The complete picture is <lb/>likely to include a complex set of drives. Developmental dynamics are certainly <lb/>the result of the interplay between intrinsic and extrinsic forms of motivations, <lb/>particular learning biases, as well as embodiment and environmental constraints. <lb/>We believe that computational and robotic approaches can help specifying the <lb/>contribution of these diAEerent components in the overall observed patterns and <lb/>shed new light on the particular role played by intrinsic motivation in these <lb/>complex processes. <lb/> 9 Conclusion <lb/> Intrinsic motivation systems are likely to play a pivotal role for the future of <lb/>developmental robotics. In this paper, we have presented the background in de-<lb/>velopmental psychology, neuroscience, and machine learning. We showed that <lb/>current eAEorts in the developmental robotics community are approaching the <lb/>construction of intrinsic motivation system through the operationalization and <lb/>implementation of concepts such as &quot; novelty &quot; , &quot; surprise &quot; or more generally &quot; cu-<lb/>riosity &quot; . We have reviewed some representative works in this direction, trying <lb/>to classify them into diAEerent groups according to the way they operationalized <lb/>
			
			<page>67 <lb/></page> 
			
			curiosity. Then we presented an intrinsic motivation system called Intelligent <lb/>Adaptive Curiosity, which was conceived to drive the development of a robot in <lb/>continuous noisy inhomogeneous environmental and sensorimotor spaces, per-<lb/>mitting an autonomous self-organization of behavior into a developmental tra-<lb/>jectory with sequences of increasingly complex behavioural patterns. This was <lb/>made possible thanks to the way the system evaluates its own learning progress, <lb/>through the combination of a regional evaluation of the similarity of situations <lb/>with a smoothing of the error rate curves associated to each region. <lb/>This system was tested in two robotic set-ups. In a first simple simulated <lb/>robotic set-up, we showed in detail how the system works, and provokes both <lb/>behavioural and cognitive development, by looking in details into the traces of <lb/>the simulation. This first set-up also showed how IAC can allow a robot to <lb/>avoid situations which are not learnable by the system, and engage in situations <lb/>of progressively increasing complexity in terms of di±culty of learning, which <lb/>leads to a self-organization of the behaviour. This first set-up also allowed to <lb/>show that our intrinsic motivation system could be used e±ciently as an active <lb/>learning algorithm robust in inhomogeneous spaces. Some currently ongoing <lb/>work suggests that these results still hold in high-dimensional continuous spaces. <lb/>If this is confirmed, this would allow to attack real-world learning problems <lb/>whose properties of inhomogeneity kept them out of reach of standard active <lb/>learning methods so far [33]. In a second real and more complex robotic set-up, <lb/>we showed how IAC can drive the development of a robot through more than one <lb/>developmental transition, and thus allows the robot to generate autonomously a <lb/>
			
			<page>68 <lb/></page> 
			
			developmental sequence. Doing these experiments was also the opportunity to <lb/>discuss methodological issues related to the evaluation of a developmental robot. <lb/>Indeed, classical machine learning methods of evaluation, based on the measure <lb/>of the performance of a system on a given human-defined task, are not suited for <lb/>developmental robots since one of their key features is to be task-independent, <lb/>as advocated by Weng ([34]). We explained that a developmental evaluation <lb/>should be based on the monitoring of the evolution of the complexity of the <lb/>system from diAEerent points of view, since indeed complexity is a concept which <lb/>is observer-dependent. For example, it is a necessity to couple a measure of the <lb/>evolution of the complexity from the robot&apos;s point of view, and the monitoring <lb/>of its behavior on a long time scale using methods inspired from human sciences <lb/>and developmental psychology. <lb/>We have also discussed the limits of the system as we presented it in this <lb/>paper. Indeed, there are two kinds of limitations which will be the subject of <lb/>future work. On the one hand, we deliberately made the simplification that what <lb/>the system should optimise is the immediate reward (r(t + 1)). This allowed <lb/>us not to use complex re-inforcement techniques and limit the biases coming <lb/>from the action selection procedure in order to better understand the properties <lb/>of our learning progress measure. Nevertheless, this will be a necessity in the <lb/>future to use such complex re-inforcement learning techniques, since in the real <lb/>world progress niches are not always readily accessible, and thus comes the <lb/>problems of delayed rewards. This extension of our system should certainly be <lb/>inspired by the work of Barto, Singh and Chentannez ([21]) who have presented <lb/>
			
			<page>69 <lb/></page> 
			
			a study which is very complementary to ours, in which they experimented the <lb/>use of a complex re-inforcement technique given a simple novelty-based intrinsic <lb/>motivation system. <lb/>A second kind of limitation which characterizes the current system is the <lb/>fact that the sensorimotor space is rather simple, in particular from the point of <lb/>view of representation. It is an open issue to study how forms of representations <lb/>more complex than scalar vectors, such as schemas for example, could be inte-<lb/>grated within the Intelligent Adaptive Curiosity system. One of the potential <lb/>problems to be solved is if several levels of representations are used: how can one <lb/>build measures of learning progress or knowledge gain which are homogeneous <lb/>and allow the comparison of activities or sensorimotor contexts which involve <lb/>diAEerent representations? <lb/>Finally, we have seen that even if the primary goal of the system we presented <lb/>is to allow the construction of a truly developmental robot, taking inspiration <lb/>from human development, the system could in return possibly be useful for <lb/>developmental psychologists as a tool for thoughts. Indeed, we explained how it <lb/>can help to formulate new concepts for the interpretation of the developmental <lb/>dynamics involved in human infant&apos;s development. <lb/> 
			
		</body>
			
			<div type="acknowledgement">10 Acknowledgements <lb/> The authors would like to thank Andrew Whyte whose help and programming <lb/>skills were precious for conducting experiments permitting to test intrinsic mo-<lb/>

			<page>70 <lb/></page>

			tivation systems (in particular, he designed the motor primitives used by the <lb/>robot in the Playground experiment), as well as Jean-Christophe Baillie for let-<lb/>ting us use its URBI system ([73]) for programming the robot and Luc Steels <lb/>for relevant comments on this work. This research has been partially supported <lb/>by the ECAGENTS project founded by the Future and Emerging Technologies <lb/>programme (IST-FET) of the European Community under EU R&amp;D contract <lb/>IST-2003-1940. <lb/></div>		

			<listBibl> References <lb/> [1] J. Weng, J. McClelland, A. Pentland, O. Sporns, I. Stockman, M. Sur, <lb/>and E. Thelen, &quot; Autonomous mental development by robots and animals, &quot; <lb/> Science, vol. 291, pp. 599–600, 2001. <lb/>[2] M. Lungarella, G. Metta, R. Pfeifer, and G. Sandini, &quot; Developmental <lb/>robotics: A survey, &quot;  Connection Science, vol. 15, no. 4, pp. 151–190, 2003. <lb/>[3] M. Asada, S. Noda, S. Tawaratsumida, and K. Hosoda, &quot; Purposive be-<lb/>havior acquisition on a real robot by vision-based reinforcement learning, &quot; <lb/> Machine Learning, vol. 23, pp. 279–303, 1996. <lb/>[4] J. Elman, &quot; Learning and development in neural networks: The importance <lb/>of starting small, &quot;  Cognition, vol. 48, pp. 71–99, 1993. <lb/>[5] R. White, &quot; Motivation reconsidered: The concept of competence, &quot;  Psycho-<lb/>logical review, vol. 66, pp. 297–333, 1959. <lb/>

			<page>71 <lb/></page>

			[6] E. Deci and R. Ryan, Intrinsic Motivation and Self-Determination in Hu-<lb/>man Behavior. Plenum Press, 1985. <lb/>[7] D. Berlyne, Conflict, Arousal and Curiosity. McGraw-Hill, 1960. <lb/>[8] M. Csikszenthmihalyi, Flow-the psychology of optimal experience. Harper <lb/>Perennial, 1991. <lb/>[9] W. Schultz, P. Dayan, and P. Montague, &quot; A neural substrate of prediction <lb/>and reward, &quot;  Science, vol. 275, pp. 1593–1599, 1997. <lb/>[10] P. Dayan and W. Belleine, &quot; Reward, motivation and reinforcement learn-<lb/>ing, &quot;  Neuron, vol. 36, pp. 285–298, 2002. <lb/>[11] S. Kakade and P. Dayan, &quot; Dopamine: Generalization and bonuses, &quot;  Neural <lb/>Networks, vol. 15, pp. 549–559, 2002. <lb/>[12] J.-C. Horvitz, &quot; Mesolimbocortical and nigrostriatal dopamine responses to <lb/>salient non-reward events, &quot;  Neuroscience, vol. 96, no. 4, pp. 651–656, 2000. <lb/>[13] M. Csikszentmihalyi, Creativity-flow and the psychology of discovery and <lb/>invention. Harper perennial, 1996. <lb/>[14] J. Schmidhuber, &quot; Curious model-building control systems, &quot; in Proceeding <lb/>International Joint Conference on Neural Networks, vol. 2. Singapore: <lb/>IEEE, 1991, pp. 1458–1463. <lb/>[15] S. Thrun, &quot; Exploration in active learning, &quot; in Handbook of Brain Science <lb/>and Neural Networks, M. Arbib, Ed. Cambridge, MA: MIT Press, 1995. <lb/>

			<page>72 <lb/></page>

			[16] J. Herrmann, K. Pawelzik, and T. Geisel, &quot; Learning predicitve representa-<lb/>tions, &quot;  Neurocomputing, vol. 32-33, pp. 785–791, 2000. <lb/> [17] J. Weng, &quot; A theory for mentally developing robots, &quot; in Second Interna-<lb/>tional Conference on Development and Learning. IEEE Computer Society <lb/>Press, 2002. <lb/>[18] X. Huang and J. Weng, &quot; Novelty and reinforcement learning in the <lb/>value system of developmental robots, &quot; in Proceedings of the 2nd inter-<lb/>national workshop on Epigenetic Robotics : Modeling cognitive develop-<lb/>ment in robotic systems, C. Prince, Y. Demiris, Y. Marom, H. Kozima, <lb/>and C. Balkenius, Eds. Lund University Cognitive Studies 94, 2002, pp. <lb/>47–55. <lb/>[19] F. Kaplan and P.-Y. Oudeyer, &quot; Motivational principles for visual know-<lb/>how development, &quot; in Proceedings of the 3rd international workshop on <lb/>Epigenetic Robotics : Modeling cognitive development in robotic systems, <lb/> C. Prince, L. Berthouze, H. Kozima, D. Bullock, G. Stojanov, and C. Balke-<lb/>nius, Eds. Lund University Cognitive Studies 101, 2003, pp. 73–80. <lb/>[20] J. Marshall, D. Blank, and L. Meeden, &quot; An emergent framework for self-<lb/>motivation in developmental robotics, &quot; in Proceedings of the 3rd Interna-<lb/>tional Conference on Development and Learning (ICDL 2004), Salk Insti-<lb/>tute, San Diego, 2004. <lb/>[21] A. Barto, S. Singh, and N. Chentanez, &quot; Intrinsically motivated learning <lb/>of hierarchical collections of skills, &quot; in Proceedings of the 3rd International <lb/>

			<page> 73 <lb/></page>

			Conference on Development and Learning (ICDL 2004), Salk Institute, San <lb/>Diego, 2004. <lb/>[22] V. Fedorov, Theory of Optimal Experiment. New York, NY: Academic <lb/>Press, 1972. <lb/>[23] D. Cohn, Z. Ghahramani, and M. Jordan, &quot; Active learning with statistical <lb/>models, &quot;  Journal of artificial intelligence research, vol. 4, pp. 129–145, <lb/>1996. <lb/>[24] M. Hasenjager and H. Ritter, Active learning in neural networks, ser. <lb/>Physica-Verlag Studies In Fuzziness And Soft Computing Series. Physica-<lb/>Verlag GmbH, 2002, pp. 137–169. <lb/>[25] J. Denzler and C. Brown, &quot; Information theoretic sensor data selection for <lb/>active object recognition and state estimation, &quot;  IEEE Transactions on Pat-<lb/>tern Analysis and Machine Intelligence, vol. 2, no. 24, pp. 145–157, 2001. <lb/>[26] M. Plutowsky and H. White, &quot; Selecting concise training sets from clean <lb/>data, &quot;  IEEE Transactions on Neural Networks, vol. 4, pp. 305–318, 1993. <lb/>[27] T. Watkin and A. Rau, &quot; Selecting examples for perceptrons, &quot;  Journal of <lb/>Physics A: Mathematical and General, vol. 25, pp. 113–121, 1992. <lb/>[28] D. MacKay, &quot; Information-based objective functions for active data selec-<lb/>tion, &quot;  Neural Computation, vol. 4, pp. 590–604, 1992. <lb/>

			<page>74 <lb/></page>

			[29] M. Belue, K. Bauer, and D. Ruck, &quot; Selecting optimal experiments for multi-<lb/>ple output multi-layer perceptrons, &quot;  Neural Computation, vol. 9, pp. 161— <lb/>183, 1997. <lb/>[30] G. Paas and J. Kindermann, &quot; Bayesian query construction for neural net-<lb/>work models, &quot; in Advances in Neural Processing Systems, G. Tesauro, <lb/>D. Touretzky, and T. Leen, Eds., vol. 7. MIT Press, 1995, pp. 443–450. <lb/>[31] K. O. M. Hasenjager, H. Ritter, Active learning in self-organizing maps. <lb/> Elsevier, 1999, pp. 57–70. <lb/>[32] D. Cohn, L. Atlas, and R. Ladner, &quot; Improving generalization with active <lb/>learning, &quot;  Machine Learning, vol. 15, no. 2, pp. 201–221, 1994. <lb/>[33] J. Poland and A. Zell, &quot; DiAEerent criteria for active learning in neural net-<lb/> works: A comparative study, &quot; in Proceedings of the 10th European Sympo-<lb/>sium on Artificial Neural Networks, M. Verleysen, Ed., 2002, pp. 119–124. <lb/>[34] J. Weng, &quot; Developmental robotics: Theory and experiments, &quot;  Interna-<lb/>tional Journal of Humanoid Robotics, vol. 1, no. 2, pp. 199–236, 2004. <lb/>[35] N. Roy and A. McCallum, &quot; Towards optimal active learning through sam-<lb/>pling estimation of error reduction, &quot; in Proc. 18th Intl Conf. Machine <lb/>Learning, 2001. <lb/>[36] R. Collobert and S. Bengio, &quot; Svmtorch: Support vector machines for large-<lb/>scale regression problems, &quot;  Journal of Machine Learning Research, vol. 1, <lb/>pp. 143–160, 2001. <lb/>

			<page>75 <lb/></page>

			[37] R. Sutton and A. Barto, Reinforcement learning: an introduction. Cam-<lb/>bridge, MA.: MIT Press, 1998. <lb/>[38] C. Walkins and P. Dayan, &quot; Q-learning, &quot;  Machine learning, vol. 8, pp. 279– <lb/>292, 1992. <lb/>[39] K. Kaneko and I. Tsuda, Complex systems : chaos and beyond. Springer, <lb/>2000. <lb/>[40] O. Sporns and T. Pegors, &quot; Information-theoretical aspects of embodied <lb/>artificial intelligence, &quot; in Embodied artificial intelligence, ser. LNAI 3139, <lb/>F. Iida, R. Pfeifer, L. Steels, and Y. Kuniyoshi, Eds. Springer, 2003, pp. <lb/>74–85. <lb/>[41] J. Piaget, The origins of intelligence in children. New York, NY: Norton, <lb/>1952. <lb/>[42] O. Michel, &quot; Webots: Professional mobile robot simulation, &quot;  International <lb/>Journal of Advanced Robotic Systems, vol. 1, no. 1, pp. 39–42, 2004. <lb/>[43] J. Rekimoto and Y. Ayatsuka, &quot; Cybercode: designing augmented reality <lb/>environments with visual tags, &quot; in Proceedings of DARE 2000 on Designing <lb/>augmented reality environments, 2000, pp. 1–10. <lb/>[44] S. Schaal, C. Atkeson, and S. Vijayakumar, &quot; Scalable techniques from non-<lb/>parameteric statistics for real-time robot learning, &quot;  Applied Intelligence, <lb/> vol. 17, no. 1, pp. 49–60, 2002. <lb/>

			<page>76 <lb/></page>

			[45] E. Thelen and L. B. Smith, A dynamic systems approach to the development <lb/>of cognition and action. Boston, MA, USA: MIT Press, 1994. <lb/>[46] R. D. Beer, &quot; The dynamics of active categorical perception in an evolved <lb/>model agent, &quot;  Adaptive Behavior, vol. 11, no. 4, pp. 209–243, 2003. <lb/>[47] S. Nolfi and J. Tani, &quot; Extracting regularities in space and time through <lb/>a cascade of prediction networks, &quot;  Connection Science, vol. 11, no. 2, pp. <lb/>129–152, 1999. <lb/>[48] M. Arbib, The handbook of brain theory and neural networks. Cambridge, <lb/>MA: MIT press, 2003. <lb/>[49] M. Minsky, &quot; A framework for representing knowledge, &quot; in The psychology <lb/>of computer vision, P. Wiston, Ed. New York: Mc Graw Hill, 1975, pp. <lb/>211–277. <lb/>[50] R. Schank and R. Abelson, Scripts, plans, goals and understanding: An in-<lb/>quiry into human knowledge structures. Hillsdale, NJ.: Lawrence Erlbaum <lb/>Associates, 1977. <lb/>[51] G. L. Drescher, Made-up minds. Cambridge, MA.: The MIT Press, 1991. <lb/>[52] R. Sutton, D. Precup, and S. Singh, &quot; Between mdpss and semi-mdps: A <lb/>framework for temporal abstraction in reinforcement learning, &quot;  Artificial <lb/>Intelligence, vol. 112, pp. 181–211, 1999. <lb/>

			<page>77 <lb/></page>

			[53] K. Doya, K. Samejima, K. Katagiri, and M. Kawato, &quot; Multiple model-<lb/>based reinforcement learning, &quot;  Neural computation, vol. 14, pp. 1347–1369, <lb/>2002. <lb/> [54] J. Tani and S. Nolfi, &quot; Learning to perceive the world as articulated : An ap-<lb/>proach for hiearchical learning in sensory-motor systems, &quot;  Neural Network, <lb/> vol. 12, pp. 1131–1141, 1999. <lb/>[55] M. Tomasello, M. Carptenter, J. Call, T. Behne, and H. Moll, &quot; Understand-<lb/>ing and sharing intentions: the origins of cultural cognition, &quot;  Behavioral <lb/>and Brain Sciences (in press), 2004. <lb/>[56] F. Dignum and R. Conte, &quot; Intentional agents and goal formation, &quot; in LNCS <lb/>1365: Proceedings of the 4th International Workshop on Intelligent Agents <lb/>IV, Agent Theories, Architectures, and Languages. London, UK: Springer-<lb/>Verlag, 1997, pp. 231–243. <lb/>[57] F. Kaplan and V. Hafner, &quot; The challenges of joint attention, &quot;  Interaction <lb/>Studies, vol. 7, no. 2, pp. 128–134, 2006. <lb/>[58] A. Robins, &quot; Transfer in cognition, &quot;  connection science, vol. 8, no. 2, pp. <lb/>185–204, 1996. <lb/>[59] G. LakoAE and M. Johnson, Philosophy in the flesh: the embodied mind and <lb/>its challenge to Western thought. Basic Books, 1998. <lb/>[60] D. Gentner, K. Holyoak, and N. Kokinov, The analogical mind: perspectives <lb/>from cognitive science. MIT Press, 2001. <lb/>

			<page>78 <lb/></page>

			[61] L. Pratt and B. Jennings, &quot; A survey of connectionist network reuse through <lb/>transfer, &quot;  Connection Science, vol. 8, no. 2, pp. 163–184, 1996. <lb/>[62] J. Tani, M. Ito, and Y. Sugita, &quot; Self-organization of distributedly repre-<lb/>sented multiple behavior schema in a mirror system, &quot;  Neural Networks, <lb/> vol. 17, pp. 1273–1289, 2004. <lb/>[63] F. Kaplan and P.-Y. Oudeyer, &quot; The progress-drive hypothesis: an interpre-<lb/>tation of early imitation, &quot; in Models and mechanisms of imitation and social <lb/>learning: Behavioural, social and communication dimensions, K. Dauten-<lb/>hahn and C. Nehaniv, Eds. Cambridge University Press, to appear. <lb/>[64] L. Vygotsky, Mind in society. Harvard university press, 1978, the devel-<lb/>opment of higher psychological processes. <lb/>[65] L. Steels, &quot; The autotelic principle, &quot; in Embodied Artificial Intelligence, ser. <lb/>Lecture Notes in AI, I. Fumiya, R. Pfeifer, L. Steels, and K. Kunyoshi, Eds. <lb/>Berlin: Springer Verlag, 2004, vol. 3139, pp. 231–242. <lb/>[66] A. MeltzoAE and A. Gopnick, &quot; The role of imitation in understanding per-<lb/>sons and developing a theory of mind, &quot; in Understanding other minds, <lb/> H. T.-F. S. Baron-Cohen and D.Cohen, Eds. Oxford, England: Oxford <lb/>University Press, 1993, pp. 335–366. <lb/>[67] C. Moore and V. Corkum, &quot; Social understanding at the end of the first <lb/>year of life, &quot;  Developmental Review, vol. 14, pp. 349–372, 1994. <lb/>

			<page>79 <lb/></page>

			[68] P. Rochat, &quot; Ego function of early imitation, &quot; in The Imitative Mind : De-<lb/>velopment, Evolution and Brain Bases, A. MelzoAE and W. Prinz, Eds. <lb/>Cambridge University Press, 2002. <lb/>[69] J. Baldwin, Mental development in the child and the race. New York: The <lb/>Macmillan Company, 1925. <lb/>[70] H. SchaAEer, &quot; Early interactive development in studies of mother-infant in-<lb/>teraction, &quot; in Proceedings of Loch Lomonds Symposium. New York: Aca-<lb/>demic Press, 1977, pp. 3–18. <lb/>[71] J. Piaget, Play, dreams and imitation in childhood. New York: Norton <lb/>Press, 1962. <lb/>[72] J. Gibson, The ecological approach to visual perception. Lawrence Erlbaum <lb/>Associates, 1986. <lb/>[73] Baillie, &quot; Urbi: Towards a universal robotic low-level programming lan-<lb/>guage, &quot; in Proceedings of the IEEE/RSJ International Conference on In-<lb/>telligent Robots and Systems -IROS05, 2005. <lb/></listBibl>

			<page>80 <lb/></page>

			<body> Figure 1:  The architecture used in various models of group 2 and group 3: here <lb/>there is a module KGA which monitors the derivative of the errors of prediction of <lb/> M, which is the basis of an evaluation of learning progress. Some systems (group 2) <lb/>evaluate the learning progress by measuring the decrease of the error rate of M in the <lb/>close past, whatever the recent situations. Some other systems (group 3) evaluate the <lb/>learning progress by measuring the decrease of the error rate of M in situations which <lb/>are similar, but not necessarily close in time. <lb/>

			<page> 81 <lb/></page>

			Figure 2:  The sensorimotor space is iteratively and recursively split into sub-spaces, <lb/>which we call &quot; regions &quot; . Each region Rn is responsible for monitoring the evolution <lb/>of the error rate in the anticipation of the consequences of the robot&apos;s actions if the <lb/>associated contexts are covered by this region. This list of regional error rates is used <lb/>for learning progress evaluation <lb/>

			<page> 82 <lb/></page>

			Figure 3:  The robotic set-up: a two-wheeled robot moves in a room and there is <lb/>also an intelligent toy (represented by a sphere) which moves according to the sounds <lb/>that the robot produces. The robot perceives the distance between himself and the <lb/>toy. The robot tries to predict this distance after performing a given action, which <lb/>is a setting of (left wheel speed, right wheel speed, sound frequency). He chooses the <lb/>actions for which it predicts its learning progress will be maximal. <lb/>

			<page> 83 <lb/></page>

			sounds within f3 <lb/>sounds within f2 <lb/>sounds within f1 <lb/>0 <lb/>1000 <lb/>2000 <lb/>3000 <lb/>4000 <lb/>5000 <lb/>6000 <lb/>1 <lb/>0.5 <lb/>0 <lb/> Figure 4:  Evolution of the percentage of time spent in: 1) situations in which the <lb/>emitted sounds have a frequency within f  3  (continuous line); 2) situations in which <lb/>the emitted sounds have a frequency within f  2  (dotted line); 1) situations in which <lb/>the emitted sounds have a frequency within f1 (dashed line). <lb/>

			<page> 84 <lb/></page>

			0 <lb/> 200 <lb/>400 <lb/>600 <lb/>800 <lb/>1000 <lb/>1200 <lb/> 0.3 <lb/>0.25 <lb/>0.2 <lb/> 0.15 <lb/> 0.1 <lb/>0.05 <lb/> 0 <lb/> Figure 5:  Evolution of the successive values of &lt; en(t) &gt; for all the regions Rn <lb/> constructed by the robot. <lb/>

			<page> 85 <lb/></page>

			MAX <lb/> IAC <lb/> RANDOM <lb/> 500 <lb/>1000 <lb/>1500 <lb/>2000 <lb/>2500 <lb/>3000 <lb/>3500 <lb/>4000 <lb/>4500 <lb/>5000 <lb/> 0.115 <lb/> 0.11 <lb/>0.105 <lb/>0.1 <lb/>0.095 <lb/> 0.09 <lb/> 0.085 <lb/> Figure 6:  Evolution of the performance in generalization (mean squared prediction <lb/>error) in situations in which the frequency of the emitted sound is within f  2  , and <lb/>respectively for the MAX algorithm (continuous line), the IAC algorithm (long dashes <lb/>line) and the RAN DOM algorithm (small dashes lines). This allows to compare <lb/>how much the robot has learnt of the interesting situations after a given number of <lb/>performed actions, when it uses a given action selection algorithm. <lb/>

			<page> 86 <lb/></page>

			Object that can <lb/>be bashed <lb/>Tag for visual <lb/>object recognition <lb/>Object that can <lb/>be bitten <lb/> Figure 7:  The Playground Experiment set-up. <lb/>

			<page> 87 <lb/></page>

			Figure 8: Curves describing a run of the Playground Experiment. Top 3: Fre-<lb/>quencies for certain action types on windows 100 time steps wide. Mid 3: Fre-<lb/>quencies of gaze direction towards certain objects in windows 200 time steps <lb/>wide: &quot; object 1 &quot; refers to the bitable object, and &quot; object 2 &quot; refers to the bash-<lb/>able object. Bottom 3: Frequencies of successful bite ans successful bash in <lb/>windows 200 time steps wide. <lb/>

			<page>88 <lb/></page>

			žžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžžž <lb/> ––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––– <lb/> Figure 9: Various runs of the simulated experiments. In the top squares, we <lb/>observe two typical developmental trajectories corresponding to the &quot; complete <lb/>scenario &quot; described by measure 1. In the bottom curve, we observe rare but <lb/>existing developmental trajectories. <lb/>

			<page>89 <lb/></page>

			Table 1: Statistical measures on the 200 simulation-based experiments. <lb/>Measures <lb/>Results <lb/>(1) number of peaks? <lb/>9.67 <lb/>(2) complete scenario? <lb/>Yes: 34 %, No: 66 % <lb/> (3) near complete scenario? <lb/> Yes: 53 %, No: 47% <lb/>(4) non-aAEordant bite before aAEordant bite? <lb/> Yes: 93 %, No: 7 % <lb/>(5) non-aAEordant bash before aAEordant bash? <lb/> Yes: 57 %, No: 43 % <lb/>(6) period of systematic successful bite? <lb/> Yes: 100 %, No: 0 % <lb/>(7) period of systematic successful bash? <lb/> Yes: 78 %, No: 11 % <lb/>(8) bite before bash? <lb/> Yes: 92 %, No: 8 % <lb/>(9) successful bite before successful bash? <lb/> Yes: 77 %, No: 23 % <lb/></body>

			<page>90 </page>

	</text>
</tei>