IN CHAPTER
IN ORDER TO BE ABLE TO recognize which representation was best suited for our purpose we use genetic algorithms in a non-conventional way. Normally a genetic algorithm uses a certain representation to find a solution to a certain problem. However, in our experiments we used the genetic algorithms to test whether a particular representation is suited. To this end, we provided a number of problems the solutions of which were already known in advance. We checked how well the genetic algorithm performed on these problems. Most importantly, the representation we are looking for should lead to a solution for all presented problems. Moreover, the number of generations needed to find a suitable solution should be as low as possible.
During our experiments, that are explained in detail in chapter 11, we compared three different representations that seemed to be suitable: single-message classifier systems, multiple-message classifier systems, and Petri-nets. Only Petri-nets yielded satisfactory results. Therefore we will now focus on the Petri-net representation of our possible solutions.
|
The goal of the liveness module is to keep the client component alive in a non-intrusive way: messages dealing with the core functionality of the client component (i.e., logic requests) are simply forwarded by the module to the server component. Synchronization messages, however, can be used by the module to alter the synchronization behavior of the client component. This is visualized in figure 9.2: On the left side of the figure is a client component located. It communicates with the liveness module by means of two interfaces, one interface, another interface. Every interface has been divided in two parts. One part for incoming messages, another part for outgoing messages. The liveness module itself is connected to the remainder of the adaptor by means of one port, again pictured as an incoming and outgoing part. Internally, the liveness module will make use of two Petri-nets. One Petri-net describing and tracking the behavior of the client component, and a second net describing the behavior of the module.
From a more detailed point of view, the liveness module will continuously extend the existing client Petri-net with extra random transitions to introduce new synchronization behavior. The fitness of these random transitions will be taken into account automatically by the learning algorithm and as such only transitions that act correctly, will survive. However, one should take care that the newly added transitions do not interfere with the original behavior of the client Petri-net. For example, the extended Petri-net may enable some of the original transitions that could never be enabled in the original Petri-net. An extended Petri-net could allow a SetPosition when the position has not been locked yet. This is clearly undesired behavior. In general, the newly added transitions should never invalidate the preconditions of the original transitions
We will denote the extended Petri-net as , while the original client Petri-net is denoted by . Obviously, must be a subnet of . This means that , and where and are the initial markings obtained from and , respectively.
The requirement that does not interfere with
is formally expressed as follows:
The definition of the multi-set of token elements was given on page (equation 3.1). The reachability definition was given on page (equation 3.5).
|
At runtime, the liveness module knows which markings occur in the original Petri-net . This knowledge allows us to optimize the process of generating random transitions. If a situation occurs in which the original Petri-net fails to take action, the module generates a new transition specifically suited for the corresponding marking. This is called adding new behavior. Otherwise, a random marking from the past is selected and the transitions related to that marking are modified at random. This is called modifying behavior. In both cases, every new transition generated by the module is inserted in the extended Petri-net .
To add new behavior, we need to analyze the current (runtime) situation. This is done by retrieving the current marking and removing all source-places. The reason behind this is that a source-place can never be part of the state of a component, because it represents an incoming message. The second reason for ignoring the source-places is that every source-token is automatically transferred to a sink (or pseudo-sink) place. So if we need the information that a token is waiting we can as well look at the out-place instead of the in-places.
Removing all source-places yields a new marking that represents the (expected) state of the client component and the messages that cannot be handled (these reside in the pseudo-sink places). From this new marking the random generator selects two places that will form the inputs for the new transition. One of both places must belong to the original Petri-net, the other one must be a pseudo-sink place. (Illustrated in figure 9.3). To guarantee the non interference requirement (equation 9.1), new transitions must restore all consumed tokens back to their original places in the original Petri-net. Determining possible output places for the new transitions is also delicate. In the same way that we cannot allow a token to be removed from the original Petri-net, we cannot add new tokens to it. Therefore, only pseudo-source places can be used as output places. Of those, the only suitable candidates are those that enable a transition that sends back a synchronization message to the client component.
The expressions placed on the arcs that connect the input places to the transition and the transition to the output places is mainly based on the available type information. This idea comes from [Mon93,HWSS95], who argues that type information is an advantage when faced with random generated syntax trees.
To modify the behavior of an existing transition we simply keep the current input-arcs but modify the output arcs and expressions in the same way as above. The original transition is removed and replaced by the freshly created transition.
The representation given in this section is the best representation we found to represent solutions for our liveness problem. However, this representation now needs to be combined with an on-line learning algorithm. Such algorithms typically require feedback, for which we did not yet explain how we would obtain it. Therefore we will now turn our attention to the feedback problem.
THE PROBLEM WITH LIVENESS IS, as explained in section 7.2.3, that it is difficult to verify. In the formal Petri-net model liveness can be defined, but this wouldn't turn out to be a good measure because we don't actually know what the goal(s) of our components are. Therefore we need some feedback from the component that quantifies its liveliness.
A technique suitable to do so is check-pointing. A checkpoint is a static place in the code which will give rise to a reward when the code at that line is executed. Algorithm 25 illustrates this. By counting the occurrences of every checkpoint we can deduce a fitness measure. This fitness measure is specific for the component under investigation. Checkpoint (1a) will normally be reached only once. Checkpoints (1b) and (1c) will occur multiple times and should correlate in some way, because every point removed must have been once ours. Checkpoint (2) is a position which is a situation we don't want to encounter too much, hence when obtaining a quantitative measure we will reward this less than the other checkpoints. The green checkpoints (3a,b,c) measure how many lock operations are issued. By correlating the green checkpoints with the blue checkpoints we also have a measure for the underlying concurrency interface. This approach has a number of advantages and disadvantages.
In this section we will focus on an important property of checkpoints and Petri-nets, namely that the checkpoints reached are statistically correlated to the marking of the Petri-net. Later on we will need this property, and although not the case in all situations we will assume that in most practical situations there is a statistical correlation between both.
We want to show that there is a statistical correlation between the state of the Petri-net and the checkpoints reached. So, if we know the marking of the Petri-net we should be able to give a probability that a certain checkpoint will be reached. It is clear that this question can be exactly answered if we have all information available. If we would have available the entire state of the component, then we can with a probability of 1 say which execution will be executed. To know this we can simply simulate the execution of the program.9.1
However, if we don't have all information available, this might not be possible. It is clear that the information contained within the Petri-net is only enough to describe the concurrency strategy of the component. It does not contain a full internal description of the state of the component, hence we might in general not be able to link these statistically. Typically, the checkpoints in components that cannot be statistically linked to the Petri-nets involved will use some extra information not available to the Petri-net. This information will be either information with respect to the concurrency strategy, or information with respect to some internal logic of the program. If the unknown information is part of the concurrency strategy, this should have been described in the Petri-net, if it isn't however, the information will still be in correlation with the concurrency strategy, because these checkpoints are specifically placed at important positions with respect to the concurrency strategy. Therefore we will further on assume that the checkpoints are statistically correlated to the marking of the Petri-net:
the checkpoints help documenting the concurrency strategy in such a way that a statistical correlation exists between the checkpoints and the marking of the Petri-net.
In this section we have explained how rewards can be created by placing checkpoints in the component's source code. This is easy to implement and allows us to offer a reinforcement learning algorithm the necessary feedback.
THE PROBLEM OF KEEPING A COMPONENT ALIVE by selecting the correct action in a certain situation seems to be a suitable problem for a reinforcement learning approach as described in section 4.4. The mapping of our problem to a reinforcement learning problem requires the definition of states , rewards and actions . We will define the state to be the marking of the Petri-net adaptor. The actions a learner can take are defined as the transitions within the net. However, part of the transitions are under control of the learner while others are executed automatically by the underlying component. If the action executed is not controlled by the learner it behaves exactly as if the learner would have chosen it.
The rewards for the learner come from the underlying component and are assigned at the moment a checkpoint in the component's code is reached. Because a reinforcement learning approach requires a reward signal at every time-step we define the rewards to be zero unless the underlying component specifies otherwise. The underlying component will however send a reward for the last action back as an asynchronous message. This means that the learner must wait to assign a reward of zero until a new message arrives.
This definition of our liveness problem as a reinforcement learning problem is straightforward, however; before we can be sure that this problem is a valid reinforcement learning problem we need to be sure that the problem constitutes a Markov Decision process.
We will now show that the stated liveness problem constitutes a Markov decision process. A Markov decision process is a process in which all necessary information from the past, that has led to the current situation is retained in such a way that a decision can be taken solely on the base of the current situation. Formally, should be known9.2 based only on the current state and the current action .
To demonstrate this we show that the current marking of the Petri-net and the current transition selected within the net define a) the next state and b) the reward that is returned. Proving a) is easy because the Petri-net describes exactly in a consistent way the concurrency behavior of the underlying component. Exact means that the next state is entirely dependent on the current state as defined by the Petri-net. Consistent means that the underlying component will act upon what is described in the Petri-net. Formally,
So,
Showing b), that the reward is dependent on the current state and the action chosen is in general more difficult because the underlying component can make use of a memory which is not specified in the Petri-net, and which is thus not visible to the learner. However, if we go back to the assumption that the checkpoints are correlated to the Petri-net (page ) then we can conclude that a reward is immediately dependent on the currently selected action and the current state of the Petri-net. Therefore we conclude that, under the stated assumption, the given problem is Markov.
Finding out which branch of reinforcement learning algorithms is useful (Monte Carlo, , Dynamic Programming) highly depends on the nature of the problem. The liveness problem we have is a continuous task because we cannot require the underlying component (and all the components it's communicating with) to reset their behavior during execution. This prohibits us from using techniques such as Monte-Carlo methods [SAG98]. The liveness problem also has no model available to foresee when a reward will be assigned in the future, therefore dynamic programming [SAG98] techniques are also of little interest to us. As such, we will investigate the use of temporal difference learning. A temporal difference learning algorithm approaches the problem of estimating the and value-functions in an incremental way. Every time a certain action is selected the resulting state (or state-action couple ) will give a certain amount of reward to the current state (or state-action couple ). The next state is responsible for recreating its amount of accumulated reward by choosing an appropriate action. As such, rewards are propagated backwards. If future rewards are accessed multiple times, the current situation will always profit of it. Below we will investigate how the well known learning technique -learning can be applied.
-learning in general works as follows:
The problem with implementing -learning is that both methods needs to keep a memory of . In our case however this is not feasible because the number of possible markings is very large. The reason behind this is that Petri-nets can express in a compact way concurrent processes, hence, even for small Petri-nets the number of markings can be very large. However, we will argue that a straightforward state space compaction is possible by only assigning strengths to transitions enabled under a certain marking. This will be a good consistent choice that will respect the structure of the problem involved. Below we will first assume that we have an elementary Petri-net, describing the behavior of an interface (see section 3.4.3). Afterward we will see how this naturally extends to colored Petri-nets.
When working with a reinforcement learning algorithm we have to define how ``good'' it is to be in a certain state and how ``good'' it is to select a certain action in a certain state. These two functions are called the state value and the state-action value functions. For our Petri-net the state value function is easily defined as the best future reward possible by any action enabled under that state, hence we define
This definition is useful because it states that the maximum reward that can be obtained from state is either or the maximum of the reward to be expected from all enabled transitions under . The state-action value function can be deduced immediately from this definition because the execution of will always result in state for which we know the expected future reward by the state value function. Hence,
In the above two mutual dependent definitions, a marking must be remembered for every possible action. This is expensive because the possible markings can be very large. Therefore, we will now define an expected reward solely on the messages being send between the components. In the definition given below, denotes the strength of transition .
Implementing the -learning algorithm as such, is straightforward
With this definition we only have to store strengths in transitions, we don't have to remember possible rewards for every couple. We will now explain that very often this high compaction ratio forms no problem. Given our definition of and the assumption that all updates to aim to make a better approximation of the possible future reward, we will indicate that indeed defines the expected future reward.
Although a seemingly trivial statement, the real problem comes from the fact that only stores one strength for every and not for every couple . So an update to will very likely modify too , which not necessarily means that an update to is appropriate. To demonstrate that an update to will never result in an inappropriate update of another couple we will demonstrate that when the update does not interfere and that when such an update will interfere, but the result is still what one would expect. In the case that proving non-interference is trivial because and this strength is stored in the transition itself. So this update does not modify , and thus does not modify any . In the other case when an update on can only occur when was enabled under . Therefore a modification to will only occur when is also enabled under . If that is the case this inference is not a problem because the expected reward under marking should also become the reward of because there is a means by which this possible higher reward can be obtained. With this we have shown that our definition of will always indicates what kind of a reward we can expect in the future. Below we give an example of a situation where and with .
An example of such a situation is pictured in figure 9.4. We consider the case where two different markings both have the same transition enabled. In the top left part of the figure we see marking . In the top right part of the figure we see a Petri-net with a current marking . Both markings enable transition . Marking (= ) receives a reward, but marking (= ) does not. In this case, after choosing action (executing transition ), a reward will be assigned to only in the bottom right case. This might indicate that an interference between and is in place. However, as explained earlier, the Petri-net marking is correlated to the rewards received, so we can assume that the reason why a reward is assigned for is because place contains a token, while it does not contain a token in (in this particular example). So will not preserve this information and not link state to a positive reward and state to no reward. Instead will oscillate around a certain value. The Q-learning algorithm will back propagate rewards to any transition fired before. Hence, after a learning period, transition will contain a constant value, indicating the necessity of a token at place . This illustrates how the necessary information is preserved.
The oscillation that occurs on transition comes from the fact that the compression of all states (which actually constitutes a Markov Decision Process) loses too much information and no longer represents a Markov decision process. To be able to handle these unwanted situations, together with the need to handle specific symbolic actions (such as when [X, Y] arrives, send back [Y-1,X-1]) we have combined the above reinforcement learning algorithm with the situation recognizer described in section 9.1.
In the case of colored Petri-nets the above definition of will lose however even more information. For instance it is possible to have a set of differently colored tokens that all enable the same transition in different ways. Therefore, our definition of loses even more information, because a colored Petri-net typically collapses states when appropriate (as observed in the experiments in section 11.5). Our learning algorithm will be able to make use of this structural information. E.g. instead of learning what to do on position , position , position and so on, the learner will learn what to do on position . To do so, the learner will add new transitions every once in a while, depending on the policy. Furthermore, when necessary, the loss of information in can be compensated by creating new transitions that recognize new situations. Once such a transition is added its strength will increase or decrease, depending on its suitability.
TO VALIDATE WHETHER THE COMBINATION of Petri-net transitions, reinforcement learning and statically placed checkpoints works in practice, we conducted a number of experiments. Every experiment is based on the conflicts enumerated in chapter 6.
|
|
|
|
|
|
|
We now describe the behavior of our reinforcement learning algorithm by looking at some of the more interesting experiments. The experiments performed apply a learning rate of 0.1 and a discount factor of . Every experiment is summarized in a figure that plots the value of a number of transitions over time. Not all transitions involved in every liveness module are presented, only those that helped in implementing a correct behavior. For every transition the Petri-net code is shown under the form of dynamic transitions, together with the number of times the transition has fired during the experiment. Every line in the graphic has its own behavior, and specifies a transition that can only be fired in certain circumstances. For instance, in figure 9.7 we see 2 lines. The red line (dynamic transition 0) is the locking behavior, the green line (dynamic transition 1) is the unlocking behavior. Now, some observations can be made:
The approach we have presented in this chapter uses a reinforcement learning technique as a means to solve the liveness problem. Because we cannot rely on the formal liveness definition, we introduced checkpoints as a means for the developer to specify which actions are favored. The technique presented here could also be used to learn how an adaptor can reach a certain state. Should we want to do this, the reward could be assigned at the moment a change in marking occurs that is closer to the target marking. If it is further away from the target marking, a negative reward (punishment) should be given. Implementing such a technique would be quite similar and it would allow us to bypass the concurrency strategy of a server component without the need to formally analyze the Petri-net.
The cautious reader may notice that there are a lot of similarities between the Petri-net transitions we use and classifier systems. Indeed: the messages in a classifier-system can be compared to the tokens present in the marking of a Petri-net system. In that case,
Nevertheless there are some major differences which make the BBA useless in our case. First is the fact that the bucket brigade algorithm needs to know how specific a certain rule is. Depending on the specificity a higher bid can be placed. This is something that cannot easily be transposed to Petri-nets. A second limitation of the BBA is that delayed rewards are implemented by means of a mean future reward. This effectively means that a possible very good solution can be hidden behind a lot of bad alternatives. A third problem with the BBA is the initial bias which is often present when introducing new rules. Evolutionary speaking, new rules can never prove themselves in an environment where systematically the worst rules, regardless of their age, are removed first. This makes balancing the exploration/exploitation phase very difficult.
IN THIS CHAPTER WE HAVE EXPLAINED how we can construct a liveness module. First, we explained which representation we would use to recognize new situations, second we explained how rewards from the underlying component could be obtained. And third, we explained how -learning can be applied to the given problem. In this chapter we have developed a method to minimize the storage-requirements of the value function, needed for -learning. Thanks to the locality property of Petri-nets we only need to store instead of and by introducing state-recognizers as new transitions within the Petri-net we are able to exploit structural properties of the colored Petri-net involved.