9. Module 2: The Liveness Module

Figure 9.1: The keep-alive adaptor

IN CHAPTER WE have explained that we will construct an adaptor based on three requirements. Every requirement will be satisfied by one module. The enforce-action (explained in chapter 8) module allows the adaptor to satisfy the no-conflict requirement. The liveness module, presented in this chapter, will learn how to keep a client component (i.e., the one that requires a concurrency interface of the server component) alive. This is necessary to avoid concurrency adaptors from returning messages to the client that do not allow the client to proceed. E.g.: always returning lockFalse. To achieve this, the module makes use of a reinforcement learning algorithm that learns which actions are suitable in a certain situation. More specifically, actions are indirectly rewarded by the client component when suitable situations arise. How the rewards are assigned will be explained in section 9.2. The reinforcement learning algorithm will be linked together with a situation recognition system that allows for the storage of the necessary information to learn what to do. This situation recognition system will be Petri-net based. By constantly adding new random transitions and letting the reinforcement learning algorithm explore these new avenues, the correct rules will survive. This will be explained in section 9.1. Finally, in section 9.3, we show how everything maps to a $Q$-learning algorithm.

9.1 A Petri-Net Representation

IN ORDER TO BE ABLE TO recognize which representation was best suited for our purpose we use genetic algorithms in a non-conventional way. Normally a genetic algorithm uses a certain representation to find a solution to a certain problem. However, in our experiments we used the genetic algorithms to test whether a particular representation is suited. To this end, we provided a number of problems the solutions of which were already known in advance. We checked how well the genetic algorithm performed on these problems. Most importantly, the representation we are looking for should lead to a solution for all presented problems. Moreover, the number of generations needed to find a suitable solution should be as low as possible.

During our experiments, that are explained in detail in chapter 11, we compared three different representations that seemed to be suitable: single-message classifier systems, multiple-message classifier systems, and Petri-nets. Only Petri-nets yielded satisfactory results. Therefore we will now focus on the Petri-net representation of our possible solutions.

9.1.1 Requirements for a Petri-net Representation

Figure 9.2: Schematic of the interconnection between the component, the adaptor ports and the underlying Petri-nets.

The goal of the liveness module is to keep the client component alive in a non-intrusive way: messages dealing with the core functionality of the client component (i.e., logic requests) are simply forwarded by the module to the server component. Synchronization messages, however, can be used by the module to alter the synchronization behavior of the client component. This is visualized in figure 9.2: On the left side of the figure is a client component located. It communicates with the liveness module by means of two interfaces, one $logic$ interface, another $synchro$ interface. Every interface has been divided in two parts. One part for incoming messages, another part for outgoing messages. The liveness module itself is connected to the remainder of the adaptor by means of one $logic$ port, again pictured as an incoming and outgoing part. Internally, the liveness module will make use of two Petri-nets. One Petri-net describing and tracking the behavior of the client component, and a second net describing the behavior of the module.

From a more detailed point of view, the liveness module will continuously extend the existing client Petri-net with extra random transitions to introduce new synchronization behavior. The fitness of these random transitions will be taken into account automatically by the learning algorithm and as such only transitions that act correctly, will survive. However, one should take care that the newly added transitions do not interfere with the original behavior of the client Petri-net. For example, the extended Petri-net may enable some of the original transitions that could never be enabled in the original Petri-net. An extended Petri-net could allow a SetPosition when the position has not been locked yet. This is clearly undesired behavior. In general, the newly added transitions should never invalidate the preconditions of the original transitions

We will denote the extended Petri-net as $\pi_{e}=(\Sigma,P_{e},T_{e},A_{e},C_{e},G_{e},E_{e},I_{e})$, while the original client Petri-net is denoted by $\pi_{o}=(\Sigma,P_{o},T_{o},A_{o},C_{o},G_{o},E_{o},I_{o})$. Obviously, $\pi_{o}$ must be a subnet of $\pi_{e}$. This means that $P_{o}\subseteq P_{e},  T_{o}\subseteq T_{e},  A_{o}\subseteq A_{e},  C_{o}\subseteq C_{e},  G_{o}\subseteq G_{e},  E_{o}\subseteq E_{e}$, and $M_{o}\subseteq M_{e}$ where $M_{o}$ and $M_{e}$ are the initial markings obtained from $I_{o}$ and $I_{e}$, respectively.

The requirement that $\pi_{e}$ does not interfere with $\pi_{o}$ is formally expressed as follows:

\not\exists M \textrm{reachable from}  M_{e} \textrm{unde...
...TE_{MS}^{\pi_{o}}  reachable  from  M_{o}  under \pi_{o})
\end{displaymath} (9.1)

The definition of the multi-set of token elements $TE_{MS}^{\pi_{o}}$ was given on page [*] (equation 3.1). The reachability definition was given on page [*] (equation 3.5).

9.1.2 Runtime Creating of Transitions

Figure 9.3: How new transitions will be created. The big cloud at the right is the original Petri-net. The new transition is part of the extended Petri-net.

At runtime, the liveness module knows which markings occur in the original Petri-net $\pi_{o}$. This knowledge allows us to optimize the process of generating random transitions. If a situation occurs in which the original Petri-net fails to take action, the module generates a new transition specifically suited for the corresponding marking. This is called adding new behavior. Otherwise, a random marking from the past is selected and the transitions related to that marking are modified at random. This is called modifying behavior. In both cases, every new transition generated by the module is inserted in the extended Petri-net $\pi_{e}$.

To add new behavior, we need to analyze the current (runtime) situation. This is done by retrieving the current marking and removing all source-places. The reason behind this is that a source-place can never be part of the state of a component, because it represents an incoming message. The second reason for ignoring the source-places is that every source-token is automatically transferred to a sink (or pseudo-sink) place. So if we need the information that a token is waiting we can as well look at the out-place instead of the in-places.

Removing all source-places yields a new marking that represents the (expected) state of the client component and the messages that cannot be handled (these reside in the pseudo-sink places). From this new marking the random generator selects two places that will form the inputs for the new transition. One of both places must belong to the original Petri-net, the other one must be a pseudo-sink place. (Illustrated in figure 9.3). To guarantee the non interference requirement (equation 9.1), new transitions must restore all consumed tokens back to their original places in the original Petri-net. Determining possible output places for the new transitions is also delicate. In the same way that we cannot allow a token to be removed from the original Petri-net, we cannot add new tokens to it. Therefore, only pseudo-source places can be used as output places. Of those, the only suitable candidates are those that enable a transition that sends back a synchronization message to the client component.

The expressions placed on the arcs that connect the input places to the transition and the transition to the output places is mainly based on the available type information. This idea comes from [Mon93,HWSS95], who argues that type information is an advantage when faced with random generated syntax trees.

To modify the behavior of an existing transition we simply keep the current input-arcs but modify the output arcs and expressions in the same way as above. The original transition is removed and replaced by the freshly created transition.

The representation given in this section is the best representation we found to represent solutions for our liveness problem. However, this representation now needs to be combined with an on-line learning algorithm. Such algorithms typically require feedback, for which we did not yet explain how we would obtain it. Therefore we will now turn our attention to the feedback problem.

9.2 Runtime Feedback

9.2.1 Check-pointing the Component's Source

% latex2html id marker 4784
...s to be reached
otherwise the component is not tested entirely.}

THE PROBLEM WITH LIVENESS IS, as explained in section 7.2.3, that it is difficult to verify. In the formal Petri-net model liveness can be defined, but this wouldn't turn out to be a good measure because we don't actually know what the goal(s) of our components are. Therefore we need some feedback from the component that quantifies its liveliness.

A technique suitable to do so is check-pointing. A checkpoint is a static place in the code which will give rise to a reward when the code at that line is executed. Algorithm 25 illustrates this. By counting the occurrences of every checkpoint we can deduce a fitness measure. This fitness measure is specific for the component under investigation. Checkpoint (1a) will normally be reached only once. Checkpoints (1b) and (1c) will occur multiple times and should correlate in some way, because every point removed must have been once ours. Checkpoint (2) is a position which is a situation we don't want to encounter too much, hence when obtaining a quantitative measure we will reward this less than the other checkpoints. The green checkpoints (3a,b,c) measure how many lock operations are issued. By correlating the green checkpoints with the blue checkpoints we also have a measure for the underlying concurrency interface. This approach has a number of advantages and disadvantages. Pros Cons

9.2.2 Correlation

In this section we will focus on an important property of checkpoints and Petri-nets, namely that the checkpoints reached are statistically correlated to the marking of the Petri-net. Later on we will need this property, and although not the case in all situations we will assume that in most practical situations there is a statistical correlation between both.

We want to show that there is a statistical correlation between the state of the Petri-net and the checkpoints reached. So, if we know the marking of the Petri-net we should be able to give a probability that a certain checkpoint will be reached. It is clear that this question can be exactly answered if we have all information available. If we would have available the entire state of the component, then we can with a probability of 1 say which execution will be executed. To know this we can simply simulate the execution of the program.9.1

However, if we don't have all information available, this might not be possible. It is clear that the information contained within the Petri-net is only enough to describe the concurrency strategy of the component. It does not contain a full internal description of the state of the component, hence we might in general not be able to link these statistically. Typically, the checkpoints in components that cannot be statistically linked to the Petri-nets involved will use some extra information not available to the Petri-net. This information will be either information with respect to the concurrency strategy, or information with respect to some internal logic of the program. If the unknown information is part of the concurrency strategy, this should have been described in the Petri-net, if it isn't however, the information will still be in correlation with the concurrency strategy, because these checkpoints are specifically placed at important positions with respect to the concurrency strategy. Therefore we will further on assume that the checkpoints are statistically correlated to the marking of the Petri-net: Assumption:

the checkpoints help documenting the concurrency strategy in such a way that a statistical correlation exists between the checkpoints and the marking of the Petri-net.

In this section we have explained how rewards can be created by placing checkpoints in the component's source code. This is easy to implement and allows us to offer a reinforcement learning algorithm the necessary feedback.

9.3 Reinforcement Learning

THE PROBLEM OF KEEPING A COMPONENT ALIVE by selecting the correct action in a certain situation seems to be a suitable problem for a reinforcement learning approach as described in section 4.4. The mapping of our problem to a reinforcement learning problem requires the definition of states $s_{t}$, rewards $r_{t}$ and actions $a_{t}$. We will define the state to be the marking of the Petri-net adaptor. The actions a learner can take are defined as the transitions within the net. However, part of the transitions are under control of the learner while others are executed automatically by the underlying component. If the action executed is not controlled by the learner it behaves exactly as if the learner would have chosen it.


A(s)=A(M)=\{ a \vert  M[a\rangle\}
\end{displaymath} (9.2)

The rewards for the learner come from the underlying component and are assigned at the moment a checkpoint in the component's code is reached. Because a reinforcement learning approach requires a reward signal at every time-step we define the rewards $r_{t}$ to be zero unless the underlying component specifies otherwise. The underlying component will however send a reward for the last action back as an asynchronous message. This means that the learner must wait to assign a reward of zero until a new message arrives.

This definition of our liveness problem as a reinforcement learning problem is straightforward, however; before we can be sure that this problem is a valid reinforcement learning problem we need to be sure that the problem constitutes a Markov Decision process.

9.3.1 A Markov Decision Process

We will now show that the stated liveness problem constitutes a Markov decision process. A Markov decision process is a process in which all necessary information from the past, that has led to the current situation is retained in such a way that a decision can be taken solely on the base of the current situation. Formally, $P_{r}(s_{t+1}=s',  r_{t+1}=r')$ should be known9.2 based only on the current state $s_{t}$ and the current action $a_{t}$.

To demonstrate this we show that the current marking of the Petri-net and the current transition selected within the net define a) the next state and b) the reward that is returned. Proving a) is easy because the Petri-net describes exactly in a consistent way the concurrency behavior of the underlying component. Exact means that the next state $s_{t+1}$ is entirely dependent on the current state $s_{t}$ as defined by the Petri-net. Consistent means that the underlying component will act upon what is described in the Petri-net. Formally,



\left\{ \begin{array}{cc}
P(s_{t+1}=s')=1 & s_{t}[a_{t}\rangle s'\\
P(s_{t+1}=s')=0 & \textrm{otherwise}\end{array}\right.\end{displaymath}

Showing b), that the reward $r_{t+1}$ is dependent on the current state $s_{t}$ and the action $a_{t}$ chosen is in general more difficult because the underlying component can make use of a memory which is not specified in the Petri-net, and which is thus not visible to the learner. However, if we go back to the assumption that the checkpoints are correlated to the Petri-net (page [*]) then we can conclude that a reward is immediately dependent on the currently selected action and the current state of the Petri-net. Therefore we conclude that, under the stated assumption, the given problem is Markov.

9.3.2 $Q$-learning

Finding out which branch of reinforcement learning algorithms is useful (Monte Carlo, $TD(\lambda)$, Dynamic Programming) highly depends on the nature of the problem. The liveness problem we have is a continuous task because we cannot require the underlying component (and all the components it's communicating with) to reset their behavior during execution. This prohibits us from using techniques such as Monte-Carlo methods [SAG98]. The liveness problem also has no model available to foresee when a reward will be assigned in the future, therefore dynamic programming [SAG98] techniques are also of little interest to us. As such, we will investigate the use of temporal difference learning. A temporal difference learning algorithm approaches the problem of estimating the $Q(s,a)$ and $V(s)$ value-functions in an incremental way. Every time a certain action $a_{t}$ is selected the resulting state $s_{t+1}$ (or state-action couple $(s_{t+1},a_{t+1})$) will give a certain amount of reward to the current state $s_{t}$ (or state-action couple $(s_{t},a_{t})$). The next state is responsible for recreating its amount of accumulated reward by choosing an appropriate action. As such, rewards are propagated backwards. If future rewards are accessed multiple times, the current situation will always profit of it. Below we will investigate how the well known $TD(\lambda)$ learning technique $Q$-learning can be applied.

$Q$-learning in general works as follows:

  1. Choose action $a$ from $s$ using a policy ($\epsilon$-greedy for instance) and the $Q$ - value function.
  2. Execute action $a$, observe $r$ and new state $s'$.
  3. Change $Q(s,a):=Q(s,a)+\alpha(r+\gamma max_{a'}Q(s',a')-Q(s,a))$. With $0\leq\alpha\leq1$. A high value for $\alpha$ will make the learner learn quickly. $0\leq\gamma\leq1$ is the discount factor. This models the fact that future rewards are worth less than immediate rewards as explained on page [*] (equation 4.1).
  4. The new state becomes the current state. $s=s'$

9.3.3 State Space Compaction

The problem with implementing $Q$-learning is that both methods needs to keep a memory of $Q(s,a)$. In our case however this is not feasible because the number of possible markings is very large. The reason behind this is that Petri-nets can express in a compact way concurrent processes, hence, even for small Petri-nets the number of markings can be very large. However, we will argue that a straightforward state space compaction is possible by only assigning strengths to transitions enabled under a certain marking. This will be a good consistent choice that will respect the structure of the problem involved. Below we will first assume that we have an elementary Petri-net, describing the behavior of an interface (see section 3.4.3). Afterward we will see how this naturally extends to colored Petri-nets. Definition of $V(s)$ and $Q(s,a)$

When working with a reinforcement learning algorithm we have to define how ``good'' it is to be in a certain state and how ``good'' it is to select a certain action in a certain state. These two functions are called the state value and the state-action value functions. For our Petri-net the state value function is easily defined as the best future reward possible by any action enabled under that state, hence we define

V(s):=max(\{0\}\cup\{ Q(s,a) \vert  a\in s^{\bullet}\})
\end{displaymath} (9.3)

This definition is useful because it states that the maximum reward that can be obtained from state $s$ is either $0$ or the maximum of the reward to be expected from all enabled transitions under $s$. The state-action value function $Q(s,a)$ can be deduced immediately from this definition because the execution of $a$ will always result in state $s[a\rangle$ for which we know the expected future reward by the state value function. Hence,


In the above two mutual dependent definitions, a marking must be remembered for every possible action. This is expensive because the possible markings can be very large. Therefore, we will now define an expected reward solely on the messages being send between the components. In the definition given below, $Q(a)$ denotes the strength of transition $a$.


Implementing the $Q$-learning algorithm as such, is straightforward

Q(s,a)=Q(s,a)+\alpha(r+\gamma  max_{a'}Q(s',a')-Q(s,a))\end{displaymath}

Q(a)=Q(a)+\alpha(r+\gamma  max_{a'\in s'^{\bullet}}Q(a')-Q(a))\end{displaymath}

strength(a):=strength(a)+\alpha(r+\gamma  max_{a'\in s'^{\bullet}}strength(a')-strength(a))\end{displaymath}

With this definition we only have to store strengths in transitions, we don't have to remember possible rewards for every $(s,a)$ couple. We will now explain that very often this high compaction ratio forms no problem. Given our definition of $Q(s,a)$ and the assumption that all updates to $Q(s,a)$ aim to make a better approximation of the possible future reward, we will indicate that $Q(s,a)$ indeed defines the expected future reward.

Although a seemingly trivial statement, the real problem comes from the fact that $Q(s,a)$ only stores one strength for every $a$ and not for every couple $(s,a)$. So an update to $Q(s,a)$ will very likely modify $Q(s',a)$ too , which not necessarily means that an update to $Q(s',a)$ is appropriate. To demonstrate that an update to $Q(s,a)$ will never result in an inappropriate update of another couple $Q(s',a')$ we will demonstrate that when $a\neq a'$ the update does not interfere and that when $s'\neq s \wedge  a'=a$ such an update will interfere, but the result is still what one would expect. In the case that $a'\neq a$ proving non-interference is trivial because $Q(s,a)=Q(a)$ and this strength is stored in the transition $a$ itself. So this update does not modify $Q(a')$, and thus does not modify any $Q(s',a')$. In the other case when $s'\neq s \wedge  a'=a$ an update on $Q(s,a)$ can only occur when $a$ was enabled under $s$. Therefore a modification to $Q(s',a)$ will only occur when $a$ is also enabled under $s'$. If that is the case this inference is not a problem because the expected reward under marking $s'$ should also become the reward of $Q(s,a)$ because there is a means by which this possible higher reward can be obtained. With this we have shown that our definition of $Q(s,a)$ will always indicates what kind of a reward we can expect in the future. Below we give an example of a situation where $Q(s,a)$ and $Q(s',a)$ with $s\not=s'$.

Figure 9.4: Four Petri-net states. When a place contains a token it is colored yellow. Red transitions are disabled, green transitions are enabled.
width=1.0\columnwidth]{StateCompactionDemonstration.eps} Example

An example of such a situation is pictured in figure 9.4. We consider the case where two different markings both have the same transition enabled. In the top left part of the figure we see marking $s_{1}$. In the top right part of the figure we see a Petri-net with a current marking $s_{2}$. Both markings enable transition $t_{3}$. Marking $s_{4}$ (= $s_{2}[t_{3}\rangle$) receives a reward, but marking $s_{3}$ (= $s_{1}[t_{2}\rangle$) does not. In this case, after choosing action $t_{3}$ (executing transition $t_{3}$), a reward will be assigned to $t_{3}$ only in the bottom right case. This might indicate that an interference between $Q(s_{1},t_{3})$ and $Q(s_{2},t_{3})$ is in place. However, as explained earlier, the Petri-net marking is correlated to the rewards received, so we can assume that the reason why a reward is assigned for $s_{4}$ is because place $p_{1}$ contains a token, while it does not contain a token in $s_{3}$ (in this particular example). So $t_{3}$ will not preserve this information and not link state $s_{2}$ to a positive reward and state $s_{1}$ to no reward. Instead $t_{3}$ will oscillate around a certain value. The Q-learning algorithm will back propagate rewards to any transition fired before. Hence, after a learning period, transition $t_{1}$ will contain a constant value, indicating the necessity of a token at place $p_{1}$. This illustrates how the necessary information is preserved.

The oscillation that occurs on transition $t_{3}$ comes from the fact that the compression of all states (which actually constitutes a Markov Decision Process) loses too much information and no longer represents a Markov decision process. To be able to handle these unwanted situations, together with the need to handle specific symbolic actions (such as when [X, Y] arrives, send back [Y-1,X-1]) we have combined the above reinforcement learning algorithm with the situation recognizer described in section 9.1.

9.3.4 Structural Aspects of this Compaction

In the case of colored Petri-nets the above definition of $Q(s,a)$ will lose however even more information. For instance it is possible to have a set of differently colored tokens that all enable the same transition in different ways. Therefore, our definition of $Q(s,a)$ loses even more information, because a colored Petri-net typically collapses states when appropriate (as observed in the experiments in section 11.5). Our learning algorithm will be able to make use of this structural information. E.g. instead of learning what to do on position $(5,5)$, position $(6,8)$, position $(9,10)$ and so on, the learner will learn what to do on position $(x,y)$. To do so, the learner will add new transitions every once in a while, depending on the policy. Furthermore, when necessary, the loss of information in $Q(s,a)$ can be compensated by creating new transitions that recognize new situations. Once such a transition is added its strength will increase or decrease, depending on its suitability.

9.4 Discussion

9.4.1 Experiments

TO VALIDATE WHETHER THE COMBINATION of Petri-net transitions, reinforcement learning and statically placed checkpoints works in practice, we conducted a number of experiments. Every experiment is based on the conflicts enumerated in chapter 6.

Figure 9.5: Liveness module tested on floodactor 1 (non-waiting, nested, square, immediate transition).

(transition "dynamic-0" (7 times)

  (input "lock-out"-req [TMP2 11])

  (input "lock_true"-enabled [])

  (output "lock_true"-act []))

(transition "dynamic-1" (4 times)

  (input "unlock-out"-req [3 TMP2])

  (input "return_unlock"-enabled [])

  (output "return_unlock"-act []))

(transition "dynamic-9" (196 times)

  (input "unlock-out"-req [TMP2 TMP1])

  (input "return_unlock"-enabled [])

  (output "return_unlock"-act []))

(transition "dynamic-63" (4 times)

  (input "lock-out"-req [TMP1 12])

  (input "lock_true"-enabled [])

  (output "lock_true"-act []))

(transition "dynamic-111" (175 times)

  (input "lock-out"-req [TMP1 TMP2])

  (input "lock_true"-enabled [])

  (output "lock_true"-act []))

Figure 9.6: Liveness module tested on floodactor2 (non-waiting, non-nested, squares, immediate, different syntax)

(transition "dynamic-0" (12 times)

  (input "lock-out"-req [9 TMP1])

  (input "lock_true"-enabled [])

  (output "lock_true"-act []))

(transition "dynamic-1" (3 times)

  (input "unlock-out"-req [TMP1 14])

  (input "return_unlock_true"-enabled [])

  (output "return_unlock_true"-act []))

(transition "dynamic-10" (189 times)

  (input "unlock-out"-req [TMP1 TMP2])

  (input "return_unlock_true"-enabled [])

  (output "return_unlock_true"-act []))

(transition "dynamic-13" (182 times)

  (input "lock-out"-req [TMP1 TMP2])

  (input "lock_true"-enabled [])

  (output "lock_true"-act []))

(transition "dynamic-16" (2 times)

  (input "unlock-out"-req [TMP2 TMP1])

  (input "return_unlock_true"-enabled [])

  (output "return_unlock_true"-act []))

Figure 9.7: Liveness module tested on floodactor3 (waiting, non-nested field lock, immediate transition)

(transition "dynamic-0" (74 times)

  (input "enter-out"-req [])

  (input "enter_done"-enabled [])

  (output "enter_done"-act []))

(transition "dynamic-1" (73 times)

  (input "leave-out"-req [])

  (input "leave_done"-enabled [])

  (output "leave_done"-act []))

Figure 9.8: Liveness module tested on floodactor 4 (waiting, nested, square, immediate transition, normal syntax)

(transition "dynamic-2" (218 times)

  (input "lock-out"-req [TMP1 TMP2])

  (input "lock_done"-enabled [])

  (output "lock_done"-act []))

(transition "dynamic-3" (15 times)

  (input "unlock-out"-req [9 TMP1])

  (input "unlock_done"-enabled [])

  (output "unlock_done"-act []))

(transition "dynamic-13" (6 times)

  (input "unlock-out"-req [TMP2 8])

  (input "unlock_done"-enabled [])

  (output "unlock_done"-act []))

(transition "dynamic-20" (3 times)

  (input "lock-out"-req [TMP2 TMP1])

  (input "lock_done"-enabled [])

  (output "lock_done"-act []))

(transition "dynamic-23" (197 times)

  (input "unlock-out"-req [TMP2 TMP1])

  (input "unlock_done"-enabled [])

  (output "unlock_done"-act []))

Figure 9.9: Liveness module tested on floodactor 5 (non-waiting, nested, line locking)

(transition "dynamic-0" (294 times)

  (input "lock-out"-req [TMP2])

  (input "lock_false"-enabled [])

  (output "lock_false"-act []))

(transition "dynamic-2" (27 times)

  (input "lock-out"-req [TMP1])

  (input "lock_true"-enabled [])

  (output "lock_true"-act []))

(transition "dynamic-66" (39 times)

  (input "lock-out"-req [18])

  (input "lock_true"-enabled [])

  (output "lock_true"-act []))

Figure 9.10: Liveness modules tested on moving dot actor 6 (non waiting, non nested, square locking)

(transition "dynamic-2" (11 times)

  (input "unlock-out"-req [28 3])

  (input "return_unlock"-enabled [VAR0])

  (output "return_unlock"-act [VAR0]))

(transition "dynamic-27" (210 times)

  (input "lock-out"-req [28 TMP2])

  (input "lock_true"-enabled [])

  (output "lock_true"-act []))

(transition "dynamic-166" (141 times)

  (input "unlock-out"-req [VAR0 TMP1])

  (input "return_unlock"-enabled [VAR0])

  (output "return_unlock"-act [VAR0]))

Figure 9.11: Liveness modules tested on line actor 9 (2 layered locking strategy. The first layer is a waiting, nested square locking strategy, the second layer is a waiting nested field locking strategy)

(transition "dynamic-3" (215 times)

  (input "lock-out"-req [TMP1 TMP2])

  (input "lock_done"-enabled [])

  (output "lock_done"-act []))

(transition "dynamic-25" (206 times)

  (input "unlock-out"-req [TMP2 TMP1])

  (input "unlock_done"-enabled [])

  (output "unlock_done"-act []))

We now describe the behavior of our reinforcement learning algorithm by looking at some of the more interesting experiments. The experiments performed apply a learning rate $\alpha$ of 0.1 and a discount factor $\gamma$ of $0.9$. Every experiment is summarized in a figure that plots the $Q$ value of a number of transitions over time. Not all transitions involved in every liveness module are presented, only those that helped in implementing a correct behavior. For every transition the Petri-net code is shown under the form of dynamic transitions, together with the number of times the transition has fired during the experiment. Every line in the graphic has its own behavior, and specifies a transition that can only be fired in certain circumstances. For instance, in figure 9.7 we see 2 lines. The red line (dynamic transition 0) is the locking behavior, the green line (dynamic transition 1) is the unlocking behavior. Now, some observations can be made:

  1. Often transitions will be present with a fairly constant $Q$ value, such as the dynamic transitions 0 and 63 in figure 9.5. This typically occurs when the transition is very specific and only works on specific positions such as row 11 and row 12. Such transitions are therefore not often executed, hence the $Q$ value is not adapted as fast as the often used transitions.
  2. Sometimes transitions occur which share behavior. For instance in figure 9.6 the 16th dynamic transition implements exact the same behavior as the 10th dynamic transition. The state-recognizer generator currently avoid duplicates as much as possible, however in this case the names of the temporary variables are swapped and as such are not recognized as duplicates. However, such duplicates (dynamic transition 16) does not immediately take away the rewards from the original transitions (dynamic transition 10).
  3. In figure 9.5 the expected Q-values of the two most used transitions oscillate. This comes from the fact that flood actor 1 not only assigns rewards when it is able to lock or unlock a square, but also when a position can be claimed to be part of the actor, in that case a relatively high reward (10) is assigned. It is clear that this latter reward behavior does not conform to the stated requirement that rewards should only be used to specify concurrency behavior, however, aside from some oscillation this does not pose much problems. Figure 9.8 illustrates this even more clearly, here also the future reward after a lock operation keeps on oscillating because the rewards are positioned at different places.
  4. In figure 9.6 one of the first rules discovered (dynamic transition 0) is a rule that allow locks on column $9$, however it takes a long time to discover a general rule that also works on other columns of the whiteboard. Visually this could be perceived as a flood actor which, during its first 12 moves, migrates only vertically. However, when a general rule is found, it quickly spreads horizontally. A similar behavior is encountered in 9.9.
  5. In figure 9.9, a reward-less lock operation is executed 294 times. This happens because another transition (which knows how to lock column 18) hides the lack of reward. This comes from the fact that the quality of a certain marking is the maximum possible reward that can be obtained in the future. Given the fact that our colored Petri-nets collapses different states, and does not make a distinction between column 18 and all the other columns, transition 0 will be used to claim the future reward. However, because our representation of state-recognizers is well designed, this forms no problem, only a delay.
  6. Figure 9.11 shows an experiment in which we tested how future rewards are remembered and how fast they are forgotten. The actor involved is a line actor, which only assigns a reward at the moment an entire line could be locked or unlocked. As can be seen, the future reward is forgotten slowly and thus still remembered correctly at a later time to allow the algorithm to exploit this knowledge.
  7. In figure 9.11 the locking strategy contains two layers. The first layer is the layer in which we need to specify that we will start or stop with locking operations. The second layer can be used to lock individual squares. Because the offered Petri-nets do not allow much choice with respect to what to do when a StartLocking request arrives, these transitions are not included.
  8. In figure 9.10 the goal of the experiment was to learn that in 50% of the cases a LockFalse should be sent back and in 50% of the other cases a LockTrue should be sent back. Learning this behavior has clearly failed. Currently, the amount of reward that is expected to be received is marginal in comparison to what can be obtained. This can be seen in the quality value of dynamic transition 166, in which at time step 400, suddenly a high reward arrives because the algorithm has tried to send back a lockFalse. However, the reason for this high reward cannot be remembered and is quickly forgotten again. This is an example of a hidden correlation that cannot be learned.
Of all the components tested (16 concurrency strategies), 15 could be kept alive by the liveness module. The one failing is component pictured in 9.10. From these experiments, we can conclude that our liveness module works as expected, however, we can also conclude that it might be overkill and that it seems that most concurrency strategies are in essence very simple FSM's, in which we don't have to remember much of the previous history. We will come back on this issue in chapter 12.

9.4.2 Bypassing a Concurrency Strategy

The approach we have presented in this chapter uses a reinforcement learning technique as a means to solve the liveness problem. Because we cannot rely on the formal liveness definition, we introduced checkpoints as a means for the developer to specify which actions are favored. The technique presented here could also be used to learn how an adaptor can reach a certain state. Should we want to do this, the reward could be assigned at the moment a change in marking occurs that is closer to the target marking. If it is further away from the target marking, a negative reward (punishment) should be given. Implementing such a technique would be quite similar and it would allow us to bypass the concurrency strategy of a server component without the need to formally analyze the Petri-net.

9.4.3 Comparison with Classifier Systems and the BBA

The cautious reader may notice that there are a lot of similarities between the Petri-net transitions we use and classifier systems. Indeed: the messages in a classifier-system can be compared to the tokens present in the marking of a Petri-net system. In that case,

Aside from these similarities, both strategies were investigated in an off-line fashion, by means of a Pittsburgh approach (described in detail in chapter 11). This technique is inherently off-line and the genetic algorithm requires different episodes to verify the value of every individual. However, if we want to make this technique work on-line we might apply the Michigan-approach [LF93], in which the different classifier rules are considered to be competing individuals. To evolve such an on-line strategy and learn which rules are appropriate in which situations, the bucket brigade algorithm [Hol] (BBA) might be useful. Summarized, the BBA gives each classifier a strength. If a classifier is fired it passes on its strength to the ones that created the current situation. In the next step, the classifier will receive a reward 'from the future' if there is a reward that could be obtained.

Nevertheless there are some major differences which make the BBA useless in our case. First is the fact that the bucket brigade algorithm needs to know how specific a certain rule is. Depending on the specificity a higher bid can be placed. This is something that cannot easily be transposed to Petri-nets. A second limitation of the BBA is that delayed rewards are implemented by means of a mean future reward. This effectively means that a possible very good solution can be hidden behind a lot of bad alternatives. A third problem with the BBA is the initial bias which is often present when introducing new rules. Evolutionary speaking, new rules can never prove themselves in an environment where systematically the worst rules, regardless of their age, are removed first. This makes balancing the exploration/exploitation phase very difficult.

9.5 Summary

IN THIS CHAPTER WE HAVE EXPLAINED how we can construct a liveness module. First, we explained which representation we would use to recognize new situations, second we explained how rewards from the underlying component could be obtained. And third, we explained how $Q$-learning can be applied to the given problem. In this chapter we have developed a method to minimize the storage-requirements of the value function, needed for $Q$-learning. Thanks to the locality property of Petri-nets we only need to store $Q(a)$ instead of $Q(s,a)$ and by introducing state-recognizers as new transitions within the Petri-net we are able to exploit structural properties of the colored Petri-net involved.


One might object and say that we cannot be sure that the program will ever stop because it is Turing-complete. However, we know that it will stop because the components themselves are reactive and will finish always at some time.
... known9.2
$P_{r}$is the probability distribution
Werner 2004-03-07