Subsections

5. Introducing the Case: Concurrent Systems

OUR CASE FOCUSES ON THE CREATION of a concurrency adaptor. When working in an open distributed system, multiple components will require and provide all kinds of different locking strategies. If we want to interface with other components we need adaptors between the different synchronization approaches. The problem with this is that there are many possible synchronization approaches and that the number of combinations of different approaches is even larger. Hence we need an automatic concurrency adaptor. This chapter explains the concurrency problems we have in open distributed systems. We will gradually investigate a number of properties of concurrency guards in open distributed systems. This will enable us to identify a set of conflicts in chapter 6. The selection of these conflicts will be based on a number of variabilities of the different presented conflicts. Therefore we will present these at the end of this chapter.

5.1 Introduction

IN THIS CHAPTER WE INTRODUCE OUR CASE, which is concurrency. The motivation of choosing concurrency as a case, lies in the fact that we feel that concurrency is a problem that is often overlooked in open systems. As we will explain, every component that can be used by third party software, needs to offer a concurrency strategy. When multiple components offer different concurrency interfaces, conflicts might arise that have nothing to do with the core functionality of the component involved. This makes our case interesting because if we can mediate these conflicts automatically, we have actually removed a non-functional concern for developers.

For practical purposes the problem of concurrency in distributed systems will be simplified to its bare essentials. Instead of using real database servers or transaction servers and real clients, we will create our own mini-version of the problems present in these software systems. We do this for a number of reasons:

We want to offer an example of a client-server architecture, which covers all the essentials. So, our example should include concurrency problems, it should clearly illustrate the reason why one needs abstract techniques to manage concurrency and it should be able to express the notion of resources.
We want to minimize the core functionality of the program, because we this makes understanding all the different concurrency strategies more understandable.
To be able to use these concurrency strategies as a case we need to implement them under the form of components. Since it is hardly possible to find an existing combination of clients and servers which offer all kinds of different concurrency strategies but with the same core functionality, we had to create them ourselves.

The approach we use in this chapter to present concurrency is our own work. When appropriate we will point out papers that might be of interest to those who need more information. In this chapter we will focus on different concurrency strategies in such a way that it allows us to define a set of interesting conflicts in chapter 6.

5.2 Concurrency

THERE ARE TWO REASONS why distributed systems are concurrent systems. First, because they are running on multiple computers, where all those computers run in parallel and share resources. Secondly, because often distributed computations allow multiple users and have to support sessions. Concurrency might lead to a number of problems.

Race conditions: these occur when at least two sessions (directly or indirectly) read and write a shared variable in such a way that the order of the events determines the outcome of the application. The result of race conditions is that the application behaves indeterministically: depending on the order of execution different application states will occur. This indeterminism is hard to debug and all too often not wanted because it was not anticipated by the programmer. Often people tend to believe that race-conditions require a shared memory, however, this does not mean that race-conditions cannot occur in an event based system. For an example of this see section 2.7 on page .
Deadlocks: A group of components is deadlocked when there exists a closed cycle of components, each in turn waiting indefinitely for an event of the next component. Our definition is largely based on the notion of deadlocks within state-machines. When a Petri-net arrives in a situation where it still contains tokens, but no pre-condition holds for any transition, then it is in a deadlocked situation. (see 3.10 on page or [Mur89,EN94]). In comparison to the operating system definitions of deadlocks, we find all necessary and sufficient conditions. A deadlock in operating system terminology requires [Bro97,Dic00]:
1. mutual exclusion, which means that only one process at a time can access a resource. In our event based system this is the case because a component can only serve one application session at a time. However, if the components themselves do offer a concurrency strategy that keeps track of session id's and allows multiple sessions to access the same resource then we might not have a deadlock. This is consistent with the Petri-net definition, because in such a case the Petri-net would have enabled transitions to support the new session.
2. hold and wait, which means that a session may hold some allocated resources while waiting for others. In our case this depends on what kind of concurrency strategy is offered by the involved components. However, also here we can determine whether a situation is a deadlock or not by looking at the involved Petri-net markings.
3. no preemption, which means that no resource can be taken away by force. In our definition we assume that no component can be taken away because this would make it impossible for the application to execute.
4. circular wait, which means that a closed chain of processes exists, such that each process holds a resources required by the next process in the chain. In our definition this is simply translated to the notion of components.
Livelocks: when measure of control is taken against race-conditions under the form of non-waiting locks, or when a measure of control is taken against deadlocks, by means of transactions, we can end up in a situation in which the system livelocks. In such a situation two processes start locking but encounter a problem halfway and release their locks again. They both restart again, again to release their resources after a while. In such a situation, the system is not waiting, on the contrary, it is working very hard, but it is not doing anything useful.
Starvation: an extra problem in concurrent systems is the problem of scheduling. By allowing processes to lock and unlock resources, some processes (or a group of processes) can (each in turn) lock a resource resulting in this resource being locked for a relatively long time thereby exceedingly slowing down other processes.

Understanding concurrency problems can be hard, solving them can be even harder, especially if we are faced with the possibility of partial failures.

5.3 The Whiteboard Case

AS A CASE THROUGHOUT THE THESIS we will use a whiteboard. A whiteboard is an application on which a number of different actors (users or other computer programs) can put and get elements. Normally a whiteboard is used as a support tool in group discussion systems. We will use it as a means to illustrate concurrency problems graphically. First we will discuss the whiteboard and its actors without any concurrency primitives. Afterward we will explain the problems and a number of standard solutions.

**Figure 5.1:** The whiteboard: every actor is represented by a color. Actor 1, 2 and 3 are moving dots. Actor 4 is a moving line and actor 5 and 6 are flood fills.
$\includegraphics[% height=8cm]{BoardIntroduction.eps}$

5.3.1 The Interface

The whiteboard has a very rudimentary interface:

in JoinActor() is an incoming message for the whiteboard. As a result
out ReturnJoinActor(<Integer|Result>) is sent. The integer sent back to the client is the color that is assigned to that actor and can afterward be used to color squares on the board.
in IsFree(<Pos|Pos>) can be sent to check whether a certain position is empty or not. Pos is an , and coordinate. As a result out ReturnIsFree(<Pos|Pos>, <Boolean|Result>) is sent. When Result is true the position specified by Pos is free, otherwise it is not.
in SetPosition(<Pos|Pos>,<Integer|Actor>): to set a certain position to the actor's color, send SetPosition to the whiteboard. When done the whiteboard will send out ReturnSetPosition() back to the requester. If the supplied integer 'actor' is zero, then the position is emptied. For illustrative purposes, when a position is cleared, a residue, in a lighter color, will remain: if an actor puts yellow on a certain position and then clears it that position will become lighter yellow and will be free for others to put something in. This way a trail can be seen where the actor has already been. See figure 5.1.
out RETURN_SetPosition(): response of a SetPosition call.

The whole whiteboard itself is one component. It has one thread running and doesn't share data with other processes. Every actor on the whiteboard is also a component. There are 3 actors we will discuss: the moving dot actor; the moving line actor and the floodfill actor.

5.3.2 The Horizontally Moving Dot Actor

$\begin{algorithm} % latex2html id marker 2701 [!htp] \begin{algorithmic}[5] \par... ...lgorithmic}\par \caption{ The horizontally moving dot algorithm} \end{algorithm}$

The moving dot actor uses a simple algorithm to move over the whiteboard. It first checks whether the next position , left or right of the current position, is free. If it is it will mark that position and clear the old one. If the position is not free then it will change its direction. In figure 5.1, the red, orange and yellow actors are moving dots. Algorithm 6 covers the details. Please note that in all implementations we give we have to take care of the non blocking nature of the component system. In fact, we cannot wait until IsFree or SetPosition returns. However, for the sake of simplicity we omitted the original complex non-blocking code and replaced it with a more readable blocking version.

This simple actor requires and provides the following interface:

out JoinActor() when starting up the moving dot actor will join a whiteboard. After sending out this message the actor will passively wait for an
in RETURN_JoinActor(<Integer|Result>).
out IsFree(<Pos|Pos>) will be sent by the moving dot actor to check whether the next position to move to is free. The dot will wait until an
in RETURN_IsFree(<Pos|Pos>, <Boolean|Result>) arrives.
out SetPosition(<Pos|Pos>,<Integer|Actor>) is sent out to set and clear the position (hence, move the dot). After every SetPosition the moving dot will wait until an in RETURN_SetPosition() arrives.

5.3.3 The Moving Line Actor

$\begin{algorithm} % latex2html id marker 2741 [!htp] \begin{algorithmic}[5] \par... ... \par \end{algorithmic}\par \caption{ The moving line algorithm} \end{algorithm}$

Aside from this simple actor, there is a more interesting figure that can be moved around the board: the moving line. It is a line of 10 pixels high, with a trailing dot behind it. The standard algorithm of the line (algorithm 7) checks whether the next line is free (lines 9,10,11 in the algorithm). If it is, the old line will be removed (lines 17,18 ), the trail will be drawn (line 19) and the next position of the line drawn (lines 20,21). The line's trail goes up and down relative to the origin of the line (lines 14,15,16). If the line bumps into something then it will change direction (line 24). Before checking whether the next line is free, the trail is removed, otherwise the line would not be able to turn around. (line 7). The interface required and provided by the line actor is exactly the same as the one for the moving dot actor.

5.3.4 The Floodfill Actor

$\begin{algorithm} % latex2html id marker 2757 [!htp] \begin{algorithmic}[5] \par... ... \par \end{algorithmic}\par \caption{ The flood actor algorithm} \end{algorithm}$

The last actor we will discuss is the floodfill actor. This actor tries to fill the whiteboard by enlarging its own domain. The standard algorithm (algorithm 8) keeps a set of points that are owned by the flood actor, a set of border points and a set of seed points. A border point is a point owned by the floodfill actor, but with not all neighbors owned by the actor. A seed point is a set of points that are possible candidates to fill. They are not yet owned by the actor. Normally a point starts in the seed set. If the point is free, the point becomes a border and all 4 neighbor points are added to the seed set. If a point is solely surrounded by points owned by the actor then it becomes an owned point. Points that became owned are cleared on the whiteboard but remain owned.This can be seen in figure 5.1. In the figure the purple and turquoise actor are flood actors. The interface required and provided by the flood actor is exactly the same as the one for the moving dot actor.

5.4 Race Conditions

WHEN WE LOOK AT THE INTENDED BEHAVIOR of all actors, we see that it is of vital importance that no actor crosses the boundaries of another. Dots should not go through each other, lines should bump, dots should not enter floodfill actors and floodfill actors should not cross each other either. Furthermore a line should always have a trail that goes up and down relative to the origin of the line. In this section we will illustrate that none of these requirements are met if we don't use any concurrency primitives. Let's investigate some of the problems

5.4.1 A Selection of Race Conditions

**Figure 5.2:** How two moving dots can pass each other without bumping.
$\includegraphics[% width=1.0\textwidth]{ConcurrencyProblem1.eps}$

Moving dots can go through each other: Figure 5.2 illustrates how two dots (as specified in algorithm 6) can pass each other without bumping. The problem occurs when both dots detect that a certain position is free and then both take that position. Afterward both dots will simply continue their way and since no other dot forms an obstacle, they can proceed. The netto result is that they did not collide, a thing they are required to do.
Dots can enter a floodfill: As stated in algorithm 8, the floodfill actor will first enlarge its set of possible seeds, in the next iteration a seed will be taken and checked whether it is free, if it is free the seed becomes ours. It is in this last step that a concurrency problem can occur: between checking whether the position is free and actually taking it. Figure 5.3 illustrates this.
In the same way line actors can pass each other and floodfill actors can enter each other: They will check whether a position is free and then acquire that position, blind to the fact that another actor has already occupied that position.

**Figure 5.3:** A moving dot crossing the flood actors boundaries
$\includegraphics[% width=1.0\textwidth]{ConcurrencyProblem2.eps}$

5.4.2 Different Solutions towards Race Conditions

The behavior described above is typically called a race condition. It means that two (or more) concurrent processes try to get to the same resources and are actually 'racing' to be the first. To deal with this kind of problems, a number of alternate tracks exist.

Detecting concurrency problems: a big problem of concurrency problems is that they are very scheduler dependent. We can work for years with the same code and after changing the scheduling behavior of the kernel we notice how things start to fail sometimes. At the moment we see these spurious errors we probably want to reproduce the bug and start debugging. The problem here is that under a debugger the scheduling behavior of programs tends to be different and concurrency problems don't show up anymore. This is what is called an Heisenbug. There is work done to help people debug such systems by means of record and replay. During the record phase a debugger runs the program and simply remembers the order in which locks or acquired or released. Afterward the debugger can replay the original execution, with the same order of events [RB02,Gar97]. Within thread based system, Lamport clocks [Lam77] are typically used for this. Within event based systems we can simply record all events. [CL94] discusses how such a the ordering of events within a message passing system (such as the one we are using) can be implemented.
Formal verification: What we often want to do is to check programs for concurrency problems in a more formal way. Therefore, something often done to describe and detect concurrency problems is to specify pre- and post-conditions. In our case, for the moving dot actor, a possible precondition could be: the next position is free and the current position is mine. The postcondition after moving would then be: the next position is mine and the current position is free. It is clear that such a pre- and postconditions can be used to detect race conditions at runtime. If we want to check the possibility of race-conditions statically, formal techniques exist that will automatically deduce how race conditions can be created. [CG]

The problem with open distributed systems is that we cannot verify those pre- and postconditions if we don't have all participating actors at hand. Now, let's see how concurrency problems can be solved in open distributed systems.

5.5 Centralized Atomic Operations

NONE OF THE BEFORE MENTIONED RACE CONDITIONS would exists if we modify our server a little bit. Instead of offering only an IsFree and SetPosition operator, we could add an extra operator to the whiteboard:

in MoveWhenFree(<Position>, <Direction>): Position is the position of the point to move. Direction can be either left, right, up or down. This operator will check whether the point we're moving to is free. If it is it will swap its content with position.After doing so, a message out RetrunMoveWhenFree(true) will be sent. When the position was not free a message out ReturnMoveWhenFree(false) is sent. This operator will also return false (meaning not free), when the target position is used by the same color as the original requester.

**Figure:** Moving a line with the `MoveWhenFree` operator.
$\includegraphics[% height=5cm]{CentralisedSolution1.eps}$

For most actors (the moving dot actor, the line actor and possible others), this solution might work, however this solution cannot guarantee that every possible use of the component an be expressed in such a way that no race-conditions occur. For instance, with this operation, it would be very difficult to make the floodfill actor to work correctly, without race-conditions. The biggest problem is that the floodfill actor in essence doesn't move, instead it takes new points if they border to the actor. As such implementing a floodfill actor in a safe way would mean that we can only move dots away from a position that is bordered by three other flood-dots. This would require an initial seed of at least 5 positions and would require additional code to input new blocks in the interior of the floodfill.

This solution illustrates that extending a server to support every possible critical section in one message is no solution at all, especially not in open distributed systems. Instead, it is a very local solution only to solve the problem of specific clients. Therefore we need some better solution. We cannot for every possible critical section end up modifying the server.

5.6 Non-Waiting Atomic Operations & Starvation

IN OPEN DISTRIBUTED SYSTEMS we need some form of critical sections, otherwise different actors can change the internal state of a component, without taking into account other components. Placing all possible critical sections at the server is not good enough because this would not allow unanticipated behavior for other components. Therefore we need a more abstract way to specify our critical sections. In fact, we need to specify the beginning and the ending of a critical section in such a way that no component allows access from another component unless it has obtained an atomic operation id. In such a scenario every component would provide a certain synchronization interface, for instance, defined as:

in enter(<Integer|Id>)

out enter_ok(<Integer|Id>)

out enter_fail(<Integer|Id>)

in leave(<Integer|Id>)

out leave_ok(<Integer|Id>)

out leave_fail(<Integer|Id>)

The implementation of the component offering such an interface of course requires some changes. to ensure that the requester can execute operations upon a server component, every incoming message id should be verified. This should be checked for every action, but this can be done quite easily.

How the semantics are defined is a bit more difficult. Two orthogonal features need to be investigated. First there is the question whether those atomic operations are reentrant or not, second there is the problem whether an atomic operation waits or not when entering. In total this gives about 4 essentially different synchronization interfaces. We will explain two of them in more detail below. The other two are reserved for the next section. We will define the semantics of the different approaches by means of pseudo code.

5.6.1 Non-Reentrant Synchronization Semantics

$\begin{algorithm} % latex2html id marker 2853 [!htp] \begin{algorithmic}[5] \par... ...tion{ non-waiting, non reentrant, enter/leave locking semantics} \end{algorithm}$

Non-waiting, non reentrant enter/leave synchronization semantics. (Algorithm 9). This way of locking allows the interface requester to start an atomic operation on the interface provider by entering it. The enter takes an argument that should be a global unique identifier. In return the interface provider will return EnterOk when the atomic operation could be started, EnterFail otherwise. At the moment a component has an atomic operation initiated on the server it cannot re-enter the server, not even from within the same atomic operation. This is called non re-entrance.

5.6.2 Reentrant Synchronization Semantics

$\begin{algorithm} % latex2html id marker 2878 [!htp] \begin{algorithmic}[5] \par... ... \caption{ non-waiting, reentrant enter/leave locking semantics} \end{algorithm}$

Non-waiting, reentrant enter/leave synchronization semantics. (Algorithm 10). When such an interface is provided, the same interface client can start multiple nested atomic operations on the server. This is especially useful when working with recursive functions. The algorithm itself simply keeps a lock counter that is increased every time a client enters. When a client leaves an atomic operation, the lock counter is decreased and LeaveOk is sent back. At the moment the lock counter reaches zero lockedby is emptied and another actor can enter an atomic operation. Currently the reentrant semantics return LeaveOk when an atomic operation is ended. This is not necessarily always the case. It would also make sense to differentiate here. For example, return a LeaveNested when there are still other atomic operations running and a LeaveOk when the last atomic operation has ended. In the same way, the non-reentrant semantics could be changed to return EnteredAlready when such an atomic operation has already been started.

5.6.3 Discussion

The two algorithms presented here, require the client to supply a globally unique identifier to the server. This is clearly not a realistic requirement. Normally the server will choose a number of its own and return that as a unique identifier. In section 5.10 we will present more about this.

The two above algorithms are both non-waiting algorithms, this means that if the interface provider returns an EnterFail message, then the interface requester will need to retry again at a later time. This is a source of trouble, because, chances are high that the interface requester will immediately try to enter again until he finally could start his atomic operation. This leads us to the following problems:

The more components are accessing the server, the more useless network traffic will take place. This network traffic would mainly consists of enter requests and EnterFails.
Moreover, since there is no fairness involved, the client flooding the server the most will probably obtain the server lock first. Clients with a slow network connection, located at a distance will have no fair chance to lock the server.

5.7 Waiting Atomic Operations & Deadlocks

TO AVOID NETWORK CONGESTION and starvation of distant components, we will now investigate how the earlier defined API can be implemented with waiting semantics. We will again distinguish between reentrant and non-reentrant semantics.

5.7.1 Non-Reentrant Synchronization Semantics

$\begin{algorithm} % latex2html id marker 2915 [!htp] \begin{algorithmic}[5] \par... ... \caption{ Waiting, non reentrant enter/leave locking semantics} \end{algorithm}$

Waiting, non reentrant enter/leave locking semantics: algorithm 11). These semantics are the same as the non-waiting locking semantics. The biggest difference is when somebody wants to enter an atomic operation on the interface provider. At the moment there is already an atomic operation running, the requester will be placed in a queue. As such, an EnterFail is never returned, instead the possible EnterOk is held back for a later time. At the moment a leave request arrives with the correct id, the atomic operation is finished and the first in the waiting list is informed and can start with his set of atomic operations.

5.7.2 Reentrant Synchronization Semantics

$\begin{algorithm} % latex2html id marker 2936 [!htp] \begin{algorithmic}[5] \par... ...\par \caption{ Waiting, reentrant enter/leave locking semantics} \end{algorithm}$

Waiting, reentrant enter/leave synchronization semantics. (Algorithm 12). The reentrant version has a similar protocol behavior. The only difference is that this algorithm will increase and decrease a lock counter in response to Enter and Leave respectively. At the moment the lock counter reaches zero, the next waiting request is considered. Some variations could exists upon these semantics.

5.7.3 Discussion

The non-waiting atomic operations had the problem of starvation and network flooding. In the same way, waiting locking semantics have their own problems: deadlocks. A deadlock is technically speaking a situation in which multiple processes are waiting for each other to do something. So they are all virtually dead. There are two interesting situations.

A typical deadlock with the non-reentrant locking strategy is a deadlock within one component. If a client component tries to lock the server component twice, due to some nested function, it will simply stop and place the second component in the queue. Of course since the client component cannot proceed, because it still doesn't have the lock, it keeps on waiting, thereby never releasing the first atomic operation started by it. The result is a deadlock, due to non re-entrance.

**Figure 5.5:** Mutual exclusion of two components. All components offer a waiting locking strategy.
$\includegraphics[% width=0.60\columnwidth]{Deadlock1.eps}$

Another deadlock situation happens when two components try to lock each other (due to some external request). This is pictured in figure 5.5. Component starts a green atomic operation on component . Component accepts this. Component starts a purple atomic operation on component . Component accepts this. Component wants to act on component , as a result component needs to lock component . This is impossible, thus component is placed in wait. Component now requests component to act. Component needs to start an operation on component . This is also impossible, so now component is also placed in wait. As a result all components stop working.

Aside from the fact that deadlocks stop components from doing anything useful, there are some extra problems involved. Since some components simply stop working within a deadlock, other components that are dependent on those deadlocked components will eventually also cease to work. In our small example, component

and component

are in such a situation. Eventually, a whole application may end up deadlocked.

5.7.4 Different Solutions to Deadlocks

There are 3 different approaches to solve the problem of deadlocks:

detection of deadlocks: this is a technique in which deadlocks are detected at runtime and then broken by some means. Typically a detected deadlock is solved by removing one or all of the offending processes. This track will not be investigated further in this dissertation because it effectively renders the application useless. Another technique is rolling back the entire state of one (or all) of the involved components to an older state and restarting them from there. In our situation this is not possible because it is very difficult to roll back a distributed state to a previous moment in time. Also the possibility of extra user input might complete this scheme.
avoidance of deadlocks: this technique requires the ability to know the future locking behavior of the involved components. This knowledge can dynamically lead to a correct decision to avoid deadlocks. However, this information is difficult to obtain if the components are not written to specify which resources they will lock in the future.
prevention of deadlocks: this technique alters the variables of the surroundings in such a way that deadlocks doesn't occur. Different approaches are possible
- instead of waiting for locks, a component can return a lock_false message. By doing so, deadlocks will no longer occur. We have already discussed this approach in section 5.6. In the next section 5.8 we will discuss some problems involved with this technique.
- define a linear ordering on the locking requests such that circular waits are no longer possible. We will discuss this in more detail in section 5.8.3.

5.8 Locking Resources & Livelocks

IN THE ABOVE MENTIONED SYNCHRONIZATION SOLUTIONS, we specify the start of an atomic operation on the server with some form of message. Once an atomic operation is started no other actors can start an atomic operation. This is not always as efficient as it could be. Consider our whiteboard example. If a client wants to set the color of two squares, it has to start an atomic operation on the whole server. This is in essence the same as locking the full server. Suppose now that a second actor also wants to draw some pixels but on the opposite side of the whiteboard. It is clear that those two operations can coincide, but won't because both need a server lock.

5.8.1 Granular Operations: Locks

A solution to this problem is to introduce a lock for every square on the whiteboard. In doing so a client actor can request the server to lock a number of squares, which it can access afterward, but still concurrently with other actors that have locked other positions on the board. The API for such a lock is extended somewhat. Instead of declaring where we want to enter we specify what position to lock (or unlock). The semantics can be implemented in the same way as specified before. We can have a combination of a waiting/non-waiting strategy with a reentrant/non-reentrant locking strategy. With respect to terminology, such locks are typically called semaphores. A reentrant lock is sometimes called a counting semaphore, while a non reentrant lock is often called a binary semaphore. [Lea00]

We introduced a lock for each square on the whiteboard. We could also introduce a lock per -line or a lock per -line on the field, or a lock per 4 squares on the field. We choose to map our resources to squares.

$\begin{algorithm} % latex2html id marker 2989 [!htp] \begin{algorithmic}[5] \par... ...hm whereby a waiting lock protocol is expected from the server.} \end{algorithm}$

Using a waiting non reentrant locking strategy to solve the concurrency problems when moving dots, flooding the whiteboard or moving lines with their trail is easy. Algorithm 13 shows how this can be done. Before actually checking whether a position is free, we lock the current position as well as the next position. If the next is free, we move and release both locks. If the position is not free, we also release both locks but we turn around. The same thing can be done for the moving line actor and the floodfill actor.

5.8.2 Problems of Fine Grained Locks: Livelocks

With this solution there still exists a plethora of problems:

**Figure 5.6:** Mutual exclusion between two actors trying to lock the same area of the whiteboard. A square layered above another square denotes a lock. So the yellow square above the Red square means that the position is colored red, but is locked by yellow.
$\includegraphics[% height=5cm]{ConcurrencyProblem3.eps}$

When using a waiting lock strategy we have the problem of deadlocks. See figure 5.6. At the moment the yellow actor wants to lock its area, that is the current position and the next position, it will first try to obtain a lock at the next position (2,1) and then try to lock its own position. The problem arises when the red actor now has already locked its next position (3,1), our current position. In this situation both dots become deadlocked because they are both waiting to acquire the same lock, a lock that will stay locked until one actor gets the locks it is waiting for.

**Figure 5.7:** Illustration of a livelock. The acquiring of locks for green and orange will continue until one finally succeeds in getting a whole line locked. This can take some time.
$\includegraphics[% height=5cm]{ConcurrencyProblem4.eps}$

When using a non-waiting locking protocol the normal behavior of the client is to try to lock a certain position and if it doesn't succeed release all the locks acquired yet. This can result in a livelock (see figure 5.7): a situation in which a number of clients try to lock the same position over and over again and stumble every time over the same problem. In our example this can be best visualized by using the line actor. Suppose that the line actor locks all its squares from top to bottom when going left, but from bottom to top when going right. At the moment both lines encounter each other at the same -coordinate, then at a certain moment they both will need to acquire a position, common to both. Both will release all their locks and retry. This is called a livelock and results in starvation of both processes. The larger the overlapping part between the lines the longer this livelock will persist, simply because one of both lines needs to keep away from the shared part during the entire locking operation of the other actor. The larger the shared part, the more unlikely this becomes. In other terms: the more shared resources involved, the longer a livelock will persist, unless an absolutely unfair scheduling system is introduced.

5.8.3 Solutions to Livelocks

Looking at these problems, we see that there are a number of possible solutions when working with non nested locks.

A typical solution to this kind of problems is called staircasing. Here locks should be acquired in a certain order. In our example it would make sense to sort all squares from left to right and top to bottom. When doing so, deadlocks are impossible. When a lock is requested and cannot be assigned we are sure that the other actor can continue because it already has all the locks it needs with an order smaller than the one it is requesting. As such, we don't need to unlock any of our already acquired locks. When using staircasing it is important to acquire the locks in order.

$\begin{algorithm} % latex2html id marker 3028 [!htp] \begin{algorithmic}[5] \par... ...ocks can be acquired within one atomic operation on the server.} \end{algorithm}$

Another possible solution is to make locking and unlocking itself an atomic operation, so extending the lock and unlock operations with a set of locks to acquire or release. (See algorithm 14). The two green lines start and stop the 'atomic locking operation'. Within this atomic operation two locks are acquired. Please note that there is no need to release the locks within a critical section. This way we are sure that we have all the locks we need at once or none at all.

**Figure 5.8:** Synchronization Layers
$\includegraphics[% width=1.0\columnwidth]{SynchronisationLayers.eps}$

In fact we are now solving a synchronization problem between locks, and no longer between resources. We are specifying locks themselves as resources where the access to the resource (the acquiring of locks) should be guarded by another lock. This layering is pictured in figure 5.8.

Using the above techniques (staircasing or making locking an atomic operation by itself) is difficult when the programmer expects nested locking semantics. Both techniques rely on the fact that at a certain point in time the program knows all the locks it will eventually need. When working with subroutines (or subprograms) that autonomously acquire locks this can be very difficult. One workaround is first to ask all subroutines what they will eventually need, collect all those locks and then acquire them within one operation.

However, if it is impractical to require that a process specifies all the resources it needs at once, we might need to resort to transactions, which we will explain in the next section.

5.9 Transactions & Partial Failure

A TRANSACTION IS AN ABSTRACT ATOMIC operation within which changes can be made on a component. When the transaction is finished it can either commit or abort. Committing a transaction means that all the changes made within its critical section become true and visible to other participants. Aborting a transaction means that all the changes made within the critical section are undone.

Transactions are necessary in systems where we cannot order locks because either there is no obvious order, because we cannot force actors to respect that order, or because an actor cannot specify all it needs at once. Another situation where we need transactions is when we think about failures. What will happen when one of the actors on the whiteboard has locked certain positions and then dies. Can the server recover from such a situation ? If the server supports transactions and it can detect client-death (by means of a time-out, or by detecting a broken socket) it can rollback the transactions owned by that client.

$\begin{algorithm} % latex2html id marker 3051 [!htp] \begin{algorithmic}[5] \par... ...of a rudimentary non-waiting, non reentrant transaction system.} \end{algorithm}$

Transactions can allow the nesting of operations. When, within one transaction another is started and the locks required to execute the inner transaction cannot be obtained, the outer transaction might abort also. From the programmer's point of view this is what we want, also from the point of view from the server transactions are good because the server will always reside in a valid state, something that cannot be guaranteed without transactions.

The problem of livelocks still remains of course. Fortunately, this can be remedied at the server side. The problem is no longer solely dependent on the network traffic or the behavior of the clients, but on the scheduling behavior of the server. For instance a server may decide to set a lock request in wait until the lock can be obtained, or it can return a lock-failure.

Depending on the context within which people talk about transactions there can be a difference between read-locks, write-locks, and the way locks are treated. If smaller locks are defined then there is a greater flexibility to optimize the concurrent access to resources. Similar to the granularity of resources, we can consider a lock to be a resource on its own.

Implementing a simple transaction system can be done by keeping in memory which locks belongs to which transaction. Once a lock is released the lock keeps on belonging to the original transaction. Only, when the transaction commits, all its locks are released. When the transaction aborts, all its locks are released and all changes to the resources are rolled back. Algorithm 15 illustrates the rudimentary semantics of a non-waiting, non recursive transaction synchronization interface. Such transactions can be easily used from a programmers point of view. We illustrate this by making the line actor movable (Figure 16). The reader interested in transactions can read [OHE96,CDK94].

$\begin{algorithm} % latex2html id marker 3072 [!htp] \begin{algorithmic}[5] \par... ...oving line algorithm using a non-waiting transaction interface.} \end{algorithm}$

5.10 Peer to Peer Concurrency & Distributed Transactions

WITHIN OPEN PEER TO PEER NETWORKS concurrency control and guarding is fairly difficult, because in a peer to peer system several components together provide a certain global behavior. So if one wants to take this global behavior from one correct state to another correct state, one needs to take all participating components within one operation to this new state.

To do so we need a transaction that spans multiple components and can commit all components or abort all components. This is called a distributed transaction. Normally distributed transactions are provided by one server that contacts the necessary components and commits or rolls back transactions as necessary.

**Figure 5.9:** Distributed Transaction server.
$\includegraphics[% height=6cm]{DistributedTxServer.eps}$

In figure 5.9 we see how component needs to go through a transaction server before it can start certain actions on a certain set of components. The transaction server takes care of transmitting all incoming messages to the other communication partners, possible mapping different transaction id's to the same number. In essence, all locking logic should go through this transaction server. The transaction server is available per group resources needed per client, which means that all components taking part in a certain session need to go through the same transaction server. So in fact we added another extra layer to solve concurrency problems between multiple components. (Pictured in figure 5.10).

The transaction manager per group is not difficult to conceive. It needs to access the transaction ports on all participating components and needs to have a mapping. This mapping ensures that when a beginTransaction comes in, the transactions started on all participating components will be represented by one transaction id. When a lock request comes in, the supplied transaction id should for every component be mapped to the correct effective transaction id. This constitutes a problem because a component is normally not aware of some sort of transaction manager. This means that all components need to agree to use the specified session transaction manager.

**Figure 5.10:** Synchronization layers within peer to peer systems
$\includegraphics[% width=1.0\textwidth]{SynchronisationLayers2.eps}$

Of course, there are problems with this setup. Other concurrency strategies worked well because their behavior was unambiguous. A transaction was either committed or not, a lock was obtained or not and so on. The problem we have here is that with a distributed transaction we can have transactions that are partially committed. We can have a situation where the transaction server starts committing all its transactions but fails to commit the last transaction because the involved component has died. In such a situation the resulting global state is invalid, even when the component recovers its old state. There is not much that we can do about this, except introducing extra redundancy. Distributed transactions are discussed in [CDK94].

5.11 Commonalities & Variabilities

WITHIN THE PREVIOUS SECTIONS we have introduced a number of concurrency strategies. All these different concurrency strategies can be described by a set of commonalities and variabilities. The common issue is the fact that we are talking about resources and about critical sections: how can we ensure a valid state transition ? Below we will present a set of variabilities which can define a concurrency strategy.

Syntactical: how a component calls another component, with which parameters, with what kind of symbols and names. Syntax as a common term refers to the structural aspects of a language. In our case we will simply stick to the symbols and data structures at hand.
Control flow: in what sequence do we need to send messages ? Will requests wait until they can return, or will they return something like 'try again later'. We have seen two examples of this: the waiting locking strategy and non-waiting locking strategy
Re-entrance: can the same lock be obtained multiple times or not. If it can, such as in the counting semaphore, the locking strategy is reentrant, in the other case, such as the binary semaphore, the locking strategy is non-reentrant.
Resources: what are the resources we are talking about, and more specifically, what is their granularity ? Can we only lock the complete whiteboard, can we lock lines or can we lock individual squares ?
Transition: how is time defined. This is important with respect to the state transitions. Is a state transition always in effect immediately, or is the transition effective after committing a transaction. If so, can we go back in time (roll back) and can we go forward in time (roll forward after recovery)
Layering: most of the time multiple basic synchronization mechanisms are layered, how this is done is another variability.

**Figure 5.11:** Projection of a hypercube illustrating the variabilities of a one layered concurrency guarding strategy.
$\includegraphics[% height=8cm]{OneLayeredVariabilityHypercube.eps}$

The above variabilities can be used to describe a concurrent strategy. We must now investigate how they relate to each other, which variabilities can be modified without impact on other variabilities and which variabilities are influenced when another variability is shifted. If two variabilities do not correlate they are said to be orthogonal.

Control flow can be modified without any immediate impact on the reentrancy of the concurrency strategy, so these two variabilities are orthogonal. The resources covered by a concurrency strategy are also independent of how the locks are offered, so resources are orthogonal to control flow and reentrancy. Transition of locks, whether they can abort or are in effect immediately is independent of the resources, control flow or reentrancy. Therefore, this criterion is also orthogonal to all others. Syntax changes when the resource granularity changes, or when the control flow changes, or sometimes even when the reentrancy changes. Therefore syntax is partly defined by all other axes. Syntax is not a pure orthogonal variability. Figure 5.11 illustrates this.

**Figure 5.12:** Projection of a hypercube of layered concurrency guarding strategies.
$\includegraphics[% height=8cm]{TwoLayeredVariabilityHypercube.eps}$

If we now need to take into account the different concurrency strategies we must understand that layering influences everything. Syntax changes to support an extra layer, resources change since extra resources are added, control flow changes because different layers interact with each other in different ways. Transition changes since lower layers need to be rolled back when an upper layer decides to roll back. Figure 5.12 illustrates this. The red arrow in the figure indicates an extra axis/dimension for layering. The axis we will work on on the other hand is slanted in the other 5 dimensions because it is impossible to change layering without interfering with other properties. On the other hand, we cannot express other properties solely by means of control flow, transition, reentrancy and resources, because an extra layer needs to reason about other resources.

However, this view on the different variabilities, should not be considered to be complete. It suffices for our goal, which is to select a set of concurrency conflicts. For instance, syntax should not necessarily define a separate variability, while a missing variability might be timing. We have limited ourselves to these because they capture a great number of possible concurrency strategies and because the resulting concurrency strategies can occur in practice.

5.12 Summary

IN THIS CHAPTER WE HAVE IDENTIFIED COMMONLY occurring concurrency problems based on a whiteboard on which multiple different actors can draw any figure they want. We will continue to use the whiteboard as a running example throughout the thesis.

When concurrency is, with any technique available, managed at a central place, and everybody adheres to the concurrency strategy, there cannot be much concurrency problems. The reason behind this is that the whole program uses the same concurrency strategy and this strategy is, when well designed, suitable for the program in question. Unfortunately, we saw that this requirement does not hold for open distributed systems.

The problem with concurrency is that it cannot be modularized. We cannot easily say where the concurrency management should be placed, nor can we offer one interface to the outside world which hides all our internal concurrency problems. When we place two deadlock-free programs in the same environment, it is possible to have a deadlock between both programs.

We investigated a number of problems with their solutions, and saw that these solutions in turn give rise to new problems:

problem: race conditions $\rightarrow$ solution: atomic operations
- solution: non-waiting atomic operations $\rightarrow$ problem: starvation
- solution: waiting atomic operations $\rightarrow$ problem: deadlocks
problem: performance $\rightarrow$ solution: locking per resource $\rightarrow$ problem: livelocks
problem: partial failure $\rightarrow$ solution: transactions
problem: multiple components $\rightarrow$ solution: distributed transactions.

We identified 6 parameters, which are sufficient to describe all the different approaches we have seen. They are: syntax, control flow, reentrancy, resources, transition and layering. In the next chapter we will use these variabilities to select a number of interface conflicts.

Werner 2004-03-07