Multiprocessor Semaphore

Shared memory semaphores are basic tools for interprocessor synchronization. Although
self-imposed design constraints can often reduce synchronization requirements, semaphores offer significant flexibility to multiprocessor system designers. The implementation presented here illustrates some fundamental issues of multiprocessor concurrency and demonstrates the tremendous value of a multitasking OS like DSP/BIOS.
Many variations on this theme are possible. Most obviously, we can modify the semaphore to handle multiple tasks on one processor or tasks on more than two processors. And because our wait operation handles notification interrupts and task wake-up, we can implement any scheduling policies that make sense for more generalized versions.

MULTIPROCESSOR SEMAPHORE
==============================================
Multiprocessing architectures are becoming pervasive. DSPs are widely used in dense multiprocessor arrangements at the network edge, and systems-on-chip often include DSP cores to accelerate math-intensive computation. Although DSP/BIOS provides a standard, efficient, robust API for uniprocessor applications, we sometimes encounter situations where interprocessor synchronization mechanisms would be very useful. Here we will take a look at multiprocessor mutual exclusion and we will talk about new method to implement interprocessor semaphores using DSP-BIOS.
Many multiprocessor DSP systems are designed to share a physical pool of memory in the sense that each processor sees the memory as directly addressable - a DSP shares a region of memory with a host processor or other DSPs-.
A common architecture uses a large region of single-port RAM shared by all devices, including the host. Although arbitration issues complicate hardware design. A second architecture uses of dual-port RAM between processors. The downside here is the relatively high cost and small storage capacity of these devices - large banks of expensive DPRAM are seldom practical. But in applications that use segmented data transport or small data sets where small amounts of DPRAM are sufficient, this method is very useful. DPRAM is relatively fast, designer-friendly and, unlike FIFOs, can store shared data structures used for interprocessor communication.
We need to remember this when using shared memory. When processors have on-chip cache or a system uses write posting, software designers must pay attention to shared variable coherence. Depending on the processor, programmers can disable cache, use cache by-pass, or flush cache to ensure that a shared location is in a proper state. The cache control API in TI's comprehensive Chip Support Library, for example, provides an ideal tool for cache subsystem management. Solutions to write-post delay problems are system-specific.
Our discussion here assumes that two processors use a common shared memory buffer to pass data or to operate cooperatively on a data set. In either case, one or more tasks on the processors might need to know the state of the buffer before accessing it, and possibly to block while the buffer is in use. As in the case of single-processor multitasking, we need a mutual exclusion mechanism to prevent inappropriate concurrent operations on the shared resource. Let's start with a quick review of mutual exclusion to better understand multiprocessor issues.
Shared resource management is a fundamental challenge of multitasking. A task (or thread, or process) needs the ability to execute sequences of instructions without interference so it can atomically manipulate shared data. These sequences, known as critical sections, are bracketed by entry and exit protocols that satisfy four properties - mutual exclusion, absence of deadlock, absence of unnecessary delay and eventual entry (no starvation). Our focus here is mutual exclusion - the remaining properties are detailed in any number of textbooks and will be satisfied by our multiprocessor semaphore.
Relative to a shared resource, mutual exclusion requires that only one task at a time execute in a critical section. Critical section entry and exit protocols use mechanisms such as polled flags (often called simple locks or spin locks) or more abstract entities such as blocking semaphores. Simple locks can be used to build protection mechanisms of greater complexity.
B-Semaphore
The semaphore is a system-level abstraction used for interprocess synchronization. It provides two atomic operations, wait (P) and signal (V), which are invoked to manipulate a non-negative integer within the semaphore data structure. The wait operation checks the value of the integer and either decrements it if positive or blocks the calling task. The signal operation either unblocks a task waiting on the semaphore or increments the semaphore if no tasks are waiting. A binary semaphore, with value limited to 0 and 1, can be used effectively by an application to guard critical sections.
A multiprocessor semaphore can be implemented by placing its data structure in shared memory and using RTOS services on each processor to handle blocking. Before outlining an implementation, let's look at two aspects of semaphores that cause complications in a multiprocessor environment. One is low-level mutual exclusion to protect shared data within a semaphore and the other is wake-up notification when a semaphore is released.
Low-level mutual exclusion
At its core, a semaphore has a count variable and possibly other data elements that must be manipulated atomically. System calls use simple mutual exclusion mechanisms to guard very short critical sections where the semaphore structure is accessed. This prevents incorrect results from concurrent modification of shared semaphore data.
In a uniprocessor environment, interrupt masking is a popular technique used to ensure that sequential operations occur without interference. Interrupts are disabled at the entrance to a critical section and re-enabled on exit. In a multiprocessor situation, however, this isn't an option. Even if one processor could disable the interrupts of another (rarely the case), the second processor would still execute an active thread and might inadvertently violate mutual exclusion requirements.
A second technique uses an atomic test-and-set (or similar) instruction to manipulate a variable. This variable might be the semaphore count itself or a simple lock used to guard critical sections where semaphore data is accessed. Either way, a specialized instruction guarantees atomic read-modify-write in a multitasking environment. Although this looks like a straightforward solution, test-and-set has disadvantages in both uniprocessor and multiprocessor scenarios. One drawback is dependence on machine instructions. These vary across processors, provide only a small number of atomic operations and are sometimes unavailable. A second problem is bus locking. If multiple processors share a common bus that doesn't support locking during test-and-set, processors might interleave accesses to a shared variable at the bus level while executing seemingly atomic test-and-set instructions. And a third problem is test-and-set behavior in multi-port RAM systems. Even if all buses can be locked, simultaneous test-and-set sequences at different ports might produce overlapped accesses.
Now consider two approaches that are very useful in shared memory scenarios. One relies on simple atomic hardware locks and the other is a general-purpose software solution known as Peterson's algorithm.
Hardware flags
In shared memory systems, hardware-assisted mutual exclusion can be implemented with special bit flags found in multi-port RAMs. DPRAM logic prevents overlap of concurrent operations on these hardware flags, forcing them to maintain correct state during simultaneous accesses. And because processors use standard-issue read/write instructions to manipulate the flags, special test-and-set-like instructions are not required. But this is still a limited solution - software engineers often encounter shared memory systems that lack this feature. So let's take one more step to arrive at a general-purpose hardware-independent method.
Peterson's Algorithm
Peterson's algorithm, published in 1981, provides an elegant software solution to the n-process critical section problem and has two distinct advantages over test-and-set spin locks. One is that atomic test-and-set is not required - the algorithm eliminates the need for special instructions and bus locking. The other is eventual entry - a task waiting for entry to a critical section won't starve in a weakly fair (typical) scheduling environment. Although Peterson's algorithm looks deceptively simple, it's a culmination of many attempts by researchers to solve the critical section problem.
The following pseudo-code shows the entry and exit protocols used to enforce two-process mutual exclusion. Note that Peterson adds a secondary "turn" variable - this prevents incorrect results from race conditions and also ensures that each waiting task will eventually enter the critical section.
Listing 1: Peterson's Algorithm
initialization
P1_wants_entry = P2_wants_entry = FALSE
turn = P1
task P1
P1_wants_entry = TRUE /* set lock */
turn = P2 /* grant turn to other task */
loop while (P2_wants_entry and turn == P2) /* buzz waiting for lock */

critical section /* execute critical section */

P1_wants_entry = FALSE /* release lock */
task P2 /* same logic as P1 */
P2_wants_entry = TRUE
turn = P1
loop while (P1_wants_entry and turn == P1)
critical section
P2_wants_entry = FALSE
We can easily imagine situations where more than two processes try to enter their critical sections concurrently. Peterson's algorithm can be generalized to n processes and used to enforce mutual exclusion between more than two tasks. And other n-process solutions, such as the bakery algorithm, are readily available in computer science textbooks. Our discussion here is limited to the two-process case only for clarity and brevity. Pseudo-code for the n-process Peterson algorithm can be found on the Electric Sand web-site.
Blocking mechanism
Now that we have a low-level mutual exclusion tool to safely manipulate shared data within a semaphore, consider the other key ingredient of semaphores, blocking. Assuming that each processor runs DSP/BIOS or another multitasking OS, we develop our wait operation using services that are already available on each individual processor. DSP/BIOS provides a flexible semaphore module (SEM) that we use in our implementation.
When the owner of a uniprocessor semaphore releases it with a signal system call, the local scheduler has immediate knowledge of the signal event and can unblock a task waiting on the semaphore. In contrast, a multiprocessor semaphore implies that the owner and the requestor can reside on different processors. Because a remote kernel has no implicit knowledge of signal calls to a local kernel, the remote kernel needs timely notification of local signal events. Our solution uses interprocessor interrupts to notify other processors of local activity involving a shared semaphore.
Semaphore implementation
This implementation of a multiprocessor binary semaphore (MBS) assumes that the hardware supports interprocessor interrupts and that a task won't try to acquire a semaphore while another task on the same processor owns it. The latter restriction simplifies the example and can easily be removed with some additional design work.
The wait operation
MBS_wait is invoked to acquire a shared memory semaphore. If the semaphore is available, MBS_wait decrements it and continues. If the semaphore is already owned, the requestor task blocks within MBS_wait until a release notification interrupt makes it ready-to-run. Once the interrupt occurs and higher priority tasks have relinquished the CPU, the task waiting on the semaphore wakes up within MBS_wait and loops to re-test it. Note that the task doesn't assume ownership immediately when unblocked. Because a remote task might re-acquire the semaphore by time the requestor wakes up, MBS_wait loops to compete for the semaphore again.
When MBS_wait determines that a semaphore is unavailable, it sets a notification request flag in the shared semaphore data structure to indicate that the processor should be interrupted when the semaphore is released elsewhere in the system. To avoid a race condition known as the lost wake-up problem, MBS_wait atomically tests the semaphore and sets the notification request flag if the semaphore is unavailable.
Code for the wait operation is divided into two distinct parts. MBS_wait contains the blocking code and is called by an application. MBS_interrupt runs in response to the notification interrupt and posts a local signal to the task waiting on the semaphore. This is very similar to a device driver model, where the upper part of a driver suspends a task pending I/O service and the interrupt-driven lower part wakes it up.
The signal operation
MBS_signal releases a semaphore by incrementing its value and posting an interrupt to a processor that requested release notification. This causes MBS_interrupt to execute on the remote processor where a task is blocked on the semaphore. Note that this varies slightly from the uniprocessor signal operation described earlier where the semaphore is incremented only if no tasks are blocked.

Pseudo-code
Now that we have a notion of shared semaphore architecture, let's look at pseudo-code describing the wait and signal operations. Keep in mind that this example applies to a two-processor version where only one task at a time on each processor tries to acquire the semaphore. General implementations servicing more processors and sharing the semaphore between multiple tasks on each processor can be implemented by using the n-process Peterson algorithm and modifying the MBS operations.
Note that the critical sections, enforced by Peterson's algorithm (Peterson entry and Peterson exit), are very short instruction sequences used to manipulate the semaphore data structure. The details of Peterson's algorithm are not shown - these are implicit in the Peterson entry/exit operations. The lock and turn variables used in Peterson's algorithm are distinct from the semaphore data elements accessed in the critical sections.
The critical sections are preceded with DSP/BIOS TSK_disable calls to prevent task switching. A task switch during a critical section could cause another processor to spin indefinitely in Peterson entry if it tries to acquire the same semaphore. The critical sections should be executed as quickly as possible.
Also note that the example omits error checking, return values and timeouts. The pseudo-code is meant to highlight discussion topics rather than provide a detailed implementation template.

MBS_wait () {
success = FALSE /* local variable, not part of semaphore */
while (success == FALSE) { /* repeat semaphore acquisition attempt */
TSK_disable () /* prevent DSP/BIOS task switch */
Peterson entry /* Peterson's entry protocol */
/* critical section begins */
if (sem_value > 1) {
sem_value = sem_value - 1
success = TRUE
}
else {
notification_request = TRUE
}
Peterson exit /* end critical section */
TSK_enable () /* re-enable DSP/BIOS scheduler */
if (success == FALSE) { /* local variable shows result */
SEM_pend () /* sleep using DSP/BIOS semaphore */
}
}
}
MBS_interrrupt {
SEM_post () /* local wake-up signal using DSP/BIOS */
}
MBS_post () { /* release the semaphore */
TSK_disable ()
Peterson entry

/* start critical section */

sem_value = sem_value +1; /* increment the semaphore */
if (notification_request == TRUE) { /* notify a remote task? */
send notification interrupt /* yes - send an interrupt */
}
Peterson exit /* end critical section */
TSK_enable ()
}

No comments: