CLASS 18

Subroutines

Extending the Instruction Set.

The modern computer is a layered set of virtual machines, as discussed in the first lecture. This idea of building virtual machines based on new instruction sets is as useful to the everyday programmer as it is to the computer system builder. In fact, it is so useful that any modern computer must give a programmer a way to create new instructions. These new instructions are called subroutines.

Using subroutines, programmers create new instructions, and therefore new virtual machines, that are specifically designed to solve the problem at hand. This allows the programmer to "raise the abstraction level" when solving a problem in the same way that system designers raise the abstraction level to bridge the semantic gap between gates and wires on one hand and high level languages on the other hand.

Subroutines are so important and are used so often that we have fairly complicated instructions in assembly and machine language to support them. That is, the mechanisms necessary to implement a subroutine are fairly complicated, and even though we want assembly language to be simple (flexible and fast), subroutines are important enough that we are willing to have fairly complex instructions to support them.

What mechanisms do we need to support subroutines? More fundamentally, what characteristics should a subroutine have?

A subroutine is intended to add a new instruction to our instruction set. Therefore, it should exhibit:

In-line execution. It should follow other instructions and precede other instructions in sequence. That is, control should flow naturally to and from it just like any other instruction. It should "look" like an instruction to the programmer.
No unintended side effects. Executing a subroutine shouldn't affect other parts of our program if we don't want it to. For example, the data that are in our registers shouldn't change just because we execute a subroutine.

Additionally, we like to generalize the notion of instruction a little bit for subroutines:

Subroutines should be allowed a varying number of arguments (rather that just 1, 2, or 3 as in SPARC assembly).

What do we need to support these three characteristics?

For in-line execution, we need a mechanism to jump to a new location, execute a bunch of instructions, and then jump right back to where we came from. Normal branch instructions don't provide this: we could always branch to the beginning of a subroutine, but when we get the end, how do we know where to branch back to? After all, the subroutine could be called from many different locations.
Need: a call/return mechanism
For isolating unintended side-effects, we need a way to protect the contents of our registers so that they are unchanged by subroutine execution. We can save and restore the contents of registers, or we can get new registers for the subroutine (which is what SPARC does).
Need: a register protection mechanism
For passing a variable number of arguments, we need an agreement between the caller and the callee on how arguments are to be passed.
Need: an argument passing convention

The solution that any particular machine architecture takes to these problems is called the machine's method of subroutine linkage. Note that subroutine linkage should and must be defined at the assembly language level. It should be defined there because subroutine linkage is important enough to require hardware implementation. It must be defined there because subroutines are common to many languages and we want to be able to call subroutines written in different languages.

To look at each of these three requirements, we will now describe subroutine linkage on the SPARC. The three requirements for implementing subroutines derive from their role as new instructions. They are:

a call/return mechanism
a register protection mechanism

an argument passing convention

Here's how these three requirements are met on the SPARC.

SPARC Subroutine Linkage

1. Call/Return mechanism. The SPARC architecture provides two special instructions to implement call/return -- named, oddly enough, "call" and "ret".

The call instruction does two things: it branches to the location specified, and it saves the current program counter into a special register, %o7. The call instruction is a control transfer instruction, so it is followed by a delay slot.

The ret instruction simply branches to the location stored in %o7 + 8. Why the "+ 8"? Each instruction is four bytes long, and we want to branch back the instruction following the delay slot following the original call.

2. Register protection mechanism. One way to protect the contents of registers during a subroutine call would be to save all of the registers by storing their contents in main memory before the subroutine starts, then copy them back from main memory when the subroutine finishes. This is the method that is used in most modern processors.

The SPARC architecture adopts a slightly different solution. Recall that the programmer has access to 32 registers in four groups: global, in, out, and local. In fact, these registers are only part of a much larger group, typically 128. This larger group is called the register file and is arranged into sets of 16 registers, called windows. At any point in time, the processor is using one particular register window. Each window has a number, and the number of the current window is stored in a special register called the CWP (current window pointer).

The scheme looks like this:

        ==================
        |                |    <- 8 registers
        -----window 7-----
        |                |    <- 8 registers
        ==================
        |                |    <- 8 registers
        -----window 6-----
        |                |    <- 8 registers
        ==================
        |                |    <- 8 registers
        -----window 5-----
        |                |    <- 8 registers
        ==================
                .
                .
                .

        ==================
        |                |    <- 8 registers
        -----window 0-----
        |                |    <- 8 registers
        ==================

Say the contents of the CWP is 6. Then, the registers being used are taken from window 6 and half of window 5:


        ==================
        |                |    
        -----window 7-----
        |                |    
        ==================
        |                |    <- in registers (%i0 - %i7)
CWP ==> -----window 6-----
        |                |    <- local registers (%l0 - %l7)
        ==================
        |                |    <- out registers (%o0 - %o7)
        -----window 5-----
        |                |    
        ==================
                .
                .
                .

        ==================
        |                |    
        -----window 0-----
        |                |    
        ==================

Register windows are used for register protection as follows. To get a new set of registers for use in a subroutine, one executes a "save" instruction. The save instruction decrements the CWP by one. The effect is to:

provide a new set of local registers
provide a new set of out registers
make the old "out" registers into the new "in" registers.

We'll discuss why the "out"s become the "in"s in a little bit. Note that what we have done with the save instruction is simply rename registers so that we are accessing new registers using the same old names.

Where are the global registers (%g0 - %g7)? They don't participate in this scheme at all. Therefore global registers are NOT protected across subroutine calls -- they may be changed unexpectedly by a subroutine call.

This process is reversed by the restore instruction, which increments the CWP by one. So upon entry to a subroutine, the program should execute a save instruction, which gets a new register set for use in the subroutine. Before returing from a subroutine, the program should execute a restore instruction, which returns to the old register set, with their contents still intact.

Thus, a subroutine call gets executed as follows:

        caller          callee
        ------          ------
          *
          *
        call foo
        <delay slot>
                        foo: save %sp, -64, %sp
                                *
                                *
                                *
                                *
                             ret
                             restore
          *
          *
          *

Don't worry about the arguments to "save" right now. Note that a "ret" is a control transfer, so has a delay slot, which is always conveniently filled by the "restore" instruction.

Window Overflow and Measurement-Based Design

The previous operation works fine, up to a point. We get sixteen new registers every time we execute a save instruction, and we get our old registers back just by executing a restore instruction. However ... what happens when we run out of register windows?

How would we run out of register windows? When the program is started, the ID of the last free register window (the one right before the first register window used) is stored in another special register, the WIM (window invalid mask). The CWP is always incremented and decremented modulo the number of windows -- that is, the CWP "wraps around" from 7 to 0 and from 0 to 7 whenever necessary. However, if incrementing the CWP will ever cause it to be equal to the WIM, then we have run out of register windows.

This could happen whenever subroutine calls are deeply nested. Thus it is quite likely to happen when executing recursive subroutines.

To handle window overflow, the CPU transfers control to the operating system. This is the first example we have encountered of an operating system function. We will discuss later on exactly what the operating system is, and how the request is made to handle the window overflow. For now, just note that the OS does the following:

it copies the contents of the register window pointed to by the WIM into main memory,
it increments the WIM, and
it returns control to the point in the program where the window overflow occurred.

So, window overflow is handled by storing register contents in memory and the letting the program continue, using the newly-emptied, new register window.

You can imagine that it takes a significant amount of time to handle a window overflow. The process of transferring control to the operating system, and storing register contents into memory takes time. If window overflow is slow, why use register windows?

The answer is that the designers of the SPARC architecture determined that window overflow would occur relatively rarely, if there were enough register windows. How many is enough? There is a tradeoff here. More register windows would make the processor execute faster in some cases, but more register windows would make the chip cost more also.

To answer this, the SPARC designers measured real programs. They determined the maximum nesting depth of subroutines in a majority of programs. This was determined to be about 8 (although later designs have more windows just for good measure). As mentioned, recursive routines will often exceed this nesting depth, but in practice recursive routines aren't all that common.

This sort of design is called measurement-based design. The SPARC designers measured real programs to guide their designs. Based on the measurements of real programs, they determined that 8 register windows would be enough to handle most programs most of the time. This is a very effective way to solve a difficult tradeoff decision, because it ensures that the resulting design will get the most performance for the least expense.

Note that using this mechanism, we allow subroutine calls to nest arbitrarily deep; it's just that subroutine calls that are nested less than 8 deep will occur much more quickly than those that are nested deeper. We are providing an "infinite" capability, but only making efficient a very limited, finite part of that capability. We gain the flexibility of an unlimited capability, that nearly always is very efficient, while only paying for a limited capability. In system design this is called optimizing the common case. This is a very important notion - we will return to it again and again.

3. Argument passing convention.

Finally, we need to look at the convention for passing arguments to subroutines (and getting return values) in the SPARC architecture.

In most architectures, argument passing is performed using main memory. That is, before a subroutine call, the arguments are placed in memory. Once the subroutine starts, it must then fetch the arguments from memory to operate on them.

The SPARC designers noted that, since data need to be in registers in order to be used by the CPU, it is wasteful to store data into memory before a subroutine call only to immediately fetch them back again from memory during the subroutine. To address this they designed register windows so that they "overlap".

As shown above, when a subroutine call is made, the "out" registers of the caller become the "in" registers of the callee. This is the principal method of passing arguments to subroutines. Before a subroutine call, the caller must place any arguments into registers %o0 through %o5. After the save instruction is executed, these become registers %i0 through %i5 to the subroutine. The subroutine simply uses those registers for its arguments.

Most languages allow a subroutine to have a return value. Return values are placed by the subroutine in register %i0, which becomes %o0 after the restore instruction. So to the caller, the return value from a subroutine appears in %o0.

By the way - only %o0 through %o5 are used to pass arguments to subroutines. %o6 is used for another purpose which we'll cover soon; and %o7 holds the return address of the subroutine as discussed above.

There is a convention for argument passing in C that you should know. The arguments are placed in %o0 through %o5 in order, when reading the C argument list from left to right.

Another Example of Optimizing the Common Case.

What about the case in which a subroutine has more than six arguments? In that case, the SPARC architecture specifies that the extra arguments be placed in memory before the call, and retrieved from there during the subroutine. So, if we are going to use memory anyway, why pay the cost of implementing argument passing in register windows -- why not use memory all the time?

The answer: "optimize the common case." The SPARC designers determined that most subroutines in most programs had six arguments or less. Therefore, almost all the time the arguments will fit into the six out registers, and no data will need to be stored into memory. Once again, we gain the flexibility of an unlimited capability, that nearly always is very efficient, while only paying for a limited capability.

Summary

The mechanisms for subroutine linkage are complicated, but the expense is justified because of the importance of subroutines as an abstraction mechanism. We want subroutines to be fast so that they will be widely used, and because there is a strong demand for them.

We also want subroutine linkage to be defined at the assembly language layer, rather than at a higher layer. This allows different programming languages to call subroutines ("virtual instructions") written in other languages. In fact, subroutines are the main method for combining code written in different languages.