Filling Delay Slots

Let's try our first attempts at optimization. We want to fill the delay slots, and in order to do so, we need to modify the order of the instructions, but at the same time ensure that we do not modify the logic of the program. Let's take one more look at the program first.s

/* first.s */ .global _main _main: save %sp, -64, %sp mov 9, %l0 sub %l0, 1, %o0 sub %l0, 7, %o1 call .mul nop ! wasted machine cycle sub %l0, 11, %o1 call .div nop ! wasted machine cycle mov %o0, %l1 mov 1, %g1 ta 0
We find that there are two wasted cycles in this program (represented by the two nops.) This program can be modified to be a bit more efficient.

.global _main _main: save %sp, -64, %sp mov 9, %l0 sub %l0, 1, %o0 call .mul sub %l0, 7, %o1 ! Delay slot filled call .div sub %l0, 11, %o1 ! Delay slot filled mov %o0, %l1 mov 1, %g1 ta 0

Branching in Sparc Assembly

Can we use any instruction to fill the delay slot? The answer is an obvious no. We cannot use any instruction that modifies the logic of the program, or affects the outcome of the branch. Branching in Sparc assembly is based on the value of certain condition codes. There are four condition codes in case of the SPARC architecture. They are :
Z  - set if the result was zero
N  - set if the result was negative
V  - set if execution resulted in number too large to store in register 
C  - if any carry out of the register after execution of instruction
Based on the value of each of these condition codes, a branch may be taken or not taken. There are only certain instructions in Sparc Assembly that set the condition codes. Once the condition codes are set, the branch outcome depends on the value of these condition codes. The condition codes remain in their state until they are overwritten by another instuction which modifies the condition codes. Hence, placing such an instruction in the delay slot of a branch or before the branch has actually taken place might not be a good idea.

The ordinary instructions do not set the condition codes. For example, the add instruction does not modify any condition codes. we need modified instructions. But we can use the modified addcc instruction which can be used to add two numbers, and at the same time, use the results of the addition to set the condition codes.

add r1, r2(or c), r3         ! no condition codes set

addcc r1, r2(or c), r3       ! condition codes are set


Let us assume for the sake of simplicity that the maximum number that can be stored in a register is 255, min. -256. i.e. We are assuming that the numbers are stored in a 9-bit register.

Which of the condition code variables (Z, C, V, N) will be set by the following statement:

addcc r1, r2(or c), r3

  1. If r1 is 10, r2 is -10
  2. If r1 is 20, r2 is 40
  3. If r1 is 240, c is 50
  4. If r1 is 2, r2 is -250

Control Structures.

High Level Languages contain a wide range of control structures; e.g., in C we have for loops, while loops, do loops, switch statements, and the goto statement. One aspect of Assembly Language's FLEXIBILITY is that it can implment all of these control structures easily. However, in order to have flexibility, we wind up with simplicity. The simplicity of Assembly Language can make it somewhat painful to program in.

Branching, Testing.

Branches are the control transfer instruction that you'll use most often. Branches are control transfers that NEVER return - they're like a "goto" in C.

Branch Instructions

The format is: b... label (b... is one of the branches testing integer condition codes)

The various branch mnemonics are:

ba              branch always, similar to goto

bn              branch never, similar to a nop

bl              branch on less than zero

ble             branch on less than or equal to

be              branch on equal to zero

bne             branch on not equal to zero

bge             branch on greater than or equal to zero

bg              branch on greater than zero
Note what complements (opposites) are:
Condition       Complement
---------       ----------
   bl              bge
   ble             bg
   be              bne
   bne             be
   bge             bl
   bg              ble
Less than what (for example)? The branches take effect based on condition codes stored in the control unit. These are bits that are modified by certain instructions; the ones you'll use most are
  addcc    reg,    reg_or_imm,    reg
  subcc    reg,    reg_or_imm,    reg
The condition codes are set based on the value of the last operand above. They are:
 Z - was it zero?
 N - was it negative?
 V - did it generate an overflow as number too large to be stored in register? 
 C - did it generate a carry-out? 

The Relationship between Assembly and C

As already mentioned, the role of assembly and machine language is to provide a small set of flexible operations that:

As such, it is an intermediary language. It's really not intended, in general, for programmer use.

The fact that SPARC machine language can be efficiently implemented in hardware is clear from the fact that processor clock speeds for SPARC chips are very fast. A typical SPARC CPU might run at 100 MHz. This means that the interpreter, running in hardware, can execute 100 million instructions each second.

To illustrate how SPARC assembly language is useful for implementing a wide range of control structures efficiently, we'll look at how common C control structure are implemented in SPARC assembly.

Along the way we'll get more flavor of assembly language programming and we'll see more examples on how to fill delay slots. Note that in each case, we start by filling each delay slot with a nop. Once we are sure the loop is correct, then we can move instructions around to replace nop's and fill the delay slots with useful instructions.

C control structures.

Here are the C Control structures that we'll examine:

Note: You can use these examples as templates for your own Assembly language programming; understand and copy them into your code when you want specific control structures.

Do....While loop

Consider the following C/C++ program:
/*  third.c  */ 

/* Calculate y= (x-1) * (x-7) / (x-11) for x = 0 to 10.  */ 

#define A2 1
#define A1 7
#define A0 11

   int x = 0, y;

   do {
        y = ((x-A2) * (x-A1) / (x-A0) ;
        x++ ;
        }   while (x < 11) ;

Now let's see the assembly code corresponding to this HLL code:
/*  third.m  */ 

        define(a2, 1)
        define(a1, 7)
        define(a0, 11)

        define(x_r, l0)
        define(y_r, l1)

        .global _main

        save    %sp, -64, %sp
        clr     %x_r

        .global loop
        sub     %x_r, a2, %o0
        call    .mul
        sub     %x_r, a1, %o1
        call    .div
        sub     %x_r, a0, %o1
        mov     %o0, %y_r

        add     %x_r, 1, %x_r
        subcc   %x_r, 11, %g0
        bl      loop

        mov     1, %g1
        ta      0

After passing through the macro-processor m4, we get the actual assembly code as shown below:

/*  third.s  */

        .global _main

        save    %sp, -64, %sp
        clr     %l0

        .global loop
        sub     %l0, 1, %o0
        call    .mul	       ! branch instruction
        sub     %l0, 7, %o1    ! delay slot	
        call    .div	       ! branch instruction
        sub     %l0, 11, %o    ! delay slot1

        mov     %o0, %l1
        add     %l0, 1, %l0    ! x++

        subcc   %l0, 11, %g0   ! check for x less than 11
        bl      loop           ! actual branch instruction

        mov     1, %g1
        ta      0

Problem: The final delay slot cannot be filled...WHY ???

Well, the answer is that there is no instruction that can be moved into the delay slot, which will not modify the logic of the program. The add instruction modifies the register l0 so it cannot be moved in the delay slot.

Solution: We can modify the order in which the instructions will be executed, as shown below:

/*  third.1.s  */

        .global _main
        save    %sp, -64, %sp
        clr     %l0

        .global loop

        sub     %l0, 1, %o0
        call    .mul
        sub     %l0, 7, %o1
        call    .div
        sub     %l0, 11, %o1

        add     %l0, 1, %l0
        subcc   %l0, 11, %g0

        bl      loop
        mov     %o0, %l1       ! nop eliminated

        mov     1, %g1
        ta      0
Here, the mov instruction has been moved in the delay slot, as it did not change the logic of the program.

While loop

While loops are the most common high level language (HLL) construct, but are actually somewhat tricky at the assembly language level. We will develop an efficient while loop in assembly language in a series of steps. Consider this fragment of C code:

Example from the book:

/*  fourth.c  */

while (a <=17)
   a = a + b ;
   c ++ ;

Corresponding assembly language code:

/*  fourth.s  */

        cmp    %a_r, 17
        bg       done
        add     %a_r, %b_r, %a_r
        add     %c_r, 1, %c_r
        ba       test

How many instructions are executed for each pass thru the loop? Answer: 7. First observation: we have two control transfer instructions, but only one loop. We should test the condition at the bottom of the loop, and leave the loop then.

To remove the extra unconditional branch instruction:

/*  fourth.1.s  */

        cmp     %a_r, 17
        bg      done
        add     %a_r, %b_r, %a_r
        add     %c_r, 1, %c_r
        cmp    %a_r, 17
        ble      loop

Now how many instructions in loop? 5. A 29% decrease in running time.

Now, why replicate code? Simply branch to the bottom of the loop to get started:

/* fourth.2.s */ ba test nop loop: add %a_r, %b_r, %a_r add %c_r, 1, %c_r test: cmp %a_r, 17 ble loop nop done:

Decreased code length *and* the loop is more maintainable this way. Now let's remove the nop.

Removal of the first nop

/* fourth.3.s */ ba test cmp %a_r, 17 loop: add %a_r, %b_r, %a_r add %c_r, 1, %c_r cmp %a_r, 17 test: ble loop nop done:

Now we would like to eliminate the nop inside the loop (more important than the first nop). Can we move an instruction from the loop body into the delay slot? Say, like this:

/* fourth.4.s */ ba test cmp %a_r, 17 loop: add %c_r, 1, %c_r cmp %a_r, 17 test: ble loop add %a_r, %b_r, %a_r done:
(A rule for debugging loops: always test what happens if the loop executes no times, one time, and two times.) The one time and two times case is OK, but the no times case results in a being modified, which it shouldn't.

To address loops, branch instructions can be annulled. An annulled branch will not execute the delay slot instr if the branch is not taken. The basic idea is the the delay slot can be part of the loop this way, and when the loop isn't taken, the delay slot isn't executed either.

So our final, correct and efficient version with second nop also removed is:

/* fourth.5.s */ ba test cmp %a_r, 17 loop: add %c_r, 1, %c_r cmp %a_r, 17 test: ble,a loop add %a_r, %b_r, %a_r done:

If the branch is not taken, then the statement after the branch (in the pipeline) is not executed, it is annulled.

Now we have four instructions inside the loop. This makes sense: two to do the work, one to set the condition, and one to conditionally branch.

It's probably simplest just to copy this example whenever you need a while loop --- use it as a template.

For loops

The for loop is defined in terms of a while loop:

for (ex1; ex2; ex3) st This is the same as : ex1; while ( ex2 ) { st ex3; }
So, to translate something like:

/* fifth.c */ for (a = 1; a <= b; a++) c *= a;
We would wind up with (after filling delay slots!):

/* fifth.s */ ba test mov 1, %a_r ! a = 1 loop: call .mul mov %c_r, %o1 mov %o0, %c_r add %a_r, 1, %a_r test: cmp %a_r, %b_r ble,a loop mov %a_r, %o0
Note that the overall structure of the embedded while loop is evident.

For class 8 notes, click here

For more information, contact me at