CLASS 28

Floating Point

So far we have deferred the discussion of floating point. That's because it's so important. We'll discuss it at some length now.

Integers are a great thing, but they have some limitations. The two main limitations that they have are:

they can't represent fractional quantities
they have a limited range: +/- about 2 billion - clearly not large enough for many tasks.

We'll first talk about how to address the fractional-quantity issue.

If we want to represent fractions, we could do so just like we do in the decimal system: include places with values less than 1. We do this by introducing a binary point (just like a decimal point) and then numbering positions to the right of the binary point with negative exponents:

                           binary point
                                v
        |  |  |  |  |  |  |  |  |  |  |  |
                                ^
          7  6  5  4  3  2  1  0 -1 -2 -3
         2  2  2  2  2  2  2  2  2  2  2

Now we can represent, eg, 2.5 as


        10.1 - which is (1 * 2) + (1 * 1/2).

The nice thing is that we can still store these numbers as integers -- we just change the way they are interpreted on input and output. Numbers which are stored internally as integers, but are given a given a binary point for input and output, are called fixed point numbers. They are fixed point because we assume that the binary point is fixed at some position.

For example, we could have a 32-bit fixed-point representation in which the binary point was to the right of bit number 1:

        000000000000000000000000000000.0

or we could have a fixed-point representation in which we assumed the binary point was to the left of bit number 31:

        .0000000000000000000000000000000

What are we giving up by getting the ability to repesent fractions? We are giving up range. For example, in the last case, we can only repesent quantities between zero and one. In fact, if the binary point is the right of bit number B, then we can represent (using unsigned representation) the numbers 0 to 2^(32-B). To interpret any bit pattern, just take its value as an unsigned integer and divide it by 2^B. So one way to think of a fixed-point number is an integer times some negative power of two:


                     -B 
                A * 2

How is arithmetic done with fixed-point quantities? Addition is straightforward - you can simply add the two bit patterns as integers and the result has the binary point in the same place. However, multiplication is more tricky. Consider:

        x * 2^(-B)  *   y * 2^(-B)   =   xy * 2^(-2*B)

which means that the binary point is now at position 2B, rather than position B as before. Here is an example. Using a fixed-point representation in which the binary point is to the right of bit number one, multiply 3.5 by 2.5:

        11.1  = 3.5  (note that 111 = 7, and 3.5 = 7/2)
        10.1  = 2.5  (note that 101 = 5, and 2.5 = 5/2)

Now, multiplying 111 by 101, we get 100011, so the binary point needs to go at position 2 - 1000.11 - to give the right answer of 8.75.

Renormalization

Putting the binary point in position two changes our encoding, though, so we need to shift the result one place to the right to get the binary point back in the right place:

        01000.1

This is called renormalization - putting the result of a calculation into the form expected for storage.

So the result we get is actually 1000.1, which is 8.5 - an error of 0.25. This occurs because with only one place to the right of the binary point, we can only represent 8.5 or 9.0 -- nothing in between.

Some terms:

Precision = the number of binary digits that can be represented. In a fixed point representation using 32 bit integers, this is always 32.
ULPs = units in the last place. This is the smallest difference between any two numbers.
Accuracy = whether the number represents what we intend it to. The opposite of error.

Note that our representations may not be accurate for a number of reasons:

the true number may not be representable in a finite amount of binary digits, e.g., the number 0.1, which in binary is .000110011001100...
the true number may not be representable in a finite number of digits in ANY base, e.g., irrational numbers
the true number may not be known; all we have is a measurement
the true number may be known but the representation is incorrect -- eg., the result of an incorrect calculation.

Now we return to support for a wide range of values as well as fractional values.

Clearly representing numbers as different as 10^23 and 10^-23 using place value is very expensive - we'd need a lot of bits. So we take a lesson from the way scientists and engineers represent numbers - scientific notation. In scientific notation we represent a number as


                          exponent
        significand * base

Since the exponent is a much smaller number than is the number itself, we save a lot of storage this way. In all floating point representations, the base is fixed (almost always it is 2) so we don't need to store the base either.

Thus, another definition of precision would be: the number of digits in the significand.

To completely specify a floating point representation, we need four numbers: the base (B), the precision (p), e-min, and e-max (smallest and largest exponents).

Floating point representations are by nature inexact. There are two reasons for this inexactness:

A real number may lie in between two floating point representations;
A real number may be larger that the largest or smaller than the smallest f.p. representation.

Many representations could be used for the same number:

        3.00 * 10^2 
        0.30 * 10^3
        0.03 * 10^4

all of these encode the number 300. However, it is a bad thing to have multiple representations for the same number, because of equality comparisons. We want each number to have only one representation so that we can compare for equality simply by comparing bit patterns.

For this reason we define what is called normalized form for a f.p. number. A number is in normalized form if its leftmost digit is non-zero (of course in binary this means that the leftmost digit is 1). Normalized form allows our f.p. representations to be unique.

Of course, this makes it impossible to represent zero, so we define zero as a special case. We use the exponent (e-min - 1) to signify that the number is zero. This is a good choice, since it will be smaller than any number and so won't need special treatment during comparisons. Of course it will need special treatment during arithmetic -- if the exponent is equal to e-min - 1 the number has to be treated as if it were zero.

Floating Point Addition

How is addition performed in a floating point system? Remember that fp is just the combination of a fixed point representation with an base and exponent. Addition and Subtraction:

        1) Bring significands into alignment (same exponent)
        2) Do fixed-point (ie integer) addition / subtraction
        3) Do floating-point normalization

Examples.

        x = 2.15 * 10^12 =   2.15                  * 10^12
      - y = 1.25 * 10^-5 = - 0.0000000000000000125 * 10^12
                           -------------------------------
                             2.1499999999999999875 * 10^12

which, when normalized, yields 2.15 * 10^12.

Another example:

        x = 2.15 * 10^12 =   2.15                  * 10^12
      - y = 1.25 * 10^12 = - 1.25                  * 10^12
                           -------------------------------
                             0.90                  * 10^12

which, when normalized, yields 9.00 * 10^11.

Calculation Error

Now, how about the case of 10.1 - 9.93? This is

        x = 1.01 * 10^1
      - y = 0.99 * 10^1
        ---------------
    x - y = 0.02 * 10^1

which, when normalized, is 2.00 * 10^-1. However, the true (accurate) answer is 1.70 * 10^-1. Our calculated answer is off by 30 ulps, and is wrong in every position!

What happened -- well, notice that when we aligned the significands, we were forced to drop a digit from the representation of y. We lost an important digit, because x and y are very close in value, so they only differ significantly in the last digits.

Floating Point Numbers on the SPARC

Because fp operations are so complicated, designers usually set aside a special piece of hardware, outside the CPU, to do fp calculations. It is called the Floating Point Unit (FPU). This allows the CPU to run faster as long as it is performing non-fp operations. Communication between the CPU and FPU is by means of special fp registers, that are contained in the FPU but can be written and read by the CPU.

When the CPU fetches an fp instruction, it simply passes it to the FPU for execution. Most fp instructions take longer than a single cycle, so the CPU goes on ahead and fetches and executes subsequent instructions while the FPU is working.

This is called instruction-level parallelism. The CPU and the FPU can be working in parallel (that is, simultaneously) on different instructions.

Now, what happens if we try to load from a register but the FPU isn't done yet, so the register doesn't have the value we expect? The FPU marks a register as invalid as soon as it is known to be the destination of an instruction. If the CPU tries to store from that register, it stalls (simply waits) until the FPU clears the invalid flag. So the programmer needn't worry about getting the correct data, but the program might run slowly without the programmer being able to tell why.

The FPU can also compare fp numbers. It has its own set of condition codes, and the CPU can test those fp condition codes by means of special conditional branches, called floating-point branch instructions.

The FPU has 32 registers, %f0 through %f31. There are no register windows in the FPU. To access the registers, use the regular load and store instructions: ld and st. So we could write code such as:

        ld      [%fp + 64], %f2
        set     var_m, %o0
        st      %f0, [%o0]

There are single, double, and extended precision fp ops. We will just cover the single precision ops:

        fadds
        fsubs   
        fmuls
        fdivs
        fsqrts          
        fmovs           moves data between fp registers
        fnegs           negate
        fabss           absolute value  
        fitos           integer to single float
        fstoi           single float to integer
        fcmps           compare

There is an additional case to consider for comparisons. Suppose one of the operands is NaN. Then, we cannot establish an ordering, so there is an unordered case as well as a less than, equal, and greater than.

Final Remarks on Floating Point

ANSI C defines float, double, and long double. There is not guaranteed to be any relationship between these and the types defined in IEEE 754, because computers with C compilers may not necessarily support IEEE 754. However, on the SPARC, the correspondence is as follows:

        ANSI C          IEEE 754
        ------          --------
        float           single precision
        double          double precision
        long double     extended precisiion

Here is an example program that illustrates some floating point error issues, using ANSI C.

#include <math.h>
main()
{
     float f;
     long double d;
     
     float pi = 3.14159265358979323846f;
     double dpi = 3.14159265358979323846;
     long double ldpi = 3.14159265358979323846L;

     printf("        actual = 3.14159265358979323846\n");
     printf("      float pi = %.7f\n", pi);
     printf("     double pi = %.16f\n", dpi);
     printf("long double pi = %.35Lf\n", ldpi);
     printf("\n");

     printf("      float pi/10.0f = %.7f\n", pi/10.0f);
     printf("     double pi/10.0  = %.16f\n", dpi/10.0);
     printf("long double pi/10.0L = %.35Lf\n", ldpi/10.0L);
     printf("\n");

     printf("      float pi*0.1f  = %.7f\n", pi*0.1f);
     printf("     double pi*0.1   = %.16f\n", dpi*0.1);
     printf("long double pi*0.1L  = %.35Lf\n", ldpi*0.1L);
     printf("\n");

     d =  ldpi*0.1L - ldpi/10.0L;
     printf("ldpi*0.1L - ldpi/10.0L = %.35Lg\n", d);
}

When we run this program on a SPARC machine, we get:

        actual = 3.14159265358979323846
      float pi = 3.1415927
     double pi = 3.1415926535897931
long double pi = 3.14159265358979323845999999999999997

      float pi/10.0f = 0.3141593
     double pi/10.0  = 0.3141592653589793
long double pi/10.0L = 0.31415926535897932384599999999999998

      float pi*0.1f  = 0.3141593
     double pi*0.1   = 0.3141592653589793
long double pi*0.1L  = 0.31415926535897932384600000000000003

long double = 4.8148248609680896326399448564623183e-35

This example illustrates a number of points about IEEE 754 and ANSI C. First of all, the default type for a constant is double. To get a float or a long double you need to use a suffix, (f or L). Second there are two kinds of rounding errors shown:

the variable has less precision than the input, so it is accurate to within 1/2 ulp;
the variable has more precision than the input, but can't express the input exactly.

Finally, look at the final value calculated. The true value should be zero, but due to the inaccuracy in representing 0.1, the calculated value is nonzero. So the relative error is infinite, and the calculated value is wrong in every place.