CLASS 10

ASCII Representation of Characters

To represent characters, we can use their octal values as specified in the ASCII table.

Any number preceded by 0 is in octal
Any number preceded by 0x is in hexa-decimal

Also, we can use single or double quotes to represent characters. e.g.

mov  0140, %r1   // All these instructions have the same effect

mov  `a', %r1    // this not used frequently, used by assembler 

mov  "a", %r1    // frequently used

Character codes, ASCII.

So far we've been working with representing numbers inside the computer, and the binary number system provides some logical basis to guide the decisions. However, we need to represent text also, and there is no logical basis to guide us. And it shows.

Many different character codes have been used over the years. People have had a hard time settling on how many bits to use, and given a set number of bits, what characters to encode, and what the mapping should be from bit patterns to characters. Character codes based on 5 bits, 6 bits, 7 bits, and 8 bits have been used at various times. There are about 47 keys on a typical typewriter (which the computer keyboard is closely modelled after). Why are 5 bit (Baudot) and 6 bit (Honeywell) codes not very useful then?

So that leaves 7 bit and 8 bit codes. Until recently, both were still in use. One was used by IBM, and one was used by everybody else. The 8 bit code was used by IBM, and was called EBCDIC (extended binary coded decimal information code). IBM gave it up in the workstation era. The one that everyone uses now is a 7 bit code, called ASCII (American Standard Code for Information Interchange).

Note the simple relationship between uppercase and lowercase characters -- convert a lowercase to uppercase by subtracting 0x20.

Seven bits means 128 characters. We only need about 96 characters to encode everything on the keyboard. What to do with the other 32 codes? "Control" characters. These have been used for various formatting and in-band signalling functions in the past. Currently, about their only use is for formatting and data communications.

Control characters are those below 0x20. We often refer to them by the character obtained by adding 0x40 -- eg, control-C Important control characters:

char	mnemonic   common use
nul	0x00 	   to terminate strings, esp in C
etx	^C	   interrupt character on Unix
eot	^D	   to indicate end of file on Unix
bel	^G   	   rings a bell on many terminals
bs	^H	   signals a backspace
ht	^I	   signals a tab
lf	^J	   signals a linefeed -- line separator in Unix
ff	^L	   formfeed
cr	^M	   carriage return
dc1	^Q 	   used for line control
dc3	^S	   used for line control
esc	^[	   escape -- used to introduce a different character mapping

del	not a control character -- used to mean back up & erase

The numerical encoding of strings in mainly important for sorting operations. It's very convenient for our notions of lexicographic order to coincide with the numerical ordering of character codes. In addition, this explains why Big Endian memory organization is often favored: instead of comparing bytes (expensive) we can compare words (cheap) to determine lexicographic order of string data.

ASCII meets this definition, which explains (perhaps) why TA comes before Tanuj in the phone book.

Bit-wise logical operations : and, or, xor, andn, orn, xnor

Rather than using a whole 32 bit word to store just one boolean variable (though it's often done!) we can use each individual bit as a boolean variable. So the SPARC instructions operate bitwise on 32 bit words. We discussed some of these instructions earlier in the course (in week 2). We will now elaborate on some of the instructions.

The basic instructions are:

 and     reg, reg_or_imm, reg
 or      reg, reg_or_imm, reg
 xor     reg, reg_or_imm, reg

Rather than having a simple "not" instruction, it is combined with each of the above:

 andn    reg, reg_or_imm, reg
 orn     reg, reg_or_imm, reg
 xnor    reg, reg_or_imm, reg

Each of these first negates the second operand, then performs the instruction.

Besides, there are also the versions of these instructions that set the condition codes, besides performing the desired operation. These instructions are:

 andcc     reg, reg_or_imm, reg
 orcc      reg, reg_or_imm, reg
 xorcc     reg, reg_or_imm, reg
 andncc    reg, reg_or_imm, reg
 orncc     reg, reg_or_imm, reg
 xnorcc    reg, reg_or_imm, reg

Synthetic Instructions

Many other instrcutions do not exist, but are synthesized using the %g0 operator. They are known as Synthetic instructions or pseudo-instructions. %g0 register has the special property that it always contains the value 0.

Examples of Synthetic Instructions:

mov %x_r , %y_r    ==>   or  %g0, %x_r, %y_r
not %x_r, %y_r     ==>   xnor  %x_r, %g0, %y_r
not %x_r           ==>   xnor  %x_r, %g0, %x_r
clr %x_r           ==>   or  %g0, %g0, %x_r
cmp %x_r, 0        ==>   subcc %x_r, %g0, %g0
tst  %x_r          ==>   orcc %x_r, %g0, %g0

We just use these synthetic instructions for our convenience while writing assembly code.

We can also test for individual flags (which are nothing but bits in a word) using synthetic instructions. There are 4 bytes per word, i.e. 32 bits. We can check any of them, or modify them. These are some synthetic instructions that are used:

        bset    =       or
        bclr    =       andn
        btog    =       xor
        btst    =       andcc

btst reg_or_imm, reg -> andcc reg, reg_or_imm, %g0
bset reg_or_imm, reg -> or reg, reg_or_imm, %reg
bclr reg_or_imm, reg -> andn reg, reg_or_imm, %reg
btog reg_or_imm, reg -> xor reg, reg_or_imm, %reg

Example: to test if flags (bits) 2 or 3 are set in %l0:

        mov    12, %l1
        btst   %l1, %l0
        be     clear
        nop

Another example: to check if flags (bits) 0x10 and 0x6 are set in %x_r:

	btst 0x16, %x_r
	be clear
	nop
set:
clear:

The complete list of Synthetic instructions and Psuedo Operations is given in Appendix D of the text.

Representations.

A computer can't manipulate real-world objects. Some real-world things are only abstractions, like integers or real numbers, and so can't be manipulated directly. Other real world objects, like an airplane, a photograph, or a hurricane, are too difficult to manipulate directly. Instead of manipulating real world objects, the essence of what a computer does is manipulate their representations.

Whenever we want to perform a computation on an object, we need to decide on how to represent that object inside the computer. As designers of both hardware and software, how do we decide on the best representation of an object? The answer is that we choose the representation that is most efficient. A representation is efficient if it allows a lot of capability for little cost. Cost here means both dollars and time, so a representation is efficient if it requires very little space (dollars), supports fast operations (time), and supports a wide range of operations (capability).

Some objects, like integers and real numbers, are so important that their representations are fixed, built into the computer's hardware. Other objects, like a photograph of a hurricane, are represented in different ways depending on the problem, because different operations may be required for different problems. This is another example of a hardware/software tradeoff. Operations on simple things like integers are so common that we want them to be as fast as possible and we are willing to sacrifice ease of modification. Operations on things like airplanes are much more rare, and ease of modification of the representation is much more important, so we leave such problems to software.

We will now look closely at the design decisions involved in implementing the basic data types: integers, text, instructions, and real numbers. We will use as our examples the implementation of these basic data types in the SPARC architecture. Before we do that however, we need to look at the raw material we have to work with.

Digital Logic for data representation.

To represent a static piece of data, we need a device that can store a particular state. For this purpose digital computers use BITS. A bit is the current state of a BISTABLE device. A bistable device is something like a coin. Whenever the coin is lying on a table, either the head is up or the tail is up. Likewise, a bistable device is a combination of two or three transistors that are wired together so that electricity flows in one of two paths. One state is written as '1', the other state is written as '0'.

To represent an object that can have more than two states, we concatenate bits. Two bits can represent an object with 4 states: 00, 01, 10, 11. In general, n bits can represent an object with 2^n states.

Concatenated bits of various sizes have names:

	8  bits is a BYTE
	16 bits is a HALFWORD
	32 bits is a WORD
	64 bits is a DOUBLEWORD

When we are talking about bit patterns, it's convenient not to have to write out all those 1s and 0s. Usually we write bit patterns either in octal or hexadecimal. In octal, we group bits into groups of threes and the write the number (0 to 7) for each group. For 16 bits:

	1 010 001 111 110 101 = 0121765
	1   2   1   7   6   5

Hexadecimal is base-16 notation. We group bits in sets of 4 and use 0-9, a-f for the numbers 0-15.

	1010 0011 1111 0101 = 0xa3f5
	   a    3    f    5

It's important to note at this point that these are just bit patterns. We don't know what they mean, if they mean anything, until we know what representation we are using. The bit pattern

binary 10010001101000101011001111000  =  octal 02215053170   =  hex 12345678

could represent the integer 305419896, the real number 5.69045661e-28, or the string ' 4Vx'. It all depends on what is being represented.

A computer has storage for bit patterns, and instructions that operate on those bit patterns AS IF they were particular representations. For example, there is a different instruction for adding floating point (real) numbers than the one we've already seen for adding integers (the "add" instruction). There is no check to make sure that you are using a bit pattern according to its intended representation. That job is performed by the compiler, or, if the programmer is writing assembly language, it must be performed by the programmer. For example, if register %l0 is intended to be storing a floating point number, there is nothing to stop you from using the register as argument to an (integer) "add" instruction. The bit pattern in $l0 will be interpreted as an integer, and so the results will be garbage, but you can do it if you want to.

It's an interesting example of design evolution that such representation checks have been built into computers in the past. However, these checks slowed down the processor and they weren't used very often, since compilers could do most of the job. So processor designers optimized the common case and just let the compiler do all the work associated with keeping representations straight. This makes the processor much faster in the long run.

Representing Integers.

As discussed above, our criterion for choosing a representation is that it be efficient: provide a lot of capability at little cost in time and space.

First of all, to make integer representations efficient, we allocate a fixed amount of space to them. This allows the operand fetch unit to know exactly how many bytes to fetch when fetching an operand, so it can be quite fast.

In the SPARC architecture integers are stored in sizes of one byte, one halfword, one word, and one doubleword. We will concentrate on the word-sized integers - these are 32 bits in size.

Consider the representation in which the number 7 is encoded as seven '1' bits. What is wrong with this representation? It's not space-efficient. To represent the number 32 we would need 32 bits. We have already noted that n bits can represent an object with 2^n states, so using 32 bits to encode only 32 numbers is not as good as we know we can do. We know that we can encode 2^32, about 4 million, numbers using 32 bits.

This has been a problem ever since neanderthals put scratches on the cave wall to count the number of mastodons killed. It was solved quite elegantly by the invention of positional notation, which is familiar to us because it's used in the decimal system. The idea, as used in the binary system, works like this. We can encode the numbers 0 and 1 using a single bit. So to encode the next number, 2, we move one space to the left. Now we use our '2' symbol along with the original bit to encode 2 and 3. We need a new number for 4, so we move one place to the left again. With the '4' symbol and the others we can encode numbers up to 7. And so on. This is called BASE TWO, or BINARY NOTATION.

To read a base two number, just add up the powers of two:

	1101 = 1*2^3 + 1*2^2 + 0*2^1 + 1*2^0 = 8 + 4 + 1 = 13.

Decimal Conversion of powers-of-two.

Decimal is what we think in. Luckily there's a quick conversion from powers-of-two to decimal. It's based on the lucky fact that 2^10 is 1024, which is almost 1000, or 10^3. In CS, we refer to 2^10 as 1K (kilo) (though that is not the correct meaning of kilo in the metric system). Anyway, 1K always means 1024. 2^20 ~= 10^6 = 1M (meg), about a million. 2^30 ~= 10^9 = 1G (gig), about a billion. 2^40 ~= 10^12 = 1T (tera) about a trillion.

Using these, and if you memorize the powers of two up to 1K, you can quickly convert any power of two to a decimal equivalent.

	2^2	= 4
	2^12	= 4K
	2^22	= 4M
	2^32	= 4G

	2^26	= 64M
	2^19	= 512K
	2^31	= 2G

Adding Integers

We want our representation to be efficient in time also. Let's look at whether a common operation -- addition -- can be implmented easily and quickly on integers stored in binary notation.

The addition operation will be quick if we can perform it using a small number of simple operations. Let's use the same approach to addition that we use every day.

Say we are adding two, 2-digit binary numbers:

	wx
      +	yz
	--
	ab

Just as in everyday addition, start on the right and work towards the left. We need to add two bits. What are the possibilities?

We can see that we can find the value of b using the "xor" logical operation. This operation can be implemented in a very simple digital circuit.

However, when x=1 and z=1, we need to carry a 1 to the next column. Once again, this can be implemented using a simple digital circuit, called "AND":


	x  z  C (the carry bit)
	---------
	0  0  0
	0  1  0
        1  0  0
        1  1  1

So: b = x XOR z, C = x AND z, a = (w XOR y) XOR C. Thus there are two circuits in the addition unit for each bit: an XOR circuit and an AND circuit. The AND circuit feeds into the XOR circuit of the next higher-order bit.

This can go on indefinitely. For a 32-bit integer, there are 32 of these bitwise adders. However, there is one problem: what happens to the output from carry circuit on the highest-order bit?

When we carry from the highest-order bit, it is because we have added two numbers whose sum is greater than 2^32-1. Thus we have calculated a number that is too big to be stored in 32 bits. This condition is called OVERFLOW. It's a bad thing to ignore an overflow, right? Think about what happens here.

Binary Addition: Two ways to carry out binary addition, first using a half adder, the second one uses a full adder. Details about the working of the half adder and the full adder are given in the text. At this point, it is also important to understand the meaning of the term modulus arithmetic. e.g. If someone says 2 modulo arithmetic, it means that there are only two numbers, 0 and 1, and any number will be represented as one of them. e.g.

Modulus arithmetic: Examples of addition modulo N: (where N = 2)

    7 % 2 = 1
    (3 + 4) % 2 = 1

An addition unit in the ALU always does arithmetic modulus some value, in the SPARC it's modulus 2^32. That is, when we add two numbers whose sum is greater than or equal to 2^32, the result is less than 2^32 because a bit gets thrown away (the carry-out bit). The fact that we can add two number and get a smaller one sounds like subtraction -- this suggested to computer designers a way to implement negative numbers.