4.8. Floating Point Numbers¶

Computers generally store data in fixed-sized chunks. Hardware can more efficiently handle data if it is assumed that integers are represented with 32-bits, doubles with 64-bits and so on. But with a fixed number of bits to store decimal values, we are left with a hard choice: how many bits should we have on either side of the binary point?

Imagine we are only using 8 bits to store decimal numbers. If we do not worry about negative values and assume that there are always 4 digits on each side of the decimal - something like 1010.0110 - that means that the largest value we can represent is 15.9375 (1111.1111). The smallest non-zero value would be 0.0625 (0000.0001). If we use only 2 bits to represent integers and 6 to represent fractional values - like 10.101101 - we could represent smaller values. With 00.000001 we could represent 0.015625; but with this scheme, the largest value we could now represent would be 3.984375 (11.111111).

Which of these two formats would be best? 4.4? or 2.6? There is no good answer. Sometimes we care about accurately representing small values and do not care about large ones. Other times, we need to represent larger values. A system with a fixed number of digits to the right of the ‘.’ locks us into one particular set of compromises.

The alternative is to use a floating-point representation. You may not have heard of the term, but you have seen the same basic idea in scientific notation. When we write 6.2 x \({10}^{12}\) instead of 6200000000000 or 1.65 x \({10}^{-8}\) instead of 0.0000000165, we are condensing the representation of large and small values by shifting (or floating) the decimal point. Values are recorded as a decimal multiplied by some power of ten.

Computers use this same trick, but instead of representing values as decimals multiplied by powers of ten, they use binary numbers multiplied by a power of two. There are thus three things to represent: the sign of the number, the binary value and the power of two to multiply it by. We will use the following scheme:

1 bit to represent the sign. 0 for positive, 1 for negative.
3 bits to represent the exponent - the power of two to multiply by. We need to represent positive and negative exponents; to do so we will subtract 4 from the binary number the exponent represents. For example, if the three exponent bits are 101, that means 5. We would subtract 4 to get 1 and thus raise 2 to the 1st power. If the exponent bits were 001, representing 1, we would subtract 4 and get -3… this indicates we should raise 2 to the -3 power.
4 bits to represent the binary fraction (more formally known as the mantissa). We will always interpret these four bits as filling in the blanks of 0.XXXX. For example, if the four fraction bits are 0100 we would interpret that as \({0.0100}_{2}\) or \({0.25}_{10}\).
The final value is obtained by multiplying the binary fractional by the power of two indicated by the exponent.

It may sound a little complex, but remember it is the same idea as scientific notation - calculate an exponent of 2, and a binary fraction and multiply them. Experiment with the floating-point decoder below. The row of boxes shows the bits of a number (initially all 0s), below that is an explanation of how that value would be decoded.

Sign:

Exponent:

Fraction:

Final meaning:

Self Check

Q-1: What is the largest value you can make using the scheme above?

Q-2: What power of two is represented by exponent bits of 110?

2
3
110 means 6
4
What power of two is that?
5
110 means 6
6
In this scheme we subtract 4 from the binary value to get the exponent

Q-3: Say the exponent bits are 101. What bits are needed in the fraction to make a final value of 1.25?

You have attempted of activities on this page