Representing floating-point numbers¶

The wikipedia articles on single-precision floating-point numbers (i.e. C’s float type) and double-precision floating-point numbers (double) are really good breakdowns of how they are represented in bits. There is an introduction with smaller bit widths and less detail all at once in the CS160 material.

So, when faced with a number in scientific notation and wanting to represent it in a float or vice versa, how to get started? I will give you two ways.

Piece by piece¶

The most direct way is to figure out which bits are from which part of the number, figure out what each represents independently, and combine them.

Converting to `float`¶

For example, let’s say I want to represent $35125.0$ as a float, writing the bit pattern in hexadecimal. First, I need my number in binary, so I will convert ${35125.0}_{10}$ into ${1000100100110101.0}_{2}$ (using any prefered technique for base conversion, or even a calculator).

Next I slide the decimal point (well, technically the binary point, or in general ‘radix point’) over until the number starts $1.$ , and keep track of how many places it has to move to get back in scientific notation. For me, that’s $1.000100100110101 \times 2^{15}$ .

Now I have all the pieces, and it’s time to put them together.

The sign bit is easy; since this number is not negative, it will be $0$ .

The exponent is written in 8 bits, with a bias of 127 (the number you write down will have 127 subtracted from it to compute the exponent). Since my exponent is $15$ , I want to represent $15 + 127$ in binary in 8 bits (padding if necessary), which is $10001110$ .

The significand is written in 23 bits, without the leading $1.$ ; my significand had 15 bits after the radix point, so I’ll need to pad it out with more zeroes until it is 23 bits long, getting $00010010011010100000000$ .

Putting the three parts (sign, exponent, significand) together, I get $01000111000010010011010100000000$ , my 32-bit float. Converting it to hexadecimal to make it easier to read and write, I get $47093500_{16}$ .

Converting from `float`¶

For an example converting the other way, let’s say I want to interpret the 32-bit value $46948400_{16}$ as a floating-point number in decimal. First, I’ll expand the hexadecimal to binary, getting $01000110100101001000010000000000_{2}$ . That breaks apart into a 1-bit sign, an 8-bit exponent, and a 23-bit significand.

The sign bit is $0$ , meaning non-negative.

The next 8 bits, $10001101$ , are the exponent. Converting to decimal, that is equivalent to $141$ . Applying the appropriate bias, $141 - 127 = 14$ , my exponent.

The remaining 23 bits are the significand, with an implied $1.$ in front, meaning overall my number is $1.0010100100001 \times 2^{14}$ . I can slide the radix point over 14 places to account for the exponent to get $100101001000010.0$ , or $19010.0$ in decimal.

Type punning¶

If you don’t understand how the conversions above work, you don’t yet have a complete grasp of how floating-point numbers are represented. However, I’d like to point out that when we write C programs using floating-point numbers, the machine is doing this sort of conversion all the time on our behalf. Can’t we ask it to help out now?

In C, pointers let us discuss memory locations, and go use the bits that reside there, in a very flexible way. They are powerful—possibly too powerful. A C pointer is typed, i.e. a pointer to an int has type int *, meaning that if you go look at those bits you expect them to be representing an int, as opposed to a float *, a pointer where if you go look there you’d expect to see bits representing a float.

Under ordinary circumstances, mixing up what type is appropriate for interpreting those bits would be bad, and lead to all kinds of problems, but that’s exactly what we want to do here, on purpose.

To convert my $35125.0$ to its representation in hexadecimal, I can write a little C program with that number in it, as a float, and then coerce the machine to read the same location in memory as though it were an unsigned int (which has the same width as float on my system, 32 bits) and print that in hexadecimal.

#include <stdio.h>

int main()
{
    float f = 35125.0;
    unsigned int u = *(unsigned int *)&f;
    printf("%x\n", u);
}

The critical piece, *(unsigned int *)&f, first finds the memory address of f (with &, ‘address-of’), then casts that pointer to be the same address but believing the bits should be seen as representing an unsigned int (with (unsigned int *)), and finally reads that value back out (with *, ‘dereference’).

A similar bit of type punning can convert the other way.

#include <stdio.h>

int main()
{
    unsigned int u = 0x46948400;
    float f = *(float *)&u;
    printf("%f\n", f);
}

This is straight up CS205 code. It is entirely architecture-specific, and may not behave as hoped or even compile on a different computer. A lot of the work of designing good programming languages is making programs not depend on hardware architecture, and we’re running headlong the other way.

You have attempted 1 of 1 activities on this page

Representing floating-point numbers¶

Piece by piece¶

Converting to float¶

Converting from float¶

Type punning¶

Converting to `float`¶

Converting from `float`¶