Exploring Width

So, C provides different widths of integers, giving some hints about which is bigger but not promising exactly how big any of them is, so that C can be compiled on different machines and adapt to their various architectures. We’re focusing on a particular architecture, x86-64, and we’ve seen some of its built-in width assumptions. So what should we expect?

Different compilers and operating systems make different choices, but restricting ourselves to gcc on Linux on x86-64, there is pretty much a particular way that the C types get mapped onto hardware register sizes. I don’t have to tell you what it is, though—you can find out for yourself!

C provides the sizeof keyword so programmers can make code that adapts to different machine characteristics as necessary. For example, if you want to find out how big an int is, compile and run the following code.

#include <stdio.h>

int main()
{
    printf("On this system, int is %zd bytes.\n", sizeof(int));
}

The %zd conversion specifier to print is a width-modified version of %d for int. There are width modifiers for all of the different integer types, but z is appropriate for the result from sizeof.

I encourage you to play around with this example, adding more lines with other types to see how big all of the various types you’ve heard of are on your system. When I run it, I find that int is 4 bytes, which Intel would call a ‘double word’ and fit in the aspects of the registers from the 32-bit extension, with names such as eax, ebx, etc.

Before we move on, it would be a good time now to review how integers are stored in binary. You can review the material from CS160 about binary integers, particularly non-negative integers, signed integers with two’s complement, and writing binary information in a shorter way with hexadecimal.

Now, you might be wondering, if the rax register from the 64-bit extension is just a bigger version of the eax register from the 32-bit extension, when you assign a value to eax, what happens to the upper 32 bits of rax? There are a few possibilities; first, there could be no guarantee, and those bits could just become uninitialized garbage. More interestingly, though, we might want to keep the value of the number just make it wider. Similarly, what if you put an int value in C into a variable of type long? The computer needs a way to represent the same number in different widths.

For an unsigned number, putting extra zeroes in higher place values does the right thing. The numbers 1, 01, 001, 0001, etc. are all just one. That’s how even though five (101) is only three bits, you can put it in a 32-bit integer—there’s just a lot of leading zeroes. If you but those bits into an even wider variable, there just need to be more zeroes; this is called zero-extension. (You can also go the other way, assigning a wider variable’s value into a narrower variable; as long as the only things that get cut off are leading zeroes, the number will still be the same. Only numbers that are too big for the narrower type will have meaningful bits cut off.)

For a signed number (using two’s complement notation), the leading bit must be one for negative numbers, but it must be zero for positive numbers and zero. So, to widen a signed number, you must zero-extend positive numbers and zero, but fill in with ones for negative numbers. Another way of looking at it is that the highest-order bit, the sign bit, is what must be copied to fill in the extra width; this is called sign-extension.

You can play around with how this looks in assembly, without worrying too much about understanding all of the assembly code. Here’s a quick way to see how some C code looks in assembly. First, write a snippet of code in a function, and let’s say the file is test.c.

void test(void)
{
    int number = 5;
    short narrower = number;
    long wider = number;
}

Compile it but don’t worry about linking, just run make test.o to get the compiled object file. Now you can read it back in assembly using objdump by running objdump -M intel -d test.o. The -M intel option picks the right assembly syntax, and the -d option is for ‘disassemble’, i.e. reinterpret binary machine code as human-readable assembly code. For me, I get the following.

test.o:     file format elf64-x86-64


Disassembly of section .text:

0000000000000000 <test>:
   0:   f3 0f 1e fa             endbr64
   4:   55                      push   rbp
   5:   48 89 e5                mov    rbp,rsp
   8:   c7 45 f4 05 00 00 00    mov    DWORD PTR [rbp-0xc],0x5
   f:   8b 45 f4                mov    eax,DWORD PTR [rbp-0xc]
  12:   66 89 45 f2             mov    WORD PTR [rbp-0xe],ax
  16:   8b 45 f4                mov    eax,DWORD PTR [rbp-0xc]
  19:   48 98                   cdqe
  1b:   48 89 45 f8             mov    QWORD PTR [rbp-0x8],rax
  1f:   90                      nop
  20:   5d                      pop    rbp
  21:   c3                      ret

A lot of that is boilerplate, and most of it won’t make sense yet. But if you just scan for familiar landmarks, hopefully you will spot a few. There’s the 5, and if you change the C code, recompile, and disassemble, you’ll see that constant changing to match. I can tell you that the mov instruction, ‘move’, is the assembly analog of the assignment operator. So we’re assigning 5 into something, that must be int number = 5. Since int is a double word on this system, that explains the DWORD and the eax on the next line. Next in C, we’re assigning the value to a short, and sure enough, on the subsequence lines in assembly there is a WORD and the ax register, both indicating 2-byte values. Next there’s an instruction called cdqe, and anybody would be forgiven for not knowing what to make of that, but then we’re back to mov with rax, the 64-bit register, so that must be to do with setting the long in C.

(The cdqe instruction does sign-extension, turning the 32-bit signed integer in eax into a fully-fledged 64-bit signed integer in rax. It comes up a lot, but it doesn’t have a very obvious abbreviation.)

I hope you’ll spend some time experimenting with writing short C code, and reading the disassembly. It is interesting in and of itself, but it is also good practice to be able to wade into unfamiliar territory but look for a few familiar landmarks to get your bearings.

(And by the end of this course, you’ll know what all of those instructions mean!)

You have attempted of activities on this page