Exploring Width¶
So, C provides different widths of integers, giving some hints about which is bigger but not promising exactly how big any of them is, so that C can be compiled on different machines and adapt to their various architectures. We’re focusing on a particular architecture, x86-64, and we’ve seen some of its built-in width assumptions. So what should we expect?
Different compilers and operating systems make different choices,
but restricting ourselves to gcc
on Linux on x86-64, there is
pretty much a particular way that the C types get mapped onto hardware
register sizes. I don’t have to tell you what it is, though—you can find
out for yourself!
C provides the sizeof
keyword so programmers can make code that
adapts to different machine characteristics as necessary. For example,
if you want to find out how big an int
is, compile and run the
following code.
#include <stdio.h>
int main()
{
printf("On this system, int is %zd bytes.\n", sizeof(int));
}
The %zd
conversion specifier to print
is a width-modified
version of %d
for int. There are width modifiers for all of the
different integer types, but z
is appropriate for the result
from sizeof
.
I encourage you to play around with this example, adding more lines
with other types to see how big all of the various types you’ve heard of
are on your system. When I run it, I find that int
is 4 bytes,
which Intel would call a ‘double word’ and fit in the aspects of
the registers from the 32-bit extension, with names such as eax
,
ebx
, etc.
Before we move on, it would be a good time now to review how integers are stored in binary. You can review the material from CS160 about binary integers, particularly non-negative integers, signed integers with two’s complement, and writing binary information in a shorter way with hexadecimal.
Now, you might be wondering, if the rax
register from the 64-bit
extension is just a bigger version of the eax
register from the
32-bit extension, when you assign a value to eax
, what happens
to the upper 32 bits of rax
? There are a few possibilities;
first, there could be no guarantee, and those bits could just become
uninitialized garbage. More interestingly, though, we might want to
keep the value of the number just make it wider. Similarly, what if
you put an int
value in C into a variable of type long
?
The computer needs a way to represent the same number in different widths.
For an unsigned number, putting extra zeroes in higher place values does the right thing. The numbers 1, 01, 001, 0001, etc. are all just one. That’s how even though five (101) is only three bits, you can put it in a 32-bit integer—there’s just a lot of leading zeroes. If you but those bits into an even wider variable, there just need to be more zeroes; this is called zero-extension. (You can also go the other way, assigning a wider variable’s value into a narrower variable; as long as the only things that get cut off are leading zeroes, the number will still be the same. Only numbers that are too big for the narrower type will have meaningful bits cut off.)
For a signed number (using two’s complement notation), the leading bit must be one for negative numbers, but it must be zero for positive numbers and zero. So, to widen a signed number, you must zero-extend positive numbers and zero, but fill in with ones for negative numbers. Another way of looking at it is that the highest-order bit, the sign bit, is what must be copied to fill in the extra width; this is called sign-extension.
You can play around with how this looks in assembly, without worrying
too much about understanding all of the assembly code. Here’s a quick
way to see how some C code looks in assembly. First, write a snippet
of code in a function, and let’s say the file is test.c
.
void test(void)
{
int number = 5;
short narrower = number;
long wider = number;
}
Compile it but don’t worry about linking, just run make test.o
to get the compiled object file. Now you can read it back in assembly
using objdump by running objdump -M intel -d test.o
.
The -M intel
option picks the right assembly syntax, and the
-d
option is for ‘disassemble’, i.e. reinterpret binary
machine code as human-readable assembly code. For me, I get the following.
test.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <test>:
0: f3 0f 1e fa endbr64
4: 55 push rbp
5: 48 89 e5 mov rbp,rsp
8: c7 45 f4 05 00 00 00 mov DWORD PTR [rbp-0xc],0x5
f: 8b 45 f4 mov eax,DWORD PTR [rbp-0xc]
12: 66 89 45 f2 mov WORD PTR [rbp-0xe],ax
16: 8b 45 f4 mov eax,DWORD PTR [rbp-0xc]
19: 48 98 cdqe
1b: 48 89 45 f8 mov QWORD PTR [rbp-0x8],rax
1f: 90 nop
20: 5d pop rbp
21: c3 ret
A lot of that is boilerplate, and most of it won’t make sense yet. But
if you just scan for familiar landmarks, hopefully you will spot
a few. There’s the 5, and if you change the C code, recompile, and
disassemble, you’ll see that constant changing to match. I can tell you
that the mov
instruction, ‘move’, is the assembly analog of
the assignment operator. So we’re assigning 5 into something, that must
be int number = 5
. Since int
is a double word on this
system, that explains the DWORD
and the eax
on the next
line. Next in C, we’re assigning the value to a short, and sure enough,
on the subsequence lines in assembly there is a WORD
and the
ax
register, both indicating 2-byte values. Next there’s an
instruction called cdqe
, and anybody would be forgiven for not
knowing what to make of that, but then we’re back to mov
with
rax
, the 64-bit register, so that must be to do with setting
the long
in C.
(The cdqe
instruction does sign-extension, turning the 32-bit
signed integer in eax
into a fully-fledged 64-bit signed integer
in rax
. It comes up a lot, but it doesn’t have a very obvious
abbreviation.)
I hope you’ll spend some time experimenting with writing short C code, and reading the disassembly. It is interesting in and of itself, but it is also good practice to be able to wade into unfamiliar territory but look for a few familiar landmarks to get your bearings.
(And by the end of this course, you’ll know what all of those instructions mean!)