The steps of compilation¶

Compilation is normally encapsulated so that all of the necessary steps happen behind the scenes. Type the following C code into a file named count.c.

#include <stdio.h>
#include <stdlib.h>

#define TARGET 10

int main() {
    int i;
    for (i = 0; i < TARGET; ++i)
        printf("%d\n", i);
    return EXIT_SUCCESS;
}

You can compile this code all at once with make count, or by running the compiler directly with the following command.

gcc -o count count.c

You can verify with ls that there are now two files: count.c containing your source code, and count, the executable program. (If you’re used to Windows where programs have a .exe extension, you’ll have to get used to the tradition in UNIX that programs do not have an extension at all.) You can run the program and see it work by entering the following.

./count

The ./ part specifies to run the program named count that is in the current directory (. always refers to the current directory). Without the directory specified (i.e. if you just typed count instead of ./count) the shell would look for a file named count in a list of system directories and run the first one it found. Since the current directory is not typically one of those places it will look, you have to specify the ./.

If you’re curious where the system programs are, you can run echo $PATH to see the list of where the shell will look, separated by colons. If you’re curious where a particular program file is, you can use the which utility. For example, if you run which gcc, you will see something like /usr/bin/gcc. That is a file that happens to contain the executable program that is the compiler. (This is why programs on UNIX typically do not have extensions; the shell searches for the name you type, so we’d have to type a lot of extra .exes all the time.)

However, traditionally there were more steps to compilation that were visible to the user, and they all still happen behind-the-scenes when you run the compiler as you did above.

Preprocessing¶

The first step is preprocessing, which handles the directives such as #include and #define. Those are not actually part of the C language, but part of the preprocessor language. When you run the preprocessor on your source code, it doesn’t care at all about the C part. When it encounters an #include, it will search for the named file (it has its own search path for files specified with angle brackets such as <stdio.h>, and it looks for files in the current directory when they are specified with quote marks as in "myheader.h"); when it finds that file, the preprocessor copy-and-pastes the contents of that file to replace the #include line. When it encounters a #define directive, the preprocessor keeps track of what to replace with what, and whenever it encounters the defined symbol, replaces it textually with the given value. There are other things the preprocessor handles as well, while passing everything it doesn’t recognize on through.

To run the preprocessor directly, enter the following.

cpp -P count.c count.i

The program is named cpp for ‘C Pre Processor’. It was only long after this tool was named that C++ was invented and people started commonly using ‘CPP’ as an abbreviation of ‘C Plus Plus’.

The traditional extension for preprocessed source files is .i. Open the file up in a text editor and skim it. You should see the pasted-in contents of the headers that were #included (you can check by finding the originals in /usr/include, e.g. /usr/include/stdio.h). You should see the code you recognize from count.c at the end, with the macro TARGET replaced.

The -P option turns off what are called ‘linemarkers’, which are extra information for the compiler so it knows which files and line numbers the result of the preprocessor originally came from. You might try preprocessing without it to see the linemarkers; without them, if the compiler tried to print out warnings or errors about your code, it wouldn’t know where to tell you to look for the problem.

You can also run gcc but tell it to stop after preprocessing with the -E option.

gcc -E -P -o count.i count.c

That is equivalent to running cpp directly as above.

Compiling¶

Once the source has been preprocessed, the compiler can run, translating the C language to assembly language. The compiler is traditionally named cc, for ‘C Compiler’; the modern gcc is the Gnu Project’s version of cc, but you can still call it cc for old time’s sake. You can use the -S option to make it stop after compiling without moving on to assembling. (The -masm=intel option changes the style of assembly language produced to the one we will be using in class.)

cc -S -masm=intel -o count.s count.i

Note that we start from the preprocessed .i file and that the traditional extension for the assembly language output is .s. (A .S file, with a capital letter, indicates assembly language that should be preprocessed, whereas with a lower-case letter it indicates assembly language that is ready to be assembled.) Again, have a look inside the resulting file—you are not expected to understand assembly language yet, but we will be learning a lot about it in the coming weeks.

Assembling¶

The translation from assembly language to machine code is much simpler than the translation from C language to assembly language. The tool that does this step is not called a compiler, but an assembler, and the traditional name for this tool is as.

as -o count.o count.s

Equivalently, you can run gcc with the -c flag to stop after assembling.

gcc -c -o count.o count.s

The result is a compiled ‘object file’, with extension .o. This contains raw machine code that could be run on the CPU directly, but it is not ready for that yet because, for example, it still contains some references to other object files that need to be resolved by linking.

From here on out, the intermediate results will no longer make sense if you open them in a text editor, because they are raw binary. You can, however, translate them to a readable hexadecimal representation using the xxd utility.

xxd count.o

This will mostly not make sense to you without knowing more about object files, but you should be able to see some recognizable landmarks still. The file begins with the magic number (yes, it’s really called magic) ‘ELF’. Also, the format string from our C code, "%d\n", will still be in there as the hexadecimal 25640a, which is the ASCII representation of percent, d, and newline, respectively.

You can also use programs like objdump that can read the information in an object file and print human-readable tables summarizing it. We will be exploring that in detail later on.

Linking¶

C was designed to allow for ‘separate compilation’, in which each source file can go through preprocessing, compiling, and assembling separately, and then they are linked together at the end. Compiling in particular is a lot of work for the computer, and with separate compilation, if one file has to be changed, that file can be recompiled without needing to recompile every file in the project. This is also a way to reuse code as compiled libraries, which are linked in to the program that uses them.

Linking in the modern compiler has become very complex, so that even if you want to run the linker directly, you normally go through gcc.

gcc -o count count.o

This is different from the first, do-everything-in-one-go command line, because we are using count.o as the input rather than count.c. The gcc command will figure out what step to start on based on the extension of the input file.

It is still possible to access the linker directly through its traditional name, ld, but we have to know which system files to link in for the program to be complete.

ld -o count /usr/lib/x86_64-linux-gnu/crt1.o /usr/lib/x86_64-linux-gnu/crti.o /usr/lib/x86_64-linux-gnu/crtn.o count.o -lc

The files crt1.o, crti.o, and crtn.o contain extra code needed to set up the runtime environment for our program (‘CRT’ stands for ‘C Run Time’). The -lc flag links in the standard C library, which provides, for example, the implementation of the printf function our example code depends on.

The output of the linker is the finished executable, count. You might view count with xxd as well, and notice its similarities and differences of this linked version relative to count.o. A useful way to control the output of programs with lots of output like this is to pipe the output into less.

xxd count | less

Within less, the up and down arrows will scroll through the output, and q will quit.

Exploration¶

Work through the example above, or the same steps with different source code, for yourself. Try changing the file produced each of the steps and compiling the rest of the way in order to change what the program does. For example, in each of those files, it is possible to change the program to stop at a target other than 10, or to print the numbers out separated by spaces rather than newlines. If you let the preprocessor emit linemarkers, you could introduce syntax errors but manipulate where the compiler believes they are.

To edit the binary files, a good trick is to use xxd to dump them, edit the dump, and then use xxd ‘in reverse’ to patch the binary file.

xxd file dumped     # dump file into dumped
            # now edit dumped
xxd -r dumped file  # patch the changes back into the original

Note that although xxd includes an attempt at interpreting the raw hex as ASCII in a column on the right, editing it has no effect on xxd -r, which only considers the hex dump part.

You have attempted of activities on this page