How to write an assembly program¶

Let’s review the example assembly program to add integers, with an eye to approaching the task of writing a new program in assembly.

Just as in C, the program’s entry point is a function named main. (It is possible, in machine code, to bypass this convention and interact directly with the operating system, but we will stick to programs that use the C runtime, so they will start with main. A hint that you are dealing with a program that doesn’t use the runtime is that it will define _start instead.)

There is are some lines before we see the word main, and they all start with periods. A line in our assembly that starts with a period does not correspond to a machine code instruction, but rather is an directive to the assembler itself that will do the translating. First, we ask it to switch from its default AT&T syntax to the better-suited-to-our-needs Intel syntax.

.intel_syntax noprefix

Then, we tell it that the following instructions are to be placed in the text segment, where machine code goes.

.text

Now we see a mention of main, but this isn’t defining what main is; this tells the assembler that it should be globally visible within the program (we will see more about linkage later).

.global main

At long last, we define main.

main:   push    rbp

The bit that gives main a value is main: at the beginning of the line, which is called a label; the rest is a machine code instruction. Any line of assembly can be given a label, and lines without a label I usually indent so that the labels stand out in the first column. It is equally acceptable, and sometimes better when the label itself is long, to put it before the line it labels, like this.

main:
    push    rbp

The first two instructions are boilerplate involved in setting up a function’s local storage.

push    rbp
mov rbp, rsp

We will get into what they are and how they work, in detail, but for now think of them like the open brace that would begin the function in a C version of this code. Syntactically, they follow the same format as all instructions: first, an operation, some space, and then whatever arguments are to be used, separated by commas. So here we see the push operation being told to operate on rbp, and then mov being told to operate on rbp and rsp.

The things an operation operates on are called operands, and one of the most likely types of thing you will see being an operand is a register, which you can specify simply by its name.

Another kind of operand is data from memory, as seen in the next instructions.

mov rax, [rip + a]
mov rcx, [rip + b]

The mov instruction is like the = assignment operator in C. It takes two operands, and it copies the value of the right side into the left side. So, mov rbp, rsp above was setting rbp to save the value of rsp, and mov rax, [rip + a] is assigning a value into rax. The value, given in square brackets, is a memory access. There will be a variable named a, and that’s what we’re loading into the rax register. The rip + part has to do with how a is being found in memory; the rip register is the instruction pointer, also called program counter, which keeps track of the address of the current instruction. When the machine code has to specify the address of a, it will be as an offset from the instruction that’s looking for it, what’s called ‘rip-relative addressing’, and that is explicit in the assembly code.

Sometimes in disassembly listings, you will see a memory access like [rip + a] given as rip[a]. The version I’m using is correct Intel syntax, and the alternative version is a sort of AT&T–Intel hybrid the Gnu tools sometimes emit. It’s equivalent.

Now let’s do some math!

add rcx, rax

The add instruction is like the += operator in C; it increases the left operand by the amount of the right operand. So, add rcx, rax computes the sum of rcx and rax and puts the result into rcx. The result of these last three instructions is to put \(a + b\) into register rcx. But registers are like scratch paper—let’s save the answer back into memory.

mov [rip + c], rcx

Now \(c = a + b\). The final lines of main are more boilerplate—for now, think of them like the closing brace.

mov rax, 0
pop rbp
ret

But what about our variables, a, b, and c? They haven’t appeared yet because they don’t go in the text segment. We have to use directives to switch to another segment and fill it in appropriate. The input variables, a and b, are global variables, which go in the data segment (since we never write to them, it would also be reasonable to put them in the rodata segment, but in this example we use data).

.data

Once we’ve switched, we can use the .quad directive to produce quadword integers, and of course we will label them so they can be accessed by their names as we saw in the code for main.

a:  .quad   123
b:  .quad   456

The output variable, c, needs space set aside for it, but it has no meaningful initial value. There is a special segment just for such things, the bss segment. (The name ‘bss’ is from history and not particularly meaningful; just think of it as the uninitialized data segment.)

.bss

The .zero directive lets us set aside however many bytes are needed, to be filled in with zeroes when the program is being loaded. In this case, we want a quadword, i.e. 8 bytes.

c:  .zero   8

And that’s the end of the assembly listing.

You have attempted of activities on this page