How to write an assembly program¶
Let’s review the example assembly program to add integers, with an eye to approaching the task of writing a new program in assembly.
Just as in C, the program’s entry point is a function named main
.
(It is possible, in machine code, to bypass this convention and interact
directly with the operating system, but we will stick to programs that
use the C runtime, so they will start with main
. A hint that you
are dealing with a program that doesn’t use the runtime is that it will
define _start
instead.)
There is are some lines before we see the word main
, and they all
start with periods. A line in our assembly that starts with a period does
not correspond to a machine code instruction, but rather is an directive
to the assembler itself that will do the translating. First, we ask it
to switch from its default AT&T syntax to the better-suited-to-our-needs
Intel syntax.
.intel_syntax noprefix
Then, we tell it that the following instructions are to be placed in the text segment, where machine code goes.
.text
Now we see a mention of main
, but this isn’t defining what
main
is; this tells the assembler that it should be globally
visible within the program (we will see more about linkage later).
.global main
At long last, we define main
.
main: push rbp
The bit that gives main
a value is main:
at the
beginning of the line, which is called a label; the rest is a machine
code instruction. Any line of assembly can be given a label, and lines
without a label I usually indent so that the labels stand out in the
first column. It is equally acceptable, and sometimes better when the
label itself is long, to put it before the line it labels, like this.
main:
push rbp
The first two instructions are boilerplate involved in setting up a function’s local storage.
push rbp
mov rbp, rsp
We will get into what they are and how they work, in detail, but for now
think of them like the open brace that would begin the function in a C
version of this code. Syntactically, they follow the same format as all
instructions: first, an operation, some space, and then whatever arguments
are to be used, separated by commas. So here we see the push
operation being told to operate on rbp
, and then mov
being told to operate on rbp
and rsp
.
The things an operation operates on are called operands, and one of the most likely types of thing you will see being an operand is a register, which you can specify simply by its name.
Another kind of operand is data from memory, as seen in the next instructions.
mov rax, [rip + a]
mov rcx, [rip + b]
The mov
instruction is like the =
assignment operator
in C. It takes two operands, and it copies the value of the right side
into the left side. So, mov rbp, rsp
above was setting rbp
to save the value of rsp
, and mov rax, [rip + a]
is
assigning a value into rax
. The value, given in square brackets,
is a memory access. There will be a variable named a
, and that’s
what we’re loading into the rax
register. The rip +
part
has to do with how a
is being found in memory; the rip
register is the instruction pointer, also called program counter, which
keeps track of the address of the current instruction. When the machine
code has to specify the address of a
, it will be as an offset
from the instruction that’s looking for it, what’s called ‘rip-relative
addressing’, and that is explicit in the assembly code.
Sometimes in disassembly listings, you will see a memory access like
[rip + a]
given as rip[a]. The version I’m using is correct
Intel syntax, and the alternative version is a sort of AT&T–Intel
hybrid the Gnu tools sometimes emit. It’s equivalent.
Now let’s do some math!
add rcx, rax
The add
instruction is like the +=
operator in C;
it increases the left operand by the amount of the right operand. So,
add rcx, rax
computes the sum of rcx
and rax
and puts the result into rcx
. The result of these last three
instructions is to put \(a + b\) into register rcx
. But
registers are like scratch paper—let’s save the answer back into memory.
mov [rip + c], rcx
Now \(c = a + b\). The final lines of main
are more
boilerplate—for now, think of them like the closing brace.
mov rax, 0
pop rbp
ret
But what about our variables, a
, b
, and c
?
They haven’t appeared yet because they don’t go in the text segment.
We have to use directives to switch to another segment and fill it
in appropriate. The input variables, a
and b
, are
global variables, which go in the data segment (since we never write
to them, it would also be reasonable to put them in the rodata segment,
but in this example we use data).
.data
Once we’ve switched, we can use the .quad
directive to produce
quadword integers, and of course we will label them so they can be accessed
by their names as we saw in the code for main
.
a: .quad 123
b: .quad 456
The output variable, c
, needs space set aside for it, but it
has no meaningful initial value. There is a special segment just for
such things, the bss segment. (The name ‘bss’ is from history and
not particularly meaningful; just think of it as the uninitialized
data segment.)
.bss
The .zero
directive lets us set aside however many bytes are needed,
to be filled in with zeroes when the program is being loaded. In this case,
we want a quadword, i.e. 8 bytes.
c: .zero 8
And that’s the end of the assembly listing.