In this blog post we’re going to take a stab at implementing syscalls by hand. There really is no advantage doing this, it’s just fun to learn the intrinsics of Linux; we’re going to discuss user- and kernel space and finally get our hands dirty with some assembly.
#JustLinuxThings
This blog post is Linux centric, and exclusively deals with x64. I try to provide links for further information, but just be aware of that.What are syscalls?
If you’ve ever written a program, chances are that you have already used syscalls. A normal computer runs an operating system, like Linux, and many applications. The OS’s job is to provide access to hardware interfaces like NICs, GPUs, HDDs or USB ports, manage the computer’s memory, schedule tasks and so on. It also deals with users and permissions.
Your program on the other hand, probably needs to interface with the OS at some point: You might want to open, read or write files. You might want to use IPC primitives like pipes or mapped memory. Or your program might need network access and use sockets. The kernel and applications live in different memory regions (kernel space and user space). Generally speaking, a user space application may not just write into kernel space. Instead, the kernel exposes some functions to allow applications to access the previously mentioned functions. These are called system calls, or syscalls in short1.
Each syscall is assigned a unique number. Syscalls may have parameters, but they don’t have to.
Example: glibc’s read()
A prime example of a syscall would be read(fileDescriptor, buffer, numBytes)
. As the name of the function implies, it reads a numBytes
from a
fileDescriptor
into a buffer
. Let’s look at how the function is implemented in glibc. Note that read()
is just an alias for __libc_read()
.
// Source: https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/read.c.html
ssize_t __libc_read (int fd, void *buf, size_t nbytes) {
return SYSCALL_CANCEL (read, fd, buf, nbytes);
}
Wait. Stop. What’s SYSCALL_CANCEL
? Turns out it is a macro that is in turn using other macros. If we replace all macro calls we get this23. I
cleaned up the coded to make it more readable (e.g., I removed various __typeof__
expressions).
// Source: https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/read.c.html
// (and subsequent includes)
ssize_t __libc_read (int fd, void *buf, size_t nbytes) {
//Edited for brevity
unsigned long int resultvar;
size_t __arg3 = nbytes;
void* __arg2 = buf;
int __arg1 = fd;
register size_t _a3 asm ("rdx") = __arg3;
register void* _a2 asm ("rsi") = __arg2;
register int _a1 asm ("rdi") = __arg1;
asm volatile (
"syscall\n\t"
: "=a" (resultvar)
: "0" (__NR_read), "r" (_a1), "r" (_a2), "r" (_a3)
: "memory", "cc", "r11", "cx"
);
if ((unsigned long int)(resultvar) >= -4095 L) {
__set_errno(-(resultvar));
resultvar = (unsigned long int) - 1;
}
return resultvar;
}
Oooofff. That’s not particularly readable. So to actually figure out what’s happening here, let’s take a step back.
How do syscalls on x64 work, anyway?
To really understand what’s going on in the snippet above, let’s take a look at syscall()
.
syscall()
is a convenience function that allows you to call any syscall with its number and arguments.
If we read the man page carefully, we stumble upon this passage:
Each architecture ABI has its own requirements on how system call arguments are passed to the kernel. For system calls that have a glibc wrapper (e.g., most system calls), glibc handles the details of copying arguments to the right registers in a manner suitable for the architecture. However, when using
syscall()
to make a system call, the caller might need to handle architecture-dependent details[.]
And a little bit later:
Every architecture has its own way of invoking and passing arguments to the kernel. The details for various architectures are listed in the two tables below.
Arch/ ABI | Instruction | Syscall # | Return val | Return val 2 | Arg 1 | Arg 2 | Arg 3 | Arg 4 | Arg 5 | Arg 6 |
---|---|---|---|---|---|---|---|---|---|---|
x86-64 | syscall | rax | rax | rdx | rdi | rsi | rdx | r10 | r8 | r9 |
So here’s what’s happening: On x64 any syscall is invoked using the syscall
assembly instruction. The instruction expects the systemcall number to
be in rax
. It returns the result of the operation in rax
and – if it returns a tuple, like pipe()
does – rdx
. Parameters have to be in the right
registers, namely the first argument in rdi
, the second in rsi
, the third in rdx
and so on and so forth.
Decyphering the assembly
We’re now ready to go through the code above step-by-step.
At first, we copy the parameters passed to open
to local variables. I believe this mainly has to do with the macro shenaniganry and will almost
certainly be optimized by the compiler.
size_t __arg3 = nbytes;
void* __arg2 = buf;
int __arg1 = fd;
Next, variables with the register
storage class are specified. While register
was deprecated in the newer C++ standards, it’s still alive and
kicking in C. In GCC you can specify the CPU register by
using the asm
keyword.
register size_t _a3 asm ("rdx") = __arg3;
register void* _a2 asm ("rsi") = __arg2;
register int _a1 asm ("rdi") = __arg1;
To summarize, after the previous six lines of code, we now have fd
in rdi
, buf
in rsi
, and nbytes
in rdx
. If you go back to the table this
is exactly the way that our calling convention mandates.
Lastly, we can figure out what the inline assembly does by looking at the GCC Manual.
asm volatile (
"syscall\n\t"
: "=a" (resultvar)
: "0" (__NR_read), "r" (_a1), "r" (_a2), "r" (_a3)
: "memory", "cc", "r11", "cx"
);
We use volatile
because our inline assembly may have side-effects. volatile
disables compiler optimizations that could cause bugs.
The next line is the verbatim assembly that we want to use. In our case it’s one simple instruction: syscall
.
The next line specifies output operands, C variables that are modified by the assembly. The general form is [[asmSymbolicName]] constraint (cvariablename)
. In our case we don’t use an asmSymbolicName. We have the constraint "=a"
. The equals sign means that the value is overwritten. For
x64 "a"
means the rax
register .
Now we’re dealing with input operands. They follow the same syntax as input operands. "r"
just means that the operand must be a register. Since we
already defined variables representing the registers in the code before, we’re just telling the compiler that we’re going to use them. We use "0"
as
a constraint for the constant expression __NR_READ
(the syscall number of read()
), to tell the compiler to put it in rax
.
Finally, we have clobbers. Clobbers are registers that are neither input nor output operands, but might be changed anyway. For example, sometimes
one needs temp registers or the processor might change registers as a side-effect. GCC knows two special clobbers, memory
, which specifies that
addresses in memory might have changed, and cc
, which tells GCC that the assembly instruction might have changed the flags register. Finally, the
r11
register and the rc
register are clobbed. The former is typically used as a temporary register while the latter is sometimes used as a
counter. In this particular case I’m not sure why they are clobbed.
To summarize:
- We put the syscall number in
rax
by using the constant expression as an input operand. - We put the arguments in their respective registers using input operands.
- We execute the
syscall
assembly instruction. - The result form
rax
is saved toresultvar
using output operands.
The last piece of code deals with errno
. If the return value is greater than -4095L
then we return -1
and set errno
.
if ((unsigned long int)(resultvar) >= -4095 L) {
__set_errno(-(resultvar));
resultvar = (unsigned long int) - 1;
}
I hope this blog post could shed some light on how to do syscalls by hand. As always, if there’s a mistake in the text I would be glad if you cold notify me.
Footnotes
-
A list of all syscalls can be found here: https://man7.org/linux/man-pages/man2/syscalls.2.html ↩
-
We’re assuming single-threaded mode here, so that everything is easier to read. ↩
-
You can simplify
read.c
by usinggcc
:gcc -E read.c
. Just a neat trick. ↩