Doing syscalls by hand

In this blog post we’re going to take a stab at implementing syscalls by hand. There really is no advantage doing this, it’s just fun to learn the intrinsics of Linux; we’re going to discuss user- and kernel space and finally get our hands dirty with some assembly.

#JustLinuxThings

This blog post is Linux centric, and exclusively deals with x64. I try to provide links for further information, but just be aware of that.

What are syscalls?

If you’ve ever written a program, chances are that you have already used syscalls. A normal computer runs an operating system, like Linux, and many applications. The OS’s job is to provide access to hardware interfaces like NICs, GPUs, HDDs or USB ports, manage the computer’s memory, schedule tasks and so on. It also deals with users and permissions.

Your program on the other hand, probably needs to interface with the OS at some point: You might want to open, read or write files. You might want to use IPC primitives like pipes or mapped memory. Or your program might need network access and use sockets. The kernel and applications live in different memory regions (kernel space and user space). Generally speaking, a user space application may not just write into kernel space. Instead, the kernel exposes some functions to allow applications to access the previously mentioned functions. These are called system calls, or syscalls in short¹.

Each syscall is assigned a unique number. Syscalls may have parameters, but they don’t have to.

Example: glibc’s `read()`

A prime example of a syscall would be read(fileDescriptor, buffer, numBytes). As the name of the function implies, it reads a numBytes from a fileDescriptor into a buffer. Let’s look at how the function is implemented in glibc. Note that read() is just an alias for __libc_read().

// Source: https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/read.c.html

ssize_t __libc_read (int fd, void *buf, size_t nbytes) {
    return SYSCALL_CANCEL (read, fd, buf, nbytes);
}

Wait. Stop. What’s SYSCALL_CANCEL? Turns out it is a macro that is in turn using other macros. If we replace all macro calls we get this²³. I cleaned up the coded to make it more readable (e.g., I removed various __typeof__ expressions).

// Source: https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/read.c.html 
//         (and subsequent includes)

ssize_t __libc_read (int fd, void *buf, size_t nbytes) {
    //Edited for brevity

    unsigned long int resultvar;

    size_t __arg3 = nbytes; 
    void* __arg2 = buf; 
    int __arg1 = fd; 

    register size_t _a3 asm ("rdx") = __arg3; 
    register void* _a2 asm ("rsi") = __arg2; 
    register int _a1 asm ("rdi") = __arg1; 
    
    asm volatile ( 
          "syscall\n\t" 
        : "=a" (resultvar) 
        : "0" (__NR_read), "r" (_a1), "r" (_a2), "r" (_a3) 
        : "memory", "cc", "r11", "cx"
    ); 

    if ((unsigned long int)(resultvar) >= -4095 L) {
      __set_errno(-(resultvar));
      resultvar = (unsigned long int) - 1;
    }

    return resultvar;
}

Oooofff. That’s not particularly readable. So to actually figure out what’s happening here, let’s take a step back.

How do syscalls on x64 work, anyway?

To really understand what’s going on in the snippet above, let’s take a look at syscall(). syscall() is a convenience function that allows you to call any syscall with its number and arguments.

If we read the man page carefully, we stumble upon this passage:

Each architecture ABI has its own requirements on how system call arguments are passed to the kernel. For system calls that have a glibc wrapper (e.g., most system calls), glibc handles the details of copying arguments to the right registers in a manner suitable for the architecture. However, when using syscall() to make a system call, the caller might need to handle architecture-dependent details[.]

And a little bit later:

Every architecture has its own way of invoking and passing arguments to the kernel. The details for various architectures are listed in the two tables below.

Arch/ ABI	Instruction	Syscall #	Return val	Return val 2	Arg 1	Arg 2	Arg 3	Arg 4	Arg 5	Arg 6
x86-64	syscall	rax	rax	rdx	rdi	rsi	rdx	r10	r8	r9

So here’s what’s happening: On x64 any syscall is invoked using the syscall assembly instruction. The instruction expects the systemcall number to be in rax. It returns the result of the operation in rax and – if it returns a tuple, like pipe() does – rdx. Parameters have to be in the right registers, namely the first argument in rdi, the second in rsi, the third in rdx and so on and so forth.

Decyphering the assembly

We’re now ready to go through the code above step-by-step.

At first, we copy the parameters passed to open to local variables. I believe this mainly has to do with the macro shenaniganry and will almost certainly be optimized by the compiler.

size_t __arg3 = nbytes; 
void* __arg2 = buf; 
int __arg1 = fd;

Next, variables with the register storage class are specified. While register was deprecated in the newer C++ standards, it’s still alive and kicking in C. In GCC you can specify the CPU register by using the asm keyword.

register size_t _a3 asm ("rdx") = __arg3; 
register void* _a2 asm ("rsi") = __arg2; 
register int _a1 asm ("rdi") = __arg1;

To summarize, after the previous six lines of code, we now have fd in rdi, buf in rsi, and nbytes in rdx. If you go back to the table this is exactly the way that our calling convention mandates.

Lastly, we can figure out what the inline assembly does by looking at the GCC Manual.

asm volatile ( 
      "syscall\n\t" 
    : "=a" (resultvar) 
    : "0" (__NR_read), "r" (_a1), "r" (_a2), "r" (_a3) 
    : "memory", "cc", "r11", "cx"
);

We use volatile because our inline assembly may have side-effects. volatile disables compiler optimizations that could cause bugs.

The next line is the verbatim assembly that we want to use. In our case it’s one simple instruction: syscall.

The next line specifies output operands, C variables that are modified by the assembly. The general form is [[asmSymbolicName]] constraint (cvariablename). In our case we don’t use an asmSymbolicName. We have the constraint "=a". The equals sign means that the value is overwritten. For x64 "a" means the rax register .

Now we’re dealing with input operands. They follow the same syntax as input operands. "r" just means that the operand must be a register. Since we already defined variables representing the registers in the code before, we’re just telling the compiler that we’re going to use them. We use "0" as a constraint for the constant expression __NR_READ (the syscall number of read()), to tell the compiler to put it in rax.

Finally, we have clobbers. Clobbers are registers that are neither input nor output operands, but might be changed anyway. For example, sometimes one needs temp registers or the processor might change registers as a side-effect. GCC knows two special clobbers, memory, which specifies that addresses in memory might have changed, and cc, which tells GCC that the assembly instruction might have changed the flags register. Finally, the r11 register and the rc register are clobbed. The former is typically used as a temporary register while the latter is sometimes used as a counter. In this particular case I’m not sure why they are clobbed.

To summarize:

We put the syscall number in rax by using the constant expression as an input operand.
We put the arguments in their respective registers using input operands.
We execute the syscall assembly instruction.
The result form rax is saved to resultvar using output operands.

The last piece of code deals with errno. If the return value is greater than -4095L then we return -1 and set errno.

if ((unsigned long int)(resultvar) >= -4095 L) {
    __set_errno(-(resultvar));
    resultvar = (unsigned long int) - 1;
}

I hope this blog post could shed some light on how to do syscalls by hand. As always, if there’s a mistake in the text I would be glad if you cold notify me.

A list of all syscalls can be found here: https://man7.org/linux/man-pages/man2/syscalls.2.html ↩
We’re assuming single-threaded mode here, so that everything is easier to read. ↩
You can simplify read.c by using gcc: gcc -E read.c. Just a neat trick. ↩