As most probably know, DMOJ uses a sandbox to protect itself from potentially malicious user submissions. An overview of the Linux sandbox has been published by my friend Tudor. However, it doesn’t go deep into the implementation details, many of which differ between Linux and FreeBSD.
At its core, the sandbox,
cptbox, uses the
ptrace(2) API to intercept system calls before and after they are executed, denying access and manipulating results. The core is written in C, hence the name
Perhaps the most obvious difference between Linux and FreeBSD is that on Linux,
ptrace(2) subfunctions are invoked as
ptrace(PTRACE_*), while on FreeBSD, it is
ptrace(PT_*). But this difference is rather superficial compared to the significant internal differences.
Processes and Threads
On Linux, threads are implemented as kernel-mode processes sharing the same address space. The concept of a user mode process is really a “thread group”. This can be seen in system calls such as
sys_exit which terminates a thread, and
sys_exit_group, which terminates all the threads in the process.
On FreeBSD, processes are treated as an entity with potentially multiple “light-weight processes” a.k.a. threads running inside.
In any case, such difference is abstracted away by the use of a process group. The advantage is that the main loop will have only perform a
wait4(2) on the process group, and it will receive events for all threads and child processes, regardless of implementation. However, the difference has some significant implication on how information is obtained.
System call information
The most critical part of the sandbox, perhaps, is to obtain the system call information, so that
cptbox may decide exactly what to do with the sandboxed process.
On Linux, the implementation seems rather sketchy. Calling
ptrace(PTRACE_SYSCALL) will cause a
SIGTRAP to be sent and suspend the child process so that
cptbox may examine the state.
However, most other events that
ptrace tracks are also done via
SIGTRAP. The detailed information can be obtained by
ptrace(PTRACE_GETSIGINFO). It will tell you if this is a system call enter/exit, but this is considered slow and expensive. Instead, there is an option called
PTRACE_O_TRACESYSGOOD which can be set, and it would generate a signal
SIGTRAP | 0x80 when a system call is entered or exited.
The catch is that it doesn’t tell you if this is a system call enter or exit. The only way to find out is to keep toggling a flag every time you receive a
SIGTRAP | 0x80. This flag must be kept specific to each thread as well, since different threads can run different system calls at the same time.
The system call information is available in the registers. However, the system call number is slightly special. In the Linux ABI, the register
rax (on x64, or
eax on x86), contains both the system call number (on enter) and the return value (on exit).
ptrace(2) on Linux lets you access the system call number through a “register” called
orig_eax), while the return value is found in the register
On FreeBSD, there is no
PTRACE_O_TRACESYSGOOD. Since everything is in a light-weight process anyway,
ptrace(PT_LWPINFO) can be used to obtain all the information you will ever need. It could be described as an equivalent of
ptrace(PTRACE_GETSIGINFO), in a sense. This copies the information into a
struct ptrace_lwpinfo, which we will call
The most convenient part is that
lwpi.pl_flags can contain the flags
PL_FLAG_SCX, telling you directly whether this is a syscall enter or exit. Having neither means this is some other event. This means we don’t have to play with toggling flags.
Another convenience is that
lwpi.pl_syscall_code contains the system call number directly, so we do not have to play with pseudo-registers. The catch is that this is only available starting in FreeBSD 10.3. While this was released when
cptbox was ported to FreeBSD, a significant chunk of testing was done on Debian GNU/kFreeBSD, which has this version of the kernel, but the field is not available in its headers. The result is that there is an
#if to check if this field is available, and storing the value of
rax on system call enter if not (there is no such thing as
orig_rax on FreeBSD).
Finally, on both platforms,
cptbox provides a table of operations to perform for each system call: allow, kill, and call a callback. Since system call numbers are different between platforms, the user of
cptbox, i.e. the Python side, will map the numbers accordingly.
On a related note, FreeBSD has over a hundred more system calls than Linux, and the table had to be increased in size to accommodate it.
On Linux, settings the flags
PTRACE_O_TRACEVFORK will cause a
SIGTRAP to be sent when
vfork(2) are used, respectively. The high 16-bits of the status code returned by
wait4(2) will specify the event.
On FreeBSD, there are no events that are sent when a thread is created. It will be discoverable by seeing a new value for
ptrace(PT_FOLLOW_FORK) can be called to follow new processes created through
fork(2), in which case
lwpi.pl_flags will contain
PL_FLAG_CHILD when the child process first is first trapped by
ptrace. Since we are not notified in advance about the new child process, using a process group is the only possible solution.
Also, on Linux, a process can be terminated while trapped by
SIGKILL. This is not true on FreeBSD. In fact, both
ptrace(PT_KILL) must be used in conjunction to reliably terminate the child process.
No system calls?
What if the child doesn’t make a system call for a long period of time? Then
wait4(2) will be waiting forever. This is solved by a thread that kills the child process once it reaches the time limit, which I named the “shocker”. Well, sort of.
The shocker cannot rely on wall clock time, since the sandbox has overhead. This requires it to keep track of the actual execution time, which is roughly measured by how long it takes for
wait4(2) to return after resuming the process. We must somehow force
wait4(2) to return periodically so that the timer can be updated.
On Linux, this can be quite easily solved by sending a signal that happens to be ignored.
ptrace(2) will trap, happily notifying the parent of what is happening, and execution resumes as if nothing has happened.
SIGWINCH was chosen for this purpose.
On FreeBSD, this scheme falls apart, as
SIGWINCH, by virtue of being ignored by default, does not trap. Hence, on FreeBSD,
SIGSTOP is used, and the execution has to be resumed by
Once these challenges are solved, the implementation of the sandbox is basically complete, save some polishing work to make it more usable. In any case, the
cptbox source code is available on GitHub for your viewing pleasure.