Online Judging Sandbox: From Linux to FreeBSD
As most probably know, DMOJ uses a sandbox to protect itself from potentially malicious user submissions. An overview of the Linux sandbox has been published by my friend Tudor. However, it doesn’t go deep into the implementation details, many of which differ between Linux and FreeBSD.
At its core, the sandbox, cptbox
, uses the ptrace(2)
API to intercept system calls before and after they are executed, denying access and manipulating results. The core is written in C, hence the name cptbox
.
Perhaps the most obvious difference between Linux and FreeBSD is that on Linux, ptrace(2)
subfunctions are invoked as ptrace(PTRACE_*)
, while on FreeBSD, it is ptrace(PT_*)
. But this difference is rather superficial compared to the significant internal differences.
Processes and Threads
On Linux, threads are implemented as kernel-mode processes sharing the same address space. The concept of a user mode process is really a “thread group”. This can be seen in system calls such as sys_exit
which terminates a thread, and sys_exit_group
, which terminates all the threads in the process.
On FreeBSD, processes are treated as an entity with potentially multiple “light-weight processes” a.k.a. threads running inside.
In any case, such difference is abstracted away by the use of a process group. The advantage is that the main loop will have only perform a wait4(2)
on the process group, and it will receive events for all threads and child processes, regardless of implementation. However, the difference has some significant implication on how information is obtained.
System call information
The most critical part of the sandbox, perhaps, is to obtain the system call information, so that cptbox
may decide exactly what to do with the sandboxed process.
On Linux, the implementation seems rather sketchy. Calling ptrace(PTRACE_SYSCALL)
will cause a SIGTRAP
to be sent and suspend the child process so that cptbox
may examine the state.
However, most other events that ptrace
tracks are also done via SIGTRAP
. The detailed information can be obtained by ptrace(PTRACE_GETSIGINFO)
. It will tell you if this is a system call enter/exit, but this is considered slow and expensive. Instead, there is an option called PTRACE_O_TRACESYSGOOD
which can be set, and it would generate a signal SIGTRAP | 0x80
when a system call is entered or exited.
The catch is that it doesn’t tell you if this is a system call enter or exit. The only way to find out is to keep toggling a flag every time you receive a SIGTRAP | 0x80
. This flag must be kept specific to each thread as well, since different threads can run different system calls at the same time.
The system call information is available in the registers. However, the system call number is slightly special. In the Linux ABI, the register rax
(on x64, or eax
on x86), contains both the system call number (on enter) and the return value (on exit). ptrace(2)
on Linux lets you access the system call number through a “register” called orig_rax
(or orig_eax
), while the return value is found in the register rax
.
On FreeBSD, there is no PTRACE_O_TRACESYSGOOD
. Since everything is in a light-weight process anyway, ptrace(PT_LWPINFO)
can be used to obtain all the information you will ever need. It could be described as an equivalent of ptrace(PTRACE_GETSIGINFO)
, in a sense. This copies the information into a struct ptrace_lwpinfo
, which we will call lwpi
.
The most convenient part is that lwpi.pl_flags
can contain the flags PL_FLAG_SCE
and PL_FLAG_SCX
, telling you directly whether this is a syscall enter or exit. Having neither means this is some other event. This means we don’t have to play with toggling flags.
Another convenience is that lwpi.pl_syscall_code
contains the system call number directly, so we do not have to play with pseudo-registers. The catch is that this is only available starting in FreeBSD 10.3. While this was released when cptbox
was ported to FreeBSD, a significant chunk of testing was done on Debian GNU/kFreeBSD, which has this version of the kernel, but the field is not available in its headers. The result is that there is an #if
to check if this field is available, and storing the value of rax
on system call enter if not (there is no such thing as orig_rax
on FreeBSD).
Finally, on both platforms, cptbox
provides a table of operations to perform for each system call: allow, kill, and call a callback. Since system call numbers are different between platforms, the user of cptbox
, i.e. the Python side, will map the numbers accordingly.
On a related note, FreeBSD has over a hundred more system calls than Linux, and the table had to be increased in size to accommodate it.
Process management
On Linux, settings the flags PTRACE_O_TRACECLONE
, PTRACE_O_TRACEFORK
, and PTRACE_O_TRACEVFORK
will cause a SIGTRAP
to be sent when clone(2)
, fork(2)
, or vfork(2)
are used, respectively. The high 16-bits of the status code returned by wait4(2)
will specify the event.
On FreeBSD, there are no events that are sent when a thread is created. It will be discoverable by seeing a new value for lwpi.pl_lwpid
. ptrace(PT_FOLLOW_FORK)
can be called to follow new processes created through fork(2)
, in which case lwpi.pl_flags
will contain PL_FLAG_CHILD
when the child process first is first trapped by ptrace
. Since we are not notified in advance about the new child process, using a process group is the only possible solution.
Also, on Linux, a process can be terminated while trapped by ptrace(2)
through SIGKILL
. This is not true on FreeBSD. In fact, both kill(SIGKILL)
and ptrace(PT_KILL)
must be used in conjunction to reliably terminate the child process.
No system calls?
What if the child doesn’t make a system call for a long period of time? Then wait4(2)
will be waiting forever. This is solved by a thread that kills the child process once it reaches the time limit, which I named the “shocker”. Well, sort of.
The shocker cannot rely on wall clock time, since the sandbox has overhead. This requires it to keep track of the actual execution time, which is roughly measured by how long it takes for wait4(2)
to return after resuming the process. We must somehow force wait4(2)
to return periodically so that the timer can be updated.
On Linux, this can be quite easily solved by sending a signal that happens to be ignored. ptrace(2)
will trap, happily notifying the parent of what is happening, and execution resumes as if nothing has happened. SIGWINCH
was chosen for this purpose.
On FreeBSD, this scheme falls apart, as SIGWINCH
, by virtue of being ignored by default, does not trap. Hence, on FreeBSD, SIGSTOP
is used, and the execution has to be resumed by cptbox
.
Conclusion
Once these challenges are solved, the implementation of the sandbox is basically complete, save some polishing work to make it more usable. In any case, the cptbox
source code is available on GitHub for your viewing pleasure.