The Containers and Checkpoint/Restore MC at Linux Plumbers is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.
Last year's edition covered a range of subjects and a lot of progress has been made on all of them. There is a working prototype for an id shifting filesystem some distributions already choose to include, proper support for running Android in containers via binderfs, seccomp-based syscall interception and improved container migration through the userfaultfd patchsets.
Last year's success has prompted us to reprise the microconference this year. Topics we would like to cover include:
Agree on an upstreamable approach to shiftfs
Securing containres by rethinking parts of ptrace access permissions, restricting or removing the ability to re-open file descriptors through procfs with higher permissions than they were originally created with, and in general how to make procfs more secure or restricted.
Adoption and transition of cgroup v2 in container workloads
Upstreaming the time namespace patchset
Adding a new clone syscall
Adoption and improvement of the new mount and pidfd APIs
Improving the state of userfaultfd and its adoption in container runtimes
Speeding up container live migration
Address space separation for containers
More to be added based on CfP for this microconference
If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.
CRIU only restores processes with the same PID the processes used to have during checkpointing. As there is no interface to create a process with a certain PID like
fork_with_pid() CRIU does the PID dance to restore the process with the same PID as before checkpointing.
The PID dance consists of
write()ing PID-1 to...
Containers are generally percieved less secure than virtual
machines. Without going into a theological argument about the actual
state of the affairs, we suggest to explore the possibility of using
address space isolation inside the kernel to make containers even more
Assuming that kernel bugs and therefore vulnerabilities are inevitable
it is worth isolating parts of the kernel to...
Recently the kernel landed seccomp support for SECCOMP_RET_USER_NOTIF which enables a process (watchee) to retrieve a fd for its seccomp filter. This fd can then be handed to another (usually more privileged) process (watcher). The watcher will then be able to receive seccomp messages about the syscalls having been performed by the watchee.
We have integrated this feature into userspace and...
Over the last year we have worked on expanding the task migration using CRIU in Google. The talk will discuss how in some cases the kernel interfaces are lacking for the purpose of migration:
- Lack of support for reading rseq configuration which means that it requires userspace support to migrate users of rseq properly.
- Lack of support for reading what cgroup events the users have...
Container runtimes, engines and orchestrators provide a production-grade, robust, high-performing, but also relatively self-managing, self-healing infrastructure using innovative open-source technologies.
CRIU allows the running state of containerised applications to be preserved as a collection of files that can be used to create an equivalent copy of the applications at a later time, and...
The Linux kernel has recently acquired a new API for creating mounts. This allows a greater range of parameter and parameter values to be specified, including, in the future, container-relevant information such as the namespaces that a mount should use.
Future developments of this API also need to work out how to deal with upcalling from the kernel to gain parameters not directly supplied,...
Since Canonical is now shipping it I think we can all agree it solves a problem and we just need to get the patches into shape for upstream submission. Can we discuss a pathway for doing that.
Userspace has (for a long time) needed a mechanism to restrict path resolution. Obvious examples are those of FTP servers, Web Servers, archiving utilities, and now container runtimes. While the fundamental issue with privileged container runtimes opening paths within an untrusted rootfs was known about for many years, the recent CVEs (CVE-2018-15664 and CVE-2019-10152 being the most recent)...
The kernel contains a keyrings facility for handling tokens for filesystems and other kernel services to use. These are frequently disabled for container environments, however, because they were not made namespace aware by the authors of the user-namespace and others.
Unfortunately, this lack prevents various things from working inside containers. To get around this, keys are now being...
We have cgroup v1 users who want to switch to cgroup v2, but there
currently isn't an upstream migration story for them. (Previous
LPC talks have focused on the issues of migrating from v1 to v2, but
no substantial upstream solution has come to fruition.)
The goal of this talk is to discuss the cgroup v1 to v2 migration
path and gauge community interest in a cgroup v1/v2...
We have a number of unsolved time and vdso related issues in CRIU.
- Syscall restart: if a task Checkpoint interrupted a syscall, on restore CRIU blindely starts again the syscall (executing SYSCALL/SYSENTER/INT80/etc instruction with the original regset). It works OKish, but not with time blocking syscalls i.e., poll(), nanosleep(), futex() and etc. For this purpose, Glibc and vDSO use...