This moves the timeout handling from the go code to conmon, whic
removes some of the complexity from criod, and additionally it will
makes it possible to do the double-fork in the exec case too.
Signed-off-by: Alexander Larsson <alexl@redhat.com>
We need to do this, because otherwise we will continue and exit the
pid before systemd has a chance to look at it.
Signed-off-by: Alexander Larsson <alexl@redhat.com>
This patch also hides the profile under the debug flag as there's
runtime cost to enable the profiler.
This removes the old way of profiling (CPU) as that's not really
needed.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
This patch isn't adding a test for /etc/hosts as that requires host
network and we don't want to play with host's /etc/hosts when running
make localintegration on our laptops. That may change in the future
moving to some sort of in-container testing.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Currently, when creating containers we never call Wait on the
conmon exec.Command, which means that the child hangs around
forever as a zombie after it dies.
However, instead of doing this waitpid() in the parent we instead
do a double-fork in conmon, to daemonize it. That makes a lot of
sense, as conmon really is not tied to the launcher, but needs
to outlive it if e.g. the cri-o daemon restarts.
However, this makes even more obvious a race condition which we
already have. When crio-d puts the conmon pid in a cgroup there
is a race where conmon could already have spawned a child, and
it would then not be part of the cgroup. In order to fix this
we add another synchronization pipe to conmon, which we block
on before we create any children. The parent then makes sure the
pid is in the cgroup before letting it continue.
Signed-off-by: Alexander Larsson <alexl@redhat.com>
If we get a kubelet annotation about the sandbox trust level, we use it
to toggle our sandbox trust flag.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Container runtimes provide different levels of isolation, from kernel
namespaces to hardware virtualization. When starting a specific
container, one may want to decide which level of isolation to use
depending on how much we trust the container workload. Fully verified
and signed containers may not need the hardware isolation layer but e.g.
CI jobs pulling packages from many untrusted sources should probably not
run only on a kernel namespace isolation layer.
Here we allow CRI-O users to define a container runtime for trusted
containers and another one for untrusted containers, and also to define
a general, default trust level. This anticipates future kubelet
implementations that would be able to tag containers as trusted or
untrusted. When missing a kubelet hint, containers are trusted by
default.
A container becomes untrusted if we get a hint in that direction from
kubelet or if the default trust level is set to "untrusted" and the
container is not privileged. In both cases CRI-O will try to use the
untrusted container runtime. For any other cases, it will switch to the
trusted one.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We use a SOCK_SEQPACKET socket for the attach unix domain socket, which
means the kernel will ensure that the reading side only ever get the
data from one write operation. We use this for frameing, where the
first byte is the pipe that the next bytes are for. We have to make sure
that all reads from the socket are using at least the same size of buffer
as the write side, because otherwise the extra data in the message
will be dropped.
This also adds a stdin pipe for the container, similar to the ones we
use for stdout/err, because we need a way for an attached client
to write to stdin, even if not using a tty.
This fixes https://github.com/kubernetes-incubator/cri-o/issues/569
Signed-off-by: Alexander Larsson <alexl@redhat.com>
we were blindly applying RO mount options but net addons like calico
modify those files.
This patch sets RO only when container's rootfs is RO, same behavior as
docker.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
tmpfs'es can override whatever there's on the container rootfs. We just
mkdir the volume as we're confident kube manages volumes in container.
We don't need any tmpfs nor any complex volume handling for now.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>