This matches the current kube behavior. This will probably
be provided over the CRI at which point we won't have to
define a constant in cri-o code.
Signed-off-by: Mrunal Patel <mpatel@redhat.com>
If we get a kubelet annotation about the sandbox trust level, we use it
to toggle our sandbox trust flag.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Container runtimes provide different levels of isolation, from kernel
namespaces to hardware virtualization. When starting a specific
container, one may want to decide which level of isolation to use
depending on how much we trust the container workload. Fully verified
and signed containers may not need the hardware isolation layer but e.g.
CI jobs pulling packages from many untrusted sources should probably not
run only on a kernel namespace isolation layer.
Here we allow CRI-O users to define a container runtime for trusted
containers and another one for untrusted containers, and also to define
a general, default trust level. This anticipates future kubelet
implementations that would be able to tag containers as trusted or
untrusted. When missing a kubelet hint, containers are trusted by
default.
A container becomes untrusted if we get a hint in that direction from
kubelet or if the default trust level is set to "untrusted" and the
container is not privileged. In both cases CRI-O will try to use the
untrusted container runtime. For any other cases, it will switch to the
trusted one.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We use a SOCK_SEQPACKET socket for the attach unix domain socket, which
means the kernel will ensure that the reading side only ever get the
data from one write operation. We use this for frameing, where the
first byte is the pipe that the next bytes are for. We have to make sure
that all reads from the socket are using at least the same size of buffer
as the write side, because otherwise the extra data in the message
will be dropped.
This also adds a stdin pipe for the container, similar to the ones we
use for stdout/err, because we need a way for an attached client
to write to stdin, even if not using a tty.
This fixes https://github.com/kubernetes-incubator/cri-o/issues/569
Signed-off-by: Alexander Larsson <alexl@redhat.com>
Some runtimes like Clear Containers need to interpret the CRI-O
annotations, to distinguish the infra container from the regular one.
Here we export those annotations and use a more standard dotted
namespace for them.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The ocid project was renamed to CRI-O, months ago, it is time that we moved
all of the code to the new name. We want to elminate the name ocid from use.
Move fully to crio.
Also cric is being renamed to crioctl for the time being.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
Two issues:
1) pod Namespace was always set to "", which prevents plugins from figuring out
what the actual pod is, and from getting more info about that pod from the
runtime via out-of-band mechanisms
2) the pod Name and ID arguments were switched, further preventing #1
Signed-off-by: Dan Williams <dcbw@redhat.com>
When RunPodSandbox fails after calling s.addSandbox(sb),
we're left with a sandbox in s.state.sandboxes while the
sandbox is not created.
We fix that by adding removeSandbox() to the deferred cleanup
call
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Because kubelet will create broken symlinks for logPath it is necessary
to remove those symlinks before we attempt to write to them. This is a
temporary workaround while the issue is fixed upstream.
Ref: https://issues.k8s.io/44043
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This adds a very simple implementation of logging within conmon, where
every buffer read from the masterfd of the container is also written to
the log file (with errors during writing to the log file ignored).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Interleaving asynchronous updates with pod or container creations can
lead to unrecoverable races and corruptions of the pod or container hash
tables. This is fixed by serializing update against pod or container
creation operations, while pod and container creation operations can
run in parallel.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We want new sandboxes to be added to the sandbox hash table before
adding their ID to the pod Index registrar, in order to avoid potential
Update() races.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When we get a pod with DNS settings, we need to build
a resolv.conf file and mount it in all pod containers.
In order to do that, we have to track the built resolv.conf
file and store/load it.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We have moved selinux support out of opencontainers/runc into its
own package. This patch moves to using the new selinux go bindings.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
The pause container is creating an AVC since the /dev/null device
is not labeled correctly. Looks like we are only setting the label of
the process not the label of the content inside of the container.
This change will label content in the pause container correctly and
eliminate the AVC.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
The sandbox privileged flag is set to true only if either the
pod configuration privileged flag is set to true or when any
of the pod namespaces are the host ones.
A container inherit its privileged flag from its sandbox, and
will be run by the privileged runtime only if it's set to true.
In other words, the privileged runtime (when defined) will be
when one of the below conditions is true:
- The sandbox will be asked to run at least one privileged container.
- The sandbox requires access to either the host IPC or networking
namespaces.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Use containers/storage to store images, pod sandboxes, and containers.
A pod sandbox's infrastructure container has the same ID as the pod to
which it belongs, and all containers also keep track of their pod's ID.
The container configuration that we build using the data in a
CreateContainerRequest is stored in the container's ContainerDirectory
and ContainerRunDirectory.
We catch SIGTERM and SIGINT, and when we receive either, we gracefully
exit the grpc loop. If we also think that there aren't any container
filesystems in use, we attempt to do a clean shutdown of the storage
driver.
The test harness now waits for ocid to exit before attempting to delete
the storage root directory.
Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
With the networking namespace code added, we were reaching a
gocyclo complexitiy of 52. By moving the container creation and
starting code path out, we're back to reasonable levels.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order for hypervisor based container runtimes to be able to
fully prepare their pod virtual machines networking interfaces,
this patch sets the pod networking namespace before creating the
sandbox container.
Once the sandbox networking namespace is prepared, the runtime
can scan the networking namespace interfaces and build the pod VM
matching interfaces (typically TAP interfaces) at pod sandbox
creation time. Not doing so means those runtimes would have to
rely on all hypervisors to support networking interfaces hotplug.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Because they need to prepare the hypervisor networking interfaces
and have them match the ones created in the pod networking
namespace (typically to bridge TAP and veth interfaces), hypervisor
based container runtimes need the sandbox pod networking namespace
to be set up before it's created. They can then prepare and start
the hypervisor interfaces when creating the pod virtual machine.
In order to do so, we need to create per pod persitent networking
namespaces that we pass to the CNI plugin. This patch leverages
the CNI ns package to create such namespaces under /var/run/netns,
and assign them to all pod containers.
The persitent namespace is removed when either the pod is stopped
or removed.
Since the StopPodSandbox() API can be called multiple times from
kubelet, we track the pod networking namespace state (closed or
not) so that we don't get a containernetworking/ns package error
when calling its Close() routine multiple times as well.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
VM base container runtimes (e.g. Clear Containers) will run each pod
in a VM and will create containers within that pod VM. Unfortunately
those runtimes will get called by ocid with the same commands
(create and start) for both the pause containers and subsequent
containers to be added to the pod namespace. Unless they work around
that by e.g. infering that a container which rootfs is under
"/pause" would represent a pod, they have no way to decide if they
need to create/start a VM or if they need to add a container to an
already running VM pod.
This patch tries to formalize this difference through pod
annotations. When starting a container or a sandbox, we now add 2
annotations for the container type (Infrastructure or not) and the
sandbox name. This will allow VM based container runtimes to handle
2 things:
- Decide if they need to create a pod VM or not.
- Keep track of which pod ID runs in a given VM, so that they
know to which sandbox they have to add containers.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>