We use a SOCK_SEQPACKET socket for the attach unix domain socket, which
means the kernel will ensure that the reading side only ever get the
data from one write operation. We use this for frameing, where the
first byte is the pipe that the next bytes are for. We have to make sure
that all reads from the socket are using at least the same size of buffer
as the write side, because otherwise the extra data in the message
will be dropped.
This also adds a stdin pipe for the container, similar to the ones we
use for stdout/err, because we need a way for an attached client
to write to stdin, even if not using a tty.
This fixes https://github.com/kubernetes-incubator/cri-o/issues/569
Signed-off-by: Alexander Larsson <alexl@redhat.com>
we were blindly applying RO mount options but net addons like calico
modify those files.
This patch sets RO only when container's rootfs is RO, same behavior as
docker.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
tmpfs'es can override whatever there's on the container rootfs. We just
mkdir the volume as we're confident kube manages volumes in container.
We don't need any tmpfs nor any complex volume handling for now.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Vendor and use docker/pkg/pools.
pools are used to lower the number of memory allocations and reuse buffers when
processing large streams operations..
The use of pools.Copy avoids io.Copy's internal buffer allocation.
This commit replaces io.Copy with pools.Copy to avoid the allocation of
buffers in io.Copy.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
This was cluttering the logs on my clusters. The log should be just in
debug mode as we do for every request/response flow.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
This is an optimization of our image pull code path. It's basically
how docker handles pulls as well. Let's be smart and check the image in
pull code path as well.
This also matches docker behavior which first checks whether we're
allowed to actually pull an image before looking into local storage.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
This patch fixes the following command:
kubectl run -i --tty centos --image=centos -- sh
The command above use to fail with:
/usr/bin/sh: /usr/bin/sh: cannot execute binary file
That's because we were wrongly assembling the OCI processArgs.
Thanks @alexlarsson for spotting this.
This patch basically replicates what docker does when merging container
config and image config. It also replicates how docker sets processArgs
for the OCI runtime.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
A goroutine is started to forward terminal resize requests
from the resize channel. Also, data is copied back/forth
between stdin, stdout, stderr streams and the attach socket
for the container.
Signed-off-by: Mrunal Patel <mpatel@redhat.com>
The bug is silly if you have a master/node cluster where node is on a
different machine than the master.
The current behavior is to give our addresses like "0.0.0.0:10101". If
you run "kubectl exec ..." from another host, that's not going to work
since on a different host 0.0.0.0 resolves to localhost and kubectl
exec fails with:
error: unable to upgrade connection: 404 page not found
This patch fixes the above by giving our correct addresses for reaching
from outside.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Some runtimes like Clear Containers need to interpret the CRI-O
annotations, to distinguish the infra container from the regular one.
Here we export those annotations and use a more standard dotted
namespace for them.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
node-e2e tests were failing in RHEL because, if running a privileged
container, we get all capability in the spec. The spec generator wasn't
filtering caps based on actual host caps, it was just adding _everything_.
This patch makes spec generator host specific.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
`containerdID` is overridden in `s.ctrIDIndex.Get()`, if the ctr is not
found it's overridden by an empty string making the error return
totally unusable.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
During "Port forwarding" e2e tests, the following panic happened:
```
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x64981d]
goroutine 52788 [running]:
panic(0x1830ee0, 0xc4200100c0)
/usr/lib/golang/src/runtime/panic.go:500 +0x1a1
github.com/kubernetes-incubator/cri-o/oci.(*Runtime).UpdateStatus(0xc4202afc00,
0x0, 0x0, 0x0)
/home/amurdaca/go/src/github.com/kubernetes-incubator/cri-o/oci/oci.go:549
+0x7d
github.com/kubernetes-incubator/cri-o/server.streamService.PortForward(0xc42026e000,
0x0, 0x0, 0x0, 0x0, 0xc420d9af40, 0x40, 0xc400000050, 0x7fe660659a28,
0xc4201cd0e0, ...)
```
The issue is `streamService.PortForward` assumed the first argument to
be the sandbox's infra container ID, thus trying to get it from memory
store using `.state.containers.Get`. Since that ID is of the sandbox
itself, it fails to get the container object from memory and panics in
`UpdateStatus`.
Fix it by looking for the sandbox's infra container ID starting from a
sandbox ID.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
If we create a container using the image ID like
771cd5947d5ea4bf8e8f4900dd357dbb67e7b16486c270f8274087d182d457c6, then
a call to container_status will return that same ID for the "Image"
field in ContainerStatusResponse.
This patch matches dockershim behavior and return the first tagged name
if available from the image store.
This is also needed to fix a failure in k8s e2d tests.
Reference:
https://github.com/kubernetes/kubernetes/pull/39298/files#diff-c7dd39479fd733354254e70845075db5R369
Reference:
67a5bf8454/test/e2e/framework/util.go (L1941)
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
The ocid project was renamed to CRI-O, months ago, it is time that we moved
all of the code to the new name. We want to elminate the name ocid from use.
Move fully to crio.
Also cric is being renamed to crioctl for the time being.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>
Two issues:
1) pod Namespace was always set to "", which prevents plugins from figuring out
what the actual pod is, and from getting more info about that pod from the
runtime via out-of-band mechanisms
2) the pod Name and ID arguments were switched, further preventing #1
Signed-off-by: Dan Williams <dcbw@redhat.com>
Reusing k8s existing implementation. This could be re-written
to do port-forwarding natively rather than relying on socat/nsenter.
Signed-off-by: Mrunal Patel <mrunalp@gmail.com>
When powering off the system, we want the ocid service, to shutdown
all containers running on the system so they can cleanup properly
This patch will cleanup all pods on poweroff.
The ocid-shutdown.service drops a file /var/run/ocid.shutdown when the system
is shutting down. The ocid-shutdown.service should only be executed at system
shutdown.
On bootup sequence should be
start ocid.service
start ocid-shutdown.service (This is a NO-OP)
On system shutdown
stop ocid-shutdown.service (Creates /var/run/ocid.shutdown)
stop ocid.service (Notices /var/run/ocid.service and stops all pods before exiting.)
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Fix the following upstream k8s's e2e-node test:
```
should be able to pull from private registry with secret [Conformance]
```
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
When RunPodSandbox fails after calling s.addSandbox(sb),
we're left with a sandbox in s.state.sandboxes while the
sandbox is not created.
We fix that by adding removeSandbox() to the deferred cleanup
call
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
This patch prevents k8s's e2e_node tests from killing CRI-O (because of
a panic in marshaling nil responses). This will ensure tests keep
running and just logging the failure.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Because kubelet will create broken symlinks for logPath it is necessary
to remove those symlinks before we attempt to write to them. This is a
temporary workaround while the issue is fixed upstream.
Ref: https://issues.k8s.io/44043
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The main purpose of these tests is to make sure that the log actually
contains output from the container. We don't test the timestamps or the
stream that's stated at the moment.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
This adds a very simple implementation of logging within conmon, where
every buffer read from the masterfd of the container is also written to
the log file (with errors during writing to the log file ignored).
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Interleaving asynchronous updates with pod or container creations can
lead to unrecoverable races and corruptions of the pod or container hash
tables. This is fixed by serializing update against pod or container
creation operations, while pod and container creation operations can
run in parallel.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Now that the image package has fixes to support docker images v2s1,
we can remove our buildOCIProcessARgs() hack for empty image configs
and simplify this routine.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
We want new sandboxes to be added to the sandbox hash table before
adding their ID to the pod Index registrar, in order to avoid potential
Update() races.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
After change in `Makefile` at #304 where `PREFIX`
has changed from `/usr` to `/usr/local` these changes
has to be reflected in default `conmonPath`.
Signed-off-by: Suraj Deshmukh <surajssd009005@gmail.com>
When a pod sandbox comes with DNS settings, the resulting resolv.conf
file needs to be bind mounted in all pod containers under
/etc/resolv.conf.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When we get a pod with DNS settings, we need to build
a resolv.conf file and mount it in all pod containers.
In order to do that, we have to track the built resolv.conf
file and store/load it.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
containers/storage is defaulting to /var/lib/containers/storage
for image and containers storage. It is also defaulting to
/var/run/containers/storage for all runtime. The defaults
for CRI-O should match so that lots of other tools that use
containers/storage can share the same storage.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
We have moved selinux support out of opencontainers/runc into its
own package. This patch moves to using the new selinux go bindings.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
We need to support a 2x2 matrix of use cases with both
kubelet giving us (command, args) slices and the OCI
image config file giving us (ENTRYPOINT, CMD) slices.
Here we always prioritize the kubelet information over
the OCI image one, and use the latter when the former
is incomplete.
Not that this routine will be slightly simpler when
issue #395 is fixed.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
When starting pods or containers, we create the mount points
first. It seems natural to do something symetrical when stopping
pods or containers, i.e. removing the mount point at last.
Also, the current logic may not work with VM based containers as the
hypervisor may hold a reference on the mount point while we're trying to
remove them.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The way we build the OCI Process Args slice is incorrect.
With the current implementation we may for example end up building this
slice with only the entry point arguments, if the kubelet passed
information is missing the Command slice.
We also will end up building the Args slice with the Image config
process arguments, without the defined entry point, if kubelet does not
tell us anything about the container process command to be run.
This patch fixes that by favoring the kubelet ContainerConfig
information. If that is missing, we try to complete it with the
container image information. We always use ContainerConfig.Command[] or
ImageConfig.EntryPoint[] as the first OCI Process Args slice entries.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The pause container is creating an AVC since the /dev/null device
is not labeled correctly. Looks like we are only setting the label of
the process not the label of the content inside of the container.
This change will label content in the pause container correctly and
eliminate the AVC.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
The sandbox privileged flag is set to true only if either the
pod configuration privileged flag is set to true or when any
of the pod namespaces are the host ones.
A container inherit its privileged flag from its sandbox, and
will be run by the privileged runtime only if it's set to true.
In other words, the privileged runtime (when defined) will be
when one of the below conditions is true:
- The sandbox will be asked to run at least one privileged container.
- The sandbox requires access to either the host IPC or networking
namespaces.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
We add a privileged flag to the container and sandbox structures
and can now select the appropriate runtime path for any container
operations depending on that flag.
Here again, the default runtime will be used for non privileged
containers and for privileged ones in case there are no privileged
runtime defined.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Not all runtimes are able to handle some of the kubelet
security context options, in particular the ones granting
host privileges to containers.
By adding a host privileged runtime path configuration, we
allow ocid to use a different runtime for host privileged
operations like e.g. host namespaces access.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
kubelet sends a request to create a container with an image ID (as
opposed as an image name). That ID comes from the ImageStatus response.
This patch fixes that by setting the image ID as well as the image name
and fix the login to lookup for image ID as well.
Found while running `make test-e2e-node`.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
Previously ocicni did not have support for setting the plugin directory.
Now that it has grown support for it, use it to actually respect the
setting a user has provided for ocid.network.* options.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
The CRI doesn't expect us to implicitly pull an image if it isn't
already present before we're asked to use it to create a container, and
the tests no longer depend on us doing so, either.
Limit the logic which attempts to pull an image, if it isn't present, to
only pulling the configured "pause" image, since our use of that image
for running pod sandboxes is an implementation detail that our clients
can't be expected to know or care about. Include the name of the image
that we didn't pull in the error we return when we don't pull one.
Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
Use containers/storage to store images, pod sandboxes, and containers.
A pod sandbox's infrastructure container has the same ID as the pod to
which it belongs, and all containers also keep track of their pod's ID.
The container configuration that we build using the data in a
CreateContainerRequest is stored in the container's ContainerDirectory
and ContainerRunDirectory.
We catch SIGTERM and SIGINT, and when we receive either, we gracefully
exit the grpc loop. If we also think that there aren't any container
filesystems in use, we attempt to do a clean shutdown of the storage
driver.
The test harness now waits for ocid to exit before attempting to delete
the storage root directory.
Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
Vendor updated containers/image and containers/storage, along
with any new dependencies they drag in, and updated versions of other
dependencies that happen to get pulled in.
github.com/coreos/go-systemd/daemon/SdNotify() now takes a boolean to
control whether or not it unsets the NOTIFY_SOCKET variable from the
calling process's environment. Adapt.
github.com/opencontainers/runtime-tools/generate/Generator.AddProcessEnv()
now takes the environment variable name and value as two arguments, not
one. Adapt.
Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
The default configuration can only be accessed from the cmd/server
package, which cannot be imported (since it's a "package main").
This change promotes DefaultConfig() to the "server" package.
Closes: #315
Signed-off-by: Jonathan Yu <jawnsy@redhat.com>
When removing a pod sandbox or container, remove the ID of the item from
the corresponding ID index, so that we can correctly determine if it was
us or another actor that cleaned them up.
Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
We add 2 ocid options for choosing the CNI configuration and plugin
binaries directories: --cni-config-dir and --cni-plugin-dir.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Update the versions of containers/storage and containers/image, and add
new dependencies that they pull in.
Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
ns.Close() will not remove and unmount the networking namespace
if it's not currently marked as mounted.
When we restore a sandbox, we generate the sandbox netns from
ns.GetNS() which does not mark the sandbox as mounted.
There currently is a PR open to fix that in the ns package:
https://github.com/containernetworking/cni/pull/342
but meanwhile this patch fixes a netns leak when restoring a pod.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order to workaround a bug introduced with runc commit bc84f833,
we create a symbolic link to our permanent networking namespace so
that runC realizes that this is not the host namespace.
Although this bug is now fixed upstream (See commit f33de5ab4), this
patch works with pre rc3 runC versions.
We may want to revert that patch once runC 1.0.0 is released.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
With the networking namespace code added, we were reaching a
gocyclo complexitiy of 52. By moving the container creation and
starting code path out, we're back to reasonable levels.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
In order for hypervisor based container runtimes to be able to
fully prepare their pod virtual machines networking interfaces,
this patch sets the pod networking namespace before creating the
sandbox container.
Once the sandbox networking namespace is prepared, the runtime
can scan the networking namespace interfaces and build the pod VM
matching interfaces (typically TAP interfaces) at pod sandbox
creation time. Not doing so means those runtimes would have to
rely on all hypervisors to support networking interfaces hotplug.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
Because they need to prepare the hypervisor networking interfaces
and have them match the ones created in the pod networking
namespace (typically to bridge TAP and veth interfaces), hypervisor
based container runtimes need the sandbox pod networking namespace
to be set up before it's created. They can then prepare and start
the hypervisor interfaces when creating the pod virtual machine.
In order to do so, we need to create per pod persitent networking
namespaces that we pass to the CNI plugin. This patch leverages
the CNI ns package to create such namespaces under /var/run/netns,
and assign them to all pod containers.
The persitent namespace is removed when either the pod is stopped
or removed.
Since the StopPodSandbox() API can be called multiple times from
kubelet, we track the pod networking namespace state (closed or
not) so that we don't get a containernetworking/ns package error
when calling its Close() routine multiple times as well.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
And in particular make it not fail when removing an already removed
sandbox pod. According to the CRI spec:
[RemovePodSandbox] is idempotent, and must not return an error if
the sandbox has already been removed.
We now only print a warning instead of returning an error.
We still return an error when the passed pod ID is empty.
Fixes#240
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
VM base container runtimes (e.g. Clear Containers) will run each pod
in a VM and will create containers within that pod VM. Unfortunately
those runtimes will get called by ocid with the same commands
(create and start) for both the pause containers and subsequent
containers to be added to the pod namespace. Unless they work around
that by e.g. infering that a container which rootfs is under
"/pause" would represent a pod, they have no way to decide if they
need to create/start a VM or if they need to add a container to an
already running VM pod.
This patch tries to formalize this difference through pod
annotations. When starting a container or a sandbox, we now add 2
annotations for the container type (Infrastructure or not) and the
sandbox name. This will allow VM based container runtimes to handle
2 things:
- Decide if they need to create a pod VM or not.
- Keep track of which pod ID runs in a given VM, so that they
know to which sandbox they have to add containers.
Signed-off-by: Samuel Ortiz <sameo@linux.intel.com>
The gprc execsync client call doesn't populate `ExecSyncResponse` on
error at all. You just get an error.
This patch modifies the code to include command's streams, exit code
and error direcly into the error. `ocic` will then print useful
infomation in the cli, otherwise it won't.
Signed-off-by: Antonio Murdaca <runcom@redhat.com>
DNSConfig can pass "options" settings in now, so add them to the
resolv.conf that we're generating, too.
Signed-off-by: Nalin Dahyabhai <nalin@redhat.com>
While logDir isn't currently used (until the conmon implementation
lands) it's probably not a great idea to hardcode our defaults. The main
issue with this setting is that the kubelet can override it at will.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
It's a bit odd to have ImageStore be part of the config and yet we don't
allow people to modify it. However, leave it out of the commented
version because it's currently unused.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
infra container is used to implement the pod
sandbox, it should not be exported to user.
this patch stores infra container in sandbox
immediately, only the containers created by user
are stored into container store, this prevents user
from removing/stopping infra container incorrectly.
Signed-off-by: Gao feng <omarapazanadi@gmail.com>
This paves the way for having a configuration file that is loaded rather
than everything being set via the command-line.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
SELinux will cause breakage when sharing these namespaces. For example it
will block processes inside of the container from looking at the hosts /proc
in hosts Pid Namespace. It will also block all access to semaphores and
shared memory defined outside of the pod.
Signed-off-by: Dan Walsh <dwalsh@redhat.com>