Signed-off-by: Stephen J Day <stephen.day@docker.com>
11 KiB
Development Report for Jan 27, 2017
This week we made a lot of progress on tools to work with local content storage and image distribution. These parts are critical in forming an end to end proof of concept, taking docker/oci images and turning them into bundles.
We also have defined a new GRPC protocol for interacting with the container-shim, which is used for robust container management.
Maintainers
Derek McGowan will be joining the containerd team as a maintainer. His extensive experience in graphdrivers and distribution will be invaluable to the containerd project.
Shim over GRPC
NAME:
containerd-shim -
__ _ __ __ _
_________ ____ / /_____ _(_)___ ___ _________/ / _____/ /_ (_)___ ___
/ ___/ __ \/ __ \/ __/ __ `/ / __ \/ _ \/ ___/ __ /_____/ ___/ __ \/ / __ `__ \
/ /__/ /_/ / / / / /_/ /_/ / / / / / __/ / / /_/ /_____(__ ) / / / / / / / / /
\___/\____/_/ /_/\__/\__,_/_/_/ /_/\___/_/ \__,_/ /____/_/ /_/_/_/ /_/ /_/
shim for container lifecycle and reconnection
USAGE:
containerd-shim [global options] command [command options] [arguments...]
VERSION:
1.0.0
COMMANDS:
help, h Shows a list of commands or help for one command
GLOBAL OPTIONS:
--debug enable debug output in logs
--help, -h show help
--version, -v print the version
This week we completed work on porting the shim over to GRPC. This allows us to have a more robust way to interface with the shim. It also allows us to have one shim per container where previously we had one shim per process. This drastically reduces the memory usage for exec processes.
We also had a lot of code in the containerd core for syncing with the shims during execution. This was because we needed ways to signal if the shim was running, the container was created or any errors on create and then starting the container's process. Getting this right and syncing was hard and required a lot of code. With the new flow it is just function calls via rpc.
service Shim {
rpc Create(CreateRequest) returns (CreateResponse);
rpc Start(StartRequest) returns (google.protobuf.Empty);
rpc Delete(DeleteRequest) returns (DeleteResponse);
rpc Exec(ExecRequest) returns (ExecResponse);
rpc Pty(PtyRequest) returns (google.protobuf.Empty);
rpc Events(EventsRequest) returns (stream Event);
rpc State(StateRequest) returns (StateResponse);
}
With the GRPC service it allows us to decouple the shim's lifecycle from the containers, in the way that we get synchronous feedback if the container failed to create, start, or exec from shim errors.
The overhead for adding GRPC to the shim is actually less than the initial implementation. We already had a few pipes that allowed you to control resizing of the pty master and exit events, now all replaced by one unix socket. Unix sockets are cheap and fast and we reduce our open fd count with way by not relying on multiple fifos.
We also added a subcommand to the ctr
command for testing and interfacing
with the shim. You can interact with a shim directly via ctr shim
and get
events, start containers, start exec processes.
Distribution Tool
- https://github.com/docker/containerd/pull/452
- https://github.com/docker/containerd/pull/472
- https://github.com/docker/containerd/pull/474
Last week, @stevvooe committed the first parts of the distribution tool. The main
component provided there was the dist fetch
command. This has been followed
up by several other low-level commands that interact with content resolution
and local storage that can be used together to work with parts of images.
With this change, we add the following commands to the dist tool:
ingest
: verify and accept content into storageactive
: display active ingest processeslist
: list content in storagepath
: provide a path to a blob by digestdelete
: remove a piece of content from storageapply
: apply a layer to a directory
When this is more solidified, we can roll these up into higher-level
operations that can be orchestrated through the dist
tool or via GRPC.
As part of the Development Report, we thought it was a good idea to show these tools in depth. Specifically, we can show going from an image locator to a root filesystem with the current suite of commands.
Fetching Image Resources
The first component added to the dist
tool is the fetch
command. It is a
low-level command for fetching image resources, such as manifests and layers.
It operates around the concept of remotes
. Objects are fetched by providing a
locator
and an object identifier. The locator
, roughly analogous to an
image name or repository, is a schema-less URL. The following is an example of
a locator
:
docker.io/library/redis
When we say the locator
is a "schema-less URL", we mean that it starts with
hostname and has a path, representing some image repository. While the hostname
may represent an actual location, we can pass it through arbitrary resolution
systems to get the actual location. In that sense, it acts like a namespace.
In practice, the locator
can be used to resolve a remote
. Object
identifiers are then passed to this remote, along with hints, which are then
mapped to the specific protocol and retrieved. By dispatching on this common
identifier, we should be able to support almost any protocol and discovery
mechanism imaginable.
The actual fetch
command currently provides anonymous access to Docker Hub
images, keyed by the locator
namespace docker.io
. With a locator
,
identifier
and hint
, the correct protocol and endpoints are resolved and the
resource is printed to stdout. As an example, one can fetch the manifest for
redis
with the following command:
$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json
Note that we have provided a mediatype "hint", nudging the fetch implementation to grab the correct endpoint. We can hash the output of that to fetch the same content by digest:
$ ./dist fetch docker.io/library/redis sha256:$(./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | shasum -a256)
The hint now elided on the outer command, since we have affixed the content to a particular hash. The above shows us effectively fetches by tag, then by hash to demonstrate the equivalence when interacting with a remote.
This is just the beginning. We should be able to centralize configuration around fetch to implement a number of distribution methodologies that have been challenging or impossible up to this point.
Keep reading to see how this is used with the other commands to fetch complete images.
Fetching all the layers of an image
If you are not yet entertained, let's bring jq
and xargs
into the mix for
maximum fun. Our first task will be to collect the layers into a local content
store with the ingest
command.
The following incantation fetches the manifest and downloads each layer:
$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | \
jq -r '.layers[] | "./dist fetch docker.io/library/redis "+.digest + "| ./dist ingest --expected-digest "+.digest+" --expected-size "+(.size | tostring) +" docker.io/library/redis@"+.digest' | xargs -I{} -P10 -n1 sh -c "{}"
The above fetches a manifest, pipes it to jq, which assembles a shell pipeline to ingest each layer into the content store. Because the transactions are keyed by their digest, concurrent downloads and downloads of repeated content are ignored. Each process is then executed parallel using xargs. If you run the above command twice, it will not download the layers because those blobs are already present in the content store.
What about status? Let's first remove our content so we can monitor a download.
dist list
can be combined with xargs and dist delete
to remove that
content:
$ ./dist list -q | xargs ./dist delete
In a separate shell session, could monitor the active downloads with the following:
$ watch -n0.2 ./dist active
For now, the content is downloaded into .content
in the current working
directory. To watch the contents of this directory, you can use the following:
$ watch -n0.2 tree .content
Now, run the fetch pipeline from above. You'll see the active downloads, keyed by locator and object, as well as the ingest transactions resulting blobs becoming available in the content store. This will help to understand what is going on internally.
Getting to a rootfs
While we haven't yet integrated full snapshot support for layer application, we
can use the dist apply
command to start building out rootfs for inspection
and testing. We'll build up a similar pipeline to unpack the layers and get an
actual image rootfs.
To get access to the layers, you can use the path command:
$./dist path sha256:010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa
sha256:010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa /home/sjd/go/src/github.com/docker/containerd/.content/blobs/sha256/010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa
This returns the a direct path to the blob to facilitate fast access. We can
incorporate this into the apply
command to get to a rootfs for redis
:
$ mkdir redis-rootfs
$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | \
jq -r '.layers[] | "sudo ./dist apply ./redis-rootfs < $(./dist path -q "+.digest+")"' | xargs -I{} -n1 sh -c "{}"
The above fetches the manifest, then passes each layer into the dist apply
command, resulting in the full redis container root filesystem. We do not do
this in parallel, since each layer must be applied sequentially. Also, note
that we have to run apply
with sudo
, since the layers typically have
resources with root ownership.
Alternatively, you can just read the manifest from the content store, rather than fetching it. We use fetch above to avoid having to lookup the manifest digest for our demo.
Note that this is mostly a POC. This tool has a long way to go. Things like
failed downloads and abandoned download cleanup aren't quite handled. We'll
probably make adjustments around how content store transactions are handled to
address this. We still need to incorporate snapshotting, as well as the ability
to calculate the ChainID
under subsequent unpacking. Once we have some tools
to play around with snapshotting, we'll be able to incorporate our
rootfs.ApplyLayer
algorithm that will get us a lot closer to a production
worthy system.
From here, we'll build out full image pull and create tooling to get runtime bundles from the fetched content.