containerd/reports/2017-01-27.md

11 KiB

Development Report for Jan 27, 2017

This week we made a lot of progress on tools to work with local content storage and image distribution. These parts are critical in forming an end to end proof of concept, taking docker/oci images and turning them into bundles.

We also have defined a new GRPC protocol for interacting with the container-shim, which is used for robust container management.

Maintainers

Derek McGowan will be joining the containerd team as a maintainer. His extensive experience in graphdrivers and distribution will be invaluable to the containerd project.

Shim over GRPC

NAME:
   containerd-shim - 
                    __        _                     __           __    _         
  _________  ____  / /_____ _(_)___  ___  _________/ /     _____/ /_  (_)___ ___ 
 / ___/ __ \/ __ \/ __/ __ `/ / __ \/ _ \/ ___/ __  /_____/ ___/ __ \/ / __ `__ \
/ /__/ /_/ / / / / /_/ /_/ / / / / /  __/ /  / /_/ /_____(__  ) / / / / / / / / /
\___/\____/_/ /_/\__/\__,_/_/_/ /_/\___/_/   \__,_/     /____/_/ /_/_/_/ /_/ /_/ 
                                                                                 
shim for container lifecycle and reconnection


USAGE:
   containerd-shim [global options] command [command options] [arguments...]

VERSION:
   1.0.0

COMMANDS:
     help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --debug        enable debug output in logs
   --help, -h     show help
   --version, -v  print the version

This week we completed work on porting the shim over to GRPC. This allows us to have a more robust way to interface with the shim. It also allows us to have one shim per container where previously we had one shim per process. This drastically reduces the memory usage for exec processes.

We also had a lot of code in the containerd core for syncing with the shims during execution. This was because we needed ways to signal if the shim was running, the container was created or any errors on create and then starting the container's process. Getting this right and syncing was hard and required a lot of code. With the new flow it is just function calls via rpc.

service Shim {
	rpc Create(CreateRequest) returns (CreateResponse);
	rpc Start(StartRequest) returns (google.protobuf.Empty);
	rpc Delete(DeleteRequest) returns (DeleteResponse);
	rpc Exec(ExecRequest) returns (ExecResponse);
	rpc Pty(PtyRequest) returns (google.protobuf.Empty);
	rpc Events(EventsRequest) returns (stream Event);
	rpc State(StateRequest) returns (StateResponse);
}

With the GRPC service it allows us to decouple the shim's lifecycle from the containers, in the way that we get synchronous feedback if the container failed to create, start, or exec from shim errors.

The overhead for adding GRPC to the shim is actually less than the initial implementation. We already had a few pipes that allowed you to control resizing of the pty master and exit events, now all replaced by one unix socket. Unix sockets are cheap and fast and we reduce our open fd count with way by not relying on multiple fifos.

We also added a subcommand to the ctr command for testing and interfacing with the shim. You can interact with a shim directly via ctr shim and get events, start containers, start exec processes.

Distribution Tool

Last week, @stevvooe committed the first parts of the distribution tool. The main component provided there was the dist fetch command. This has been followed up by several other low-level commands that interact with content resolution and local storage that can be used together to work with parts of images.

With this change, we add the following commands to the dist tool:

  • ingest: verify and accept content into storage
  • active: display active ingest processes
  • list: list content in storage
  • path: provide a path to a blob by digest
  • delete: remove a piece of content from storage
  • apply: apply a layer to a directory

When this is more solidified, we can roll these up into higher-level operations that can be orchestrated through the dist tool or via GRPC.

As part of the Development Report, we thought it was a good idea to show these tools in depth. Specifically, we can show going from an image locator to a root filesystem with the current suite of commands.

Fetching Image Resources

The first component added to the dist tool is the fetch command. It is a low-level command for fetching image resources, such as manifests and layers. It operates around the concept of remotes. Objects are fetched by providing a locator and an object identifier. The locator, roughly analogous to an image name or repository, is a schema-less URL. The following is an example of a locator:

docker.io/library/redis

When we say the locator is a "schema-less URL", we mean that it starts with hostname and has a path, representing some image repository. While the hostname may represent an actual location, we can pass it through arbitrary resolution systems to get the actual location. In that sense, it acts like a namespace.

In practice, the locator can be used to resolve a remote. Object identifiers are then passed to this remote, along with hints, which are then mapped to the specific protocol and retrieved. By dispatching on this common identifier, we should be able to support almost any protocol and discovery mechanism imaginable.

The actual fetch command currently provides anonymous access to Docker Hub images, keyed by the locator namespace docker.io. With a locator, identifier and hint, the correct protocol and endpoints are resolved and the resource is printed to stdout. As an example, one can fetch the manifest for redis with the following command:

$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json

Note that we have provided a mediatype "hint", nudging the fetch implementation to grab the correct endpoint. We can hash the output of that to fetch the same content by digest:

$ ./dist fetch docker.io/library/redis sha256:$(./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | shasum -a256)

The hint now elided on the outer command, since we have affixed the content to a particular hash. The above shows us effectively fetches by tag, then by hash to demonstrate the equivalence when interacting with a remote.

This is just the beginning. We should be able to centralize configuration around fetch to implement a number of distribution methodologies that have been challenging or impossible up to this point.

Keep reading to see how this is used with the other commands to fetch complete images.

Fetching all the layers of an image

If you are not yet entertained, let's bring jq and xargs into the mix for maximum fun. Our first task will be to collect the layers into a local content store with the ingest command.

The following incantation fetches the manifest and downloads each layer:

$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | \
   jq -r '.layers[] | "./dist fetch docker.io/library/redis "+.digest + "| ./dist ingest --expected-digest "+.digest+" --expected-size "+(.size | tostring) +" docker.io/library/redis@"+.digest' | xargs -I{} -P10 -n1 sh -c "{}"

The above fetches a manifest, pipes it to jq, which assembles a shell pipeline to ingest each layer into the content store. Because the transactions are keyed by their digest, concurrent downloads and downloads of repeated content are ignored. Each process is then executed parallel using xargs. If you run the above command twice, it will not download the layers because those blobs are already present in the content store.

What about status? Let's first remove our content so we can monitor a download. dist list can be combined with xargs and dist delete to remove that content:

$ ./dist list -q | xargs ./dist delete

In a separate shell session, could monitor the active downloads with the following:

$ watch -n0.2 ./dist active

For now, the content is downloaded into .content in the current working directory. To watch the contents of this directory, you can use the following:

$ watch -n0.2 tree .content

Now, run the fetch pipeline from above. You'll see the active downloads, keyed by locator and object, as well as the ingest transactions resulting blobs becoming available in the content store. This will help to understand what is going on internally.

Getting to a rootfs

While we haven't yet integrated full snapshot support for layer application, we can use the dist apply command to start building out rootfs for inspection and testing. We'll build up a similar pipeline to unpack the layers and get an actual image rootfs.

To get access to the layers, you can use the path command:

$./dist path sha256:010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa
sha256:010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa /home/sjd/go/src/github.com/containerd/containerd/.content/blobs/sha256/010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa

This returns the a direct path to the blob to facilitate fast access. We can incorporate this into the apply command to get to a rootfs for redis:

$ mkdir redis-rootfs
$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | \
	jq -r '.layers[] | "sudo ./dist apply ./redis-rootfs < $(./dist path -q "+.digest+")"' | xargs -I{} -n1 sh -c "{}"

The above fetches the manifest, then passes each layer into the dist apply command, resulting in the full redis container root filesystem. We do not do this in parallel, since each layer must be applied sequentially. Also, note that we have to run apply with sudo, since the layers typically have resources with root ownership.

Alternatively, you can just read the manifest from the content store, rather than fetching it. We use fetch above to avoid having to lookup the manifest digest for our demo.

Note that this is mostly a POC. This tool has a long way to go. Things like failed downloads and abandoned download cleanup aren't quite handled. We'll probably make adjustments around how content store transactions are handled to address this. We still need to incorporate snapshotting, as well as the ability to calculate the ChainID under subsequent unpacking. Once we have some tools to play around with snapshotting, we'll be able to incorporate our rootfs.ApplyLayer algorithm that will get us a lot closer to a production worthy system.

From here, we'll build out full image pull and create tooling to get runtime bundles from the fetched content.