257 lines
11 KiB
Markdown
257 lines
11 KiB
Markdown
|
# Development Report for Jan 27, 2017
|
||
|
|
||
|
This week we made a lot of progress on tools to work with local content storage
|
||
|
and image distribution. These parts are critical in forming an end to end proof
|
||
|
of concept, taking docker/oci images and turning them into bundles.
|
||
|
|
||
|
We also have defined a new GRPC protocol for interacting with the
|
||
|
container-shim, which is used for robust container management.
|
||
|
|
||
|
## Maintainers
|
||
|
|
||
|
* https://github.com/docker/containerd/pull/473
|
||
|
|
||
|
Derek McGowan will be joining the containerd team as a maintainer. His
|
||
|
extensive experience in graphdrivers and distribution will be invaluable to the
|
||
|
containerd project.
|
||
|
|
||
|
## Shim over GRPC
|
||
|
|
||
|
* https://github.com/docker/containerd/pull/462
|
||
|
|
||
|
```
|
||
|
NAME:
|
||
|
containerd-shim -
|
||
|
__ _ __ __ _
|
||
|
_________ ____ / /_____ _(_)___ ___ _________/ / _____/ /_ (_)___ ___
|
||
|
/ ___/ __ \/ __ \/ __/ __ `/ / __ \/ _ \/ ___/ __ /_____/ ___/ __ \/ / __ `__ \
|
||
|
/ /__/ /_/ / / / / /_/ /_/ / / / / / __/ / / /_/ /_____(__ ) / / / / / / / / /
|
||
|
\___/\____/_/ /_/\__/\__,_/_/_/ /_/\___/_/ \__,_/ /____/_/ /_/_/_/ /_/ /_/
|
||
|
|
||
|
shim for container lifecycle and reconnection
|
||
|
|
||
|
|
||
|
USAGE:
|
||
|
containerd-shim [global options] command [command options] [arguments...]
|
||
|
|
||
|
VERSION:
|
||
|
1.0.0
|
||
|
|
||
|
COMMANDS:
|
||
|
help, h Shows a list of commands or help for one command
|
||
|
|
||
|
GLOBAL OPTIONS:
|
||
|
--debug enable debug output in logs
|
||
|
--help, -h show help
|
||
|
--version, -v print the version
|
||
|
|
||
|
```
|
||
|
|
||
|
This week we completed work on porting the shim over to GRPC. This allows us
|
||
|
to have a more robust way to interface with the shim. It also allows us to
|
||
|
have one shim per container where previously we had one shim per process. This
|
||
|
drastically reduces the memory usage for exec processes.
|
||
|
|
||
|
We also had a lot of code in the containerd core for syncing with the shims
|
||
|
during execution. This was because we needed ways to signal if the shim was
|
||
|
running, the container was created or any errors on create and then starting
|
||
|
the container's process. Getting this right and syncing was hard and required
|
||
|
a lot of code. With the new flow it is just function calls via rpc.
|
||
|
|
||
|
```proto
|
||
|
service Shim {
|
||
|
rpc Create(CreateRequest) returns (CreateResponse);
|
||
|
rpc Start(StartRequest) returns (google.protobuf.Empty);
|
||
|
rpc Delete(DeleteRequest) returns (DeleteResponse);
|
||
|
rpc Exec(ExecRequest) returns (ExecResponse);
|
||
|
rpc Pty(PtyRequest) returns (google.protobuf.Empty);
|
||
|
rpc Events(EventsRequest) returns (stream Event);
|
||
|
rpc State(StateRequest) returns (StateResponse);
|
||
|
}
|
||
|
```
|
||
|
|
||
|
With the GRPC service it allows us to decouple the shim's lifecycle from the
|
||
|
containers, in the way that we get synchronous feedback if the container failed
|
||
|
to create, start, or exec from shim errors.
|
||
|
|
||
|
The overhead for adding GRPC to the shim is actually less than the initial
|
||
|
implementation. We already had a few pipes that allowed you to control
|
||
|
resizing of the pty master and exit events, now all replaced by one unix
|
||
|
socket. Unix sockets are cheap and fast and we reduce our open fd count with
|
||
|
way by not relying on multiple fifos.
|
||
|
|
||
|
We also added a subcommand to the `ctr` command for testing and interfacing
|
||
|
with the shim. You can interact with a shim directly via `ctr shim` and get
|
||
|
events, start containers, start exec processes.
|
||
|
|
||
|
## Distribution Tool
|
||
|
|
||
|
* https://github.com/docker/containerd/pull/452
|
||
|
* https://github.com/docker/containerd/pull/472
|
||
|
* https://github.com/docker/containerd/pull/474
|
||
|
|
||
|
Last week, @stevvooe committed the first parts of the distribution tool. The main
|
||
|
component provided there was the `dist fetch` command. This has been followed
|
||
|
up by several other low-level commands that interact with content resolution
|
||
|
and local storage that can be used together to work with parts of images.
|
||
|
|
||
|
With this change, we add the following commands to the dist tool:
|
||
|
|
||
|
- `ingest`: verify and accept content into storage
|
||
|
- `active`: display active ingest processes
|
||
|
- `list`: list content in storage
|
||
|
- `path`: provide a path to a blob by digest
|
||
|
- `delete`: remove a piece of content from storage
|
||
|
- `apply`: apply a layer to a directory
|
||
|
|
||
|
When this is more solidified, we can roll these up into higher-level
|
||
|
operations that can be orchestrated through the `dist` tool or via GRPC.
|
||
|
|
||
|
As part of the _Development Report_, we thought it was a good idea to show
|
||
|
these tools in depth. Specifically, we can show going from an image locator to
|
||
|
a root filesystem with the current suite of commands.
|
||
|
|
||
|
### Fetching Image Resources
|
||
|
|
||
|
The first component added to the `dist` tool is the `fetch` command. It is a
|
||
|
low-level command for fetching image resources, such as manifests and layers.
|
||
|
It operates around the concept of `remotes`. Objects are fetched by providing a
|
||
|
`locator` and an object identifier. The `locator`, roughly analogous to an
|
||
|
image name or repository, is a schema-less URL. The following is an example of
|
||
|
a `locator`:
|
||
|
|
||
|
```
|
||
|
docker.io/library/redis
|
||
|
```
|
||
|
|
||
|
When we say the `locator` is a "schema-less URL", we mean that it starts with
|
||
|
hostname and has a path, representing some image repository. While the hostname
|
||
|
may represent an actual location, we can pass it through arbitrary resolution
|
||
|
systems to get the actual location. In that sense, it acts like a namespace.
|
||
|
|
||
|
In practice, the `locator` can be used to resolve a `remote`. Object
|
||
|
identifiers are then passed to this remote, along with hints, which are then
|
||
|
mapped to the specific protocol and retrieved. By dispatching on this common
|
||
|
identifier, we should be able to support almost any protocol and discovery
|
||
|
mechanism imaginable.
|
||
|
|
||
|
The actual `fetch` command currently provides anonymous access to Docker Hub
|
||
|
images, keyed by the `locator` namespace `docker.io`. With a `locator`,
|
||
|
`identifier` and `hint`, the correct protocol and endpoints are resolved and the
|
||
|
resource is printed to stdout. As an example, one can fetch the manifest for
|
||
|
`redis` with the following command:
|
||
|
|
||
|
```
|
||
|
$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json
|
||
|
```
|
||
|
|
||
|
Note that we have provided a mediatype "hint", nudging the fetch implementation
|
||
|
to grab the correct endpoint. We can hash the output of that to fetch the same
|
||
|
content by digest:
|
||
|
|
||
|
```
|
||
|
$ ./dist fetch docker.io/library/redis sha256:$(./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | shasum -a256)
|
||
|
```
|
||
|
|
||
|
The hint now elided on the outer command, since we have affixed the content to
|
||
|
a particular hash. The above shows us effectively fetches by tag, then by hash
|
||
|
to demonstrate the equivalence when interacting with a remote.
|
||
|
|
||
|
This is just the beginning. We should be able to centralize configuration
|
||
|
around fetch to implement a number of distribution methodologies that have been
|
||
|
challenging or impossible up to this point.
|
||
|
|
||
|
Keep reading to see how this is used with the other commands to fetch complete
|
||
|
images.
|
||
|
|
||
|
### Fetching all the layers of an image
|
||
|
|
||
|
If you are not yet entertained, let's bring `jq` and `xargs` into the mix for
|
||
|
maximum fun. Our first task will be to collect the layers into a local content
|
||
|
store with the `ingest` command.
|
||
|
|
||
|
The following incantation fetches the manifest and downloads each layer:
|
||
|
|
||
|
```
|
||
|
$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | \
|
||
|
jq -r '.layers[] | "./dist fetch docker.io/library/redis "+.digest + "| ./dist ingest --expected-digest "+.digest+" --expected-size "+(.size | tostring) +" docker.io/library/redis@"+.digest' | xargs -I{} -P10 -n1 sh -c "{}"
|
||
|
```
|
||
|
|
||
|
The above fetches a manifest, pipes it to jq, which assembles a shell pipeline
|
||
|
to ingest each layer into the content store. Because the transactions are keyed
|
||
|
by their digest, concurrent downloads and downloads of repeated content are
|
||
|
ignored. Each process is then executed parallel using xargs. If you run the
|
||
|
above command twice, it will not download the layers because those blobs are
|
||
|
already present in the content store.
|
||
|
|
||
|
What about status? Let's first remove our content so we can monitor a download.
|
||
|
`dist list` can be combined with xargs and `dist delete` to remove that
|
||
|
content:
|
||
|
|
||
|
```
|
||
|
$ ./dist list -q | xargs ./dist delete
|
||
|
```
|
||
|
|
||
|
In a separate shell session, could monitor the active downloads with the following:
|
||
|
|
||
|
```
|
||
|
$ watch -n0.2 ./dist active
|
||
|
```
|
||
|
|
||
|
For now, the content is downloaded into `.content` in the current working
|
||
|
directory. To watch the contents of this directory, you can use the following:
|
||
|
|
||
|
```
|
||
|
$ watch -n0.2 tree .content
|
||
|
```
|
||
|
|
||
|
Now, run the fetch pipeline from above. You'll see the active downloads, keyed
|
||
|
by locator and object, as well as the ingest transactions resulting blobs
|
||
|
becoming available in the content store. This will help to understand what is
|
||
|
going on internally.
|
||
|
|
||
|
### Getting to a rootfs
|
||
|
|
||
|
While we haven't yet integrated full snapshot support for layer application, we
|
||
|
can use the `dist apply` command to start building out rootfs for inspection
|
||
|
and testing. We'll build up a similar pipeline to unpack the layers and get an
|
||
|
actual image rootfs.
|
||
|
|
||
|
To get access to the layers, you can use the path command:
|
||
|
|
||
|
```
|
||
|
$./dist path sha256:010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa
|
||
|
sha256:010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa /home/sjd/go/src/github.com/docker/containerd/.content/blobs/sha256/010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa
|
||
|
```
|
||
|
|
||
|
This returns the a direct path to the blob to facilitate fast access. We can
|
||
|
incorporate this into the `apply` command to get to a rootfs for `redis`:
|
||
|
|
||
|
```
|
||
|
$ mkdir redis-rootfs
|
||
|
$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | \
|
||
|
jq -r '.layers[] | "sudo ./dist apply ./redis-rootfs < $(./dist path -q "+.digest+")"' | xargs -I{} -n1 sh -c "{}"
|
||
|
```
|
||
|
|
||
|
The above fetches the manifest, then passes each layer into the `dist apply`
|
||
|
command, resulting in the full redis container root filesystem. We do not do
|
||
|
this in parallel, since each layer must be applied sequentially. Also, note
|
||
|
that we have to run `apply` with `sudo`, since the layers typically have
|
||
|
resources with root ownership.
|
||
|
|
||
|
Alternatively, you can just read the manifest from the content store, rather
|
||
|
than fetching it. We use fetch above to avoid having to lookup the manifest
|
||
|
digest for our demo.
|
||
|
|
||
|
Note that this is mostly a POC. This tool has a long way to go. Things like
|
||
|
failed downloads and abandoned download cleanup aren't quite handled. We'll
|
||
|
probably make adjustments around how content store transactions are handled to
|
||
|
address this. We still need to incorporate snapshotting, as well as the ability
|
||
|
to calculate the `ChainID` under subsequent unpacking. Once we have some tools
|
||
|
to play around with snapshotting, we'll be able to incorporate our
|
||
|
`rootfs.ApplyLayer` algorithm that will get us a lot closer to a production
|
||
|
worthy system.
|
||
|
|
||
|
From here, we'll build out full image pull and create tooling to get runtime
|
||
|
bundles from the fetched content.
|