containerd/reports/2017-01-27.md

# Development Report for Jan 27, 2017

This week we made a lot of progress on tools to work with local content storage
and image distribution. These parts are critical in forming an end to end proof
of concept, taking docker/oci images and turning them into bundles.

We also have defined a new GRPC protocol for interacting with the
container-shim, which is used for robust container management.

## Maintainers

* https://github.com/docker/containerd/pull/473

Derek McGowan will be joining the containerd team as a maintainer. His
extensive experience in graphdrivers and distribution will be invaluable to the
containerd project.

## Shim over GRPC

* https://github.com/docker/containerd/pull/462

```
NAME:
   containerd-shim - 
                    __        _                     __           __    _         
  _________  ____  / /_____ _(_)___  ___  _________/ /     _____/ /_  (_)___ ___ 
 / ___/ __ \/ __ \/ __/ __ `/ / __ \/ _ \/ ___/ __  /_____/ ___/ __ \/ / __ `__ \
/ /__/ /_/ / / / / /_/ /_/ / / / / /  __/ /  / /_/ /_____(__  ) / / / / / / / / /
\___/\____/_/ /_/\__/\__,_/_/_/ /_/\___/_/   \__,_/     /____/_/ /_/_/_/ /_/ /_/ 
                                                                                 
shim for container lifecycle and reconnection


USAGE:
   containerd-shim [global options] command [command options] [arguments...]

VERSION:
   1.0.0

COMMANDS:
     help, h  Shows a list of commands or help for one command

GLOBAL OPTIONS:
   --debug        enable debug output in logs
   --help, -h     show help
   --version, -v  print the version

```

This week we completed work on porting the shim over to GRPC.  This allows us
to have a more robust way to interface with the shim.  It also allows us to
have one shim per container where previously we had one shim per process.  This
drastically reduces the memory usage for exec processes.

We also had a lot of code in the containerd core for syncing with the shims
during execution.  This was because we needed ways to signal if the shim was
running, the container was created or any errors on create and then starting
the container's process.  Getting this right and syncing was hard and required
a lot of code.  With the new flow it is just function calls via rpc.

```proto
service Shim {
	rpc Create(CreateRequest) returns (CreateResponse);
	rpc Start(StartRequest) returns (google.protobuf.Empty);
	rpc Delete(DeleteRequest) returns (DeleteResponse);
	rpc Exec(ExecRequest) returns (ExecResponse);
	rpc Pty(PtyRequest) returns (google.protobuf.Empty);
	rpc Events(EventsRequest) returns (stream Event);
	rpc State(StateRequest) returns (StateResponse);
}
```

With the GRPC service it allows us to decouple the shim's lifecycle from the
containers, in the way that we get synchronous feedback if the container failed
to create, start, or exec from shim errors.

The overhead for adding GRPC to the shim is actually less than the initial
implementation.  We already had a few pipes that allowed you to control
resizing of the pty master and exit events, now all replaced by one unix
socket.  Unix sockets are cheap and fast and we reduce our open fd count with
way by not relying on multiple fifos.  

We also added a subcommand to the `ctr` command for testing and interfacing
with the shim.  You can interact with a shim directly via `ctr shim` and get
events, start containers, start exec processes.

## Distribution Tool

* https://github.com/docker/containerd/pull/452
* https://github.com/docker/containerd/pull/472
* https://github.com/docker/containerd/pull/474

Last week, @stevvooe committed the first parts of the distribution tool. The main
component provided there was the `dist fetch` command. This has been followed
up by several other low-level commands that interact with content resolution
and local storage that can be used together to work with parts of images.

With this change, we add the following commands to the dist tool:
    
- `ingest`: verify and accept content into storage
- `active`: display active ingest processes
- `list`: list content in storage
- `path`: provide a path to a blob by digest
- `delete`: remove a piece of content from storage
- `apply`: apply a layer to a directory

When this is more solidified, we can roll these up into higher-level
operations that can be orchestrated through the `dist` tool or via GRPC.

As part of the _Development Report_, we thought it was a good idea to show
these tools in depth. Specifically, we can show going from an image locator to
a root filesystem with the current suite of commands.

### Fetching Image Resources

The first component added to the `dist` tool is the `fetch` command. It is a
low-level command for fetching image resources, such as manifests and layers.
It operates around the concept of `remotes`. Objects are fetched by providing a
`locator` and an object identifier. The `locator`, roughly analogous to an
image name or repository, is a schema-less URL. The following is an example of
a `locator`:

```
docker.io/library/redis
```

When we say the `locator` is a "schema-less URL", we mean that it starts with
hostname and has a path, representing some image repository. While the hostname
may represent an actual location, we can pass it through arbitrary resolution
systems to get the actual location. In that sense, it acts like a namespace.

In practice, the `locator` can be used to resolve a `remote`. Object
identifiers are then passed to this remote, along with hints, which are then
mapped to the specific protocol and retrieved.  By dispatching on this common
identifier, we should be able to support almost any protocol and discovery
mechanism imaginable.

The actual `fetch` command currently provides anonymous access to Docker Hub
images, keyed by the `locator` namespace `docker.io`. With a `locator`,
`identifier` and `hint`, the correct protocol and endpoints are resolved and the
resource is printed to stdout. As an example, one can fetch the manifest for
`redis` with the following command:
    
```
$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json
```

Note that we have provided a mediatype "hint", nudging the fetch implementation
to grab the correct endpoint. We can hash the output of that to fetch the same
content by digest:
    
```
$ ./dist fetch docker.io/library/redis sha256:$(./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | shasum -a256)
```
    
The hint now elided on the outer command, since we have affixed the content to
a particular hash. The above shows us effectively fetches by tag, then by hash
to demonstrate the equivalence when interacting with a remote.
 
This is just the beginning. We should be able to centralize configuration
around fetch to implement a number of distribution methodologies that have been
challenging or impossible up to this point.

Keep reading to see how this is used with the other commands to fetch complete
images.

### Fetching all the layers of an image

If you are not yet entertained, let's bring `jq` and `xargs` into the mix for
maximum fun. Our first task will be to collect the layers into a local content
store with the `ingest` command.

The following incantation fetches the manifest and downloads each layer:

 ```
$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | \
	jq -r '.layers[] | "./dist fetch docker.io/library/redis "+.digest + "| ./dist ingest --expected-digest "+.digest+" --expected-size "+(.size | tostring) +" docker.io/library/redis@"+.digest' | xargs -I{} -P10 -n1 sh -c "{}"
```

The above fetches a manifest, pipes it to jq, which assembles a shell pipeline
to ingest each layer into the content store. Because the transactions are keyed
by their digest, concurrent downloads and downloads of repeated content are
ignored. Each process is then executed parallel using xargs.  If you run the
above command twice, it will not download the layers because those blobs are
already present in the content store.

What about status? Let's first remove our content so we can monitor a download.
`dist list` can be combined with xargs and `dist delete` to remove that
content:

```
$ ./dist list -q | xargs ./dist delete
```

In a separate shell session, could monitor the active downloads with the following:
    
```
$ watch -n0.2 ./dist active
```
    
For now, the content is downloaded into `.content` in the current working
directory. To watch the contents of this directory, you can use the following:
    
```
$ watch -n0.2 tree .content
```

Now, run the fetch pipeline from above. You'll see the active downloads, keyed
by locator and object, as well as the ingest transactions resulting blobs
becoming available in the content store. This will help to understand what is
going on internally.
 
### Getting to a rootfs

While we haven't yet integrated full snapshot support for layer application, we
can use the `dist apply` command to start building out rootfs for inspection
and testing. We'll build up a similar pipeline to unpack the layers and get an
actual image rootfs.

To get access to the layers, you can use the path command: 

```
$./dist path sha256:010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa
sha256:010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa /home/sjd/go/src/github.com/docker/containerd/.content/blobs/sha256/010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa
```

This returns the a direct path to the blob to facilitate fast access. We can
incorporate this into the `apply` command to get to a rootfs for `redis`:
    
```
$ mkdir redis-rootfs
$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json | \
	jq -r '.layers[] | "sudo ./dist apply ./redis-rootfs < $(./dist path -q "+.digest+")"' | xargs -I{} -n1 sh -c "{}"
```

The above fetches the manifest, then passes each layer into the `dist apply`
command, resulting in the full redis container root filesystem. We do not do
this in parallel, since each layer must be applied sequentially. Also, note
that we have to run `apply` with `sudo`, since the layers typically have
resources with root ownership.

Alternatively, you can just read the manifest from the content store, rather
than fetching it. We use fetch above to avoid having to lookup the manifest
digest for our demo.

Note that this is mostly a POC. This tool has a long way to go. Things like
failed downloads and abandoned download cleanup aren't quite handled. We'll
probably make adjustments around how content store transactions are handled to
address this. We still need to incorporate snapshotting, as well as the ability
to calculate the `ChainID` under subsequent unpacking. Once we have some tools
to play around with snapshotting, we'll be able to incorporate our
`rootfs.ApplyLayer` algorithm that will get us a lot closer to a production
worthy system.
   
From here, we'll build out full image pull and create tooling to get runtime
bundles from the fetched content.
reports: development report for 2017-01-27 Signed-off-by: Stephen J Day <stephen.day@docker.com> 2017-01-27 20:04:23 +00:00			`# Development Report for Jan 27, 2017`

			`This week we made a lot of progress on tools to work with local content storage`
			`and image distribution. These parts are critical in forming an end to end proof`
			`of concept, taking docker/oci images and turning them into bundles.`

			`We also have defined a new GRPC protocol for interacting with the`
			`container-shim, which is used for robust container management.`

			`## Maintainers`

			`* https://github.com/docker/containerd/pull/473`

			`Derek McGowan will be joining the containerd team as a maintainer. His`
			`extensive experience in graphdrivers and distribution will be invaluable to the`
			`containerd project.`

			`## Shim over GRPC`

			`* https://github.com/docker/containerd/pull/462`

			```
			`NAME:`
			`containerd-shim -`
			`__ _ __ __ _`
			`_________ ____ / /_____ _(_)___ ___ _________/ / _____/ /_ (_)___ ___`
			/ ___/ __ \/ __ \/ __/ __ `/ / __ \/ _ \/ ___/ __ /_____/ ___/ __ \/ / __ `__ \
			`/ /__/ /_/ / / / / /_/ /_/ / / / / / __/ / / /_/ /_____(__ ) / / / / / / / / /`
			`\___/\____/_/ /_/\__/\__,_/_/_/ /_/\___/_/ \__,_/ /____/_/ /_/_/_/ /_/ /_/`

			`shim for container lifecycle and reconnection`


			`USAGE:`
			`containerd-shim [global options] command [command options] [arguments...]`

			`VERSION:`
			`1.0.0`

			`COMMANDS:`
			`help, h Shows a list of commands or help for one command`

			`GLOBAL OPTIONS:`
			`--debug enable debug output in logs`
			`--help, -h show help`
			`--version, -v print the version`

			```

			`This week we completed work on porting the shim over to GRPC. This allows us`
			`to have a more robust way to interface with the shim. It also allows us to`
			`have one shim per container where previously we had one shim per process. This`
			`drastically reduces the memory usage for exec processes.`

			`We also had a lot of code in the containerd core for syncing with the shims`
			`during execution. This was because we needed ways to signal if the shim was`
			`running, the container was created or any errors on create and then starting`
			`the container's process. Getting this right and syncing was hard and required`
			`a lot of code. With the new flow it is just function calls via rpc.`

			```proto
			`service Shim {`
			`rpc Create(CreateRequest) returns (CreateResponse);`
			`rpc Start(StartRequest) returns (google.protobuf.Empty);`
			`rpc Delete(DeleteRequest) returns (DeleteResponse);`
			`rpc Exec(ExecRequest) returns (ExecResponse);`
			`rpc Pty(PtyRequest) returns (google.protobuf.Empty);`
			`rpc Events(EventsRequest) returns (stream Event);`
			`rpc State(StateRequest) returns (StateResponse);`
			`}`
			```

			`With the GRPC service it allows us to decouple the shim's lifecycle from the`
			`containers, in the way that we get synchronous feedback if the container failed`
			`to create, start, or exec from shim errors.`

			`The overhead for adding GRPC to the shim is actually less than the initial`
			`implementation. We already had a few pipes that allowed you to control`
			`resizing of the pty master and exit events, now all replaced by one unix`
			`socket. Unix sockets are cheap and fast and we reduce our open fd count with`
			`way by not relying on multiple fifos.`

			We also added a subcommand to the `ctr` command for testing and interfacing
			with the shim. You can interact with a shim directly via `ctr shim` and get
			`events, start containers, start exec processes.`

			`## Distribution Tool`

			`* https://github.com/docker/containerd/pull/452`
			`* https://github.com/docker/containerd/pull/472`
			`* https://github.com/docker/containerd/pull/474`

			`Last week, @stevvooe committed the first parts of the distribution tool. The main`
			component provided there was the `dist fetch` command. This has been followed
			`up by several other low-level commands that interact with content resolution`
			`and local storage that can be used together to work with parts of images.`

			`With this change, we add the following commands to the dist tool:`

			- `ingest`: verify and accept content into storage
			- `active`: display active ingest processes
			- `list`: list content in storage
			- `path`: provide a path to a blob by digest
			- `delete`: remove a piece of content from storage
			- `apply`: apply a layer to a directory

			`When this is more solidified, we can roll these up into higher-level`
			operations that can be orchestrated through the `dist` tool or via GRPC.

			`As part of the _Development Report_, we thought it was a good idea to show`
			`these tools in depth. Specifically, we can show going from an image locator to`
			`a root filesystem with the current suite of commands.`

			`### Fetching Image Resources`

			The first component added to the `dist` tool is the `fetch` command. It is a
			`low-level command for fetching image resources, such as manifests and layers.`
			It operates around the concept of `remotes`. Objects are fetched by providing a
			`locator` and an object identifier. The `locator`, roughly analogous to an
			`image name or repository, is a schema-less URL. The following is an example of`
			a `locator`:

			```
			`docker.io/library/redis`
			```

			When we say the `locator` is a "schema-less URL", we mean that it starts with
			`hostname and has a path, representing some image repository. While the hostname`
			`may represent an actual location, we can pass it through arbitrary resolution`
			`systems to get the actual location. In that sense, it acts like a namespace.`

			In practice, the `locator` can be used to resolve a `remote`. Object
			`identifiers are then passed to this remote, along with hints, which are then`
			`mapped to the specific protocol and retrieved. By dispatching on this common`
			`identifier, we should be able to support almost any protocol and discovery`
			`mechanism imaginable.`

			The actual `fetch` command currently provides anonymous access to Docker Hub
			images, keyed by the `locator` namespace `docker.io`. With a `locator`,
			`identifier` and `hint`, the correct protocol and endpoints are resolved and the
			`resource is printed to stdout. As an example, one can fetch the manifest for`
			`redis` with the following command:

			```
			`$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json`
			```

			`Note that we have provided a mediatype "hint", nudging the fetch implementation`
			`to grab the correct endpoint. We can hash the output of that to fetch the same`
			`content by digest:`

			```
			`$ ./dist fetch docker.io/library/redis sha256:$(./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json \| shasum -a256)`
			```

			`The hint now elided on the outer command, since we have affixed the content to`
			`a particular hash. The above shows us effectively fetches by tag, then by hash`
			`to demonstrate the equivalence when interacting with a remote.`

			`This is just the beginning. We should be able to centralize configuration`
			`around fetch to implement a number of distribution methodologies that have been`
			`challenging or impossible up to this point.`

			`Keep reading to see how this is used with the other commands to fetch complete`
			`images.`

			`### Fetching all the layers of an image`

			If you are not yet entertained, let's bring `jq` and `xargs` into the mix for
			`maximum fun. Our first task will be to collect the layers into a local content`
			store with the `ingest` command.

			`The following incantation fetches the manifest and downloads each layer:`

			```
			`$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json \| \`
			`jq -r '.layers[] \| "./dist fetch docker.io/library/redis "+.digest + "\| ./dist ingest --expected-digest "+.digest+" --expected-size "+(.size \| tostring) +" docker.io/library/redis@"+.digest' \| xargs -I{} -P10 -n1 sh -c "{}"`
			```

			`The above fetches a manifest, pipes it to jq, which assembles a shell pipeline`
			`to ingest each layer into the content store. Because the transactions are keyed`
			`by their digest, concurrent downloads and downloads of repeated content are`
			`ignored. Each process is then executed parallel using xargs. If you run the`
			`above command twice, it will not download the layers because those blobs are`
			`already present in the content store.`

			`What about status? Let's first remove our content so we can monitor a download.`
			`dist list` can be combined with xargs and `dist delete` to remove that
			`content:`

			```
			`$ ./dist list -q \| xargs ./dist delete`
			```

			`In a separate shell session, could monitor the active downloads with the following:`

			```
			`$ watch -n0.2 ./dist active`
			```

			For now, the content is downloaded into `.content` in the current working
			`directory. To watch the contents of this directory, you can use the following:`

			```
			`$ watch -n0.2 tree .content`
			```

			`Now, run the fetch pipeline from above. You'll see the active downloads, keyed`
			`by locator and object, as well as the ingest transactions resulting blobs`
			`becoming available in the content store. This will help to understand what is`
			`going on internally.`

			`### Getting to a rootfs`

			`While we haven't yet integrated full snapshot support for layer application, we`
			can use the `dist apply` command to start building out rootfs for inspection
			`and testing. We'll build up a similar pipeline to unpack the layers and get an`
			`actual image rootfs.`

			`To get access to the layers, you can use the path command:`

			```
			`$./dist path sha256:010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa`
			`sha256:010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa /home/sjd/go/src/github.com/docker/containerd/.content/blobs/sha256/010c454d55e53059beaba4044116ea4636f8dd8181e975d893931c7e7204fffa`
			```

			`This returns the a direct path to the blob to facilitate fast access. We can`
			incorporate this into the `apply` command to get to a rootfs for `redis`:

			```
			`$ mkdir redis-rootfs`
			`$ ./dist fetch docker.io/library/redis latest mediatype:application/vnd.docker.distribution.manifest.v2+json \| \`
			`jq -r '.layers[] \| "sudo ./dist apply ./redis-rootfs < $(./dist path -q "+.digest+")"' \| xargs -I{} -n1 sh -c "{}"`
			```

			The above fetches the manifest, then passes each layer into the `dist apply`
			`command, resulting in the full redis container root filesystem. We do not do`
			`this in parallel, since each layer must be applied sequentially. Also, note`
			that we have to run `apply` with `sudo`, since the layers typically have
			`resources with root ownership.`

			`Alternatively, you can just read the manifest from the content store, rather`
			`than fetching it. We use fetch above to avoid having to lookup the manifest`
			`digest for our demo.`

			`Note that this is mostly a POC. This tool has a long way to go. Things like`
			`failed downloads and abandoned download cleanup aren't quite handled. We'll`
			`probably make adjustments around how content store transactions are handled to`
			`address this. We still need to incorporate snapshotting, as well as the ability`
			to calculate the `ChainID` under subsequent unpacking. Once we have some tools
			`to play around with snapshotting, we'll be able to incorporate our`
			`rootfs.ApplyLayer` algorithm that will get us a lot closer to a production
			`worthy system.`

			`From here, we'll build out full image pull and create tooling to get runtime`
			`bundles from the fetched content.`