From 7c1b9831df0eb80501d05bf3d1500fecd7dbb82b Mon Sep 17 00:00:00 2001 From: Vincent Batts Date: Fri, 24 Oct 2014 16:23:50 -0400 Subject: [PATCH 1/4] pkg/tarsum: specification on TarSum checksum Signed-off-by: Vincent Batts --- tarsum/tarsum_spec.md | 228 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 228 insertions(+) create mode 100644 tarsum/tarsum_spec.md diff --git a/tarsum/tarsum_spec.md b/tarsum/tarsum_spec.md new file mode 100644 index 0000000..bffd44a --- /dev/null +++ b/tarsum/tarsum_spec.md @@ -0,0 +1,228 @@ +page_title: TarSum checksum specification +page_description: Documentation for algorithm used in the TarSum checksum calculation +page_keywords: docker, checksum, validation, tarsum + +# TarSum Checksum Specification + +## Abstract + +This document describes the algorithms used in performing the TarSum checksum +calculation on file system layers, the need for this method over existing +methods, and the versioning of this calculation. + + +## Introduction + +The transportation of file systems, regarding docker, is done with tar(1) +archives. Types of transpiration include distribution to and from a registry +endpoint, saving and loading through commands or docker daemon APIs, +transferring the build context from client to docker daemon, and committing the +file system of a container to become an image. + +As tar archives are used for transit, but not preserved in many situations, the +focus of the algorithm is to ensure the integrity of the preserved file system, +while maintaining a deterministic accountability. This includes neither +constrain the ordering or manipulation of the files during the creation or +unpacking of the archive, nor include additional metadata state about the file +system attributes. + + +## Intended Audience + +This document is outlining the methods used for consistent checksum calculation +for file systems transported via tar archives. + +Auditing these methodologies is an open and iterative process. This document +should accommodate the review of source code. Ultimately, this document should +be the starting point of further refinements to the algorithm and its future +versions. + + +## Concept + +The checksum mechanism must ensure the integrity and confidentiality of the +file system payload. + + +## Checksum Algorithm Profile + +A checksum mechanism must define the following operations and attributes: + +* associated hashing cipher - used to checksum each file payload and attribute + information. +* checksum list - each file of the file system archive has its checksum + calculated from the payload and attributes of the file. The final checksum is + calculated from this list, with specific ordering. +* version - as the algorithm adapts to requirements, there are behaviors of the + algorithm to manage by versioning. +* archive being calculated - the tar archive having its checksum calculated + + +## Elements of TarSum checksum + +The calculated sum output is a text string. The elements included in the output +of the calculated sum comprise the information needed for validation of the sum +(TarSum version and block cipher used) and the expected checksum in hexadecimal +form. + +There are two delimiters used: +* '+' separates TarSum version from block cipher +* ':' separates calculation mechanics from expected hash + +Example: + + "tarsum.v1+sha256:220a60ecd4a3c32c282622a625a54db9ba0ff55b5ba9c29c7064a2bc358b6a3e" + | | \ | + | | \ | + |_version_|_cipher__|__ | + | \ | + |_calculation_mechanics_|______________________expected_sum_______________________| + + +## Versioning + +Versioning was introduced [0] to accommodate differences in calculation needed, +and ability to maintain reverse compatibility. + +The general algorithm will be describe further in the 'Calculation'. + +### Version0 + +This is the initial version of TarSum. + +Its element in the checksum "tarsum" + + +### Version1 + +Its element in the checksum "tarsum.v1" + +The notable changes in this version: +* exclusion of file mtime from the file information headers, in each file + checksum calculation +* inclusion of extended attributes (xattrs. Also seen as "SCHILY.xattr." prefixed Pax + tar file info headers) keys and values in each file checksum calculation + +### VersionDev + +*Do not use unless validating refinements to the checksum algorithm* + +Its element in the checksum "tarsum.dev" + +This is a floating place holder for a next version. The methods used for +calculation are subject to change without notice. + +## Ciphers + +The official default and standard block cipher used in the calculation mechanic +is "sha256". This refers to SHA256 hash algorithm as defined in FIPS 180-4. + +Though the algorithm itself is not exclusively bound to this single block +cipher, and support for alternate block ciphers was later added [1]. Presently +use of this is for isolated use-cases and future-proofing the TarSum checksum +format. + +## Calculation + +### Requirement + +As mentioned earlier, the calculation is such that it takes into consideration +the life and cycle of the tar archive. In that the tar archive is not an +immutable, permanent artifact. Otherwise options like relying on a known block +cipher checksum of the archive itself would be reliable enough. Since the tar +archive is used as a transportation medium, and is thrown away after its +contents are extracted. Therefore, for consistent validation items such as +order of files in the tar archive and time stamps are subject to change once an +image is received. + + +### Process + +The method is typically iterative due to reading tar info headers from the +archive stream, though this is not a strict requirement. + +#### Files + +Each file in the tar archive have their contents (headers and body) checksummed +individually using the designated associated hashing cipher. The ordered +headers of the file are written to the checksum calculation first, and then the +payload of the file body. + +The resulting checksum of the file is appended to the list of file sums. The +sum is encoded as a string of the hexadecimal digest. Additionally, the file +name and position in the archive is kept as reference for special ordering. + +#### Headers + +The following headers are read, in this +order ( and the corresponding representation of its value): +* 'name' - string +* 'mode' - string of the base10 integer +* 'uid' - string of the integer +* 'gid' - string of the integer +* 'size' - string of the integer +* 'mtime' (_Version0 only_) - string of integer of the seconds since 1970-01-01 00:00:00 UTC +* 'typeflag' - string of the char +* 'linkname' - string +* 'uname' - string +* 'gname' - string +* 'devmajor' - string of the integer +* 'devminor' - string of the integer + +For >= Version1, the extented attribute headers ("SCHILY.xattr." prefixed pax +headers) included after the above list. These xattrs key/values are first +sorted by the keys. + + +#### Header Format + +The ordered headers are written to the hash in the format of + + "{.key}{.value}" + +with no newline. + + +#### Body + +After the order headers of the file have been added to the checksum for the +file, then the body of the file is written to the hash. + + +#### List of file sums + +The list of file sums is sorted by the string of the hexadecimal digest. + +If there are two files in the tar with matching paths, the order of occurrence +for that path is reflected for the sums of the corresponding file header and +body. + + +#### Final Checksum + +Using an initialize hash of the associated hash cipher, if there is additional +payload to include in the TarSum calculation for the archive, it is written +first. Then each checksum from the ordered list of files sums is written to the +hash. The resulting digest is formatted per the Elements of TarSum checksum, +including the TarSum version, the associated hash cipher and the hexadecimal +encoded checksum digest. + + +## Security Considerations + +The initial version of TarSum has undergone one update that could invalidate +handcrafted tar archives. The tar archive format supports appending of files +with same names as prior files in the archive. The latter file will clobber the +prior file of the same path. Due to this the algorithm now accounts for + + +## Footnotes + +* [0] Versioning https://github.com/docker/docker/commit/747f89cd327db9d50251b17797c4d825162226d0 +* [1] Alternate ciphers https://github.com/docker/docker/commit/4e9925d780665149b8bc940d5ba242ada1973c4e + +## Acknowledgements + +Joffrey F (shin-) and Guillaume J. Charmes (creack) on the initial work of the +TarSum calculation. + From 0597513d59a142483100796c771b092c091a13fb Mon Sep 17 00:00:00 2001 From: Vincent Batts Date: Wed, 12 Nov 2014 09:25:46 -0500 Subject: [PATCH 2/4] pkg/tarsum: review amendments (separate commit to preserve github conversation) Signed-off-by: Vincent Batts --- tarsum/tarsum_spec.md | 35 +++++++++++++++++++++-------------- 1 file changed, 21 insertions(+), 14 deletions(-) diff --git a/tarsum/tarsum_spec.md b/tarsum/tarsum_spec.md index bffd44a..aa5065d 100644 --- a/tarsum/tarsum_spec.md +++ b/tarsum/tarsum_spec.md @@ -14,8 +14,10 @@ methods, and the versioning of this calculation. ## Introduction The transportation of file systems, regarding docker, is done with tar(1) -archives. Types of transpiration include distribution to and from a registry -endpoint, saving and loading through commands or docker daemon APIs, +archives. There are a variety of tar serialization formats [2], and a key +concern here is ensuring a repeatable checksum given a set of inputs from a +generic tar archive. Types of transportation include distribution to and from a +registry endpoint, saving and loading through commands or docker daemon APIs, transferring the build context from client to docker daemon, and committing the file system of a container to become an image. @@ -40,7 +42,7 @@ versions. ## Concept -The checksum mechanism must ensure the integrity and confidentiality of the +The checksum mechanism must ensure the integrity and assurance of the file system payload. @@ -62,11 +64,11 @@ A checksum mechanism must define the following operations and attributes: The calculated sum output is a text string. The elements included in the output of the calculated sum comprise the information needed for validation of the sum -(TarSum version and block cipher used) and the expected checksum in hexadecimal +(TarSum version and hashing cipher used) and the expected checksum in hexadecimal form. There are two delimiters used: -* '+' separates TarSum version from block cipher +* '+' separates TarSum version from hashing cipher * ':' separates calculation mechanics from expected hash Example: @@ -114,11 +116,11 @@ calculation are subject to change without notice. ## Ciphers -The official default and standard block cipher used in the calculation mechanic +The official default and standard hashing cipher used in the calculation mechanic is "sha256". This refers to SHA256 hash algorithm as defined in FIPS 180-4. -Though the algorithm itself is not exclusively bound to this single block -cipher, and support for alternate block ciphers was later added [1]. Presently +Though the algorithm itself is not exclusively bound to this single hashing +cipher, and support for alternate hashing ciphers was later added [1]. Presently use of this is for isolated use-cases and future-proofing the TarSum checksum format. @@ -128,7 +130,7 @@ format. As mentioned earlier, the calculation is such that it takes into consideration the life and cycle of the tar archive. In that the tar archive is not an -immutable, permanent artifact. Otherwise options like relying on a known block +immutable, permanent artifact. Otherwise options like relying on a known hashing cipher checksum of the archive itself would be reliable enough. Since the tar archive is used as a transportation medium, and is thrown away after its contents are extracted. Therefore, for consistent validation items such as @@ -200,10 +202,12 @@ body. #### Final Checksum -Using an initialize hash of the associated hash cipher, if there is additional -payload to include in the TarSum calculation for the archive, it is written -first. Then each checksum from the ordered list of files sums is written to the -hash. The resulting digest is formatted per the Elements of TarSum checksum, +Begin with a fresh or initial state of the associated hash cipher. If there is +additional payload to include in the TarSum calculation for the archive, it is +written first. Then each checksum from the ordered list of file sums is written +to the hash. + +The resulting digest is formatted per the Elements of TarSum checksum, including the TarSum version, the associated hash cipher and the hexadecimal encoded checksum digest. @@ -213,13 +217,16 @@ encoded checksum digest. The initial version of TarSum has undergone one update that could invalidate handcrafted tar archives. The tar archive format supports appending of files with same names as prior files in the archive. The latter file will clobber the -prior file of the same path. Due to this the algorithm now accounts for +prior file of the same path. Due to this the algorithm now accounts for files +with matching paths, and orders the list of file sums accordingly [3]. ## Footnotes * [0] Versioning https://github.com/docker/docker/commit/747f89cd327db9d50251b17797c4d825162226d0 * [1] Alternate ciphers https://github.com/docker/docker/commit/4e9925d780665149b8bc940d5ba242ada1973c4e +* [2] Tar http://en.wikipedia.org/wiki/Tar_%28computing%29 +* [3] Name collision https://github.com/docker/docker/commit/c5e6362c53cbbc09ddbabd5a7323e04438b57d31 ## Acknowledgements From 9a45c4235a5f0ff13f2b112587776e67e9148cb0 Mon Sep 17 00:00:00 2001 From: Vincent Batts Date: Thu, 13 Nov 2014 13:09:05 -0500 Subject: [PATCH 3/4] pkg/tarsum: review cleanup Signed-off-by: Vincent Batts --- tarsum/tarsum_spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tarsum/tarsum_spec.md b/tarsum/tarsum_spec.md index aa5065d..b51e5b1 100644 --- a/tarsum/tarsum_spec.md +++ b/tarsum/tarsum_spec.md @@ -188,7 +188,7 @@ with no newline. #### Body After the order headers of the file have been added to the checksum for the -file, then the body of the file is written to the hash. +file, the body of the file is written to the hash. #### List of file sums From bd9c676bb7b609d71bbb01288766a72585c3856c Mon Sep 17 00:00:00 2001 From: Vincent Batts Date: Thu, 20 Nov 2014 15:46:15 -0500 Subject: [PATCH 4/4] tarsum: updates for jamtur01 comments Signed-off-by: Vincent Batts --- tarsum/tarsum_spec.md | 82 +++++++++++++++++++------------------------ 1 file changed, 36 insertions(+), 46 deletions(-) diff --git a/tarsum/tarsum_spec.md b/tarsum/tarsum_spec.md index b51e5b1..7a6f8ed 100644 --- a/tarsum/tarsum_spec.md +++ b/tarsum/tarsum_spec.md @@ -1,5 +1,5 @@ page_title: TarSum checksum specification -page_description: Documentation for algorithm used in the TarSum checksum calculation +page_description: Documentation for algorithms used in the TarSum checksum calculation page_keywords: docker, checksum, validation, tarsum # TarSum Checksum Specification @@ -7,58 +7,54 @@ page_keywords: docker, checksum, validation, tarsum ## Abstract This document describes the algorithms used in performing the TarSum checksum -calculation on file system layers, the need for this method over existing +calculation on filesystem layers, the need for this method over existing methods, and the versioning of this calculation. ## Introduction -The transportation of file systems, regarding docker, is done with tar(1) +The transportation of filesystems, regarding Docker, is done with tar(1) archives. There are a variety of tar serialization formats [2], and a key concern here is ensuring a repeatable checksum given a set of inputs from a generic tar archive. Types of transportation include distribution to and from a -registry endpoint, saving and loading through commands or docker daemon APIs, -transferring the build context from client to docker daemon, and committing the -file system of a container to become an image. +registry endpoint, saving and loading through commands or Docker daemon APIs, +transferring the build context from client to Docker daemon, and committing the +filesystem of a container to become an image. As tar archives are used for transit, but not preserved in many situations, the -focus of the algorithm is to ensure the integrity of the preserved file system, +focus of the algorithm is to ensure the integrity of the preserved filesystem, while maintaining a deterministic accountability. This includes neither -constrain the ordering or manipulation of the files during the creation or +constraining the ordering or manipulation of the files during the creation or unpacking of the archive, nor include additional metadata state about the file system attributes. - ## Intended Audience This document is outlining the methods used for consistent checksum calculation -for file systems transported via tar archives. +for filesystems transported via tar archives. Auditing these methodologies is an open and iterative process. This document should accommodate the review of source code. Ultimately, this document should be the starting point of further refinements to the algorithm and its future versions. - ## Concept The checksum mechanism must ensure the integrity and assurance of the -file system payload. - +filesystem payload. ## Checksum Algorithm Profile A checksum mechanism must define the following operations and attributes: -* associated hashing cipher - used to checksum each file payload and attribute +* Associated hashing cipher - used to checksum each file payload and attribute information. -* checksum list - each file of the file system archive has its checksum +* Checksum list - each file of the filesystem archive has its checksum calculated from the payload and attributes of the file. The final checksum is calculated from this list, with specific ordering. -* version - as the algorithm adapts to requirements, there are behaviors of the +* Version - as the algorithm adapts to requirements, there are behaviors of the algorithm to manage by versioning. -* archive being calculated - the tar archive having its checksum calculated - +* Archive being calculated - the tar archive having its checksum calculated ## Elements of TarSum checksum @@ -73,13 +69,14 @@ There are two delimiters used: Example: +``` "tarsum.v1+sha256:220a60ecd4a3c32c282622a625a54db9ba0ff55b5ba9c29c7064a2bc358b6a3e" | | \ | | | \ | |_version_|_cipher__|__ | | \ | |_calculation_mechanics_|______________________expected_sum_______________________| - +``` ## Versioning @@ -92,51 +89,50 @@ The general algorithm will be describe further in the 'Calculation'. This is the initial version of TarSum. -Its element in the checksum "tarsum" - +Its element in the TarSum checksum string is `tarsum`. ### Version1 -Its element in the checksum "tarsum.v1" +Its element in the TarSum checksum is `tarsum.v1`. The notable changes in this version: -* exclusion of file mtime from the file information headers, in each file +* Exclusion of file `mtime` from the file information headers, in each file checksum calculation -* inclusion of extended attributes (xattrs. Also seen as "SCHILY.xattr." prefixed Pax +* Inclusion of extended attributes (`xattrs`. Also seen as `SCHILY.xattr.` prefixed Pax tar file info headers) keys and values in each file checksum calculation ### VersionDev *Do not use unless validating refinements to the checksum algorithm* -Its element in the checksum "tarsum.dev" +Its element in the TarSum checksum is `tarsum.dev`. -This is a floating place holder for a next version. The methods used for -calculation are subject to change without notice. +This is a floating place holder for a next version and grounds for testing +changes. The methods used for calculation are subject to change without notice, +and this version is for testing and not for production use. ## Ciphers The official default and standard hashing cipher used in the calculation mechanic -is "sha256". This refers to SHA256 hash algorithm as defined in FIPS 180-4. +is `sha256`. This refers to SHA256 hash algorithm as defined in FIPS 180-4. -Though the algorithm itself is not exclusively bound to this single hashing -cipher, and support for alternate hashing ciphers was later added [1]. Presently -use of this is for isolated use-cases and future-proofing the TarSum checksum -format. +Though the TarSum algorithm itself is not exclusively bound to the single +hashing cipher `sha256`, support for alternate hashing ciphers was later added +[1]. Use cases for alternate cipher could include future-proofing TarSum +checksum format and using faster cipher hashes for tar filesystem checksums. ## Calculation ### Requirement As mentioned earlier, the calculation is such that it takes into consideration -the life and cycle of the tar archive. In that the tar archive is not an -immutable, permanent artifact. Otherwise options like relying on a known hashing -cipher checksum of the archive itself would be reliable enough. Since the tar -archive is used as a transportation medium, and is thrown away after its -contents are extracted. Therefore, for consistent validation items such as -order of files in the tar archive and time stamps are subject to change once an -image is received. - +the lifecycle of the tar archive. In that the tar archive is not an immutable, +permanent artifact. Otherwise options like relying on a known hashing cipher +checksum of the archive itself would be reliable enough. The tar archive of the +filesystem is used as a transportation medium for Docker images, and the +archive is discarded once its contents are extracted. Therefore, for consistent +validation items such as order of files in the tar archive and time stamps are +subject to change once an image is received. ### Process @@ -175,7 +171,6 @@ For >= Version1, the extented attribute headers ("SCHILY.xattr." prefixed pax headers) included after the above list. These xattrs key/values are first sorted by the keys. - #### Header Format The ordered headers are written to the hash in the format of @@ -184,13 +179,11 @@ The ordered headers are written to the hash in the format of with no newline. - #### Body After the order headers of the file have been added to the checksum for the file, the body of the file is written to the hash. - #### List of file sums The list of file sums is sorted by the string of the hexadecimal digest. @@ -199,7 +192,6 @@ If there are two files in the tar with matching paths, the order of occurrence for that path is reflected for the sums of the corresponding file header and body. - #### Final Checksum Begin with a fresh or initial state of the associated hash cipher. If there is @@ -211,7 +203,6 @@ The resulting digest is formatted per the Elements of TarSum checksum, including the TarSum version, the associated hash cipher and the hexadecimal encoded checksum digest. - ## Security Considerations The initial version of TarSum has undergone one update that could invalidate @@ -220,7 +211,6 @@ with same names as prior files in the archive. The latter file will clobber the prior file of the same path. Due to this the algorithm now accounts for files with matching paths, and orders the list of file sums accordingly [3]. - ## Footnotes * [0] Versioning https://github.com/docker/docker/commit/747f89cd327db9d50251b17797c4d825162226d0