diff --git a/tarsum/tarsum_spec.md b/tarsum/tarsum_spec.md new file mode 100644 index 0000000..bffd44a --- /dev/null +++ b/tarsum/tarsum_spec.md @@ -0,0 +1,228 @@ +page_title: TarSum checksum specification +page_description: Documentation for algorithm used in the TarSum checksum calculation +page_keywords: docker, checksum, validation, tarsum + +# TarSum Checksum Specification + +## Abstract + +This document describes the algorithms used in performing the TarSum checksum +calculation on file system layers, the need for this method over existing +methods, and the versioning of this calculation. + + +## Introduction + +The transportation of file systems, regarding docker, is done with tar(1) +archives. Types of transpiration include distribution to and from a registry +endpoint, saving and loading through commands or docker daemon APIs, +transferring the build context from client to docker daemon, and committing the +file system of a container to become an image. + +As tar archives are used for transit, but not preserved in many situations, the +focus of the algorithm is to ensure the integrity of the preserved file system, +while maintaining a deterministic accountability. This includes neither +constrain the ordering or manipulation of the files during the creation or +unpacking of the archive, nor include additional metadata state about the file +system attributes. + + +## Intended Audience + +This document is outlining the methods used for consistent checksum calculation +for file systems transported via tar archives. + +Auditing these methodologies is an open and iterative process. This document +should accommodate the review of source code. Ultimately, this document should +be the starting point of further refinements to the algorithm and its future +versions. + + +## Concept + +The checksum mechanism must ensure the integrity and confidentiality of the +file system payload. + + +## Checksum Algorithm Profile + +A checksum mechanism must define the following operations and attributes: + +* associated hashing cipher - used to checksum each file payload and attribute + information. +* checksum list - each file of the file system archive has its checksum + calculated from the payload and attributes of the file. The final checksum is + calculated from this list, with specific ordering. +* version - as the algorithm adapts to requirements, there are behaviors of the + algorithm to manage by versioning. +* archive being calculated - the tar archive having its checksum calculated + + +## Elements of TarSum checksum + +The calculated sum output is a text string. The elements included in the output +of the calculated sum comprise the information needed for validation of the sum +(TarSum version and block cipher used) and the expected checksum in hexadecimal +form. + +There are two delimiters used: +* '+' separates TarSum version from block cipher +* ':' separates calculation mechanics from expected hash + +Example: + + "tarsum.v1+sha256:220a60ecd4a3c32c282622a625a54db9ba0ff55b5ba9c29c7064a2bc358b6a3e" + | | \ | + | | \ | + |_version_|_cipher__|__ | + | \ | + |_calculation_mechanics_|______________________expected_sum_______________________| + + +## Versioning + +Versioning was introduced [0] to accommodate differences in calculation needed, +and ability to maintain reverse compatibility. + +The general algorithm will be describe further in the 'Calculation'. + +### Version0 + +This is the initial version of TarSum. + +Its element in the checksum "tarsum" + + +### Version1 + +Its element in the checksum "tarsum.v1" + +The notable changes in this version: +* exclusion of file mtime from the file information headers, in each file + checksum calculation +* inclusion of extended attributes (xattrs. Also seen as "SCHILY.xattr." prefixed Pax + tar file info headers) keys and values in each file checksum calculation + +### VersionDev + +*Do not use unless validating refinements to the checksum algorithm* + +Its element in the checksum "tarsum.dev" + +This is a floating place holder for a next version. The methods used for +calculation are subject to change without notice. + +## Ciphers + +The official default and standard block cipher used in the calculation mechanic +is "sha256". This refers to SHA256 hash algorithm as defined in FIPS 180-4. + +Though the algorithm itself is not exclusively bound to this single block +cipher, and support for alternate block ciphers was later added [1]. Presently +use of this is for isolated use-cases and future-proofing the TarSum checksum +format. + +## Calculation + +### Requirement + +As mentioned earlier, the calculation is such that it takes into consideration +the life and cycle of the tar archive. In that the tar archive is not an +immutable, permanent artifact. Otherwise options like relying on a known block +cipher checksum of the archive itself would be reliable enough. Since the tar +archive is used as a transportation medium, and is thrown away after its +contents are extracted. Therefore, for consistent validation items such as +order of files in the tar archive and time stamps are subject to change once an +image is received. + + +### Process + +The method is typically iterative due to reading tar info headers from the +archive stream, though this is not a strict requirement. + +#### Files + +Each file in the tar archive have their contents (headers and body) checksummed +individually using the designated associated hashing cipher. The ordered +headers of the file are written to the checksum calculation first, and then the +payload of the file body. + +The resulting checksum of the file is appended to the list of file sums. The +sum is encoded as a string of the hexadecimal digest. Additionally, the file +name and position in the archive is kept as reference for special ordering. + +#### Headers + +The following headers are read, in this +order ( and the corresponding representation of its value): +* 'name' - string +* 'mode' - string of the base10 integer +* 'uid' - string of the integer +* 'gid' - string of the integer +* 'size' - string of the integer +* 'mtime' (_Version0 only_) - string of integer of the seconds since 1970-01-01 00:00:00 UTC +* 'typeflag' - string of the char +* 'linkname' - string +* 'uname' - string +* 'gname' - string +* 'devmajor' - string of the integer +* 'devminor' - string of the integer + +For >= Version1, the extented attribute headers ("SCHILY.xattr." prefixed pax +headers) included after the above list. These xattrs key/values are first +sorted by the keys. + + +#### Header Format + +The ordered headers are written to the hash in the format of + + "{.key}{.value}" + +with no newline. + + +#### Body + +After the order headers of the file have been added to the checksum for the +file, then the body of the file is written to the hash. + + +#### List of file sums + +The list of file sums is sorted by the string of the hexadecimal digest. + +If there are two files in the tar with matching paths, the order of occurrence +for that path is reflected for the sums of the corresponding file header and +body. + + +#### Final Checksum + +Using an initialize hash of the associated hash cipher, if there is additional +payload to include in the TarSum calculation for the archive, it is written +first. Then each checksum from the ordered list of files sums is written to the +hash. The resulting digest is formatted per the Elements of TarSum checksum, +including the TarSum version, the associated hash cipher and the hexadecimal +encoded checksum digest. + + +## Security Considerations + +The initial version of TarSum has undergone one update that could invalidate +handcrafted tar archives. The tar archive format supports appending of files +with same names as prior files in the archive. The latter file will clobber the +prior file of the same path. Due to this the algorithm now accounts for + + +## Footnotes + +* [0] Versioning https://github.com/docker/docker/commit/747f89cd327db9d50251b17797c4d825162226d0 +* [1] Alternate ciphers https://github.com/docker/docker/commit/4e9925d780665149b8bc940d5ba242ada1973c4e + +## Acknowledgements + +Joffrey F (shin-) and Guillaume J. Charmes (creack) on the initial work of the +TarSum calculation. +