7500c932c7
Certain special type-flags, specifically 1, 2, 3, 4, 5, 6, do not have a data section. Thus, regardless of what the size field says, we should not attempt to read any data for these special types. The relevant PAX and USTAR specification says: <<< If the typeflag field is set to specify a file to be of type 1 (a link) or 2 (a symbolic link), the size field shall be specified as zero. If the typeflag field is set to specify a file of type 5 (directory), the size field shall be interpreted as described under the definition of that record type. No data logical records are stored for types 1, 2, or 5. If the typeflag field is set to 3 (character special file), 4 (block special file), or 6 (FIFO), the meaning of the size field is unspecified by this volume of POSIX.1-2008, and no data logical records shall be stored on the medium. Additionally, for type 6, the size field shall be ignored when reading. If the typeflag field is set to any other value, the number of logical records written following the header shall be (size+511)/512, ignoring any fraction in the result of the division. >>> Contrary to the specification, we do not assert that the size field is zero for type 1 and 2 since we liberally accept non-conforming formats. Change-Id: I666b601597cb9d7a50caa081813d90ca9cfc52ed Reviewed-on: https://go-review.googlesource.com/16614 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Run-TryBot: Brad Fitzpatrick <bradfitz@golang.org> TryBot-Result: Gobot Gobot <gobot@golang.org> |
||
---|---|---|
archive/tar | ||
cmd/tar-split | ||
concept | ||
tar | ||
version | ||
.travis.yml | ||
LICENSE | ||
README.md |
tar-split
Pristinely disassembling a tar archive, and stashing needed raw bytes and offsets to reassemble a validating original archive.
Docs
Code API for libraries provided by tar-split
:
- https://godoc.org/github.com/vbatts/tar-split/tar/asm
- https://godoc.org/github.com/vbatts/tar-split/tar/storage
- https://godoc.org/github.com/vbatts/tar-split/archive/tar
Install
The command line utilitiy is installable via:
go get github.com/vbatts/tar-split/cmd/tar-split
Usage
For cli usage, see its README.md. For the library see the docs
Demo
Basic disassembly and assembly
This demonstrates the tar-split
command and how to assemble a tar archive from the tar-data.json.gz
youtube video of basic command demo
Docker layer preservation
This demonstrates the tar-split integration for docker-1.8. Providing consistent tar archives for the image layer content.
youtube vide of docker layer checksums
Caveat
Eventually this should detect TARs that this is not possible with.
For example stored sparse files that have "holes" in them, will be read as a contiguous file, though the archive contents may be recorded in sparse format. Therefore when adding the file payload to a reassembled tar, to achieve identical output, the file payload would need be precisely re-sparsified. This is not something I seek to fix imediately, but would rather have an alert that precise reassembly is not possible. (see more http://www.gnu.org/software/tar/manual/html_node/Sparse-Formats.html)
Other caveat, while tar archives support having multiple file entries for the
same path, we will not support this feature. If there are more than one entries
with the same path, expect an err (like ErrDuplicatePath
) or a resulting tar
stream that does not validate your original checksum/signature.
Contract
Do not break the API of stdlib archive/tar
in our fork (ideally find an upstream mergeable solution).
Std Version
The version of golang stdlib archive/tar
is from go1.4.1, and their master branch around a9dddb53f.
It is minimally extended to expose the raw bytes of the TAR, rather than just the marshalled headers and file stream.
Design
See the design.
Stored Metadata
Since the raw bytes of the headers and padding are stored, you may be wondering what the size implications are. The headers are at least 512 bytes per file (sometimes more), at least 1024 null bytes on the end, and then various padding. This makes for a constant linear growth in the stored metadata, with a naive storage implementation.
First we'll get an archive to work with. For repeatability, we'll make an archive from what you've just cloned:
git archive --format=tar -o tar-split.tar HEAD .
$ go get github.com/vbatts/tar-split/cmd/tar-split
$ tar-split checksize ./tar-split.tar
inspecting "tar-split.tar" (size 210k)
-- number of files: 50
-- size of metadata uncompressed: 53k
-- size of gzip compressed metadata: 3k
So assuming you've managed the extraction of the archive yourself, for reuse of the file payloads from a relative path, then the only additional storage implications are as little as 3kb.
But let's look at a larger archive, with many files.
$ ls -sh ./d.tar
1.4G ./d.tar
$ tar-split checksize ~/d.tar
inspecting "/home/vbatts/d.tar" (size 1420749k)
-- number of files: 38718
-- size of metadata uncompressed: 43261k
-- size of gzip compressed metadata: 2251k
Here, an archive with 38,718 files has a compressed footprint of about 2mb.
Rolling the null bytes on the end of the archive, we will assume a bytes-per-file rate for the storage implications.
uncompressed | compressed |
---|---|
~ 1kb per/file | 0.06kb per/file |
What's Next?
- More implementations of storage Packer and Unpacker
- More implementations of FileGetter and FilePutter
- would be interesting to have an assembler stream that implements
io.Seeker
License
See LICENSE