1
0
Fork 1
mirror of https://github.com/vbatts/tar-split.git synced 2025-10-26 16:30:57 +00:00
tar archive assembly/disassembly
Find a file
Miloslav Trmač 99c8914877 Add tar/asm.IterateHeaders
This allows reading the metadata contained in tar-split
without expensively recreating the whole tar stream
including full contents.

We have two use cases for this:
- In a situation where tar-split is distributed along with
  a separate metadata stream, ensuring that the two are
  exactly consistent
- Reading the tar headers allows making a ~cheap check
  of consistency of on-disk layers, just checking that the
  files exist in expected sizes, without reading the full
  contents.

This can be implemented outside of this repo, but it's
not ideal:
- The function necessarily hard-codes some assumptions
  about how tar-split determines the boundaries of
  SegmentType/FileType entries (or, indeed, whether it
  uses FileType entries at all). That's best maintained
  directly beside the code that creates this.
- The ExpectedPadding() value is not currently exported,
  so the consumer would have to heuristically guess where
  the padding ends.

Signed-off-by: Miloslav Trmač <mitr@redhat.com>
2024-09-11 20:01:49 +02:00
.github/workflows *: mage, drop go1.1{5,6}, module updates, drop vendor 2023-03-26 14:01:33 -04:00
archive/tar Add tar/asm.IterateHeaders 2024-09-11 20:01:49 +02:00
cmd/tar-split chore: remove refs to deprecated io/ioutil 2023-07-20 23:00:46 +08:00
concept chore: remove refs to deprecated io/ioutil 2023-07-20 23:00:46 +08:00
tar Add tar/asm.IterateHeaders 2024-09-11 20:01:49 +02:00
go.mod Add tar/asm.IterateHeaders 2024-09-11 20:01:49 +02:00
go.sum Add tar/asm.IterateHeaders 2024-09-11 20:01:49 +02:00
LICENSE LICENSE: update LICENSE to BSD 3-clause 2015-12-03 15:45:57 -05:00
mage.go *: mage, drop go1.1{5,6}, module updates, drop vendor 2023-03-26 14:01:33 -04:00
mage_color.go *: mage, drop go1.1{5,6}, module updates, drop vendor 2023-03-26 14:01:33 -04:00
magefile.go magefile: attempting to recreate make file dependencies 2023-04-27 14:19:40 -04:00
README.md fix: utility typo 2023-08-26 16:23:45 +08:00

tar-split

Build Status Lint Go Report Card

Pristinely disassembling a tar archive, and stashing needed raw bytes and offsets to reassemble a validating original archive.

Docs

Code API for libraries provided by tar-split:

Install

The command line utility is installable via:

go get github.com/vbatts/tar-split/cmd/tar-split

Usage

For cli usage, see its README.md. For the library see the docs

Demo

Basic disassembly and assembly

This demonstrates the tar-split command and how to assemble a tar archive from the tar-data.json.gz

basic cmd demo thumbnail youtube video of basic command demo

Docker layer preservation

This demonstrates the tar-split integration for docker-1.8. Providing consistent tar archives for the image layer content.

docker tar-split demo youtube vide of docker layer checksums

Caveat

Eventually this should detect TARs that this is not possible with.

For example stored sparse files that have "holes" in them, will be read as a contiguous file, though the archive contents may be recorded in sparse format. Therefore when adding the file payload to a reassembled tar, to achieve identical output, the file payload would need be precisely re-sparsified. This is not something I seek to fix immediately, but would rather have an alert that precise reassembly is not possible. (see more http://www.gnu.org/software/tar/manual/html_node/Sparse-Formats.html)

Other caveat, while tar archives support having multiple file entries for the same path, we will not support this feature. If there are more than one entries with the same path, expect an err (like ErrDuplicatePath) or a resulting tar stream that does not validate your original checksum/signature.

Contract

Do not break the API of stdlib archive/tar in our fork (ideally find an upstream mergeable solution).

Std Version

The version of golang stdlib archive/tar is from go1.11 It is minimally extended to expose the raw bytes of the TAR, rather than just the marshalled headers and file stream.

Design

See the design.

Stored Metadata

Since the raw bytes of the headers and padding are stored, you may be wondering what the size implications are. The headers are at least 512 bytes per file (sometimes more), at least 1024 null bytes on the end, and then various padding. This makes for a constant linear growth in the stored metadata, with a naive storage implementation.

First we'll get an archive to work with. For repeatability, we'll make an archive from what you've just cloned:

git archive --format=tar -o tar-split.tar HEAD .
$ go get github.com/vbatts/tar-split/cmd/tar-split
$ tar-split checksize ./tar-split.tar
inspecting "tar-split.tar" (size 210k)
 -- number of files: 50
 -- size of metadata uncompressed: 53k
 -- size of gzip compressed metadata: 3k

So assuming you've managed the extraction of the archive yourself, for reuse of the file payloads from a relative path, then the only additional storage implications are as little as 3kb.

But let's look at a larger archive, with many files.

$ ls -sh ./d.tar
1.4G ./d.tar
$ tar-split checksize ~/d.tar 
inspecting "/home/vbatts/d.tar" (size 1420749k)
 -- number of files: 38718
 -- size of metadata uncompressed: 43261k
 -- size of gzip compressed metadata: 2251k

Here, an archive with 38,718 files has a compressed footprint of about 2mb.

Rolling the null bytes on the end of the archive, we will assume a bytes-per-file rate for the storage implications.

uncompressed compressed
~ 1kb per/file 0.06kb per/file

What's Next?

  • More implementations of storage Packer and Unpacker
  • More implementations of FileGetter and FilePutter
  • would be interesting to have an assembler stream that implements io.Seeker

License

See LICENSE