tar-split/README.md

# tar-split

[![Build Status](https://travis-ci.org/vbatts/tar-split.svg?branch=master)](https://travis-ci.org/vbatts/tar-split)

Pristinely disassembling a tar archive, and stashing needed raw bytes and offsets to reassemble a validating original archive.

## Docs

Code API for libraries provided by `tar-split`:

* https://godoc.org/github.com/vbatts/tar-split/tar/asm
* https://godoc.org/github.com/vbatts/tar-split/tar/storage
* https://godoc.org/github.com/vbatts/tar-split/archive/tar

## Install

The command line utilitiy is installable via:

```bash
go get github.com/vbatts/tar-split/cmd/tar-split
```

## Usage

For cli usage, see its [README.md](cmd/tar-split/README.md).
For the library see the [docs](#docs)

## Demo

### Basic disassembly and assembly

![basic cmd demo thumbnail](https://i.ytimg.com/vi/vh5wyjIOBtc/2.jpg?time=1445027151805)
[youtube video of basic command demo](https://youtu.be/vh5wyjIOBtc)


## Caveat

Eventually this should detect TARs that this is not possible with.

For example stored sparse files that have "holes" in them, will be read as a
contiguous file, though the archive contents may be recorded in sparse format.
Therefore when adding the file payload to a reassembled tar, to achieve
identical output, the file payload would need be precisely re-sparsified. This
is not something I seek to fix imediately, but would rather have an alert that
precise reassembly is not possible.
(see more http://www.gnu.org/software/tar/manual/html_node/Sparse-Formats.html)


Other caveat, while tar archives support having multiple file entries for the
same path, we will not support this feature. If there are more than one entries
with the same path, expect an err (like `ErrDuplicatePath`) or a resulting tar
stream that does not validate your original checksum/signature.

## Contract

Do not break the API of stdlib `archive/tar` in our fork (ideally find an upstream mergeable solution).

## Std Version

The version of golang stdlib `archive/tar` is from go1.4.1, and their master branch around [a9dddb53f](https://github.com/golang/go/tree/a9dddb53f).
It is minimally extended to expose the raw bytes of the TAR, rather than just the marshalled headers and file stream.


## Design

See the [design](concept/DESIGN.md).

## Stored Metadata

Since the raw bytes of the headers and padding are stored, you may be wondering
what the size implications are. The headers are at least 512 bytes per
file (sometimes more), at least 1024 null bytes on the end, and then various
padding. This makes for a constant linear growth in the stored metadata, with a
naive storage implementation.

First we'll get an archive to work with. For repeatability, we'll make an
archive from what you've just cloned:

```bash
git archive --format=tar -o tar-split.tar HEAD .
```

```bash
$ go get github.com/vbatts/tar-split/cmd/tar-split
$ tar-split checksize ./tar-split.tar
inspecting "tar-split.tar" (size 210k)
 -- number of files: 50
 -- size of metadata uncompressed: 53k
 -- size of gzip compressed metadata: 3k
```

So assuming you've managed the extraction of the archive yourself, for reuse of
the file payloads from a relative path, then the only additional storage
implications are as little as 3kb.

But let's look at a larger archive, with many files.

```bash
$ ls -sh ./d.tar
1.4G ./d.tar
$ tar-split checksize ~/d.tar 
inspecting "/home/vbatts/d.tar" (size 1420749k)
 -- number of files: 38718
 -- size of metadata uncompressed: 43261k
 -- size of gzip compressed metadata: 2251k
```

Here, an archive with 38,718 files has a compressed footprint of about 2mb.

Rolling the null bytes on the end of the archive, we will assume a
bytes-per-file rate for the storage implications.

| uncompressed | compressed |
| :----------: | :--------: |
| ~ 1kb per/file | 0.06kb per/file |


## What's Next?

* More implementations of storage Packer and Unpacker
* More implementations of FileGetter and FilePutter
* would be interesting to have an assembler stream that implements `io.Seeker`


## License

See [LICENSE](LICENSE)
README: formatting and cleanup 2015-08-10 19:24:51 +00:00			`# tar-split`
.: add README and LICENSE 2015-02-20 15:29:48 +00:00
README.md: build status 2015-03-09 18:22:45 +00:00			`[![Build Status](https://travis-ci.org/vbatts/tar-split.svg?branch=master)](https://travis-ci.org/vbatts/tar-split)`

README: updates 2015-08-18 18:54:32 +00:00			`Pristinely disassembling a tar archive, and stashing needed raw bytes and offsets to reassemble a validating original archive.`
.: add README and LICENSE 2015-02-20 15:29:48 +00:00
README: formatting and cleanup 2015-08-10 19:24:51 +00:00			`## Docs`
README.md: put the docs higher 2015-03-09 18:24:00 +00:00
README: cleanup 2015-08-10 19:29:08 +00:00			Code API for libraries provided by `tar-split`:

README.md: put the docs higher 2015-03-09 18:24:00 +00:00			`* https://godoc.org/github.com/vbatts/tar-split/tar/asm`
			`* https://godoc.org/github.com/vbatts/tar-split/tar/storage`
			`* https://godoc.org/github.com/vbatts/tar-split/archive/tar`

README: cleanup 2015-08-10 19:29:08 +00:00			`## Install`

			`The command line utilitiy is installable via:`

			```bash
			`go get github.com/vbatts/tar-split/cmd/tar-split`
			```

README: usage 2015-09-03 19:01:25 +00:00			`## Usage`

			`For cli usage, see its [README.md](cmd/tar-split/README.md).`
			`For the library see the [docs](#docs)`

demo: basic command Signed-off-by: Vincent Batts <vbatts@hashbangbash.com> 2015-10-16 20:41:09 +00:00			`## Demo`

			`### Basic disassembly and assembly`

			`![basic cmd demo thumbnail](https://i.ytimg.com/vi/vh5wyjIOBtc/2.jpg?time=1445027151805)`
			`[youtube video of basic command demo](https://youtu.be/vh5wyjIOBtc)`


README: formatting and cleanup 2015-08-10 19:24:51 +00:00			`## Caveat`
.: add README and LICENSE 2015-02-20 15:29:48 +00:00
			`Eventually this should detect TARs that this is not possible with.`

			`For example stored sparse files that have "holes" in them, will be read as a`
			`contiguous file, though the archive contents may be recorded in sparse format.`
			`Therefore when adding the file payload to a reassembled tar, to achieve`
			`identical output, the file payload would need be precisely re-sparsified. This`
			`is not something I seek to fix imediately, but would rather have an alert that`
			`precise reassembly is not possible.`
README: add link on sparse files 2015-02-20 15:30:18 +00:00			`(see more http://www.gnu.org/software/tar/manual/html_node/Sparse-Formats.html)`
.: add README and LICENSE 2015-02-20 15:29:48 +00:00

README.md: add caveat about dup paths 2015-03-05 19:33:05 +00:00			`Other caveat, while tar archives support having multiple file entries for the`
			`same path, we will not support this feature. If there are more than one entries`
			with the same path, expect an err (like `ErrDuplicatePath`) or a resulting tar
			`stream that does not validate your original checksum/signature.`

README: formatting and cleanup 2015-08-10 19:24:51 +00:00			`## Contract`
README.md: update what's next 2015-03-09 18:19:40 +00:00
README: formatting and cleanup 2015-08-10 19:24:51 +00:00			Do not break the API of stdlib `archive/tar` in our fork (ideally find an upstream mergeable solution).
.: add README and LICENSE 2015-02-20 15:29:48 +00:00
README: formatting and cleanup 2015-08-10 19:24:51 +00:00			`## Std Version`
README.md: link to docs and info on the version 2015-02-20 21:12:01 +00:00
README: updates 2015-08-18 18:54:32 +00:00			The version of golang stdlib `archive/tar` is from go1.4.1, and their master branch around [a9dddb53f](https://github.com/golang/go/tree/a9dddb53f).
			`It is minimally extended to expose the raw bytes of the TAR, rather than just the marshalled headers and file stream.`

README.md: link to docs and info on the version 2015-02-20 21:12:01 +00:00
README: cleanup 2015-08-10 19:29:08 +00:00			`## Design`
README.md: link to docs and info on the version 2015-02-20 21:12:01 +00:00
README: formatting and cleanup 2015-08-10 19:24:51 +00:00			`See the [design](concept/DESIGN.md).`
README.md: update example and add explanation Add an explanation of the readings that happen on the tar archive stream Fixes #3 2015-03-10 15:19:50 +00:00
README: formatting and cleanup 2015-08-10 19:24:51 +00:00			`## Stored Metadata`
README.md: information on metadata size 2015-03-10 15:41:20 +00:00
			`Since the raw bytes of the headers and padding are stored, you may be wondering`
			`what the size implications are. The headers are at least 512 bytes per`
			`file (sometimes more), at least 1024 null bytes on the end, and then various`
			`padding. This makes for a constant linear growth in the stored metadata, with a`
			`naive storage implementation.`

README: cleanup 2015-08-10 19:29:08 +00:00			`First we'll get an archive to work with. For repeatability, we'll make an`
			`archive from what you've just cloned:`

cmd/tar-split: make `checksize` a sub-command Moving it from top-level to the `tar-split` command 2015-08-10 20:20:22 +00:00			```bash
README: cleanup 2015-08-10 19:29:08 +00:00			`git archive --format=tar -o tar-split.tar HEAD .`
			```
README.md: information on metadata size 2015-03-10 15:41:20 +00:00
cmd/tar-split: make `checksize` a sub-command Moving it from top-level to the `tar-split` command 2015-08-10 20:20:22 +00:00			```bash
			`$ go get github.com/vbatts/tar-split/cmd/tar-split`
			`$ tar-split checksize ./tar-split.tar`
README.md: information on metadata size 2015-03-10 15:41:20 +00:00			`inspecting "tar-split.tar" (size 210k)`
			`-- number of files: 50`
			`-- size of metadata uncompressed: 53k`
			`-- size of gzip compressed metadata: 3k`
			```

			`So assuming you've managed the extraction of the archive yourself, for reuse of`
			`the file payloads from a relative path, then the only additional storage`
			`implications are as little as 3kb.`

			`But let's look at a larger archive, with many files.`

cmd/tar-split: make `checksize` a sub-command Moving it from top-level to the `tar-split` command 2015-08-10 20:20:22 +00:00			```bash
README.md: information on metadata size 2015-03-10 15:41:20 +00:00			`$ ls -sh ./d.tar`
			`1.4G ./d.tar`
README: missed a checksize reference 2015-08-10 20:26:09 +00:00			`$ tar-split checksize ~/d.tar`
README.md: information on metadata size 2015-03-10 15:41:20 +00:00			`inspecting "/home/vbatts/d.tar" (size 1420749k)`
			`-- number of files: 38718`
			`-- size of metadata uncompressed: 43261k`
			`-- size of gzip compressed metadata: 2251k`
			```

			`Here, an archive with 38,718 files has a compressed footprint of about 2mb.`

			`Rolling the null bytes on the end of the archive, we will assume a`
			`bytes-per-file rate for the storage implications.`

			`\| uncompressed \| compressed \|`
			`\| :----------: \| :--------: \|`
			`\| ~ 1kb per/file \| 0.06kb per/file \|`


README: formatting and cleanup 2015-08-10 19:24:51 +00:00			`## What's Next?`
README.md: add Next Steps 2015-02-20 20:44:07 +00:00
README.md: update what's next 2015-03-09 18:19:40 +00:00			`* More implementations of storage Packer and Unpacker`
			`* More implementations of FileGetter and FilePutter`
README.md: comments on what's next 2015-03-23 20:35:42 +00:00			* would be interesting to have an assembler stream that implements `io.Seeker`
README.md: add Next Steps 2015-02-20 20:44:07 +00:00
cmd/tar-split: make `checksize` a sub-command Moving it from top-level to the `tar-split` command 2015-08-10 20:20:22 +00:00
README: formatting and cleanup 2015-08-10 19:24:51 +00:00			`## License`
.: add README and LICENSE 2015-02-20 15:29:48 +00:00
README: formatting and cleanup 2015-08-10 19:24:51 +00:00			`See [LICENSE](LICENSE)`
.: add README and LICENSE 2015-02-20 15:29:48 +00:00