1
0
Fork 1
mirror of https://github.com/vbatts/tar-split.git synced 2025-01-11 22:37:08 +00:00
tar-split/README.md

130 lines
4.5 KiB
Markdown

Overview
========
Extend the upstream golang stdlib `archive/tar` library, to expose the raw
bytes of the TAR, rather than just the marshalled headers and file stream.
The goal being that by preserving the raw bytes of each header, padding bytes,
and the raw file payload, one could reassemble the original archive.
Caveat
------
Eventually this should detect TARs that this is not possible with.
For example stored sparse files that have "holes" in them, will be read as a
contiguous file, though the archive contents may be recorded in sparse format.
Therefore when adding the file payload to a reassembled tar, to achieve
identical output, the file payload would need be precisely re-sparsified. This
is not something I seek to fix imediately, but would rather have an alert that
precise reassembly is not possible.
(see more http://www.gnu.org/software/tar/manual/html_node/Sparse-Formats.html)
Other caveat, while tar archives support having multiple file entries for the
same path, we will not support this feature. If there are more than one entries
with the same path, expect an err (like `ErrDuplicatePath`) or a resulting tar
stream that does not validate your original checksum/signature.
Contract
--------
Do not break the API of stdlib `archive/tar`
Std Version
-----------
The version of golang stdlib `archive/tar` is from go1.4.1, and their master branch around [a9dddb53f](https://github.com/golang/go/tree/a9dddb53f)
Docs
----
* https://godoc.org/github.com/vbatts/tar-split/archive/tar
* https://godoc.org/github.com/vbatts/tar-split/tar/storage
* https://godoc.org/github.com/vbatts/tar-split/tar/asm
Example
-------
First we'll get an archive to work with. For repeatability, we'll make an
archive from what you've just cloned:
```
git archive --format=tar -o tar-split.tar HEAD .
```
Then build the example main.go:
```
go build ./main.go
```
Now run the example over the archive:
```
$ ./main tar-split.tar
2015/02/20 15:00:58 writing "tar-split.tar" to "tar-split.tar.out"
pax_global_header pre: 512 read: 52 post: 0
LICENSE pre: 972 read: 1075 post: 0
README.md pre: 973 read: 1004 post: 0
archive/ pre: 532 read: 0 post: 0
archive/tar/ pre: 512 read: 0 post: 0
archive/tar/common.go pre: 512 read: 7790 post: 0
archive/tar/example_test.go pre: 914 read: 1659 post: 0
archive/tar/reader.go pre: 901 read: 25303 post: 0
archive/tar/reader_test.go pre: 809 read: 17513 post: 0
archive/tar/stat_atim.go pre: 919 read: 414 post: 0
archive/tar/stat_atimespec.go pre: 610 read: 414 post: 0
archive/tar/stat_unix.go pre: 610 read: 716 post: 0
archive/tar/tar_test.go pre: 820 read: 6673 post: 0
archive/tar/testdata/ pre: 1007 read: 0 post: 0
archive/tar/testdata/gnu.tar pre: 512 read: 3072 post: 0
archive/tar/testdata/nil-uid.tar pre: 512 read: 1024 post: 0
archive/tar/testdata/pax.tar pre: 512 read: 10240 post: 0
archive/tar/testdata/small.txt pre: 512 read: 5 post: 0
archive/tar/testdata/small2.txt pre: 1019 read: 11 post: 0
archive/tar/testdata/sparse-formats.tar pre: 1013 read: 17920 post: 0
archive/tar/testdata/star.tar pre: 512 read: 3072 post: 0
archive/tar/testdata/ustar.tar pre: 512 read: 2048 post: 0
archive/tar/testdata/v7.tar pre: 512 read: 3584 post: 0
archive/tar/testdata/writer-big-long.tar pre: 512 read: 4096 post: 0
archive/tar/testdata/writer-big.tar pre: 512 read: 4096 post: 0
archive/tar/testdata/writer.tar pre: 512 read: 3584 post: 0
archive/tar/testdata/xattrs.tar pre: 512 read: 5120 post: 0
archive/tar/writer.go pre: 512 read: 11867 post: 0
archive/tar/writer_test.go pre: 933 read: 12436 post: 0
main.go pre: 876 read: 1568 post: 0
old.go pre: 992 read: 4918 post: 0
Size: 174080; Sum: 174080
```
Ideally the input tar and output `*.out`, will match:
```
$ sha1sum tar-split.tar*
ca9e19966b892d9ad5960414abac01ef585a1e22 tar-split.tar
ca9e19966b892d9ad5960414abac01ef585a1e22 tar-split.tar.out
```
What's Next?
------------
* Add tests for different types of tar options/extensions
* Package for convenience handling around collecting the RawBytes()
* Marshalling and storing index, ordering, file size and perhaps relative path of extracted files
- perhaps have an API to allow user to provided a `hash.Hash` to checksum and store for the file payloads
- though not enabled by default
- this way, users wanting to implement an on disk tree validation could do so
- but otherwise, we rely on the resulting re-assembled tar be validated
* Using stored index information, make an API for providing `io.Reader` and perhaps `tar.Reader` from re-assembled tar
License
-------
See LICENSE