1
0
Fork 1
mirror of https://github.com/vbatts/tar-split.git synced 2024-11-26 01:35:39 +00:00
tar-split/README.md

121 lines
3.8 KiB
Markdown
Raw Normal View History

2015-08-10 19:24:51 +00:00
# tar-split
2015-02-20 15:29:48 +00:00
2015-03-09 18:22:45 +00:00
[![Build Status](https://travis-ci.org/vbatts/tar-split.svg?branch=master)](https://travis-ci.org/vbatts/tar-split)
2015-02-20 15:29:48 +00:00
Extend the upstream golang stdlib `archive/tar` library, to expose the raw
bytes of the TAR, rather than just the marshalled headers and file stream.
The goal being that by preserving the raw bytes of each header, padding bytes,
and the raw file payload, one could reassemble the original archive.
2015-08-10 19:24:51 +00:00
## Docs
2015-03-09 18:24:00 +00:00
2015-08-10 19:29:08 +00:00
Code API for libraries provided by `tar-split`:
2015-03-09 18:24:00 +00:00
* https://godoc.org/github.com/vbatts/tar-split/tar/asm
* https://godoc.org/github.com/vbatts/tar-split/tar/storage
* https://godoc.org/github.com/vbatts/tar-split/archive/tar
2015-08-10 19:29:08 +00:00
## Install
The command line utilitiy is installable via:
```bash
go get github.com/vbatts/tar-split/cmd/tar-split
```
2015-08-10 19:24:51 +00:00
## Caveat
2015-02-20 15:29:48 +00:00
Eventually this should detect TARs that this is not possible with.
For example stored sparse files that have "holes" in them, will be read as a
contiguous file, though the archive contents may be recorded in sparse format.
Therefore when adding the file payload to a reassembled tar, to achieve
identical output, the file payload would need be precisely re-sparsified. This
is not something I seek to fix imediately, but would rather have an alert that
precise reassembly is not possible.
2015-02-20 15:30:18 +00:00
(see more http://www.gnu.org/software/tar/manual/html_node/Sparse-Formats.html)
2015-02-20 15:29:48 +00:00
2015-03-05 19:33:05 +00:00
Other caveat, while tar archives support having multiple file entries for the
same path, we will not support this feature. If there are more than one entries
with the same path, expect an err (like `ErrDuplicatePath`) or a resulting tar
stream that does not validate your original checksum/signature.
2015-08-10 19:24:51 +00:00
## Contract
2015-03-09 18:19:40 +00:00
2015-08-10 19:24:51 +00:00
Do not break the API of stdlib `archive/tar` in our fork (ideally find an upstream mergeable solution).
2015-02-20 15:29:48 +00:00
2015-08-10 19:24:51 +00:00
## Std Version
The version of golang stdlib `archive/tar` is from go1.4.1, and their master branch around [a9dddb53f](https://github.com/golang/go/tree/a9dddb53f)
2015-08-10 19:29:08 +00:00
## Design
2015-08-10 19:24:51 +00:00
See the [design](concept/DESIGN.md).
2015-08-10 19:24:51 +00:00
## Stored Metadata
Since the raw bytes of the headers and padding are stored, you may be wondering
what the size implications are. The headers are at least 512 bytes per
file (sometimes more), at least 1024 null bytes on the end, and then various
padding. This makes for a constant linear growth in the stored metadata, with a
naive storage implementation.
2015-08-10 19:29:08 +00:00
First we'll get an archive to work with. For repeatability, we'll make an
archive from what you've just cloned:
```bash
2015-08-10 19:29:08 +00:00
git archive --format=tar -o tar-split.tar HEAD .
```
```bash
$ go get github.com/vbatts/tar-split/cmd/tar-split
$ tar-split checksize ./tar-split.tar
inspecting "tar-split.tar" (size 210k)
-- number of files: 50
-- size of metadata uncompressed: 53k
-- size of gzip compressed metadata: 3k
```
So assuming you've managed the extraction of the archive yourself, for reuse of
the file payloads from a relative path, then the only additional storage
implications are as little as 3kb.
But let's look at a larger archive, with many files.
```bash
$ ls -sh ./d.tar
1.4G ./d.tar
2015-08-10 20:26:09 +00:00
$ tar-split checksize ~/d.tar
inspecting "/home/vbatts/d.tar" (size 1420749k)
-- number of files: 38718
-- size of metadata uncompressed: 43261k
-- size of gzip compressed metadata: 2251k
```
Here, an archive with 38,718 files has a compressed footprint of about 2mb.
Rolling the null bytes on the end of the archive, we will assume a
bytes-per-file rate for the storage implications.
| uncompressed | compressed |
| :----------: | :--------: |
| ~ 1kb per/file | 0.06kb per/file |
2015-08-10 19:24:51 +00:00
## What's Next?
2015-02-20 20:44:07 +00:00
2015-03-09 18:19:40 +00:00
* More implementations of storage Packer and Unpacker
2015-03-23 20:35:42 +00:00
- could be a redis or mongo backend
2015-03-09 18:19:40 +00:00
* More implementations of FileGetter and FilePutter
2015-03-23 20:35:42 +00:00
- could be a redis or mongo backend
2015-03-09 18:19:40 +00:00
* cli tooling to assemble/disassemble a provided tar archive
2015-03-23 20:35:42 +00:00
* would be interesting to have an assembler stream that implements `io.Seeker`
2015-02-20 20:44:07 +00:00
2015-08-10 19:24:51 +00:00
## License
2015-02-20 15:29:48 +00:00
2015-08-10 19:24:51 +00:00
See [LICENSE](LICENSE)
2015-02-20 15:29:48 +00:00