mirror of
https://github.com/vbatts/tar-split.git
synced 2024-11-22 08:05:39 +00:00
README: formatting and cleanup
This commit is contained in:
parent
f465e4720e
commit
779e824d76
3 changed files with 106 additions and 122 deletions
36
DESIGN.md
36
DESIGN.md
|
@ -1,36 +0,0 @@
|
|||
Flow of TAR stream
|
||||
==================
|
||||
|
||||
The underlying use of `github.com/vbatts/tar-split/archive/tar` is most similar
|
||||
to stdlib.
|
||||
|
||||
|
||||
Packer interface
|
||||
----------------
|
||||
|
||||
For ease of storage and usage of the raw bytes, there will be a storage
|
||||
interface, that accepts an io.Writer (This way you could pass it an in memory
|
||||
buffer or a file handle).
|
||||
|
||||
Having a Packer interface can allow configuration of hash.Hash for file payloads
|
||||
and providing your own io.Writer.
|
||||
|
||||
Instead of having a state directory to store all the header information for all
|
||||
Readers, we will leave that up to user of Reader. Because we can not assume an
|
||||
ID for each Reader, and keeping that information differentiated.
|
||||
|
||||
|
||||
|
||||
State Directory
|
||||
---------------
|
||||
|
||||
Perhaps we could deduplicate the header info, by hashing the rawbytes and
|
||||
storing them in a directory tree like:
|
||||
|
||||
./ac/dc/beef
|
||||
|
||||
Then reference the hash of the header info, in the positional records for the
|
||||
tar stream. Though this could be a future feature, and not required for an
|
||||
initial implementation. Also, this would imply an owned state directory, rather
|
||||
than just writing storage info to an io.Writer.
|
||||
|
98
README.md
98
README.md
|
@ -1,5 +1,4 @@
|
|||
tar-split
|
||||
========
|
||||
# tar-split
|
||||
|
||||
[![Build Status](https://travis-ci.org/vbatts/tar-split.svg?branch=master)](https://travis-ci.org/vbatts/tar-split)
|
||||
|
||||
|
@ -9,17 +8,13 @@ bytes of the TAR, rather than just the marshalled headers and file stream.
|
|||
The goal being that by preserving the raw bytes of each header, padding bytes,
|
||||
and the raw file payload, one could reassemble the original archive.
|
||||
|
||||
|
||||
Docs
|
||||
----
|
||||
## Docs
|
||||
|
||||
* https://godoc.org/github.com/vbatts/tar-split/tar/asm
|
||||
* https://godoc.org/github.com/vbatts/tar-split/tar/storage
|
||||
* https://godoc.org/github.com/vbatts/tar-split/archive/tar
|
||||
|
||||
|
||||
Caveat
|
||||
------
|
||||
## Caveat
|
||||
|
||||
Eventually this should detect TARs that this is not possible with.
|
||||
|
||||
|
@ -37,85 +32,19 @@ same path, we will not support this feature. If there are more than one entries
|
|||
with the same path, expect an err (like `ErrDuplicatePath`) or a resulting tar
|
||||
stream that does not validate your original checksum/signature.
|
||||
|
||||
## Contract
|
||||
|
||||
Contract
|
||||
--------
|
||||
Do not break the API of stdlib `archive/tar` in our fork (ideally find an upstream mergeable solution).
|
||||
|
||||
Do not break the API of stdlib `archive/tar` in our fork (ideally find an
|
||||
upstream mergeable solution)
|
||||
|
||||
|
||||
Std Version
|
||||
-----------
|
||||
## Std Version
|
||||
|
||||
The version of golang stdlib `archive/tar` is from go1.4.1, and their master branch around [a9dddb53f](https://github.com/golang/go/tree/a9dddb53f)
|
||||
|
||||
## Concept
|
||||
|
||||
Example
|
||||
-------
|
||||
See the [design](concept/DESIGN.md).
|
||||
|
||||
First we'll get an archive to work with. For repeatability, we'll make an
|
||||
archive from what you've just cloned:
|
||||
|
||||
```
|
||||
git archive --format=tar -o tar-split.tar HEAD .
|
||||
```
|
||||
|
||||
Then build the example main.go:
|
||||
|
||||
```
|
||||
go build ./main.go
|
||||
```
|
||||
|
||||
Now run the example over the archive:
|
||||
|
||||
```
|
||||
$ ./main tar-split.tar
|
||||
2015/02/20 15:00:58 writing "tar-split.tar" to "tar-split.tar.out"
|
||||
pax_global_header pre: 512 read: 52
|
||||
.travis.yml pre: 972 read: 374
|
||||
DESIGN.md pre: 650 read: 1131
|
||||
LICENSE pre: 917 read: 1075
|
||||
README.md pre: 973 read: 4289
|
||||
archive/ pre: 831 read: 0
|
||||
archive/tar/ pre: 512 read: 0
|
||||
archive/tar/common.go pre: 512 read: 7790
|
||||
[...]
|
||||
tar/storage/entry_test.go pre: 667 read: 1137
|
||||
tar/storage/getter.go pre: 911 read: 2741
|
||||
tar/storage/getter_test.go pre: 843 read: 1491
|
||||
tar/storage/packer.go pre: 557 read: 3141
|
||||
tar/storage/packer_test.go pre: 955 read: 3096
|
||||
EOF padding: 1512
|
||||
Remainder: 512
|
||||
Size: 215040; Sum: 215040
|
||||
```
|
||||
|
||||
*What are we seeing here?*
|
||||
|
||||
* `pre` is the header of a file entry, and potentially the padding from the
|
||||
end of the prior file's payload. Also with particular tar extensions and pax
|
||||
attributes, the header can exceed 512 bytes.
|
||||
* `read` is the size of the file payload from the entry
|
||||
* `EOF padding` is the expected 1024 null bytes on the end of a tar archive,
|
||||
plus potential padding from the end of the prior file entry's payload
|
||||
* `Remainder` is the remaining bytes of an archive. This is typically deadspace
|
||||
as most tar implmentations will return after having reached the end of the
|
||||
1024 null bytes. Though various implementations will include some amount of
|
||||
bytes here, which will affect the checksum of the resulting tar archive,
|
||||
therefore this must be accounted for as well.
|
||||
|
||||
Ideally the input tar and output `*.out`, will match:
|
||||
|
||||
```
|
||||
$ sha1sum tar-split.tar*
|
||||
ca9e19966b892d9ad5960414abac01ef585a1e22 tar-split.tar
|
||||
ca9e19966b892d9ad5960414abac01ef585a1e22 tar-split.tar.out
|
||||
```
|
||||
|
||||
|
||||
Stored Metadata
|
||||
---------------
|
||||
## Stored Metadata
|
||||
|
||||
Since the raw bytes of the headers and padding are stored, you may be wondering
|
||||
what the size implications are. The headers are at least 512 bytes per
|
||||
|
@ -163,8 +92,7 @@ bytes-per-file rate for the storage implications.
|
|||
| ~ 1kb per/file | 0.06kb per/file |
|
||||
|
||||
|
||||
What's Next?
|
||||
------------
|
||||
## What's Next?
|
||||
|
||||
* More implementations of storage Packer and Unpacker
|
||||
- could be a redis or mongo backend
|
||||
|
@ -173,9 +101,7 @@ What's Next?
|
|||
* cli tooling to assemble/disassemble a provided tar archive
|
||||
* would be interesting to have an assembler stream that implements `io.Seeker`
|
||||
|
||||
License
|
||||
-------
|
||||
|
||||
See LICENSE
|
||||
## License
|
||||
|
||||
See [LICENSE](LICENSE)
|
||||
|
||||
|
|
94
concept/DESIGN.md
Normal file
94
concept/DESIGN.md
Normal file
|
@ -0,0 +1,94 @@
|
|||
# Flow of TAR stream
|
||||
|
||||
## `./archive/tar`
|
||||
|
||||
The import path `github.com/vbatts/tar-split/archive/tar` is fork of upstream golang stdlib [`archive/tar`](http://golang.org/pkg/archive/tar/).
|
||||
It adds plumbing to access raw bytes of the tar stream as the headers and payload are read.
|
||||
|
||||
## Packer interface
|
||||
|
||||
For ease of storage and usage of the raw bytes, there will be a storage
|
||||
interface, that accepts an io.Writer (This way you could pass it an in memory
|
||||
buffer or a file handle).
|
||||
|
||||
Having a Packer interface can allow configuration of hash.Hash for file payloads
|
||||
and providing your own io.Writer.
|
||||
|
||||
Instead of having a state directory to store all the header information for all
|
||||
Readers, we will leave that up to user of Reader. Because we can not assume an
|
||||
ID for each Reader, and keeping that information differentiated.
|
||||
|
||||
## State Directory
|
||||
|
||||
Perhaps we could deduplicate the header info, by hashing the rawbytes and
|
||||
storing them in a directory tree like:
|
||||
|
||||
./ac/dc/beef
|
||||
|
||||
Then reference the hash of the header info, in the positional records for the
|
||||
tar stream. Though this could be a future feature, and not required for an
|
||||
initial implementation. Also, this would imply an owned state directory, rather
|
||||
than just writing storage info to an io.Writer.
|
||||
|
||||
## Concept Example
|
||||
|
||||
First we'll get an archive to work with. For repeatability, we'll make an
|
||||
archive from what you've just cloned:
|
||||
|
||||
```
|
||||
git archive --format=tar -o tar-split.tar HEAD .
|
||||
```
|
||||
|
||||
Then build the example main.go:
|
||||
|
||||
```
|
||||
go build ./main.go
|
||||
```
|
||||
|
||||
Now run the example over the archive:
|
||||
|
||||
```
|
||||
$ ./main tar-split.tar
|
||||
2015/02/20 15:00:58 writing "tar-split.tar" to "tar-split.tar.out"
|
||||
pax_global_header pre: 512 read: 52
|
||||
.travis.yml pre: 972 read: 374
|
||||
DESIGN.md pre: 650 read: 1131
|
||||
LICENSE pre: 917 read: 1075
|
||||
README.md pre: 973 read: 4289
|
||||
archive/ pre: 831 read: 0
|
||||
archive/tar/ pre: 512 read: 0
|
||||
archive/tar/common.go pre: 512 read: 7790
|
||||
[...]
|
||||
tar/storage/entry_test.go pre: 667 read: 1137
|
||||
tar/storage/getter.go pre: 911 read: 2741
|
||||
tar/storage/getter_test.go pre: 843 read: 1491
|
||||
tar/storage/packer.go pre: 557 read: 3141
|
||||
tar/storage/packer_test.go pre: 955 read: 3096
|
||||
EOF padding: 1512
|
||||
Remainder: 512
|
||||
Size: 215040; Sum: 215040
|
||||
```
|
||||
|
||||
*What are we seeing here?*
|
||||
|
||||
* `pre` is the header of a file entry, and potentially the padding from the
|
||||
end of the prior file's payload. Also with particular tar extensions and pax
|
||||
attributes, the header can exceed 512 bytes.
|
||||
* `read` is the size of the file payload from the entry
|
||||
* `EOF padding` is the expected 1024 null bytes on the end of a tar archive,
|
||||
plus potential padding from the end of the prior file entry's payload
|
||||
* `Remainder` is the remaining bytes of an archive. This is typically deadspace
|
||||
as most tar implmentations will return after having reached the end of the
|
||||
1024 null bytes. Though various implementations will include some amount of
|
||||
bytes here, which will affect the checksum of the resulting tar archive,
|
||||
therefore this must be accounted for as well.
|
||||
|
||||
Ideally the input tar and output `*.out`, will match:
|
||||
|
||||
```
|
||||
$ sha1sum tar-split.tar*
|
||||
ca9e19966b892d9ad5960414abac01ef585a1e22 tar-split.tar
|
||||
ca9e19966b892d9ad5960414abac01ef585a1e22 tar-split.tar.out
|
||||
```
|
||||
|
||||
|
Loading…
Reference in a new issue