README: formatting and cleanup

2025-10-05 05:31:01 +00:00 · 2015-08-10 15:24:51 -04:00 · 2015-08-10 15:24:51 -04:00 · 779e824d76
commit 779e824d76
parent f465e4720e
3 changed files with 106 additions and 122 deletions
--- a/DESIGN.md
+++ b/DESIGN.md
@ -1,36 +0,0 @@
 Flow of TAR stream
 ==================
 The underlying use of `github.com/vbatts/tar-split/archive/tar` is most similar
 to stdlib.
 Packer interface
 ----------------
 For ease of storage and usage of the raw bytes, there will be a storage
 interface, that accepts an io.Writer (This way you could pass it an in memory
 buffer or a file handle).
 Having a Packer interface can allow configuration of hash.Hash for file payloads
 and providing your own io.Writer.
 Instead of having a state directory to store all the header information for all
 Readers, we will leave that up to user of Reader. Because we can not assume an
 ID for each Reader, and keeping that information differentiated.
 State Directory
 ---------------
 Perhaps we could deduplicate the header info, by hashing the rawbytes and
 storing them in a directory tree like:
 	./ac/dc/beef
 Then reference the hash of the header info, in the positional records for the
 tar stream. Though this could be a future feature, and not required for an
 initial implementation. Also, this would imply an owned state directory, rather
 than just writing storage info to an io.Writer.
--- a/README.md
+++ b/README.md
@ -1,5 +1,4 @@
-tar-split
+# tar-split
 ========
 [![Build Status](https://travis-ci.org/vbatts/tar-split.svg?branch=master)](https://travis-ci.org/vbatts/tar-split)
@ -9,17 +8,13 @@ bytes of the TAR, rather than just the marshalled headers and file stream.
 The goal being that by preserving the raw bytes of each header, padding bytes,
 and the raw file payload, one could reassemble the original archive.
-
+## Docs
 Docs
 ----
 * https://godoc.org/github.com/vbatts/tar-split/tar/asm
 * https://godoc.org/github.com/vbatts/tar-split/tar/storage
 * https://godoc.org/github.com/vbatts/tar-split/archive/tar
-
+## Caveat
 Caveat
 ------
 Eventually this should detect TARs that this is not possible with.
@ -37,85 +32,19 @@ same path, we will not support this feature. If there are more than one entries
 with the same path, expect an err (like `ErrDuplicatePath`) or a resulting tar
 stream that does not validate your original checksum/signature.
 ## Contract
-Contract
+Do not break the API of stdlib `archive/tar` in our fork (ideally find an upstream mergeable solution).
 --------
-Do not break the API of stdlib `archive/tar` in our fork (ideally find an
+## Std Version
 upstream mergeable solution)
 Std Version
 -----------
 The version of golang stdlib `archive/tar` is from go1.4.1, and their master branch around [a9dddb53f](https://github.com/golang/go/tree/a9dddb53f)
 ## Concept
-Example
+See the [design](concept/DESIGN.md).
 -------
-First we'll get an archive to work with. For repeatability, we'll make an
+## Stored Metadata
 archive from what you've just cloned:
 ```
 git archive --format=tar -o tar-split.tar HEAD .
 ```
 Then build the example main.go:
 ```
 go build ./main.go
 ```
 Now run the example over the archive:
 ```
 $ ./main tar-split.tar
 2015/02/20 15:00:58 writing "tar-split.tar" to "tar-split.tar.out"
 pax_global_header pre: 512 read: 52
 .travis.yml pre: 972 read: 374
 DESIGN.md pre: 650 read: 1131
 LICENSE pre: 917 read: 1075
 README.md pre: 973 read: 4289
 archive/ pre: 831 read: 0
 archive/tar/ pre: 512 read: 0
 archive/tar/common.go pre: 512 read: 7790
 [...]
 tar/storage/entry_test.go pre: 667 read: 1137
 tar/storage/getter.go pre: 911 read: 2741
 tar/storage/getter_test.go pre: 843 read: 1491
 tar/storage/packer.go pre: 557 read: 3141
 tar/storage/packer_test.go pre: 955 read: 3096
 EOF padding: 1512
 Remainder: 512
 Size: 215040; Sum: 215040
 ```
 *What are we seeing here?* 
 * `pre` is the header of a file entry, and potentially the padding from the
  end of the prior file's payload. Also with particular tar extensions and pax
  attributes, the header can exceed 512 bytes.
 * `read` is the size of the file payload from the entry
 * `EOF padding` is the expected 1024 null bytes on the end of a tar archive,
  plus potential padding from the end of the prior file entry's payload
 * `Remainder` is the remaining bytes of an archive. This is typically deadspace
  as most tar implmentations will return after having reached the end of the
  1024 null bytes. Though various implementations will include some amount of
  bytes here, which will affect the checksum of the resulting tar archive,
  therefore this must be accounted for as well.
 Ideally the input tar and output `*.out`, will match:
 ```
 $ sha1sum tar-split.tar*
 ca9e19966b892d9ad5960414abac01ef585a1e22  tar-split.tar
 ca9e19966b892d9ad5960414abac01ef585a1e22  tar-split.tar.out
 ```
 Stored Metadata
 ---------------
 Since the raw bytes of the headers and padding are stored, you may be wondering
 what the size implications are. The headers are at least 512 bytes per
@ -163,8 +92,7 @@ bytes-per-file rate for the storage implications.
 | ~ 1kb per/file | 0.06kb per/file |
-What's Next?
+## What's Next?
 ------------
 * More implementations of storage Packer and Unpacker
 - could be a redis or mongo backend
@ -173,9 +101,7 @@ What's Next?
 * cli tooling to assemble/disassemble a provided tar archive
 * would be interesting to have an assembler stream that implements `io.Seeker`
-License
+## License
 -------
 See LICENSE
 See [LICENSE](LICENSE)
--- a/concept/DESIGN.md
+++ b/concept/DESIGN.md
@ -0,0 +1,94 @@
 # Flow of TAR stream
 ## `./archive/tar`
 The import path `github.com/vbatts/tar-split/archive/tar` is fork of upstream golang stdlib [`archive/tar`](http://golang.org/pkg/archive/tar/).
 It adds plumbing to access raw bytes of the tar stream as the headers and payload are read.
 ## Packer interface
 For ease of storage and usage of the raw bytes, there will be a storage
 interface, that accepts an io.Writer (This way you could pass it an in memory
 buffer or a file handle).
 Having a Packer interface can allow configuration of hash.Hash for file payloads
 and providing your own io.Writer.
 Instead of having a state directory to store all the header information for all
 Readers, we will leave that up to user of Reader. Because we can not assume an
 ID for each Reader, and keeping that information differentiated.
 ## State Directory
 Perhaps we could deduplicate the header info, by hashing the rawbytes and
 storing them in a directory tree like:
 	./ac/dc/beef
 Then reference the hash of the header info, in the positional records for the
 tar stream. Though this could be a future feature, and not required for an
 initial implementation. Also, this would imply an owned state directory, rather
 than just writing storage info to an io.Writer.
 ## Concept Example
 First we'll get an archive to work with. For repeatability, we'll make an
 archive from what you've just cloned:
 ```
 git archive --format=tar -o tar-split.tar HEAD .
 ```
 Then build the example main.go:
 ```
 go build ./main.go
 ```
 Now run the example over the archive:
 ```
 $ ./main tar-split.tar
 2015/02/20 15:00:58 writing "tar-split.tar" to "tar-split.tar.out"
 pax_global_header pre: 512 read: 52
 .travis.yml pre: 972 read: 374
 DESIGN.md pre: 650 read: 1131
 LICENSE pre: 917 read: 1075
 README.md pre: 973 read: 4289
 archive/ pre: 831 read: 0
 archive/tar/ pre: 512 read: 0
 archive/tar/common.go pre: 512 read: 7790
 [...]
 tar/storage/entry_test.go pre: 667 read: 1137
 tar/storage/getter.go pre: 911 read: 2741
 tar/storage/getter_test.go pre: 843 read: 1491
 tar/storage/packer.go pre: 557 read: 3141
 tar/storage/packer_test.go pre: 955 read: 3096
 EOF padding: 1512
 Remainder: 512
 Size: 215040; Sum: 215040
 ```
 *What are we seeing here?* 
 * `pre` is the header of a file entry, and potentially the padding from the
  end of the prior file's payload. Also with particular tar extensions and pax
  attributes, the header can exceed 512 bytes.
 * `read` is the size of the file payload from the entry
 * `EOF padding` is the expected 1024 null bytes on the end of a tar archive,
  plus potential padding from the end of the prior file entry's payload
 * `Remainder` is the remaining bytes of an archive. This is typically deadspace
  as most tar implmentations will return after having reached the end of the
  1024 null bytes. Though various implementations will include some amount of
  bytes here, which will affect the checksum of the resulting tar archive,
  therefore this must be accounted for as well.
 Ideally the input tar and output `*.out`, will match:
 ```
 $ sha1sum tar-split.tar*
 ca9e19966b892d9ad5960414abac01ef585a1e22  tar-split.tar
 ca9e19966b892d9ad5960414abac01ef585a1e22  tar-split.tar.out
 ```