This allows reading the metadata contained in tar-split
without expensively recreating the whole tar stream
including full contents.
We have two use cases for this:
- In a situation where tar-split is distributed along with
a separate metadata stream, ensuring that the two are
exactly consistent
- Reading the tar headers allows making a ~cheap check
of consistency of on-disk layers, just checking that the
files exist in expected sizes, without reading the full
contents.
This can be implemented outside of this repo, but it's
not ideal:
- The function necessarily hard-codes some assumptions
about how tar-split determines the boundaries of
SegmentType/FileType entries (or, indeed, whether it
uses FileType entries at all). That's best maintained
directly beside the code that creates this.
- The ExpectedPadding() value is not currently exported,
so the consumer would have to heuristically guess where
the padding ends.
Signed-off-by: Miloslav Trmač <mitr@redhat.com>
Fixes#65
if the read bytes is 0, then don't even create the entry for that
padding.
This sounds like the solution for the issue opened, but I haven't found
a reproducer for this issue yet. :-\
Signed-off-by: Vincent Batts <vbatts@hashbangbash.com>
this function is used widely and it's JSON. And it was not written in
such a way as to have exchangable codec.. per se
So, maybe I'll just kick out the idea of using https://github.com/ugorji/go
Signed-off-by: Vincent Batts <vbatts@hashbangbash.com>
the pointer to the pool may be useful, but holding on that until I get
benchmarks of memory use to show the benefit.
Signed-off-by: Vincent Batts <vbatts@hashbangbash.com>
io.Copy usually allocates a 32kB buffer, and due to the large
number of files processed by tar-split, this shows up in Go profiles
as a very large alloc_space total.
It doesn't seem to actually be a measurable problem in any way,
but we can allocate the buffer only once per tar-split creation,
at no additional cost to existing allocations, so let's do so,
and remove the distraction.
Signed-off-by: Miloslav Trmač <mitr@redhat.com>
To ensure we don't have regressions in our padding fix, add a test case
that attempts to crash the test by creating 20GB of random junk padding.
Signed-off-by: Aleksa Sarai <asarai@suse.de>
Previously, we would read the entire padding in a given archive into
memory in order to store it in the packer. This would cause memory
exhaustion if a malicious archive was crafted with very large amounts of
padding. Since a given SegmentType is reconstructed losslessly, we can
simply chunk up any padding into large segments to avoid this problem.
Use a reasonable default of 1MiB to avoid changing the tar-split.json of
existing archives that are not malformed.
Fixes: CVE-2017-14992
Signed-off-by: Aleksa Sarai <asarai@suse.de>
- New writeTo method allows to avoid creating extra pipe.
- Copy with a pooled buffer instead of allocating new buffer for each file.
- Avoid extra object allocations inside the loop.
Signed-off-by: Tonis Tiigi <tonistiigi@gmail.com>
When the entry name is not UTF-8, for example ISO-8859-1, then store the
raw bytes.
To accommodate this, we will have getters and setters for the entry's
name now. Since this most heavily affects the json marshalling, we'll
double check the sanity of the name before storing it in the JSONPacker.
It uses slightly less memory and more understandable.
Benchmar results:
benchmark old ns/op new ns/op delta
BenchmarkPutter-4 57272 52375 -8.55%
benchmark old allocs new allocs delta
BenchmarkPutter-4 21 19 -9.52%
benchmark old bytes new bytes delta
BenchmarkPutter-4 19416 13336 -31.31%
Signed-off-by: Alexander Morozov <lk4d4@docker.com>