From e756d673db1ad850143872023a3a5e5277941794 Mon Sep 17 00:00:00 2001 From: James Bowes Date: Wed, 22 Aug 2012 11:50:53 -0300 Subject: [PATCH] update algorithm --- ALGORITHM.md | 34 ++++++++++++++++++++++++++++++---- README.md | 4 ++-- 2 files changed, 32 insertions(+), 6 deletions(-) diff --git a/ALGORITHM.md b/ALGORITHM.md index 7219a9a..cafe109 100644 --- a/ALGORITHM.md +++ b/ALGORITHM.md @@ -77,10 +77,10 @@ the original: +--+beta | | | | | | +-------+ |-------| | | | |jboss +--+ | -+-------+-------------------+rhel | | -| | +-------+ | -|-------| | -|rhel +--+-----------+ | ++-------+ +--+rhel | | +| | | +-------+ | +|-------| | | +|rhel +--+-----------+-+ | +-------+ | | | |-----------| | +----------+$releasever| | @@ -112,3 +112,29 @@ the original: +---+ ``` +With this structure, we can begin creating the packed data. The first step is +to build a huffman coding for the string components of the path. In the +example, all strings are used only once, except for rhel and source. We create +a list of all the strings, ordered from least to most occurance. This list is +the one written out in the path dictionary section of the binary packing. The +ordering used in the list is what we then feed into a huffman tree for paths. +thus, even though both os and debug occur just once in the above DAG, depending +on the ordering in the list, one will be assigned a higher weight than the +other (which helps the other side decode the binary format). The string huffman +tree also includes a special sentinal value to indicate end of node in the +binary format. This value should not be written to the binary packing, should +be given the highest weight, and should be some string that is not used in the +path strings themselves (to avoid collision). + +Next, we order the nodes from the above DAG by order of reference from other +nodes, from least to most references. This ensures the root of the DAG is +always first, as it has no references. Similar to the strings, we create a +huffman tree for the nodes. + +We can then iterate over the node list, writing each one out to the binary +packing. Within each node, for each string and node pair that it references, we +look up each in their respective huffman trees, and write the huffman coding to +the binary packing. + +At the end of a node, we use the special sentinal value from the string huffman +tree to indicate end of node. diff --git a/README.md b/README.md index 2a4b30e..e8dac5b 100644 --- a/README.md +++ b/README.md @@ -4,8 +4,8 @@ Overview A POC to take a list of content sets (basically a listing of directories) and pack them into a format optimized for space efficieny and reading. -For details on the file format, please see `FORMAT.md`, and the included -source. +For details on the file format, please see `FORMAT.md`, `ALGORITHM.md`, and the +included source. Compilation and Usage =====================