path-packer-c/ALGORITHM.md

PACKING ALGORITHM
=================

Begin with a list of content set paths:
```
/content/dist/rhel/$releasever/$basearch/os
/content/dist/rhel/$releasever/$basearch/debug
/content/dist/rhel/$releasever/$basearch/source/SRPMS
/content/dist/jboss/source
/content/beta/rhel/$releasever/$basearch/os
/content/beta/rhel/$releasever/$basearch/debug
/content/beta/rhel/$releasever/$basearch/source/SRPMS
```

Create a prefix tree out of the paths, where each node in the tree contains the
names of its children, rather than the children containing its own name:

```
+-------+                              +------+
|       |                              |      |
|-------+                              |------|
|content+--+-------+                   |source+-+---+
+-------+  |       |                   +------+ |   |
           |-------|                   |        |---|
           |dist   +--------+-------+  |        +---+
        +--+beta   |        |       |  |
        |  +-------+        |-------|  |
        |                   |jboss  +--+
+-------+                   |rhel   +--+
|       |                   +-------+  |
|-------|                              |
|rhel   +--+-----------+               +-----------+
+-------+  |           |               |           |
           |-----------|               |-----------|
+----------+$releasever|               |$releasever+--+
|          +-----------+               +-----------+  |
|                                                     |
+---------+                                 +---------+
|         |                                 |         |
|---------|                                 |---------|
|$basearch+--+                           +--+$basearch|    +---+
+---------+  |                           |  +---------+    |   |
             |                           |                 |---|
             +--------+                  +-------+         +---+
             |        |                  |       |         |
             |--------|                  |-------|         |
+---+--------+os      |                  |os     +---------+
|   |     +--+debug   |                  |debug  +-----------+---+
|---|     |  |source  +--+               |source +--+        |   |
+---+     |  +--------+  |               +-------+  |        |---|
          |              |                          |        +---+
      +---+              +------+                   |
      |   |              |      |                   +-----+
      |---|              |------|                   |     |
      +---+           +--+SRPMS |                   |-----|
                      |  +------+                   |SRPMS+--+---+
                      |                             +-----+  |   |
                  +---+                                      |---|
                  |   |                                      +---+
                  |---|
                  +---+
```

This eliminates the duplication in the prefixes of our paths. We now only have
one instance of the word "content", for example. We now find any duplicate
subtrees, remove the duplicate, and point all references from the duplicate to
the original:

```
+-------+                              +------+
|       |                              |      |
|-------+                              |------|
|content+--+-------+                   |source+-+
+-------+  |       |                   +------+ |
           |-------|                   |        |
           |dist   +--------+-------+  |        |
        +--+beta   |        |       |  |        |
        |  +-------+        |-------|  |        |
        |                   |jboss  +--+        |
+-------+                +--+rhel   |           |
|       |                |  +-------+           |
|-------|                |                      |
|rhel   +--+-----------+-+                      |
+-------+  |           |                        |
           |-----------|                        |
+----------+$releasever|                        |
|          +-----------+                        |
|                                               |
+---------+                                     |
|         |                                     |
|---------|                                     |
|$basearch+--+                                  |
+---------+  |                                  |
             |                                  |
             +--------+                         |
             |        |                         |
             |--------|                         |
     +-------+os      |                         |
     |    +--+debug   |                         |
     |    |  |source  +--+                      |
     |    |  +--------+  |                      |
     |    |              |                      |
     |    |              +------+               |
     |    +-------+      |      |               |
     |            |      |------|               |
     |            |   +--+SRPMS |               |
     |            |   |  +------+               |
     |            |   |                         |
     +------------+---+-------------------------+
                  |   |
                  |---|
                  +---+
```

With this structure, we can begin creating the packed data. The first step is
to build a huffman coding for the string components of the path. In the
example, all strings are used only once, except for rhel and source. We create
a list of all the strings, ordered from least to most occurance. This list is
the one written out in the path dictionary section of the binary packing. The
ordering used in the list is what we then feed into a huffman tree for paths.
thus, even though both os and debug occur just once in the above DAG, depending
on the ordering in the list, one will be assigned a higher weight than the
other (which helps the other side decode the binary format). The string huffman
tree also includes a special sentinal value to indicate end of node in the
binary format. This value should not be written to the binary packing, should
be given the highest weight, and should be some string that is not used in the
path strings themselves (to avoid collision).

Next, we order the nodes from the above DAG by order of reference from other
nodes, from least to most references. This ensures the root of the DAG is
always first, as it has no references. Similar to the strings, we create a
huffman tree for the nodes.

We can then iterate over the node list, writing each one out to the binary
packing. Within each node, for each string and node pair that it references, we
look up each in their respective huffman trees, and write the huffman coding to
the binary packing.

At the end of a node, we use the special sentinal value from the string huffman
tree to indicate end of node.