Changeset: c7125e2745c9 for MonetDB
URL: https://dev.monetdb.org/hg/MonetDB?cmd=changeset;node=c7125e2745c9
Modified Files:
        monetdb5/modules/mal/mkey.c
Branch: Oct2020
Log Message:

Don't lose documentation


diffs (111 lines):

diff --git a/monetdb5/modules/mal/mkey.c b/monetdb5/modules/mal/mkey.c
--- a/monetdb5/modules/mal/mkey.c
+++ b/monetdb5/modules/mal/mkey.c
@@ -7,6 +7,99 @@
  */
 
 /*
+- * @- The Problem
+- * When creating a join, we want to make a unique key of the attributes on 
both
+- * sides and then join these keys. Consider the following BATs.
+- *
+- * @verbatim
+- * orders                  customer                link
+- * ====================    =====================   ===========
+- *         zipcode h_nr            zipcode hnr    oid     cid
+- * o1      13      9       c1      11      10      o1      c5
+- * o2      11      10      c2      11      11      o2      c1
+- * o3      11      11      c3      12      2       o3      c2
+- * o4      12      5       c4      12      1       o4      nil
+- * o5      11      10      c5      13      9       o5      c1
+- * o6      12      2       c6      14      7       o6      c3
+- * o7      13      9                               o7      c5
+- * o8      12      1                               o8      c4
+- * o9      13      9                               o9      c5
+- * @end verbatim
+- *
+- * The current approach is designed to take minimal memory, as our previous
+- * solutions to the problem did not scale well. In case of singular keys,
+- * the link is executed by a simple join. Before going into the join, we
+- * make sure the end result size is not too large, which is done by looking
+- * at relation sizes (if the other key is unique) or, if that is not possible,
+- * by computing the exact join size.
+- *
+- * The join algorithm was also improved to do dynamic sampling to determine
+- * with high accuracy the join size, so that we can alloc in one go a memory
+- * region of sufficient size. This also reduces the ds\_link memory 
requirements.
+- *
+- * For compound keys, those that consist of multiple attributes, we now 
compute
+- * a derived column that contains an integer hash value derived from all
+- * key columns.
+- * This is done by computing a hash value for each individual key column
+- * and combining those by bitwise XOR and left-rotation. That is, for each
+- * column,we rotate the working hash value by N bits and XOR the hash value
+- * of the column over it. The working hash value is initialized with zero,
+- * and after all columns are processed, this working value is used as output.
+- * Computing the hash value for all columns in the key for one table is done
+- * by the command hash(). Hence, we do hash on both sides, and join
+- * that together with a simple join:
+- *
+- * @code{join(hash(keys), hash(keys.reverse);}
+- *
+- * One complication of this procedure are nil values:
+- * @table
+- * @itemize
+- * @item
+- * it may happen that the final hash-value (an int formed by a
+- * random bit pattern) accidentally has the value of int(nil).
+- * Notice that join never matches nil values.
+- * Hence these accidental nils must be replaced by a begin value (currently: 
0).
+- * @item
+- * in case any of the compound key values is nil, our nil semantics
+- * require us that those tuples may never match on a join. Consequently,
+- * during the hash() processing of all compound key columns for computing
+- * the hash value, we also maintain a bit-bat that records which tuples had
+- * a nil value. The bit-bat is initialized to false, and the results of the
+- * nil-check on each column is OR-ed to it.
+- * Afterwards, the hash-value of all tuples that have this nil-bit set to
+- * TRUE are forced to int(nil), which will exclude them from matching.
+- * @end itemize
+- *
+- * Joining on hash values produces a @emph{superset} of the join result:
+- * it may happen that  two different key combinations hash on the same value,
+- * which will make them match on the join (false hits). The final part
+- * of the ds\_link therefore consists of filtering out the false hits.
+- * This is done incrementally by joining back the join result to the original
+- * columns, incrementally one by one for each pair of corresponding
+- * columns. These values are compared with each other and we AND the
+- * result of this comparison together for each pair of columns.
+- * The bat containing these bits is initialized to all TRUE and serves as
+- * final result after all column pairs have been compared.
+- * The initial join result is finally filtered with this bit-bat.
+- *
+- * Joining back from the initial join-result to the original columns on
+- * both sides takes quite a lot of memory. For this reason, the false
+- * hit-filtering is done in slices (not all tuples at one time).
+- * In this way the memory requirements of this phase are kept low.
+- * In fact, the most memory demanding part of the join is the int-join
+- * on hash number, which takes N*24 bytes (where N= |L| = |R|).
+- * In comparison, the previous CTmultigroup/CTmultiderive approach
+- * took N*48 bytes. Additionally, by making it possible to use merge-sort,
+- * it avoids severe performance degradation (memory thrashing) as produced
+- * by the old ds\_link when the inner join relation would be larger than 
memory.
+- *
+- * If ds\_link performance is still an issue, the sort-merge join used here
+- * could be replaced by partitioned hash-join with radix-cluster/decluster.
+- *
+- * @+ Implementation
+- */
+
+/*
  * (c) Peter Boncz, Stefan Manegold, Niels Nes
  *
  * new functionality for the low-resource-consumption. It will
@@ -14,6 +107,7 @@
  * This hash value is computed by xoring and rotating individual hash
  * values together. We create a hash and rotate command to do this.
  */
+
 #include "monetdb_config.h"
 #include "mal.h"
 #include "mal_interpreter.h"
_______________________________________________
checkin-list mailing list
checkin-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/checkin-list

Reply via email to