[ANN] data.avl 0.0.10 -- fast sorted maps and sets with log-time rank queries

Michał Marczyk Sat, 04 Jan 2014 21:40:25 -0800

Hi,

I am pleased to announce the initial Clojure Contrib release of
data.avl (previously known as avl.clj), a library implementing drop-in
replacements for Clojure(Script)'s sorted maps and sets with faster
lookups, support for the transient API (leading to unambiguously
improved performance for many usage patterns) and log-time rank
queries (via clojure.core/nth and clojure.data.avl/rank-of).


There are two relevant trade-offs: higher memory usage (a pointer and
two ints per key) and somewhat slower single "updates". See below for
an extended discussion and some benchmark results.

For users of avl.clj 0.0.9: this release fixes a bug in rank-of and a
single reflective call I missed previously. Upgrading requires no
changes other than to the namespace name, which is now
clojure.data.avl.

The Maven artefact is now available from Maven Central; dependency
information is as follows (Leiningen / Maven / Gradle):

  [org.clojure/data.avl "0.0.10"]

  <dependency>
    <groupId>org.clojure</groupId>
    <artifactId>data.avl</artifactId>
    <version>0.0.10</version>
  </dependency>

  compile "org.clojure:data.avl:0.0.10"

The repository for this project is located here:

  https://github.com/clojure/data.avl

data.avl uses persistent AVL trees as the underlying data structure.
AVL trees tend to be significantly shallower than red-black trees;
since the built-in sorted collections use red-black trees, data.avl
maps and sets offer noticeably improved expected lookup times.

The trade-off here is that "updates" (assoc / dissoc / conj / disj)
must perform more work in rebalancing the trees, so they are typically
slower than with the built-ins. However, data.avl maps and sets
support the transient API, which can be used to speed up long chains
of "updates". Thanks to this, data.avl types tend to outperform the
built-ins for batch "updates". One particularly important case of this
type is that of the initial construction of large instances; this
tends to be noticeably faster with data.avl types (on a par with the
built-ins worst-case in my benchmarking so far).

(Of course the time to complete all operations varies with tree
structure. When comparing lookup times for nodes occupying the same
positions in both trees, the built-in red-black trees have a very
slight edge; see for example (get avl 0) and (get rb 0) in the
benchmark results below. "Updates" involving identically placed nodes
may be faster in either tree type depending on the overall shape of
the tree; see for example (dissoc avl 131071) and (dissoc rb 131071)
below, where the AVL map completes the operation twice as fast. On
average, however, performance is as described above.)

Besides supporting transients, data.avl types offer two additions to
the sorted collection API:

  (require '[clojure.data.avl :as avl])

  ;; nth can be used to look up the nth smallest item in a sorted
  ;; collection, as determined by the comparator being used:

  (nth (avl/sorted-set 0 2 4) 2)
  ;= 4

  (nth (avl/sorted-set-by > 0 2 4) 2)
  ;= 0

  ;; for maps, nth returns map entry objects:

  (nth (avl/sorted-map 0 0 2 2 4 4) 2)
  ;= [4 4]

  ;; clojure.data.avl/rank-of can be used to query for the index of a
  ;; given key in a data.avl collection:

  (avl/rank-of (avl/sorted-set 3 4 5) 4)
  ;= 1

  ;; if the key is not present in the collection, -1 is returned:
  (avl/rank-of (avl/sorted-set 3 4 5) 0)
  ;= -1

As mentioned above, the new functionality comes at a cost in memory
consumption. To support transients, each node in a data.avl tree must
carry an extra reference field -- that's one pointer per key. To
support efficient rank queries, each pointer needs to carry an extra
int field. Finally, another int field is used by the rebalancing
algorithm itself.

The library carries its own test suite used in both Clojure and
ClojureScript. Additionally, the Clojure version is stress-tested with
Zach Tellman's collection-check library.

I attach some Criterium benchmark results below.

Cheers,
Michał


Benchmark environment:

  OpenJDK 1.7, Clojure 1.5.1, Criterium 0.4.2.

Relevant defs:

  (def ks   (range 300000))
  (def ksks (interleave ks ks))

  (def rb  (apply sorted-map ksks))
  (def avl (apply avl/sorted-map ksks))

  (defn rb-rank-of [rb-map k]
    (if (contains? rb k)
      (count (take-while #(not= k (key %)) (seq rb)))
      -1))

Benchmark results:

(lookup-benchmarks)
===================
(c/bench (get avl 0))
WARNING: Final GC required 17.30960512486499 % of runtime
WARNING: Final GC required 1.487642242862071 % of runtime
Evaluation count : 184965780 in 60 samples of 3082763 calls.
             Execution time mean : 320.595187 ns
    Execution time std-deviation : 2.710112 ns
   Execution time lower quantile : 316.756609 ns ( 2.5%)
   Execution time upper quantile : 325.833642 ns (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (get rb 0))
WARNING: Final GC required 1.481744139127142 % of runtime
Evaluation count : 192022740 in 60 samples of 3200379 calls.
             Execution time mean : 308.423733 ns
    Execution time std-deviation : 3.135581 ns
   Execution time lower quantile : 304.385589 ns ( 2.5%)
   Execution time upper quantile : 315.500500 ns (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (get avl 131071))
WARNING: Final GC required 1.594644811795316 % of runtime
Evaluation count : 1833957600 in 60 samples of 30565960 calls.
             Execution time mean : 30.406098 ns
    Execution time std-deviation : 0.317429 ns
   Execution time lower quantile : 29.896215 ns ( 2.5%)
   Execution time upper quantile : 31.120450 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 3 outliers in 60 samples (5.0000 %)
low-severe 3 (5.0000 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (get rb 131071))
WARNING: Final GC required 1.623586044372867 % of runtime
Evaluation count : 1976983260 in 60 samples of 32949721 calls.
             Execution time mean : 27.838439 ns
    Execution time std-deviation : 0.287299 ns
   Execution time lower quantile : 27.346813 ns ( 2.5%)
   Execution time upper quantile : 28.375493 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (get avl 299999))
WARNING: Final GC required 1.433452208207925 % of runtime
Evaluation count : 156641220 in 60 samples of 2610687 calls.
             Execution time mean : 386.680734 ns
    Execution time std-deviation : 3.648226 ns
   Execution time lower quantile : 381.171378 ns ( 2.5%)
   Execution time upper quantile : 394.259276 ns (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (get rb 299999))
WARNING: Final GC required 1.458498405132945 % of runtime
Evaluation count : 87642960 in 60 samples of 1460716 calls.
             Execution time mean : 683.691293 ns
    Execution time std-deviation : 6.293332 ns
   Execution time lower quantile : 676.164098 ns ( 2.5%)
   Execution time upper quantile : 697.942500 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(construction-benchmarks)
=========================
(c/bench (apply avl/sorted-map ksks))
WARNING: Final GC required 6.475014797576352 % of runtime
Evaluation count : 300 in 60 samples of 5 calls.
             Execution time mean : 196.151940 ms
    Execution time std-deviation : 2.434096 ms
   Execution time lower quantile : 193.202175 ms ( 2.5%)
   Execution time upper quantile : 202.256218 ms (97.5%)
                   Overhead used : 2.128084 ns

Found 3 outliers in 60 samples (5.0000 %)
low-severe 3 (5.0000 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (apply sorted-map ksks))
WARNING: Final GC required 4.8414421771766305 % of runtime
Evaluation count : 240 in 60 samples of 4 calls.
             Execution time mean : 314.285107 ms
    Execution time std-deviation : 2.687500 ms
   Execution time lower quantile : 310.117205 ms ( 2.5%)
   Execution time upper quantile : 319.825282 ms (97.5%)
                   Overhead used : 2.128084 ns
(assoc-benchmarks)
==================
(c/bench (assoc avl -1 -1))
WARNING: Final GC required 1.443316131643841 % of runtime
Evaluation count : 85057080 in 60 samples of 1417618 calls.
             Execution time mean : 704.100540 ns
    Execution time std-deviation : 5.756555 ns
   Execution time lower quantile : 696.151863 ns ( 2.5%)
   Execution time upper quantile : 716.721543 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (assoc rb -1 -1))
WARNING: Final GC required 1.473615175697951 % of runtime
Evaluation count : 94522860 in 60 samples of 1575381 calls.
             Execution time mean : 638.321250 ns
    Execution time std-deviation : 5.147774 ns
   Execution time lower quantile : 631.959478 ns ( 2.5%)
   Execution time upper quantile : 648.923646 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (assoc avl 131071.5 131071.5))
WARNING: Final GC required 1.475403685555393 % of runtime
Evaluation count : 87600660 in 60 samples of 1460011 calls.
             Execution time mean : 694.815846 ns
    Execution time std-deviation : 6.267862 ns
   Execution time lower quantile : 685.491226 ns ( 2.5%)
   Execution time upper quantile : 707.824187 ns (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (assoc rb 131071.5 131071.5))
WARNING: Final GC required 1.500960406817446 % of runtime
Evaluation count : 96543600 in 60 samples of 1609060 calls.
             Execution time mean : 617.554662 ns
    Execution time std-deviation : 6.103529 ns
   Execution time lower quantile : 609.156583 ns ( 2.5%)
   Execution time upper quantile : 631.065400 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (assoc avl 300000 300000))
WARNING: Final GC required 1.515899576717984 % of runtime
Evaluation count : 51857640 in 60 samples of 864294 calls.
             Execution time mean : 1.142891 µs
    Execution time std-deviation : 10.147119 ns
   Execution time lower quantile : 1.129993 µs ( 2.5%)
   Execution time upper quantile : 1.165111 µs (97.5%)
                   Overhead used : 2.128084 ns

Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (assoc rb 300000 300000))
WARNING: Final GC required 1.508336024198179 % of runtime
Evaluation count : 65808300 in 60 samples of 1096805 calls.
             Execution time mean : 909.241190 ns
    Execution time std-deviation : 8.308224 ns
   Execution time lower quantile : 899.482479 ns ( 2.5%)
   Execution time upper quantile : 927.906490 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(dissoc-benchmarks)
===================
(c/bench (dissoc avl 0))
WARNING: Final GC required 1.423418517945197 % of runtime
Evaluation count : 93948600 in 60 samples of 1565810 calls.
             Execution time mean : 646.364053 ns
    Execution time std-deviation : 4.752838 ns
   Execution time lower quantile : 638.311265 ns ( 2.5%)
   Execution time upper quantile : 655.962057 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (dissoc rb 0))
WARNING: Final GC required 1.584824677744148 % of runtime
Evaluation count : 94848360 in 60 samples of 1580806 calls.
             Execution time mean : 631.947940 ns
    Execution time std-deviation : 3.817958 ns
   Execution time lower quantile : 625.144366 ns ( 2.5%)
   Execution time upper quantile : 638.920567 ns (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (dissoc avl 131071))
WARNING: Final GC required 1.605081849882345 % of runtime
Evaluation count : 220124040 in 60 samples of 3668734 calls.
             Execution time mean : 266.365232 ns
    Execution time std-deviation : 2.013280 ns
   Execution time lower quantile : 263.177469 ns ( 2.5%)
   Execution time upper quantile : 270.351565 ns (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (dissoc rb 131071))
WARNING: Final GC required 1.571888258342256 % of runtime
Evaluation count : 112493340 in 60 samples of 1874889 calls.
             Execution time mean : 533.037294 ns
    Execution time std-deviation : 3.334672 ns
   Execution time lower quantile : 527.605873 ns ( 2.5%)
   Execution time upper quantile : 539.581328 ns (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (dissoc avl 299999))
WARNING: Final GC required 1.483980334798602 % of runtime
Evaluation count : 53342520 in 60 samples of 889042 calls.
             Execution time mean : 1.132967 µs
    Execution time std-deviation : 7.353171 ns
   Execution time lower quantile : 1.120979 µs ( 2.5%)
   Execution time upper quantile : 1.148367 µs (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (dissoc rb 299999))
WARNING: Final GC required 1.508118289184663 % of runtime
Evaluation count : 69107160 in 60 samples of 1151786 calls.
             Execution time mean : 881.091241 ns
    Execution time std-deviation : 6.621161 ns
   Execution time lower quantile : 868.480766 ns ( 2.5%)
   Execution time upper quantile : 892.546645 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(reduce-kv-benchmarks)
======================
(c/bench (reduce-kv + 0 avl))
WARNING: Final GC required 1.561011180350695 % of runtime
Evaluation count : 1620 in 60 samples of 27 calls.
             Execution time mean : 37.223528 ms
    Execution time std-deviation : 284.200552 µs
   Execution time lower quantile : 36.733677 ms ( 2.5%)
   Execution time upper quantile : 37.700421 ms (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (reduce-kv + 0 rb))
WARNING: Final GC required 1.498782422077256 % of runtime
Evaluation count : 1500 in 60 samples of 25 calls.
             Execution time mean : 41.694365 ms
    Execution time std-deviation : 332.607807 µs
   Execution time lower quantile : 41.083362 ms ( 2.5%)
   Execution time upper quantile : 42.348164 ms (97.5%)
                   Overhead used : 2.128084 ns
(select-benchmarks)
===================
(c/bench (nth avl 0))
WARNING: Final GC required 1.479982375647741 % of runtime
Evaluation count : 207875520 in 60 samples of 3464592 calls.
             Execution time mean : 289.801393 ns
    Execution time std-deviation : 2.361106 ns
   Execution time lower quantile : 286.809191 ns ( 2.5%)
   Execution time upper quantile : 293.622950 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (nth (seq rb) 0))
WARNING: Final GC required 1.46387748103801 % of runtime
Evaluation count : 280530660 in 60 samples of 4675511 calls.
             Execution time mean : 214.456442 ns
    Execution time std-deviation : 1.838107 ns
   Execution time lower quantile : 211.716850 ns ( 2.5%)
   Execution time upper quantile : 217.151388 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 3 outliers in 60 samples (5.0000 %)
low-severe 1 (1.6667 %)
low-mild 1 (1.6667 %)
high-mild 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (nth avl 131071))
WARNING: Final GC required 1.630328987905956 % of runtime
Evaluation count : 2479221180 in 60 samples of 41320353 calls.
             Execution time mean : 22.273116 ns
    Execution time std-deviation : 0.193119 ns
   Execution time lower quantile : 21.913883 ns ( 2.5%)
   Execution time upper quantile : 22.674207 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (nth (seq rb) 131071))
WARNING: Final GC required 1.501022611379085 % of runtime
Evaluation count : 15480 in 60 samples of 258 calls.
             Execution time mean : 3.868847 ms
    Execution time std-deviation : 30.435022 µs
   Execution time lower quantile : 3.815580 ms ( 2.5%)
   Execution time upper quantile : 3.919490 ms (97.5%)
                   Overhead used : 2.128084 ns

Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (nth avl 299999))
WARNING: Final GC required 1.590021994376285 % of runtime
Evaluation count : 68943060 in 60 samples of 1149051 calls.
             Execution time mean : 846.178065 ns
    Execution time std-deviation : 5.350532 ns
   Execution time lower quantile : 835.947697 ns ( 2.5%)
   Execution time upper quantile : 855.535318 ns (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (nth (seq rb) 299999))
WARNING: Final GC required 1.4559626406394761 % of runtime
Evaluation count : 3540 in 60 samples of 59 calls.
             Execution time mean : 17.253646 ms
    Execution time std-deviation : 137.868346 µs
   Execution time lower quantile : 17.043361 ms ( 2.5%)
   Execution time upper quantile : 17.566532 ms (97.5%)
                   Overhead used : 2.128084 ns

Found 4 outliers in 60 samples (6.6667 %)
low-severe 4 (6.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(rank-of-benchmarks)
====================
(c/bench (avl/rank-of avl 0))
WARNING: Final GC required 1.435081391872642 % of runtime
Evaluation count : 206515440 in 60 samples of 3441924 calls.
             Execution time mean : 288.829471 ns
    Execution time std-deviation : 2.436205 ns
   Execution time lower quantile : 284.627916 ns ( 2.5%)
   Execution time upper quantile : 293.684025 ns (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (rb-rank-of rb 0))
WARNING: Final GC required 1.5237700419512759 % of runtime
Evaluation count : 116885160 in 60 samples of 1948086 calls.
             Execution time mean : 515.495174 ns
    Execution time std-deviation : 4.405742 ns
   Execution time lower quantile : 509.372306 ns ( 2.5%)
   Execution time upper quantile : 525.137634 ns (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (avl/rank-of avl 131071))
WARNING: Final GC required 1.89767944295473 % of runtime
Evaluation count : 2006788020 in 60 samples of 33446467 calls.
             Execution time mean : 27.768944 ns
    Execution time std-deviation : 0.316862 ns
   Execution time lower quantile : 27.307884 ns ( 2.5%)
   Execution time upper quantile : 28.547999 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 4 outliers in 60 samples (6.6667 %)
low-severe 4 (6.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (rb-rank-of rb 131071))
WARNING: Final GC required 1.557940388859877 % of runtime
Evaluation count : 2700 in 60 samples of 45 calls.
             Execution time mean : 22.387215 ms
    Execution time std-deviation : 172.796714 µs
   Execution time lower quantile : 22.054730 ms ( 2.5%)
   Execution time upper quantile : 22.716453 ms (97.5%)
                   Overhead used : 2.128084 ns
(c/bench (avl/rank-of avl 299999))
WARNING: Final GC required 1.58577957292816 % of runtime
Evaluation count : 71499360 in 60 samples of 1191656 calls.
             Execution time mean : 846.167011 ns
    Execution time std-deviation : 6.919658 ns
   Execution time lower quantile : 834.322212 ns ( 2.5%)
   Execution time upper quantile : 860.323062 ns (97.5%)
                   Overhead used : 2.128084 ns

Found 1 outliers in 60 samples (1.6667 %)
low-severe 1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
(c/bench (rb-rank-of rb 299999))
WARNING: Final GC required 1.518370722058118 % of runtime
Evaluation count : 1200 in 60 samples of 20 calls.
             Execution time mean : 51.116182 ms
    Execution time std-deviation : 384.558939 µs
   Execution time lower quantile : 50.562968 ms ( 2.5%)
   Execution time upper quantile : 51.927406 ms (97.5%)
                   Overhead used : 2.128084 ns

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

[ANN] data.avl 0.0.10 -- fast sorted maps and sets with log-time rank queries

Reply via email to