Re: Thoughts on bags?

Richard Newman Tue, 09 Jun 2009 23:10:05 -0700

> I am very fond of the relational functions in Clojure. That was one of
> the first things that started winning me over actually.


Indeed, they're very nice to have!

> Forgive me if this is an obvious question, but what exactly is the
> disadvantage of the add-an-id approach?

It's largely aesthetic for me: I don't like the idea of having to  
generate some identifier and decorate my data with it. From my  
perspective it's a hack to turn a set into a multiset, which is the  
concept I'm really working with (an unordered collection which  
includes duplicates). One could argue that choosing a name for the ID  
is not obviously easy, and that this is an approach that only works  
well for maps/structs, but those problems don't apply in my case, so I  
won't argue those points!

I haven't done any timing to determine if it's an expensive hack: this  
is not time-critical code, so it doesn't matter much to me. For that  
reason I'll probably stick with this approach, albeit well-commented  
to explain to my future self why I'm temporarily introducing an  
otherwise-unused ID!

I raised this whole issue not because I can't work around it, but  
because I like to use the right tool for the job if it exists, and  
maybe other people already built that tool. Who knows? perhaps Rich  
has been considering spending an afternoon adding multisets to core,  
and this is additional motivation. After all, we now have sorted-sets,  
which is the other axis of set-hood...


> Or, another way, what would be
> substantially better about having multisets over just doing what
> you're doing? My understanding of relational theory and SQL (thanks
> largely to Joe Celko's books) makes me suspicious of needing
> cardinality—it sounds a lot like wanting access to the physical
> ordering on disk. Then again, a lot of my database tables wind up with
> a sort-order column or an auto-incrementing ID, I admit.

It depends on how "pure" your experience with relational algebra is :)

I've spent a lot of time with SPARQL, the RDF query language. It's  
relational (much like SQL for the web), but it preserves cardinality  
by default, but not ordering. (It has REDUCED and DISTINCT keywords to  
discard duplicates if desired or permitted.)

Some people think preserving cardinality is an odd choice, given that  
RDF is defined in terms of sets, not bags, but it has its uses.

Modeling event-like things (charges, in my case) in a pure relational  
system -- one with set semantics -- typically requires the addition of  
two things: a unique identifier to preserve otherwise-identical  
events; and some ordering attribute, to preserve sequentiality in an  
unordered system. Removing the "set-ness" (cardinality, un- 
orderedness, or both) is another way to resolve the impedance mismatch.


> Of course, just because it violates relational theory doesn't mean it
> wouldn't be a great addition to the language. I'm curious.
>
> Would you mind sharing the code with the error for the calculation
> you're doing?

I'm afraid I can't share the exact code, but the simplified relational  
part is something like:

(use 'clojure.set)

(defn example-charges
   "Take a relation between charge and identifier, and a relation  
between
    identifier and client, and sum the charges for each client."
   [charges-rel clients]

   ;; 5. Produce a sum charge for each client in a single map.
   ;; No need to apply merge-with: the index has unique keys.
   (into {}
     (map
       ;; 4. Turn the index into a numeric sum for each client.
       (fn [[k v]]
         [(:client k)
          (reduce + (map :charge v))])

       (index
         (project
           ;; 1. Note that any identifiers not in the clients relation  
will
           ;; simply disappear at this point.
           (join
             charges-rel
             clients)

           ;; 2. Include :id in the projection to prevent set semantics.
           [:client :charge :id])

         ;; 3. Now index from client to the projected relations.
         #{:client}))))


E.g.,

(example-charges
   #{{:charge 10 :identifier "12345abcdef" :id 0}
     {:charge 10 :identifier "67890ghijkl" :id 1}
     {:charge 15 :identifier "12345poiuyt" :id 2}}
   #{{:identifier "12345abcdef" :client "Foocorp"}
     {:identifier "67890ghijkl" :client "Foocorp"}
     {:identifier "12345poiuyt" :client "Barcorp"}})

=> {"Foocorp" 20, "Barcorp" 15}


Omit the :id and we get this:

(example-charges
   #{{:charge 10 :identifier "12345abcdef"}
     {:charge 10 :identifier "67890ghijkl"}
     {:charge 15 :identifier "12345poiuyt"}}
   #{{:identifier "12345abcdef" :client "Foocorp"}
     {:identifier "67890ghijkl" :client "Foocorp"}
     {:identifier "12345poiuyt" :client "Barcorp"}})

=> {"Barcorp" 15, "Foocorp" 10}

Oops! We're going to under-charge Foocorp!

You get the same result if you omit the :id from the projection vector.

Thanks,

-R
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Thoughts on bags?

Reply via email to