Re: Thoughts on bags?

Rich Hickey Wed, 10 Jun 2009 04:40:19 -0700

On Jun 10, 2:09 am, Richard Newman <holyg...@gmail.com> wrote:
> > I am very fond of the relational functions in Clojure. That was one of
> > the first things that started winning me over actually.
>
> Indeed, they're very nice to have!
>
> > Forgive me if this is an obvious question, but what exactly is the
> > disadvantage of the add-an-id approach?
>
> It's largely aesthetic for me: I don't like the idea of having to
> generate some identifier and decorate my data with it. From my
> perspective it's a hack to turn a set into a multiset, which is the
> concept I'm really working with (an unordered collection which
> includes duplicates). One could argue that choosing a name for the ID
> is not obviously easy, and that this is an approach that only works
> well for maps/structs, but those problems don't apply in my case, so I
> won't argue those points!
>
> I haven't done any timing to determine if it's an expensive hack: this
> is not time-critical code, so it doesn't matter much to me. For that
> reason I'll probably stick with this approach, albeit well-commented
> to explain to my future self why I'm temporarily introducing an
> otherwise-unused ID!
>
> I raised this whole issue not because I can't work around it, but
> because I like to use the right tool for the job if it exists, and
> maybe other people already built that tool. Who knows? perhaps Rich
> has been considering spending an afternoon adding multisets to core,
> and this is additional motivation. After all, we now have sorted-sets,
> which is the other axis of set-hood...
>
> > Or, another way, what would be
> > substantially better about having multisets over just doing what
> > you're doing? My understanding of relational theory and SQL (thanks
> > largely to Joe Celko's books) makes me suspicious of needing
> > cardinality—it sounds a lot like wanting access to the physical
> > ordering on disk. Then again, a lot of my database tables wind up with
> > a sort-order column or an auto-incrementing ID, I admit.
>
> It depends on how "pure" your experience with relational algebra is :)
>
> I've spent a lot of time with SPARQL, the RDF query language. It's
> relational (much like SQL for the web), but it preserves cardinality
> by default, but not ordering. (It has REDUCED and DISTINCT keywords to
> discard duplicates if desired or permitted.)
>
> Some people think preserving cardinality is an odd choice, given that
> RDF is defined in terms of sets, not bags, but it has its uses.
>
> Modeling event-like things (charges, in my case) in a pure relational
> system -- one with set semantics -- typically requires the addition of
> two things: a unique identifier to preserve otherwise-identical
> events; and some ordering attribute, to preserve sequentiality in an
> unordered system. Removing the "set-ness" (cardinality, un-
> orderedness, or both) is another way to resolve the impedance mismatch.
>
> > Of course, just because it violates relational theory doesn't mean it
> > wouldn't be a great addition to the language. I'm curious.
>
> > Would you mind sharing the code with the error for the calculation
> > you're doing?
>
> I'm afraid I can't share the exact code, but the simplified relational
> part is something like:
>
> (use 'clojure.set)
>
> (defn example-charges
>    "Take a relation between charge and identifier, and a relation
> between
>     identifier and client, and sum the charges for each client."
>    [charges-rel clients]
>
>    ;; 5. Produce a sum charge for each client in a single map.
>    ;; No need to apply merge-with: the index has unique keys.
>    (into {}
>      (map
>        ;; 4. Turn the index into a numeric sum for each client.
>        (fn [[k v]]
>          [(:client k)
>           (reduce + (map :charge v))])
>
>        (index
>          (project
>            ;; 1. Note that any identifiers not in the clients relation
> will
>            ;; simply disappear at this point.
>            (join
>              charges-rel
>              clients)
>
>            ;; 2. Include :id in the projection to prevent set semantics.
>            [:client :charge :id])
>
>          ;; 3. Now index from client to the projected relations.
>          #{:client}))))
>
> E.g.,
>
> (example-charges
>    #{{:charge 10 :identifier "12345abcdef" :id 0}
>      {:charge 10 :identifier "67890ghijkl" :id 1}
>      {:charge 15 :identifier "12345poiuyt" :id 2}}
>    #{{:identifier "12345abcdef" :client "Foocorp"}
>      {:identifier "67890ghijkl" :client "Foocorp"}
>      {:identifier "12345poiuyt" :client "Barcorp"}})
>
> => {"Foocorp" 20, "Barcorp" 15}
>
> Omit the :id and we get this:
>
> (example-charges
>    #{{:charge 10 :identifier "12345abcdef"}
>      {:charge 10 :identifier "67890ghijkl"}
>      {:charge 15 :identifier "12345poiuyt"}}
>    #{{:identifier "12345abcdef" :client "Foocorp"}
>      {:identifier "67890ghijkl" :client "Foocorp"}
>      {:identifier "12345poiuyt" :client "Barcorp"}})
>
> => {"Barcorp" 15, "Foocorp" 10}
>
> Oops! We're going to under-charge Foocorp!
>
> You get the same result if you omit the :id from the projection vector.
>

While I thinks bags and multimaps would be nice additions to the core
collections, I am not sympathetic to this argument about the
relational ops. Allowing duplicates is not relational, it complicates
all logic involving relations. So, even if we had bags the relational
ops would still be relational - i.e. relations are (true) sets.

Rich
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---
Re: Thoughts on bags?

Reply via email to