Re: [elephant-devel] Query System

Ian Eslick Sat, 10 May 2008 07:13:58 -0700


On May 10, 2008, at 12:26 AM, [EMAIL PROTECTED] wrote:

Hi Ian,
Thanks for your comments. They do make some points a bit clearer andbring others to the table. I'd like to see others comment as wellbefore we continue moving forward.
In summary, I agree to follow your suggestion. However, one thingthat still remains unclear to me is the type of result expected fromthe query system.

See my comments below, but I think we want to avoid consing large setswhere possible. So once we've defined a set of objects using a queryexpressions we may want to:


1) Perform a non-consing map operation over it
2) Return it in a list
3) intersect it with another query set
4) returning a list of slot values or mapping over the slot value

#2 is easy to derive from #1 - that's what get-instances-by-value doestoday.


#3, to be efficient, may require lazy query evaluation.

#4 is what you discuss below.

I think the general idea is that a query defines a set of objects, buthow you use it defines the time/space cost of evaluating that set. Iactually like the Django notion of lazy query sets - you can performoperations like intersecting two query sets, but the actual queryisn't executed until you need to operate on a member of the set. Thisgives the query compiler the maximum information to optimize.

I was under the assumption that we want to continue handling objectsof the persistent classes in Elephant. Thus, the results from thequery should be a set (list, etc) of objects matching the criteria.However, reading your comments makes me think that what you'resuggesting is that the result set be something closer to what a SQLquery would return. So, going back to my example, instead ofreceiving a set of Books, you would get back something like:

Wow, that was not the impression I intended to make! I definitelyprefer an object-centric view by default. We may want to return pairsof objects, however. Returning values is easy to implement on top ofa map, as the example below illustrates.

((book_oid book_title book_author publisher_oid publisher_namepublisher_year) (value11 value12 value 13, value14, value15 value16)(value21 value22 value23 value24 value25 value26) ... )
In other words, maybe something like a list that contains a list ofthe slot names returned and then the list of matching values, wherethe list of slots is either the concatenation of all the slots inthe persistent classes (in the case of something like SELECTBooks.*, Publishers.* ...) or the specific slots requested (e.g.SELECT Books.author, Publishers.name ...).

Of course it's easy to do the above with a map operation or wrapperwith a closure that binds map-fn, the user's function, and slots, alist of slots to operate on.


(lambda (obj)
  (apply #'map-fn
        (mapcar (lambda (slot) (funcall accessor obj))
                slots)))

So the user would see something like:

(map-query (:slots name city state country)
   (lambda (name city state country)
      ...))

It would mostly be a convenience function built on top of the basicquery mechanism, and not essential to start with...

And therefore, the result set would be treated as simply a set ofvalues instead of a set of objects. Within the result set, you couldinclude the OIDs so that you could eventually instantiate theparticular object and work in via the object model, or you couldsimply just use the values returned, which is what you originallyasked for in the SELECT statement.

I do think the semantics of a query should be that query results are aset of objects, rather than only persistent objects. It means theuser can work with a result list, or map over the results; workingwith whatever my query string specifies. That said, we should startby just extracting sets of objects for users to map over that satisfythe constraints. We definitely should be able to specify that thesystem only return oids.

If my assumption is correct from what you're saying, that certainlyclarifies a lot or my concerns and doubts. I'm sure more detailswill arise along the way, but that could be a starting point tosketch the system.
Comments?

Thanks,
Daniel

On May 9, 2008, at 9:54 PM, Ian Eslick wrote:
Welcome back Daniel, we all know the work drill!

Here are a few thoughts to throw into the mix...
One advantage of the relational model is that you have implicitdata structures (tables) that can be assembled from existing tablesvia the SQL query. This is nice because it means we don't have toexplicitly create and maintain the structure for all these deriveddata structures. In a pure lisp model, you actually have to do allthis maintenance yourself, especially the optimizations necessaryfor efficiency that add to complexity. I feel that Elephant shouldprobably fall somewhere in-between. You maintain the datastructures that you want to work with in your program logic, butthe system can maintain pointers and indices and otherrelationships that make it easy and efficient to generate and workwith subsets of objects efficiently (a user's inbox, for example).
Some of the limitations/frustrations with the current system may becaused by people trying to do familiar relational tasks in the OODBframework.
I also think that Robert's lisp-as-query-language works well forthe prevalence model when all objects are in memory, but I thinkit's less practical in, say, BDB where you are going to disk alot.However, it's a good discipline to consider - when does it makessense to add new syntax/apis and when does it make sense to uselisp directly.
You mentioned associations. The best way to think aboutassociations is that it is an easy way to maintain back pointers.For example, if a message object has a slot that contains areference to a user, we may also want the user object to have anaccessor that provides quick and efficient access to the collectionof messages that point to it. That's what associations are for.You could do this by declaring after methods on (setf (usermessage) value) that add the message to a pset sitting in a userinstance slot, but that gets tedious. As Leslie says, we're tryingto make common cases simple and reasonably efficient.
So the approach I'd like to see taken to designing the queryframework is to capture the use cases and metaphors that people arereally interested in and are encountering in real-world use andpick the largest subset that fits nicely into a clean, theoreticalconceptual model. There are already a good number (Leslie, Alex,etc) on the list that we could start with.
For example, I often find myself wanting to filter a set of objectsby more than one parameter (messages from user U that are highpriority between 4/1/08 and 5/1/08). What is the complexity ofdifferent approaches afforded by the existing Elephantimplementation?
In order of computational efficiency (I surmise):
1. scan all messages and collect/operate on only those matching allcriteria2. scan an index on messages instead of all messages; pick the onelikely to yield the smallest subset3. intersection: scan two or more indexes for subsets representedas sequences of oids, instantiate, filter and operate on theobjects represented by the intersection.4. create an index that orders objects by all three parameters andjust walk the matching set. Trade off space for time.
Any others?
The other consideration is the conceptual framework we want to useto approach the problem. Procedural? Constraint satisfaction?Logical form? Graph matching? There are some good examples ofexisting OODB systems in lisp out there (PLOB, AllegroStore/AllegroCache, Statice, etc). If you search the list archives, Ithink I've forwarded references in the past.
I tend to lean towards a constraint satisfaction approach, as mysketch demonstrations. "Operate on the set of objects that satisfythese constraints." There are a bunch of practical issues. Do wemap query sets? Do we cache them? Do we represent them as lists?Are they lazily evaluated? If we don't have a DSL, but allowarbitrary lisp expressions, then there isn't enough information toautomatically select indexes, perform intersections, etc.
My other strong suggestion, besides starting by capturing the majoruse cases, is that we begin by implementing a procedural approachby implementing the building blocks for filter, sort, intersect,etc. If we take the list of four filtering approaches above, wecan start writing code that do these things and use them toimplement some of the use cases. The common building blocks andproblems that we discover will inform the additions we'll want tothe MOP, new implicit data structures like associations, the mostconvenient query syntax, etc. Plus it will be useful in themeantime. This fits into the classic lisp bottom-up DSLdevelopment model (well proselytized by Paul. Graham).
Ian



On May 9, 2008, at 6:02 PM, [EMAIL PROTECTED] wrote:
Hello everyone,
I apologize for being disconnected for so long. I had volunteeredto help in the query system and should have done more progress bynow. Unfortunately, the same as some (most or all) of you, puttingfood on the table for my family has a higher priority and mycurrent job has demanded 110% of my time lately.
Enough excuses! I have been passively reading several of youremail threads. I am convinced that a query system will bring a lotof value to Elephant. The question that still arises is whether ornot people want a SQL-like syntax or a Lisp-like syntax.
As Ian has suggested, publicly and/or privately, we should startdesigning the query system in a very basic form. The most criticalpart would be query optimization, which I'd rather work on afterwe have the basic query system in place. But there are a lot ofdecisions to make before we get there and coming to a consensus ofhow it should look and how it should work is of critical importance.
From a simplistic point of view, a SQL-like syntax should allowfor the execution of the basic relational algebraic operations(union, difference, cartesian product, projection, and selection).For the most part, these would not be difficult to implement.However, IMHO, there is an intrinsic "contradiction" in applying aSQL-like syntax on top of Elephant.
Assume you have the following Tables (relations) in a SQL world:

Books (
book_id,
title,
author
)

Publishers (
publisher_id,
name
)

BooksPublishers (
book_id,
publisher_id,
year
)
Suppose you wanted to get the cartesian product of all the bookspublished in 2008, you could run a SQL query like:
SELECT Books.*, Publishers.* FROM Books, Publishers,BooksPublishers WHERE Books.book_id = BooksPublishers.book_id ANDPublishers.publisher_id = BooksPublishers.publisher_id ANDBooksPublishers.year = 2008
The result will be a concatenation of all the columns from theBooks and Publishers tables. In a SQL-world, you would accessthese results in a key-value pair type mode (e.g. Books.book_id =1, Books.title = "1984", etc). However, when you think in terms ofElephant (at least my understanding of it), you're dealing withobjects and not key-value pairs from multiple tables. So, insteadof getting a concatenation of all the columns, you "should" begetting just a list of Book objects (or Publisher objects) thatmet your query criteria, such that when you iterate thru them, youcould "query" their Publishers (or the Books). So, if we hadsomething like (please keep in mind this is no suggestion tosyntax or correctness but just for illustrative purposes):
(defpclass book ()
((title :accessor book-title :index t)
(author :accessor book-author :index t)
(published_copies :accessor book-copies :initform (make-pset))))

(defpclass publisher ()
((name :accessor publisher-name :index t)))

(defmethod add-published-copy ((bk book) (pb publisher) year)
(insert-item '(pb year) (book-copies bk)))

(defmethod map-published-copies (fn (bk book))
(map-pset fn (book-copies bk)))
(setq objs (select book :where ((map-published-copies (lambda(item year) (= (second item) year)) $bk 2008)))))
From then on, you could just iterate through the book objects inthe result set for their respective published copies. The problemwith this is that, ok, you get all the books that met yourcriteria but if you then wanted to get a list of all the publishedcopies, you would need to apply the filter criteria again. Thereason I think it "should behave" this way is because Elephantdeals with sets of objects, and you use Lisp to navigate throughthe object space, whereas in a SQL-world you are not dealing withobjects but with a result set that contains all the columns youasked for. If we were to emulate the same behavior in the querysystem, that would sort of defeat the purpose of Elephant. Forthat matter, you might as well use some of the other libraries(e.g. CL-SQL, cl-perec, cl-rdbms, etc).
The above example is a very simple example. We haven't looked atSORTING, LIMIT, OFFSET, etc. Things which will simply make thiswhole dilemma more difficult.
I haven't looked into Ian's association mechanism yet. Maybe thequery system could/should be an extension to that with somespecialized features to apply filter criteria instead (andpossibly evolve into something similar to Ruby's ActiveRecord). Iknow the association mechanism is still being developed and Ihaven't really seen anyone comment much on it other than what Ianhas mentioned. In one of Ian's comments, he said:
"A more general query language is probably the right solutionfor this interface. The query language would know aboutassociations, derived indices, etc and perform query planning viaintrospection over the class objects."
At the same time, Robert said on another thread:
"One might philosophically prefer SQL. I personally vasterprefer to work in a powerful programming language to accomplishthese things. Obviously, whether two classes that refer to eachother stand in a "parent-child" relationship or not dependsentirely on the circumstances. I prefer to write simple functionssuch as "delete-order" below, which both utilize and (in a sense)expand the power of LISP applied to persistent objects."
Leslie said on yet another thread:
"While I'm at it: OFFSET and LIMIT (a real limit which lets youspecify an arbitrary Lisp expression) are things we definitelywant to aim for in 1.0. They are not difficult to implement atall, but they don't work with GET-INSTANCES-BY-* and, worse, MAP-BTREE. This means everyone has to write their own version of thesefunctions that take appropriate arguments and move the cursoraround themselves instead of relying on a simple high-level API.
I'd have implemented these extensions myself, but I thought itbetter to wait for the integration of the query language to add it."
And Alex said:
"I think main problem is not how it looks, but that querylanguage actually makes programming a lot easier."
All those comments make sense. There seems to be a group agreementthat something is needed, but everyone has their own ideas of howit should work. Both the query language and the associations arestill being developed, so if we get consensus no how these shouldwork, it may give a better direction to both feature sets. Ifanyone has any comments or suggestion as to whether a query systembe of real interest/necessity and if so, which would be thepreferred query syntax and expected behavior, that would reallyhelp.
I'm willing to work on this in as much as possible with my limitedknowledge of Lisp and Elephant. However, given a clear directionof where this should go, I will be able to focus better and learnfaster what I haven't learned so far.
Again, your feedback is much appreciated. I'm hopeful to be ableto work more on this over the weekend, assuming I get somefeedback from you guys.
Thanks
Daniel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel


_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

Re: [elephant-devel] Query System

Reply via email to