On May 10, 2008, at 12:26 AM, [EMAIL PROTECTED] wrote:
Hi Ian,
Thanks for your comments. They do make some points a bit clearer and
bring others to the table. I'd like to see others comment as well
before we continue moving forward.
In summary, I agree to follow your suggestion. However, one thing
that still remains unclear to me is the type of result expected from
the query system.
See my comments below, but I think we want to avoid consing large sets
where possible. So once we've defined a set of objects using a query
expressions we may want to:
1) Perform a non-consing map operation over it
2) Return it in a list
3) intersect it with another query set
4) returning a list of slot values or mapping over the slot value
#2 is easy to derive from #1 - that's what get-instances-by-value does
today.
#3, to be efficient, may require lazy query evaluation.
#4 is what you discuss below.
I think the general idea is that a query defines a set of objects, but
how you use it defines the time/space cost of evaluating that set. I
actually like the Django notion of lazy query sets - you can perform
operations like intersecting two query sets, but the actual query
isn't executed until you need to operate on a member of the set. This
gives the query compiler the maximum information to optimize.
I was under the assumption that we want to continue handling objects
of the persistent classes in Elephant. Thus, the results from the
query should be a set (list, etc) of objects matching the criteria.
However, reading your comments makes me think that what you're
suggesting is that the result set be something closer to what a SQL
query would return. So, going back to my example, instead of
receiving a set of Books, you would get back something like:
Wow, that was not the impression I intended to make! I definitely
prefer an object-centric view by default. We may want to return pairs
of objects, however. Returning values is easy to implement on top of
a map, as the example below illustrates.
((book_oid book_title book_author publisher_oid publisher_name
publisher_year) (value11 value12 value 13, value14, value15 value16)
(value21 value22 value23 value24 value25 value26) ... )
In other words, maybe something like a list that contains a list of
the slot names returned and then the list of matching values, where
the list of slots is either the concatenation of all the slots in
the persistent classes (in the case of something like SELECT
Books.*, Publishers.* ...) or the specific slots requested (e.g.
SELECT Books.author, Publishers.name ...).
Of course it's easy to do the above with a map operation or wrapper
with a closure that binds map-fn, the user's function, and slots, a
list of slots to operate on.
(lambda (obj)
(apply #'map-fn
(mapcar (lambda (slot) (funcall accessor obj))
slots)))
So the user would see something like:
(map-query (:slots name city state country)
(lambda (name city state country)
...))
It would mostly be a convenience function built on top of the basic
query mechanism, and not essential to start with...
And therefore, the result set would be treated as simply a set of
values instead of a set of objects. Within the result set, you could
include the OIDs so that you could eventually instantiate the
particular object and work in via the object model, or you could
simply just use the values returned, which is what you originally
asked for in the SELECT statement.
I do think the semantics of a query should be that query results are a
set of objects, rather than only persistent objects. It means the
user can work with a result list, or map over the results; working
with whatever my query string specifies. That said, we should start
by just extracting sets of objects for users to map over that satisfy
the constraints. We definitely should be able to specify that the
system only return oids.
If my assumption is correct from what you're saying, that certainly
clarifies a lot or my concerns and doubts. I'm sure more details
will arise along the way, but that could be a starting point to
sketch the system.
Comments?
Thanks,
Daniel
On May 9, 2008, at 9:54 PM, Ian Eslick wrote:
Welcome back Daniel, we all know the work drill!
Here are a few thoughts to throw into the mix...
One advantage of the relational model is that you have implicit
data structures (tables) that can be assembled from existing tables
via the SQL query. This is nice because it means we don't have to
explicitly create and maintain the structure for all these derived
data structures. In a pure lisp model, you actually have to do all
this maintenance yourself, especially the optimizations necessary
for efficiency that add to complexity. I feel that Elephant should
probably fall somewhere in-between. You maintain the data
structures that you want to work with in your program logic, but
the system can maintain pointers and indices and other
relationships that make it easy and efficient to generate and work
with subsets of objects efficiently (a user's inbox, for example).
Some of the limitations/frustrations with the current system may be
caused by people trying to do familiar relational tasks in the OODB
framework.
I also think that Robert's lisp-as-query-language works well for
the prevalence model when all objects are in memory, but I think
it's less practical in, say, BDB where you are going to disk alot.
However, it's a good discipline to consider - when does it makes
sense to add new syntax/apis and when does it make sense to use
lisp directly.
You mentioned associations. The best way to think about
associations is that it is an easy way to maintain back pointers.
For example, if a message object has a slot that contains a
reference to a user, we may also want the user object to have an
accessor that provides quick and efficient access to the collection
of messages that point to it. That's what associations are for.
You could do this by declaring after methods on (setf (user
message) value) that add the message to a pset sitting in a user
instance slot, but that gets tedious. As Leslie says, we're trying
to make common cases simple and reasonably efficient.
So the approach I'd like to see taken to designing the query
framework is to capture the use cases and metaphors that people are
really interested in and are encountering in real-world use and
pick the largest subset that fits nicely into a clean, theoretical
conceptual model. There are already a good number (Leslie, Alex,
etc) on the list that we could start with.
For example, I often find myself wanting to filter a set of objects
by more than one parameter (messages from user U that are high
priority between 4/1/08 and 5/1/08). What is the complexity of
different approaches afforded by the existing Elephant
implementation?
In order of computational efficiency (I surmise):
1. scan all messages and collect/operate on only those matching all
criteria
2. scan an index on messages instead of all messages; pick the one
likely to yield the smallest subset
3. intersection: scan two or more indexes for subsets represented
as sequences of oids, instantiate, filter and operate on the
objects represented by the intersection.
4. create an index that orders objects by all three parameters and
just walk the matching set. Trade off space for time.
Any others?
The other consideration is the conceptual framework we want to use
to approach the problem. Procedural? Constraint satisfaction?
Logical form? Graph matching? There are some good examples of
existing OODB systems in lisp out there (PLOB, AllegroStore/
AllegroCache, Statice, etc). If you search the list archives, I
think I've forwarded references in the past.
I tend to lean towards a constraint satisfaction approach, as my
sketch demonstrations. "Operate on the set of objects that satisfy
these constraints." There are a bunch of practical issues. Do we
map query sets? Do we cache them? Do we represent them as lists?
Are they lazily evaluated? If we don't have a DSL, but allow
arbitrary lisp expressions, then there isn't enough information to
automatically select indexes, perform intersections, etc.
My other strong suggestion, besides starting by capturing the major
use cases, is that we begin by implementing a procedural approach
by implementing the building blocks for filter, sort, intersect,
etc. If we take the list of four filtering approaches above, we
can start writing code that do these things and use them to
implement some of the use cases. The common building blocks and
problems that we discover will inform the additions we'll want to
the MOP, new implicit data structures like associations, the most
convenient query syntax, etc. Plus it will be useful in the
meantime. This fits into the classic lisp bottom-up DSL
development model (well proselytized by Paul. Graham).
Ian
On May 9, 2008, at 6:02 PM, [EMAIL PROTECTED] wrote:
Hello everyone,
I apologize for being disconnected for so long. I had volunteered
to help in the query system and should have done more progress by
now. Unfortunately, the same as some (most or all) of you, putting
food on the table for my family has a higher priority and my
current job has demanded 110% of my time lately.
Enough excuses! I have been passively reading several of your
email threads. I am convinced that a query system will bring a lot
of value to Elephant. The question that still arises is whether or
not people want a SQL-like syntax or a Lisp-like syntax.
As Ian has suggested, publicly and/or privately, we should start
designing the query system in a very basic form. The most critical
part would be query optimization, which I'd rather work on after
we have the basic query system in place. But there are a lot of
decisions to make before we get there and coming to a consensus of
how it should look and how it should work is of critical importance.
From a simplistic point of view, a SQL-like syntax should allow
for the execution of the basic relational algebraic operations
(union, difference, cartesian product, projection, and selection).
For the most part, these would not be difficult to implement.
However, IMHO, there is an intrinsic "contradiction" in applying a
SQL-like syntax on top of Elephant.
Assume you have the following Tables (relations) in a SQL world:
Books (
book_id,
title,
author
)
Publishers (
publisher_id,
name
)
BooksPublishers (
book_id,
publisher_id,
year
)
Suppose you wanted to get the cartesian product of all the books
published in 2008, you could run a SQL query like:
SELECT Books.*, Publishers.* FROM Books, Publishers,
BooksPublishers WHERE Books.book_id = BooksPublishers.book_id AND
Publishers.publisher_id = BooksPublishers.publisher_id AND
BooksPublishers.year = 2008
The result will be a concatenation of all the columns from the
Books and Publishers tables. In a SQL-world, you would access
these results in a key-value pair type mode (e.g. Books.book_id =
1, Books.title = "1984", etc). However, when you think in terms of
Elephant (at least my understanding of it), you're dealing with
objects and not key-value pairs from multiple tables. So, instead
of getting a concatenation of all the columns, you "should" be
getting just a list of Book objects (or Publisher objects) that
met your query criteria, such that when you iterate thru them, you
could "query" their Publishers (or the Books). So, if we had
something like (please keep in mind this is no suggestion to
syntax or correctness but just for illustrative purposes):
(defpclass book ()
((title :accessor book-title :index t)
(author :accessor book-author :index t)
(published_copies :accessor book-copies :initform (make-pset))))
(defpclass publisher ()
((name :accessor publisher-name :index t)))
(defmethod add-published-copy ((bk book) (pb publisher) year)
(insert-item '(pb year) (book-copies bk)))
(defmethod map-published-copies (fn (bk book))
(map-pset fn (book-copies bk)))
(setq objs (select book :where ((map-published-copies (lambda
(item year) (= (second item) year)) $bk 2008)))))
From then on, you could just iterate through the book objects in
the result set for their respective published copies. The problem
with this is that, ok, you get all the books that met your
criteria but if you then wanted to get a list of all the published
copies, you would need to apply the filter criteria again. The
reason I think it "should behave" this way is because Elephant
deals with sets of objects, and you use Lisp to navigate through
the object space, whereas in a SQL-world you are not dealing with
objects but with a result set that contains all the columns you
asked for. If we were to emulate the same behavior in the query
system, that would sort of defeat the purpose of Elephant. For
that matter, you might as well use some of the other libraries
(e.g. CL-SQL, cl-perec, cl-rdbms, etc).
The above example is a very simple example. We haven't looked at
SORTING, LIMIT, OFFSET, etc. Things which will simply make this
whole dilemma more difficult.
I haven't looked into Ian's association mechanism yet. Maybe the
query system could/should be an extension to that with some
specialized features to apply filter criteria instead (and
possibly evolve into something similar to Ruby's ActiveRecord). I
know the association mechanism is still being developed and I
haven't really seen anyone comment much on it other than what Ian
has mentioned. In one of Ian's comments, he said:
"A more general query language is probably the right solution
for this interface. The query language would know about
associations, derived indices, etc and perform query planning via
introspection over the class objects."
At the same time, Robert said on another thread:
"One might philosophically prefer SQL. I personally vaster
prefer to work in a powerful programming language to accomplish
these things. Obviously, whether two classes that refer to each
other stand in a "parent-child" relationship or not depends
entirely on the circumstances. I prefer to write simple functions
such as "delete-order" below, which both utilize and (in a sense)
expand the power of LISP applied to persistent objects."
Leslie said on yet another thread:
"While I'm at it: OFFSET and LIMIT (a real limit which lets you
specify an arbitrary Lisp expression) are things we definitely
want to aim for in 1.0. They are not difficult to implement at
all, but they don't work with GET-INSTANCES-BY-* and, worse, MAP-
BTREE. This means everyone has to write their own version of these
functions that take appropriate arguments and move the cursor
around themselves instead of relying on a simple high-level API.
I'd have implemented these extensions myself, but I thought it
better to wait for the integration of the query language to add it."
And Alex said:
"I think main problem is not how it looks, but that query
language actually makes programming a lot easier."
All those comments make sense. There seems to be a group agreement
that something is needed, but everyone has their own ideas of how
it should work. Both the query language and the associations are
still being developed, so if we get consensus no how these should
work, it may give a better direction to both feature sets. If
anyone has any comments or suggestion as to whether a query system
be of real interest/necessity and if so, which would be the
preferred query syntax and expected behavior, that would really
help.
I'm willing to work on this in as much as possible with my limited
knowledge of Lisp and Elephant. However, given a clear direction
of where this should go, I will be able to focus better and learn
faster what I haven't learned so far.
Again, your feedback is much appreciated. I'm hopeful to be able
to work more on this over the weekend, assuming I get some
feedback from you guys.
Thanks
Daniel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel
_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel