Re: [elephant-devel] Design Suggestion Request

Daniel Salama Thu, 27 Jul 2006 10:45:57 -0700

Wow! What a great set of responses.

Thanks so much for the information.

Robert, I understand your approach. However, I don't know if usingDCM at my beginner's stage may be more complicated. I also thinkthat, although RAM is getting cheaper every day, there are justphysical limitations that machines have. I'm looking over myapplication's data and it's not that bad, considering that after 2years worth of data, I'm using approximately 1GB of hard drive space.So, independently of that, and if I understood you correctly, whenI'm abstracting and accessing my model through the director, itsounds to me that either I persist all on storage or in cache (foreach director). If I choose cache, then I have to manually write codeto persist changes, instead of it being "automatic".

I do like the whole DCM concept from a performance perspective but Icertainly have to study further that whole framework to see how toput it into good use.

Now, Ian, I think that what you explained addressed my immediateneeds, considering my current knowledge of Elephant. I guess when Imentioned collections in my original email I should have been morespecific to CLOS persistent objects, which would be more efficient.So, in a way, that's what I was looking for. What I didn't graspclear enough was the secondary indexing and querying model, whichcertainly seem to do what I initially have in mind.

I do think that I will need to implement more complex indices, suchas indices spanning over multiple slots, as you mentioned, so if youdon't mind expanding on what you started typing regarding the cursoriteration functions or models, I would really appreciate it.

The other question I have is: I couldn't find documentation on theelephant online manual/tutorial regarding defpclass, get-instances-by-value, get-instances-by-range, and select-if. Is there any formaldocumentation about it or were those presented by you as pseudocodesamples?

Overall, I couldn't estimate at this moment how much it would benefitme by using cache vs hitting the storage penalty. As I said, I thinkI'll start with Ian's suggestions and as I learn more of Elephant andDCM, I'll start migrating/integrating DCM into the picture.


Thanks again. I will keep you guys posted about my progress.

Daniel

On Jul 27, 2006, at 10:08 AM, Ian Eslick wrote:

Another way to get  reasonable reporting efficiency for queries with

multiple constraints is to use cursors over secondary indicesdirectly.


For your example I would create a CLOS class in Elephant instead of a
table.  Each instance of this class is like a record in a table.  To

link two records together you have slots that contain references tothe

contained objects.  You don't keep collections, those are to be stored
in secondary indices so they are ordered and readout can be linear in
the size of the result set.

(defpclass customer
  ((name ... :index t)))

(defpclass invoice
  ((number ... :index t)
   (date ... :index t)
   (customer  ... :index t)))

You can now easily do BTree lookups using secondary indices via the
0.6.0 indexing mechanism:

(get-instances-by-value 'invoice 'number <invoice-integer>) =>returns a

single invoice
(get-instances-by-range 'invoice 'date <start date> <end date>) =>
returns a list of invoices between two dates
(get-instances-by-value 'invoice 'customer

These are highly efficient calls although they do require going todisk

unless you are using DCM, but the performance
for reports is not going to be a real problem as the secondary indices
are ordered so in retrieving a set of invoices over
dates so performance is roughly  O( # invoices in range + log # total
invoices ) and for small ranges is effectively
log 2N disk accesses (one for index, one for deserializing object if
it's not cached).

The rub is when you want to do a query for a customer's invoicesbetweentwo date ranges using the customer's name. In SQL this is a joinquery

and the query compiler will typically optimize how this works.  If
either set is likely to be small (# of invoices per customer or # of
invoices in a date range) you can write your own query to fetch one
subset and filter it:

(defun get-customer-invoices-for-dates (name start-date end-date)

(let ((customer (car (get-instances-by-value 'customer 'namename))))(select-if (lambda (invoice) (in-date-range-p invoice start-date

end-date))
          (get-instances-by-value 'invoice 'customer customer))))

To handle joins over large collections, a more sophisticated function
using cursors would need to be written and in

some cases you'd need to generate some new index data structures toget

the time down to O ( N log M ) with N the target dataset and M the

largest # of instances for any class involved in the query. Aneasy way

to do this is to use derived indices which, for example, can use any

number of slots in an object to create an ordering - so you can pre-sortobjects by date and then by customer so you can skip over customer-date

pairs as you traverse the btree.  That might require some more
explanation but I dont' have the time just now.  :)

In a future release we hope to integrate DCM/prevalence style

functionality more directly into the persistent object system socommonlinear queries are fast and only reports over non-resident(infrequently

and not recently accessed) objects is expensive.

Good luck and let us know if you need more suggestions!

Ian

Robert L. Read wrote:

On Wed, 2006-07-26 at 17:36 -0400, Daniel Salama wrote:

The other approach I thought would be to model it similarly as tohow

I would do it in a relational database. Basically, I would create
separate collections of objects representing the tables I would have

in the relational database. Then, within each object, e.g. acustomer

object, I would create a reference to a collection that holds a
subset of invoices, for example. This would allow me to simply query
the invoices collection of a customer and that's it. At the same
time, I would be able to query the entire invoices collection.

Dear Daniel,
    I think this approach is much better than creating a very large
object.

    Personally, I have an opinion a lot of people disagree with --- I
use the "prevalence" model,
which is basically that I keep all of the objects in memory, and when
I change something I
write back to the data store.   This pretty much makes your reporting
efficiency issues
go away, because you can compute any report really, really fast.

    I have checked in, in the "contrib" directory, a packaged called
DCM, for Data Collection Management,
that does the in-memory management --- the responsibility of the user
is to call "register-object" whenever

an object needs to be back. DCM also supports the "reference"problem

that you mention --- that is,
instead of putting a whole object into a slot, you put the key there
and look it up in a separate object.

In this model, each class of object you would objectify(which is

very similar to the "tables" in
relational model or "entities" in the entity-relational model.)  Each
should class gets a "director", and
you operate against the director when you do something.  One of the
advantages of this approach is
that you can choose the "strategy" for each director --- so you can
choose to cache the objects in
memory, or to directly use the database store, or even to use a
generational system.

I think the tests of DCM could be considered a little bit of
pseudocode for what you want.

In considering whether or not things should be kept in memory,one

should do the math: the
maximum number of objects * their size vs. the free memory.  Memories
are big and getting bigger.

    Let me know if this addresses you ideas or if you have any other
questions; Ian and I had

previously agreed that the lack of a big example application isone of

the biggest weaknesses in
Elephant right now.

------------------------------------------------------------------------


_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel


_______________________________________________
elephant-devel site list
elephant-devel@common-lisp.net
http://common-lisp.net/mailman/listinfo/elephant-devel

Re: [elephant-devel] Design Suggestion Request

Reply via email to