On Oct 3, 1:16 pm, Ross Gayler <r.gay...@gmail.com> wrote:
> Thanks Michael.
>
> > This sounds very similar to NoSQL and Map/Reduce?
>
> I'm not so sure about that (which may be mostly due to my ignorance of
> NoSQL and Map/Reduce). The amount of data involved in my problem is
> quite small and any infrastructure aimed at massive scaling may bring
> a load of conceptual and implementation baggage that is unnecessary/
> unhelpful.
>
> Let me restate my problem:
>
> I have a bunch of statistician colleagues with minimal programming
> skills. (I am also a statistician, but with slightly better
> programming skills.) As part of our analytical workflow we take data
> sets and preprocess them by adding new variables that are typically
> aggregate functions of other values. We source the data form a
> database/file, add the new variables, and store the augmented data in
> a database/file for subsequent, extensive and extended (a couple of
> months) analysis with other tools (off the shelf statistical packages
> such as SAS and R).  After the analyses are complete, some subset of
> the preprocessing calculations need to be implemented in an
> operational environment. This is currently done by completely re-
> implementing them in yet another fairly basic imperative language.
>
> The preprocessing in our analytical environment is usually written in
> a combination of SQL and the SAS data manipulation language (think of
> it as a very basic imperative language with macros but no user-defined
> functions). The statisticians take a long time to get their
> preprocessing right (they're not good at nested queries in SQL and
> make all the usual errors iterating over arrays of values with
> imperative code). So my primary goal is to find/build a query language
> that minimises the cognitive impedance mismatch with the statisticians
> and minimises their opportunity for error.
>
> Another goal is that the same mechanism should be applicable in our
> statistical analytical environment and the corporate deployment
> environment(s). The most different operational environment is online
> and realtime. The data describing one case gets thrown at some code
> that (among other things) implements the preprocessing with some
> embedded imperative code. So, linking in some Java byte code to do the
> preprocessing on a single case sounds feasible, whereas replacing/
> augmenting the current corporate infrastructure with NoSQL and a CPU
> farm is more aggravation with corporate IT than I am paid for.
>
> The final goal is that the preprocessing mechanism should be no slower
> than the current methods in each of the deployment environments. The
> hardest one is probably in our statistical analysis environment, but
> there we do have the option of farming the work across multiple CPUs
> if needed.
>
> Let me describe the computational scale of the problem - it is really
> quite small.
>
> Data is organised as completely independent cases.  One case might
> contain 500 primitive values for a total size of ~1kb. Preprocessing
> might calulate another 500 values, each of those being an aggregate
> function of some subset (say, 20 values) of the original 500 values.
> Currently, all these new values are calculated independently of each
> other, but there is a lot of overlap of intermediate results and,
> therefore, potential for optimisation of the computational effort
> required to calculate the entire set of results within a single case.
>
> In our statistical analytical environment the preprocessing is carried
> out in batch mode. A large dataset might contain 1M cases (~1GB of
> data). We can churn through the preprocessing at ~300 cases/second on
> a modest PC.  Higher throughput in our analytical environment would be
> a bonus, but not essential.
>
> So I see the problem as primarily about the conceptual design of the
> query language, with some side constraints about implementation
> compatibility across a range of deployment environments and adequate
> throughput performance.
>
> As I mentioned in an earlier post, I'll probably assemble a collection
> of representative queries, express them in a variety of query
> languages, and try to assess how compatible the different query
> languages are with the way my colleagues want to think about the
> proble.

Seeing examples (perhaps quite a few of them) will be certainly
useful. (Due to my non-stats background) I may not have understood
your use-cases correctly, but are these helpful for you?

http://github.com/MrHus/rql

http://bitbucket.org/kumarshantanu/sqlrat/wiki/Clause

The SQLRat clause API will be part of 0.2 release (expected very
soon).

Regards,
Shantanu

>
> Ross
>
> On Oct 3, 11:31 am, Michael Ossareh <ossa...@gmail.com> wrote:
>
>
>
> > On Fri, Oct 1, 2010 at 17:55, Ross Gayler <r.gay...@gmail.com> wrote:
> > > Hi,
>
> > > This is probably an abuse of the Clojure forum, but it is a bit
> > > Clojure-related and strikes me as the sort of thing that a bright,
> > > eclectic bunch of Clojure users might know about. (Plus I'm not really
> > > a software person, so I need all the help I can get.)
>
> > > I am looking at the possibility of finding/building a declarative data
> > > aggregation language operating on a small relational representation.
> > > Each query identifies a set of rows satisfying some relational
> > > predicate and calculates some aggregate function of a set of values
> > > (e.g. min, max, sum). There might be ~20 input tables of up to ~1k
> > > rows.  The data is immutable - it gets loaded and never changed. The
> > > results of the queries get loaded as new rows in other tables and are
> > > eventually used as input to other computations. There might be ~1k
> > > queries. There is no requirement for transaction management or any
> > > inherent concurrency (there is only one consumer of the results).
> > > There is no requirement for persistent storage - the aggregation is
> > > the only thing of interest. I would like the query language to map as
> > > directly as possible to the task (SQL is powerful enough, but can get
> > > very contorted and opaque for some of the queries). There is
> > > considerable scope for optimisation of the calculations over the total
> > > set of queries as partial results are common across many of the
> > > queries.
>
> > > I would like to be able to do this in Clojure (which I have not yet
> > > used), partly for some very practical reasons to do with Java interop
> > > and partly because Clojure looks very cool.
>
> > > * Is there any existing Clojure functionality which looks like a good
> > > fit to this problem?
>
> > > I have looked at Clojure-Datalog. It looks like a pretty good fit
> > > except that it lacks the aggregation operators. Apart from that the
> > > deductive power is probably greater than I need (although that doesn't
> > > necessarily cost me anything).  I know that there are other (non-
> > > Clojure) Datalog implementations that have been extended with
> > > aggregation operators (e.g. DLV
> > >http://www.mat.unical.it/dlv-complex/dlv-complex).
>
> > > Tutorial D (what SQL should have been
> > >http://en.wikipedia.org/wiki/D_%28data_language_specification%29#Tuto...
> > > )
> > > might be a good fit, although once again, there is probably a lot of
> > > conceptual and implementation baggage (e.g. Rel
> > >http://dbappbuilder.sourceforge.net/Rel.php)
> > > that I don't need.
>
> > > * Is there a Clojure implementation of something like Tutorial D?
>
> > > If there is no implementation of anything that meets my requirements
> > > then I would be willing to look at the possibility of creating a
> > > Domain Specific language.  However, I am wary of launching straight
> > > into that because of the probability that anything I dreamed up would
> > > be an ad hoc kludge rather than a semantically complete and consistent
> > > language. Optimised execution would be a whole other can of worms.
>
> > > * Does anyone know of any DSLs/formalisms for declaratively specifying
> > > relational data aggregations?
>
> > > Thanks
>
> > > Ross
>
> > This sounds very similar to NoSQL and 
> > Map/Reduce?http://www.basho.com/Riak.html
>
> > Where your predicate is a reduce fn?- Hide quoted text -
>
> > - Show quoted text -

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to