Hey Julian, Most of the general discussion surrounding a high level language for Samza can be found at:
https://issues.apache.org/jira/browse/SAMZA-390 Early mockups of Yi's work on what the lower level APIs (phases after AST rewriting) might look like can also be found at: https://issues.apache.org/jira/browse/SAMZA-482 Much of the work is actually derived from CQL. However, CQL is a bit obscure, so the work is commonly being compared to the SQL in casual discussion since it's more widely known. We can always use another set of eyes on something this complex and would appreciate any comments you have. -Jon On Jan 27, 2015, at 5:34 PM, Julian Hyde <jh...@apache.org> wrote: > Hi all, > > This is my first post to the Samza list. I heard from Chris and Jay > that you guys were looking into putting a SQL interface on Samza, so I > thought I'd take a look. > > My background is in the SQL world, most recently with Apache Calcite, > (although I have quite a lot of experience with streaming too) so > forgive me if I am speaking a foreign language or seem to be coming at > this from a completely different direction. Also forgive me if I have > missed preceding discussions and I am opening up areas that have been > settled already. > > I was surprised that one of the first goals is to create a SQL API. > SQL is a textual language; a lot of the nuance (e.g. scope of > identifiers) is lost when you convert it to a linear builder API. Now, > it definitely makes sense to have a SQL AST (abstract syntax tree), > that can be created by hand-written code or by a parser. And you can > create an AST builder, if you like. But there is not a simple mapping > between true SQL and a data-flow graph that you can execute. If you > imagine that there is a simple mapping, you will achieve great results > with simple SELECT-FROM-WHERE queries but hit the wall when you hit > the hard stuff. You will end up -- as so many others have -- with a > SQL-like language. Close but no cigar. > > Case in point: Spark (and Spark-streaming) is a SQL-like language that > looks similar to the proposed Samza API, and now they are building > SparkSQL from the ground up. > > I think the way to approach this is to have a SQL parser and a logical > algebra. The logical algebra looks very similar to relational algebra, > maybe with one or two extensions for streaming. (A lot of SQL features > -- such as query blocks, sub-queries, correlated variables, aliases, > views and the HAVING clause -- are not present in the algebra.) > Between the parser and the logical algebra is an AST, a validator, and > a translator from AST the the algebra. And then there is a physical > algebra, which is Samza of course. > > Maybe the proposed SQL object model is in fact that logical algebra. > But I'd recommend that you not call it SQL; in fact it should be > non-goal that an end-user would use that API and think that they are > in any way creating a "SQL query". > > Julian
signature.asc
Description: Message signed with OpenPGP using GPGMail