I think using nested fields would be a mistake. In SQL, you have two namespaces when joining two tables (or N namespaces when joining N tables), but what outputs the query is a flat namespace.
For example: select emp.name from ( select emp.*, dept.* from emp join dept on emp.deptno = dept.deptno) AS v is not legal SQL. Why? Because the “emp” qualifier does not come out of the sub-query. Only the column names are visible. The sub-query produces flat rows. This is good. It allows you to substitute a view: select v.emp_name from v that produces the same result by other means. Now, you can produce nested rows in modern SQL, if that is what you want. For example, an order record could contain a nested list of line-items. Nested rows are useful, especially for streaming queries. But you need to use explicit operators to create that nesting. Regular join produces flat rows. Some systems, such as Pig and Cascading, perform an operation similar to Join called Cogroup. The nested fields have their benefits, but they make it difficult to apply the association rule. In relational algebra and SQL, when you transform "(a join b) join c” into “a join (b join c)”, it has the same schema, but “cogroup(cogroup(a, b), c)” has a different schema than “cogroup(a, cogroup(b, c))”. Julian On Apr 9, 2015, at 6:58 PM, Felix GV <fville...@linkedin.com.INVALID> wrote: > I think the most straightforward mapping from SQL to Avro is probably to have > nested fields in the Avro record. > > This is pretty similar to SQL aliases in a join query, like in Julian's > example. > > With that strategy, the OP's schemas would be modeled as such: > > { > "name": "A", > "type": "record", > "namespace": "fubar", > "fields": [{"name": "a", "type" : "int"}] > } > > { > "name": "B", > "type": "record", > "namespace": "fubar", > "fields": [{"name": "b", "type" : "int"}] > } > > { > "name": "A_join_B", > "type": "record", > "namespace": "fubar", > "fields": [ > {"name": "A", "type" : "record", "fields": [{"name": "a", "type" : > "int"}]}, > {"name": "B", "type" : "record", "fields": [{"name": "b", "type" : "int"}]} > ] > } > > -- > > Felix GV > Data Infrastructure Engineer > Distributed Data Systems > LinkedIn > > f...@linkedin.com > linkedin.com/in/felixgv > > ________________________________________ > From: Roger Hoover [roger.hoo...@gmail.com] > Sent: Thursday, April 09, 2015 12:56 PM > To: dev@samza.apache.org > Subject: Re: Joining Avro records > > Thanks, Julian. Good point about needing aliasing for unique names in > SQL. I didn't know about array_agg...nice. > > On Thu, Apr 9, 2015 at 12:35 PM, Julian Hyde <jul...@hydromatic.net> wrote: > >> Much of this is about mapping from logical fields (i.e. the fields you can >> reference in SQL) down to the Avro representation; I’m no expert on that >> mapping, so I’ll focus on the SQL stuff. >> >> First, SQL doesn’t allow a record to have two fields of the same name, so >> you wouldn’t be allowed to have two “name” fields. When you do a join, you >> might need to alias output columns: >> >> select stream orders.id, products.id as productId >> from orders >> join products on orders.id = products.id; >> >> Second, JOIN isn’t the only SQL operator that combines records; GROUP BY >> also combines records. JOIN combines records from different streams, and >> they usually have different types (i.e. different numbers/types of fields), >> whereas GROUP BY combines records from the same stream. Use whichever best >> suits your purpose. >> >> select stream zipcode, floor(rowtime to hour), array_agg(orderid) as >> orderIds >> from orders >> group by zipcode, floor(rowtime to hour) >> >> (array_agg is an aggregate function, recently added to the SQL standard, >> that gathers input values into an array. See >> http://www.craigkerstiens.com/2013/04/17/array-agg/.) >> >> Output: >> >> { zipcode: “94705”, rowtime: “2015-04-09 11:00:00”, orderIds: [123, 156, >> 1056] }, >> { zipcode: “94117”, rowtime: “2015-04-09 11:00:00”, orderIds: [45, 777] }, >> { zipcode: “94705”, rowtime: “2015-04-09 12:00:00”, orderIds: [55] } >> >> Julian >> >> >> On Apr 9, 2015, at 12:07 PM, Yi Pan <nickpa...@gmail.com> wrote: >> >>> Hi, Roger, >>> >>> Good question on that. I am actually not aware of any "automatic" way of >>> doing this in Avro. I have tried to add generic Schema and Data interface >>> in samza-sql branch to address the morphing of the schemas from input >>> streams to the output streams. The basic idea is to have wrapper Schema >> and >>> Data classes on-top-of the deserialized objects to access the data fields >>> according to the schema w/o changing and copying the actual data fields. >>> Hence, when there is a need to morph the input data schemas into a new >>> output data schema, we just need an implementation of the new output data >>> Schema class that can read the corresponding data fields from the input >>> data and write them out in the output schema. An interface function >>> transform() is added in the Schema class for this exact purpose. >> Currently, >>> it only takes one input data and one example of "projection" >> transformation >>> can be found in the implementation of AvroSchema class. A join case as >> you >>> presented may well be a reason to have an implementation of "join" with >>> multiple input data. >>> >>> All the above solution is still experimental and please feel free to >>> provide your feedback and comments on that. If we agree that this >> solution >>> is good and suit for a broader use case, it can be considered to be used >>> outside the "SQL" context as well. >>> >>> Best regards! >>> >>> -Yi >>> >>> On Thu, Apr 9, 2015 at 8:55 AM, Roger Hoover <roger.hoo...@gmail.com> >> wrote: >>> >>>> Hi Milinda and others, >>>> >>>> This is an Avro question but since you guys are working on Avro support >> for >>>> stream SQL, I thought I'd ask you for help. >>>> >>>> If I have a two records of type A and B as below and want to join them >>>> similar to "SELECT *" in SQL to produce a record of type AB, is there an >>>> simple way to do this with Avro without writing code to copy each field >>>> individually? >>>> >>>> I appreciate any help. >>>> >>>> Thanks, >>>> >>>> Roger >>>> >>>> { >>>> "name": "A", >>>> "type": "record", >>>> "namespace": "fubar", >>>> "fields": [{"name": "a", "type" : "int"}] >>>> } >>>> >>>> { >>>> "name": "B", >>>> "type": "record", >>>> "namespace": "fubar", >>>> "fields": [{"name": "b", "type" : "int"}] >>>> } >>>> >>>> { >>>> "name": "AB", >>>> "type": "record", >>>> "namespace": "fubar", >>>> "fields": [{"name": "a", "type" : "int"}, {"name": "b", "type" : >> "int"}] >>>> } >>>> >> >>