Re: Joining Avro records

Julian Hyde Mon, 13 Apr 2015 11:42:06 -0700

I think using nested fields would be a mistake. In SQL, you have two namespaces 
when joining two tables (or N namespaces when joining N tables), but what 
outputs the query is a flat namespace.


For example:

  select emp.name
  from (
    select emp.*, dept.*
    from emp join dept on emp.deptno = dept.deptno) AS v

is not legal SQL. Why? Because the “emp” qualifier does not come out of the 
sub-query. Only the column names are visible. The sub-query produces flat rows. 
This is good. It allows you to substitute a view:

select v.emp_name
from v

that produces the same result by other means.

Now, you can produce nested rows in modern SQL, if that is what you want. For 
example, an order record could contain a nested list of line-items. Nested rows 
are useful, especially for streaming queries. But you need to use explicit 
operators to create that nesting. Regular join produces flat rows.

Some systems, such as Pig and Cascading, perform an operation similar to Join 
called Cogroup. The nested fields have their benefits, but they make it 
difficult to apply the association rule. In relational algebra and SQL, when 
you transform "(a join b) join c” into “a join (b join c)”, it has the same 
schema, but “cogroup(cogroup(a,  b), c)” has a different schema than 
“cogroup(a, cogroup(b, c))”.

Julian



On Apr 9, 2015, at 6:58 PM, Felix GV <[email protected]> wrote:

> I think the most straightforward mapping from SQL to Avro is probably to have 
> nested fields in the Avro record.
> 
> This is pretty similar to SQL aliases in a join query, like in Julian's 
> example.
> 
> With that strategy, the OP's schemas would be modeled as such:
> 
> {
>  "name": "A",
>  "type": "record",
>  "namespace": "fubar",
>  "fields": [{"name": "a", "type" : "int"}]
> }
> 
> {
>  "name": "B",
>  "type": "record",
>  "namespace": "fubar",
>  "fields": [{"name": "b", "type" : "int"}]
> }
> 
> {
>  "name": "A_join_B",
>  "type": "record",
>  "namespace": "fubar",
>  "fields": [
>    {"name": "A", "type" : "record", "fields": [{"name": "a", "type" : 
> "int"}]}, 
>    {"name": "B", "type" : "record", "fields": [{"name": "b", "type" : "int"}]}
>  ]
> }
> 
> --
> 
> Felix GV
> Data Infrastructure Engineer
> Distributed Data Systems
> LinkedIn
> 
> [email protected]
> linkedin.com/in/felixgv
> 
> ________________________________________
> From: Roger Hoover [[email protected]]
> Sent: Thursday, April 09, 2015 12:56 PM
> To: [email protected]
> Subject: Re: Joining Avro records
> 
> Thanks, Julian.  Good point about needing aliasing for unique names in
> SQL.  I didn't know about array_agg...nice.
> 
> On Thu, Apr 9, 2015 at 12:35 PM, Julian Hyde <[email protected]> wrote:
> 
>> Much of this is about mapping from logical fields (i.e. the fields you can
>> reference in SQL) down to the Avro representation; I’m no expert on that
>> mapping, so I’ll focus on the SQL stuff.
>> 
>> First, SQL doesn’t allow a record to have two fields of the same name, so
>> you wouldn’t be allowed to have two “name” fields. When you do a join, you
>> might need to alias output columns:
>> 
>> select stream orders.id, products.id as productId
>> from orders
>> join products on orders.id = products.id;
>> 
>> Second, JOIN isn’t the only SQL operator that combines records; GROUP BY
>> also combines records. JOIN combines records from different streams, and
>> they usually have different types (i.e. different numbers/types of fields),
>> whereas GROUP BY combines records from the same stream. Use whichever best
>> suits your purpose.
>> 
>> select stream zipcode, floor(rowtime to hour), array_agg(orderid) as
>> orderIds
>> from orders
>> group by zipcode, floor(rowtime to hour)
>> 
>> (array_agg is an aggregate function, recently added to the SQL standard,
>> that gathers input values into an array. See
>> http://www.craigkerstiens.com/2013/04/17/array-agg/.)
>> 
>> Output:
>> 
>> { zipcode: “94705”, rowtime: “2015-04-09 11:00:00”, orderIds: [123, 156,
>> 1056] },
>> { zipcode: “94117”, rowtime: “2015-04-09 11:00:00”, orderIds: [45, 777] },
>> { zipcode: “94705”, rowtime: “2015-04-09 12:00:00”, orderIds: [55] }
>> 
>> Julian
>> 
>> 
>> On Apr 9, 2015, at 12:07 PM, Yi Pan <[email protected]> wrote:
>> 
>>> Hi, Roger,
>>> 
>>> Good question on that. I am actually not aware of any "automatic" way of
>>> doing this in Avro. I have tried to add generic Schema and Data interface
>>> in samza-sql branch to address the morphing of the schemas from input
>>> streams to the output streams. The basic idea is to have wrapper Schema
>> and
>>> Data classes on-top-of the deserialized objects to access the data fields
>>> according to the schema w/o changing and copying the actual data fields.
>>> Hence, when there is a need to morph the input data schemas into a new
>>> output data schema, we just need an implementation of the new output data
>>> Schema class that can read the corresponding data fields from the input
>>> data and write them out in the output schema. An interface function
>>> transform() is added in the Schema class for this exact purpose.
>> Currently,
>>> it only takes one input data and one example of "projection"
>> transformation
>>> can be found in the implementation of AvroSchema class. A join case as
>> you
>>> presented may well be a reason to have an implementation of "join" with
>>> multiple input data.
>>> 
>>> All the above solution is still experimental and please feel free to
>>> provide your feedback and comments on that. If we agree that this
>> solution
>>> is good and suit for a broader use case, it can be considered to be used
>>> outside the "SQL" context as well.
>>> 
>>> Best regards!
>>> 
>>> -Yi
>>> 
>>> On Thu, Apr 9, 2015 at 8:55 AM, Roger Hoover <[email protected]>
>> wrote:
>>> 
>>>> Hi Milinda and others,
>>>> 
>>>> This is an Avro question but since you guys are working on Avro support
>> for
>>>> stream SQL, I thought I'd ask you for help.
>>>> 
>>>> If I have a two records of type A and B as below and want to join them
>>>> similar to "SELECT *" in SQL to produce a record of type AB, is there an
>>>> simple way to do this with Avro without writing code to copy each field
>>>> individually?
>>>> 
>>>> I appreciate any help.
>>>> 
>>>> Thanks,
>>>> 
>>>> Roger
>>>> 
>>>> {
>>>> "name": "A",
>>>> "type": "record",
>>>> "namespace": "fubar",
>>>> "fields": [{"name": "a", "type" : "int"}]
>>>> }
>>>> 
>>>> {
>>>> "name": "B",
>>>> "type": "record",
>>>> "namespace": "fubar",
>>>> "fields": [{"name": "b", "type" : "int"}]
>>>> }
>>>> 
>>>> {
>>>> "name": "AB",
>>>> "type": "record",
>>>> "namespace": "fubar",
>>>> "fields": [{"name": "a", "type" : "int"}, {"name": "b", "type" :
>> "int"}]
>>>> }
>>>> 
>> 
>>

Re: Joining Avro records

Reply via email to