Re: Dataset Outer Join vs RDD Outer Join

Richard Marscher Tue, 07 Jun 2016 11:48:26 -0700

For anyone following along the chain went private for a bit, but there were
still issues with the bytecode generation in the 2.0-preview so this JIRA
was created: https://issues.apache.org/jira/browse/SPARK-15786


On Mon, Jun 6, 2016 at 1:11 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> That kind of stuff is likely fixed in 2.0.  If you can get a reproduction
> working there it would be very helpful if you could open a JIRA.
>
> On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher <rmarsc...@localytics.com
> > wrote:
>
>> A quick unit test attempt didn't get far replacing map with as[], I'm
>> only working against 1.6.1 at the moment though, I was going to try 2.0 but
>> I'm having a hard time building a working spark-sql jar from source, the
>> only ones I've managed to make are intended for the full assembly fat jar.
>>
>>
>> Example of the error from calling joinWith as left_outer and then
>> .as[(Option[T], U]) where T and U are Int and Int.
>>
>> [info] newinstance(class scala.Tuple2,decodeusingserializer(input[0,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true),decodeusingserializer(input[1,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true),false,ObjectType(class
>> scala.Tuple2),None)
>> [info] :- decodeusingserializer(input[0,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true)
>> [info] :  +- input[0, StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))]
>> [info] +- decodeusingserializer(input[1,
>> StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))],scala.Option,true)
>> [info]    +- input[1, StructType(StructField(_1,IntegerType,true),
>> StructField(_2,IntegerType,true))]
>>
>> Cause: java.util.concurrent.ExecutionException: java.lang.Exception:
>> failed to compile: org.codehaus.commons.compiler.CompileException: File
>> 'generated.java', Line 32, Column 60: No applicable constructor/method
>> found for actual parameters "org.apache.spark.sql.catalyst.InternalRow";
>> candidates are: "public static java.nio.ByteBuffer
>> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer
>> java.nio.ByteBuffer.wrap(byte[], int, int)"
>>
>> The generated code is passing InternalRow objects into the ByteBuffer
>>
>> Starting from two Datasets of types Dataset[(Int, Int)] with expression
>> $"left._1" === $"right._1". I'll have to spend some time getting a better
>> understanding of this analysis phase, but hopefully I can come up with
>> something.
>>
>> On Wed, Jun 1, 2016 at 3:43 PM, Michael Armbrust <mich...@databricks.com>
>> wrote:
>>
>>> Option should place nicely with encoders, but its always possible there
>>> are bugs.  I think those function signatures are slightly more expensive
>>> (one extra object allocation) and its not as java friendly so we probably
>>> don't want them to be the default.
>>>
>>> That said, I would like to enable that kind of sugar while still taking
>>> advantage of all the optimizations going on under the covers.  Can you get
>>> it to work if you use `as[...]` instead of `map`?
>>>
>>> On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher <
>>> rmarsc...@localytics.com> wrote:
>>>
>>>> Ah thanks, I missed seeing the PR for
>>>> https://issues.apache.org/jira/browse/SPARK-15441. If the rows became
>>>> null objects then I can implement methods that will map those back to
>>>> results that align closer to the RDD interface.
>>>>
>>>> As a follow on, I'm curious about thoughts regarding enriching the
>>>> Dataset join interface versus a package or users sugaring for themselves. I
>>>> haven't considered the implications of what the optimizations datasets,
>>>> tungsten, and/or bytecode gen can do now regarding joins so I may be
>>>> missing a critical benefit there around say avoiding Options in favor of
>>>> nulls. If nothing else, I guess Option doesn't have a first class Encoder
>>>> or DataType yet and maybe for good reasons.
>>>>
>>>> I did find the RDD join interface elegant, though. In the ideal world
>>>> an API comparable the following would be nice:
>>>> https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06
>>>>
>>>>
>>>> On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> Thanks for the feedback.  I think this will address at least some of
>>>>> the problems you are describing:
>>>>> https://github.com/apache/spark/pull/13425
>>>>>
>>>>> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <
>>>>> rmarsc...@localytics.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've been working on transitioning from RDD to Datasets in our
>>>>>> codebase in anticipation of being able to leverage features of 2.0.
>>>>>>
>>>>>> I'm having a lot of difficulties with the impedance mismatches
>>>>>> between how outer joins worked with RDD versus Dataset. The Dataset joins
>>>>>> feel like a big step backwards IMO. With RDD, leftOuterJoin would give 
>>>>>> you
>>>>>> Option types of the results from the right side of the join. This follows
>>>>>> idiomatic Scala avoiding nulls and was easy to work with.
>>>>>>
>>>>>> Now with Dataset there is only joinWith where you specify the join
>>>>>> type, but it lost all the semantics of identifying missing data from 
>>>>>> outer
>>>>>> joins. I can write some enriched methods on Dataset with an implicit 
>>>>>> class
>>>>>> to abstract messiness away if Dataset nulled out all mismatching data 
>>>>>> from
>>>>>> an outer join, however the problem goes even further in that the values
>>>>>> aren't always null. Integer, for example, defaults to -1 instead of null.
>>>>>> Now it's completely ambiguous what data in the join was actually there
>>>>>> versus populated via this atypical semantic.
>>>>>>
>>>>>> Are there additional options available to work around this issue? I
>>>>>> can convert to RDD and back to Dataset but that's less than ideal.
>>>>>>
>>>>>> Thanks,
>>>>>> --
>>>>>> *Richard Marscher*
>>>>>> Senior Software Engineer
>>>>>> Localytics
>>>>>> Localytics.com <http://localytics.com/> | Our Blog
>>>>>> <http://localytics.com/blog> | Twitter
>>>>>> <http://twitter.com/localytics> | Facebook
>>>>>> <http://facebook.com/localytics> | LinkedIn
>>>>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Richard Marscher*
>>>> Senior Software Engineer
>>>> Localytics
>>>> Localytics.com <http://localytics.com/> | Our Blog
>>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics>
>>>>  | Facebook <http://facebook.com/localytics> | LinkedIn
>>>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>>>
>>>
>>>
>>
>>
>> --
>> *Richard Marscher*
>> Senior Software Engineer
>> Localytics
>> Localytics.com <http://localytics.com/> | Our Blog
>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
>> Facebook <http://facebook.com/localytics> | LinkedIn
>> <http://www.linkedin.com/company/1148792?trk=tyah>
>>
>
>


-- 
*Richard Marscher*
Senior Software Engineer
Localytics
Localytics.com <http://localytics.com/> | Our Blog
<http://localytics.com/blog> | Twitter <http://twitter.com/localytics> |
Facebook <http://facebook.com/localytics> | LinkedIn
<http://www.linkedin.com/company/1148792?trk=tyah>

Re: Dataset Outer Join vs RDD Outer Join

Reply via email to