For anyone following along the chain went private for a bit, but there were still issues with the bytecode generation in the 2.0-preview so this JIRA was created: https://issues.apache.org/jira/browse/SPARK-15786
On Mon, Jun 6, 2016 at 1:11 PM, Michael Armbrust <mich...@databricks.com> wrote: > That kind of stuff is likely fixed in 2.0. If you can get a reproduction > working there it would be very helpful if you could open a JIRA. > > On Mon, Jun 6, 2016 at 7:37 AM, Richard Marscher <rmarsc...@localytics.com > > wrote: > >> A quick unit test attempt didn't get far replacing map with as[], I'm >> only working against 1.6.1 at the moment though, I was going to try 2.0 but >> I'm having a hard time building a working spark-sql jar from source, the >> only ones I've managed to make are intended for the full assembly fat jar. >> >> >> Example of the error from calling joinWith as left_outer and then >> .as[(Option[T], U]) where T and U are Int and Int. >> >> [info] newinstance(class scala.Tuple2,decodeusingserializer(input[0, >> StructType(StructField(_1,IntegerType,true), >> StructField(_2,IntegerType,true))],scala.Option,true),decodeusingserializer(input[1, >> StructType(StructField(_1,IntegerType,true), >> StructField(_2,IntegerType,true))],scala.Option,true),false,ObjectType(class >> scala.Tuple2),None) >> [info] :- decodeusingserializer(input[0, >> StructType(StructField(_1,IntegerType,true), >> StructField(_2,IntegerType,true))],scala.Option,true) >> [info] : +- input[0, StructType(StructField(_1,IntegerType,true), >> StructField(_2,IntegerType,true))] >> [info] +- decodeusingserializer(input[1, >> StructType(StructField(_1,IntegerType,true), >> StructField(_2,IntegerType,true))],scala.Option,true) >> [info] +- input[1, StructType(StructField(_1,IntegerType,true), >> StructField(_2,IntegerType,true))] >> >> Cause: java.util.concurrent.ExecutionException: java.lang.Exception: >> failed to compile: org.codehaus.commons.compiler.CompileException: File >> 'generated.java', Line 32, Column 60: No applicable constructor/method >> found for actual parameters "org.apache.spark.sql.catalyst.InternalRow"; >> candidates are: "public static java.nio.ByteBuffer >> java.nio.ByteBuffer.wrap(byte[])", "public static java.nio.ByteBuffer >> java.nio.ByteBuffer.wrap(byte[], int, int)" >> >> The generated code is passing InternalRow objects into the ByteBuffer >> >> Starting from two Datasets of types Dataset[(Int, Int)] with expression >> $"left._1" === $"right._1". I'll have to spend some time getting a better >> understanding of this analysis phase, but hopefully I can come up with >> something. >> >> On Wed, Jun 1, 2016 at 3:43 PM, Michael Armbrust <mich...@databricks.com> >> wrote: >> >>> Option should place nicely with encoders, but its always possible there >>> are bugs. I think those function signatures are slightly more expensive >>> (one extra object allocation) and its not as java friendly so we probably >>> don't want them to be the default. >>> >>> That said, I would like to enable that kind of sugar while still taking >>> advantage of all the optimizations going on under the covers. Can you get >>> it to work if you use `as[...]` instead of `map`? >>> >>> On Wed, Jun 1, 2016 at 11:59 AM, Richard Marscher < >>> rmarsc...@localytics.com> wrote: >>> >>>> Ah thanks, I missed seeing the PR for >>>> https://issues.apache.org/jira/browse/SPARK-15441. If the rows became >>>> null objects then I can implement methods that will map those back to >>>> results that align closer to the RDD interface. >>>> >>>> As a follow on, I'm curious about thoughts regarding enriching the >>>> Dataset join interface versus a package or users sugaring for themselves. I >>>> haven't considered the implications of what the optimizations datasets, >>>> tungsten, and/or bytecode gen can do now regarding joins so I may be >>>> missing a critical benefit there around say avoiding Options in favor of >>>> nulls. If nothing else, I guess Option doesn't have a first class Encoder >>>> or DataType yet and maybe for good reasons. >>>> >>>> I did find the RDD join interface elegant, though. In the ideal world >>>> an API comparable the following would be nice: >>>> https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06 >>>> >>>> >>>> On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust < >>>> mich...@databricks.com> wrote: >>>> >>>>> Thanks for the feedback. I think this will address at least some of >>>>> the problems you are describing: >>>>> https://github.com/apache/spark/pull/13425 >>>>> >>>>> On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher < >>>>> rmarsc...@localytics.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I've been working on transitioning from RDD to Datasets in our >>>>>> codebase in anticipation of being able to leverage features of 2.0. >>>>>> >>>>>> I'm having a lot of difficulties with the impedance mismatches >>>>>> between how outer joins worked with RDD versus Dataset. The Dataset joins >>>>>> feel like a big step backwards IMO. With RDD, leftOuterJoin would give >>>>>> you >>>>>> Option types of the results from the right side of the join. This follows >>>>>> idiomatic Scala avoiding nulls and was easy to work with. >>>>>> >>>>>> Now with Dataset there is only joinWith where you specify the join >>>>>> type, but it lost all the semantics of identifying missing data from >>>>>> outer >>>>>> joins. I can write some enriched methods on Dataset with an implicit >>>>>> class >>>>>> to abstract messiness away if Dataset nulled out all mismatching data >>>>>> from >>>>>> an outer join, however the problem goes even further in that the values >>>>>> aren't always null. Integer, for example, defaults to -1 instead of null. >>>>>> Now it's completely ambiguous what data in the join was actually there >>>>>> versus populated via this atypical semantic. >>>>>> >>>>>> Are there additional options available to work around this issue? I >>>>>> can convert to RDD and back to Dataset but that's less than ideal. >>>>>> >>>>>> Thanks, >>>>>> -- >>>>>> *Richard Marscher* >>>>>> Senior Software Engineer >>>>>> Localytics >>>>>> Localytics.com <http://localytics.com/> | Our Blog >>>>>> <http://localytics.com/blog> | Twitter >>>>>> <http://twitter.com/localytics> | Facebook >>>>>> <http://facebook.com/localytics> | LinkedIn >>>>>> <http://www.linkedin.com/company/1148792?trk=tyah> >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Richard Marscher* >>>> Senior Software Engineer >>>> Localytics >>>> Localytics.com <http://localytics.com/> | Our Blog >>>> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> >>>> | Facebook <http://facebook.com/localytics> | LinkedIn >>>> <http://www.linkedin.com/company/1148792?trk=tyah> >>>> >>> >>> >> >> >> -- >> *Richard Marscher* >> Senior Software Engineer >> Localytics >> Localytics.com <http://localytics.com/> | Our Blog >> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | >> Facebook <http://facebook.com/localytics> | LinkedIn >> <http://www.linkedin.com/company/1148792?trk=tyah> >> > > -- *Richard Marscher* Senior Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>