Ah thanks, I missed seeing the PR for https://issues.apache.org/jira/browse/SPARK-15441. If the rows became null objects then I can implement methods that will map those back to results that align closer to the RDD interface.
As a follow on, I'm curious about thoughts regarding enriching the Dataset join interface versus a package or users sugaring for themselves. I haven't considered the implications of what the optimizations datasets, tungsten, and/or bytecode gen can do now regarding joins so I may be missing a critical benefit there around say avoiding Options in favor of nulls. If nothing else, I guess Option doesn't have a first class Encoder or DataType yet and maybe for good reasons. I did find the RDD join interface elegant, though. In the ideal world an API comparable the following would be nice: https://gist.github.com/rmarsch/3ea78b3a9a8a0e83ce162ed947fcab06 On Wed, Jun 1, 2016 at 1:42 PM, Michael Armbrust <mich...@databricks.com> wrote: > Thanks for the feedback. I think this will address at least some of the > problems you are describing: https://github.com/apache/spark/pull/13425 > > On Wed, Jun 1, 2016 at 9:58 AM, Richard Marscher <rmarsc...@localytics.com > > wrote: > >> Hi, >> >> I've been working on transitioning from RDD to Datasets in our codebase >> in anticipation of being able to leverage features of 2.0. >> >> I'm having a lot of difficulties with the impedance mismatches between >> how outer joins worked with RDD versus Dataset. The Dataset joins feel like >> a big step backwards IMO. With RDD, leftOuterJoin would give you Option >> types of the results from the right side of the join. This follows >> idiomatic Scala avoiding nulls and was easy to work with. >> >> Now with Dataset there is only joinWith where you specify the join type, >> but it lost all the semantics of identifying missing data from outer joins. >> I can write some enriched methods on Dataset with an implicit class to >> abstract messiness away if Dataset nulled out all mismatching data from an >> outer join, however the problem goes even further in that the values aren't >> always null. Integer, for example, defaults to -1 instead of null. Now it's >> completely ambiguous what data in the join was actually there versus >> populated via this atypical semantic. >> >> Are there additional options available to work around this issue? I can >> convert to RDD and back to Dataset but that's less than ideal. >> >> Thanks, >> -- >> *Richard Marscher* >> Senior Software Engineer >> Localytics >> Localytics.com <http://localytics.com/> | Our Blog >> <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | >> Facebook <http://facebook.com/localytics> | LinkedIn >> <http://www.linkedin.com/company/1148792?trk=tyah> >> > > -- *Richard Marscher* Senior Software Engineer Localytics Localytics.com <http://localytics.com/> | Our Blog <http://localytics.com/blog> | Twitter <http://twitter.com/localytics> | Facebook <http://facebook.com/localytics> | LinkedIn <http://www.linkedin.com/company/1148792?trk=tyah>