Thanks Michal - this is exactly what I need. On Mon, Feb 29, 2016 at 11:40 AM, Michał Zieliński < zielinski.mich...@gmail.com> wrote:
> Hi Kevin, > > This should help: > > https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html > > On 29 February 2016 at 16:54, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> Fellow Sparkers, >> >> I'm trying to "flatten" my view of data within a DataFrame, and am having >> difficulties doing so. The DataFrame contains product information, which >> includes multiple levels of categories (primary, secondary, etc). >> >> *Example Data (Raw):* >> *Name Level Category* >> Baked Code Food 1 >> Baked Code Seafood 2 >> Baked Code Fish 3 >> Hockey Stick Sports 1 >> Hockey Stick Hockey 2 >> Hockey Stick Equipment 3 >> >> *Desired Data:* >> *Name Category1 Category2 Category3* >> Baked Code Food Seafood Fish >> Hockey Stick Sports Hockey Equipment >> >> *Approach:* >> After parsing the "raw" information into two separate DataFrames (called >> *products >> *and *categories*) and registering them as a Spark SQL tables, I was >> attempting to perform the following query to flatten this all into the >> "desired data" (depicted above). >> >> products.registerTempTable("products") >> categories.registerTempTable("categories") >> >> val productList = sqlContext.sql( >> " SELECT p.Name, " + >> " c1.Description AS Category1, " + >> " c2.Description AS Category2, " + >> " c3.Description AS Category3 " + >> " FROM products AS p " + >> " JOIN categories AS c1 " + >> " ON c1.Name = p.Name AND c1.Level = '1' " >> " JOIN categories AS c2 " + >> " ON c2.Name = p.Name AND c2.Level = '2' " >> " JOIN categories AS c3 " + >> " ON c3.Name = p.Name AND c3.Level = '3' " >> >> *Issue:* >> I get an error when running my query above, because I am not able to JOIN >> the *categories* table more than once. Has anybody dealt with this type >> of use case before, and if so how did you achieve the desired behavior? >> >> Thank you in advance for your thoughts. >> >> Kevin >> > >