RE: Best way to process lookup ETL with Dataframes

2017-01-04 Thread Sesterhenn, Mike
ght? Thanks, -Mike From: Nicholas Hakobian [mailto:nicholas.hakob...@rallyhealth.com] Sent: Friday, December 30, 2016 5:50 PM To: Sesterhenn, Mike Cc: ayan guha; user@spark.apache.org Subject: Re: Best way to process lookup ETL with Dataframes Yep, sequential joins is what I have done in the p

Re: Best way to process lookup ETL with Dataframes

2016-12-30 Thread Sesterhenn, Mike
row because bad data will result. Any other thoughts? From: Nicholas Hakobian Sent: Friday, December 30, 2016 2:12:40 PM To: Sesterhenn, Mike Cc: ayan guha; user@spark.apache.org Subject: Re: Best way to process lookup ETL with Dataframes It looks like Sp

Re: Best way to process lookup ETL with Dataframes

2016-12-30 Thread Sesterhenn, Mike
hat I need is to join after the first join fails. From: ayan guha Sent: Thursday, December 29, 2016 11:06 PM To: Sesterhenn, Mike Cc: user@spark.apache.org Subject: Re: Best way to process lookup ETL with Dataframes How about this - select a.*, nvl(b.col,nvl(

Best way to process lookup ETL with Dataframes

2016-12-29 Thread Sesterhenn, Mike
Hi all, I'm writing an ETL process with Spark 1.5, and I was wondering the best way to do something. A lot of the fields I am processing require an algorithm similar to this: Join input dataframe to a lookup table. if (that lookup fails (the joined fields are null)) { Lookup into some

Re: Time-unit of RDD.countApprox timeout parameter

2016-10-04 Thread Sesterhenn, Mike
It only exists in the latest docs, not in versions <= 1.6. From: Sean Owen Sent: Tuesday, October 4, 2016 1:51:49 PM To: Sesterhenn, Mike; user@spark.apache.org Subject: Re: Time-unit of RDD.countApprox timeout parameter The API docs already say: "maxi

Re: Time-unit of RDD.countApprox timeout parameter

2016-10-04 Thread Sesterhenn, Mike
Nevermind. Through testing it seems it is MILLISECONDS. This should be added to the docs. From: Sesterhenn, Mike Sent: Tuesday, October 4, 2016 1:02:25 PM To: user@spark.apache.org Subject: Time-unit of RDD.countApprox timeout parameter Hi all, Does anyone

Time-unit of RDD.countApprox timeout parameter

2016-10-04 Thread Sesterhenn, Mike
Hi all, Does anyone know what the unit is on the 'timeout' parameter to the RDD.countApprox() function? (ie. is that seconds, milliseconds, nanoseconds, ...?) I was searching through the source but it got hairy pretty quickly. Thanks