UDF and native functions performance

2016-09-12 Thread assaf.mendelson
I am trying to create UDFs with improved performance. So I decided to compare several ways of doing it. In general I created a dataframe using range with 50M elements, cached it and counted it to manifest it. I then implemented a simple predicate (x<10) in 4 different ways, counted the elements

Re: Fwd: HANA data access from SPARK

2016-09-12 Thread whitefalcon
Hi Dushyant, I saw this same error with an older Hana JDBC driver, but the error went away when I tried a later ngdbc.jar driver file (dated May 2016). I've not tried to Heres an example I did using the later driver with Spark 1.6.2 running standalone. http://scn.sap.com/community/hana-in-memor

GZIP compression support for Spark internal data

2016-09-12 Thread Nasser Ebrahim
Hi, Can we use GZIP compression for internal data such as RDD partitions, broadcast variables and shuffle outputs so that user will have more choice compared to the available LZ4, LZF and Snappy? Is there any specific reason we are not supporting the JDK inbuilt compression? If not, shall I

Why CatalogImpl.makeDataset and SparkSession.createDataset?

2016-09-12 Thread Jacek Laskowski
Hi, I've stumbled upon CatalogImpl.makeDataset [1] -- the only private[sql] method in the CatalogImpl object -- that looks like SparkSession.createDataset [2]. What do you think about removing CatalogImpl.makeDataset? If not, what's so special about one over the other to keep them both? I'd appr

Re: GZIP compression support for Spark internal data

2016-09-12 Thread Takeshi Yamamuro
Hi, Have you seen https://issues.apache.org/jira/browse/SPARK-4633 ? // maropu On Mon, Sep 12, 2016 at 11:00 PM, Nasser Ebrahim wrote: > Hi, > > Can we use GZIP compression for internal data such as RDD partitions, > broadcast variables and shuffle outputs so that user will have more choice >

Re: Organizing Spark ML example packages

2016-09-12 Thread Nick Pentreath
Never actually got around to doing this - do folks still think it worthwhile? On Thu, 21 Apr 2016 at 00:10 Joseph Bradley wrote: > Sounds good to me. I'd request we be strict during this process about > requiring *no* changes to the example itself, which will make review easier. > > On Tue, Apr

Re: UDF and native functions performance

2016-09-12 Thread Takeshi Yamamuro
Hi, I think you'd better off comparing the gen'd code of `df.filter` and your gen'd code by using .debugCodegen(). // maropu On Mon, Sep 12, 2016 at 7:43 PM, assaf.mendelson wrote: > I am trying to create UDFs with improved performance. So I decided to > compare several ways of doing it. > > I

Re: Organizing Spark ML example packages

2016-09-12 Thread Stephen Boesch
Yes: will you have cycles to do it? 2016-09-12 9:09 GMT-07:00 Nick Pentreath : > Never actually got around to doing this - do folks still think it > worthwhile? > > On Thu, 21 Apr 2016 at 00:10 Joseph Bradley wrote: > >> Sounds good to me. I'd request we be strict during this process about >> r

[build system] brief jenkins downtime this morning

2016-09-12 Thread shane knapp
our weekly backups failed due to a hung job. even though i tried to change the backup scheduler (internal to jenkins) to run tonite, it's still insisting that it needs to run immediately and is continually putting jenkins in to quiet mode. short of killing all of the current jobs and restarting

RE: UDF and native functions performance

2016-09-12 Thread Mendelson, Assaf
I did, they look the same: scala> my_func.explain(true) == Parsed Logical Plan == Filter smaller#3L < 10 +- Project [id#0L AS smaller#3L] +- Range (0, 5, splits=1) == Analyzed Logical Plan == smaller: bigint Filter smaller#3L < 10 +- Project [id#0L AS smaller#3L] +- Range (0, 50

Re: [build system] brief jenkins downtime this morning

2016-09-12 Thread shane knapp
the backup is done and we're building again! On Mon, Sep 12, 2016 at 9:31 AM, shane knapp wrote: > our weekly backups failed due to a hung job. even though i tried to > change the backup scheduler (internal to jenkins) to run tonite, it's > still insisting that it needs to run immediately and i

Re: GZIP compression support for Spark internal data

2016-09-12 Thread Nasser Ebrahim
Thank you Takeshi for sharing the info. I agree with Patrick and you that there is no point in adding more codec unless it is showing better performance results (at least with some work loads on some platforms). The performance of GZIP depends upon its implementation on the platforms. Will do s

Re: UDF and native functions performance

2016-09-12 Thread Reynold Xin
Not sure if this is why but perhaps the constraint framework? On Tuesday, September 13, 2016, Mendelson, Assaf wrote: > I did, they look the same: > > > > scala> my_func.explain(true) > > == Parsed Logical Plan == > > Filter smaller#3L < 10 > > +- Project [id#0L AS smaller#3L] > >+- Range (0

RE: UDF and native functions performance

2016-09-12 Thread assaf.mendelson
What is the constraint framework? How would I add the same optimization to the sample function I created? From: rxin [via Apache Spark Developers List] [mailto:ml-node+s1001551n18932...@n3.nabble.com] Sent: Tuesday, September 13, 2016 3:37 AM To: Mendelson, Assaf Subject: Re: UDF and native func