date:20170109

Spark performance tests

2017-01-09 Thread Prasun Ratn

Hi Are there performance tests or microbenchmarks for Spark - especially directed towards the CPU specific parts? I looked at spark-perf but that doesn't seem to have been updated recently. Thanks Prasun - To unsubscribe e-mail:

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Liang-Chi Hsieh

Hi Andy, Because hash-based aggregate uses unsafe row as aggregation states, so the aggregation buffer schema must be mutable types in unsafe row. If you can use TypedImperativeAggregate to implement your aggregation function, SparkSQL has ObjectHashAggregateExec which supports hash-based aggreg

Re: Parquet patch release

2017-01-09 Thread Cheng Lian

Finished reviewing the list and it LGTM now (left comments in the spreadsheet and Ryan already made corresponding changes). Ryan - Thanks a lot for pushing this and making it happen! Cheng On 1/6/17 3:46 PM, Ryan Blue wrote: Last month, there was interest in a Parquet patch release on PR #162

Re: [SQL][PYTHON] UDF improvements.

2017-01-09 Thread Ryan Blue

Maciej, this looks great. Could you open a JIRA issue for improving the @udf decorator and possibly sub-tasks for the specific features from the gist? Thanks! rb On Sat, Jan 7, 2017 at 12:39 PM, Maciej Szymkiewicz wrote: > Hi, > > I've been looking at the PySpark UserDefinedFunction and I have

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang

Hi Takeshi, Thanks for the answer. My UDAF aggregates data into an array of rows. Apparently this makes it ineligible to using Hash-based aggregate based on the logic at: https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationM

Re: How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Takeshi Yamamuro

Hi, Spark always uses hash-based aggregates if the types of aggregated data are supported there; otherwise, spark fails to use hash-based ones, then it uses sort-based ones. See: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.s

How to hint Spark to use HashAggregate() for UDAF

2017-01-09 Thread Andy Dang

Hi all, It appears to me that Dataset.groupBy().agg(udaf) requires a full sort, which is very inefficient for certain aggration: The code is very simple: - I have a UDAF - What I want to do is: dataset.groupBy(cols).agg(udaf).count() The physical plan I got was: *HashAggregate(keys=[], functions

Re: A note about MLlib's StandardScaler

2017-01-09 Thread Sean Owen

This could be true if you knew you were just going to scale the input to StandardScaler and nothing else. It's probably more typical you'd scale some other data. The current behavior is therefore the sensible default, because the input is a sample of some unknown larger population. I think it does

Re: Spark checkpointing

2017-01-09 Thread Steve Loughran

On 7 Jan 2017, at 08:29, Felix Cheung mailto:felixcheun...@hotmail.com>> wrote: Thanks Steve. As you have pointed out, we have seen some issues related to cloud storage as "file system". I'm looking at checkpointing recently. What do you think would be the improvement we could make for "non l

Re: scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

2017-01-09 Thread Holden Karau

If you want to check if it's your modifications or just in mainline, you can always just checkout mainline or stash your current changes to rebuild (this is something I do pretty often when I run into bugs I don't think I would have introduced). On Mon, Jan 9, 2017 at 1:01 AM Liang-Chi Hsieh wrot

Re: scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

2017-01-09 Thread Liang-Chi Hsieh

Hi, As Seq(0 to 8) is: scala> Seq(0 to 8) res1: Seq[scala.collection.immutable.Range.Inclusive] = List(Range(0, 1, 2, 3, 4, 5, 6, 7, 8)) Do you actually want to create a Dataset of Range? If so, I think currently ScalaReflection which the encoder relies doesn't support Range. Jacek Laskowski

Re: handling of empty partitions

2017-01-09 Thread Georg Heiler

Hi Liang-Chi Hsieh, Strange: As the "toCarry" returned is the following when I tested your codes: Map(1 -> Some(FooBar(Some(2016-01-04),lastAssumingSameDate)), 0 -> Some(FooBar(Some(2016-01-02),second))) For me it always looked like: ## carry Map(2 -> None, 5 -> None, 4 -> No

scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

2017-01-09 Thread Jacek Laskowski

Hi, Just got this this morning using the fresh build of Spark 2.2.0-SNAPSHOT (with a few local modifications): scala> Seq(0 to 8).toDF scala.MatchError: scala.collection.immutable.Range.Inclusive (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) at org.apache.spark.sql.catalyst.ScalaR

Re: handling of empty partitions

2017-01-09 Thread Liang-Chi Hsieh

The map "toCarry" will return you (partitionIndex, None) for empty partition. So I think line 51 won't fail. Line 58 can fail if "lastNotNullRow" is None. You of course should check if an Option has value or not before you access it. As the "toCarry" returned is the following when I tested your

Spark performance tests

Re: How to hint Spark to use HashAggregate() for UDAF

Re: Parquet patch release

Re: [SQL][PYTHON] UDF improvements.

Re: How to hint Spark to use HashAggregate() for UDAF

Re: How to hint Spark to use HashAggregate() for UDAF

How to hint Spark to use HashAggregate() for UDAF

Re: A note about MLlib's StandardScaler

Re: Spark checkpointing

Re: scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

Re: scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

Re: handling of empty partitions

scala.MatchError: scala.collection.immutable.Range.Inclusive from catalyst.ScalaReflection.serializerFor?

Re: handling of empty partitions

14 matches

Site Navigation

Mail list logo

Footer information