Hi
Are there performance tests or microbenchmarks for Spark - especially
directed towards the CPU specific parts? I looked at spark-perf but
that doesn't seem to have been updated recently.
Thanks
Prasun
-
To unsubscribe e-mail:
Hi Andy,
Because hash-based aggregate uses unsafe row as aggregation states, so the
aggregation buffer schema must be mutable types in unsafe row.
If you can use TypedImperativeAggregate to implement your aggregation
function, SparkSQL has ObjectHashAggregateExec which supports hash-based
aggreg
Finished reviewing the list and it LGTM now (left comments in the
spreadsheet and Ryan already made corresponding changes).
Ryan - Thanks a lot for pushing this and making it happen!
Cheng
On 1/6/17 3:46 PM, Ryan Blue wrote:
Last month, there was interest in a Parquet patch release on PR #162
Maciej, this looks great.
Could you open a JIRA issue for improving the @udf decorator and possibly
sub-tasks for the specific features from the gist? Thanks!
rb
On Sat, Jan 7, 2017 at 12:39 PM, Maciej Szymkiewicz
wrote:
> Hi,
>
> I've been looking at the PySpark UserDefinedFunction and I have
Hi Takeshi,
Thanks for the answer. My UDAF aggregates data into an array of rows.
Apparently this makes it ineligible to using Hash-based aggregate based on
the logic at:
https://github.com/apache/spark/blob/master/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeFixedWidthAggregationM
Hi,
Spark always uses hash-based aggregates if the types of aggregated data are
supported there;
otherwise, spark fails to use hash-based ones, then it uses sort-based ones.
See:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.s
Hi all,
It appears to me that Dataset.groupBy().agg(udaf) requires a full sort,
which is very inefficient for certain aggration:
The code is very simple:
- I have a UDAF
- What I want to do is: dataset.groupBy(cols).agg(udaf).count()
The physical plan I got was:
*HashAggregate(keys=[], functions
This could be true if you knew you were just going to scale the input to
StandardScaler and nothing else. It's probably more typical you'd scale
some other data. The current behavior is therefore the sensible default,
because the input is a sample of some unknown larger population.
I think it does
On 7 Jan 2017, at 08:29, Felix Cheung
mailto:felixcheun...@hotmail.com>> wrote:
Thanks Steve.
As you have pointed out, we have seen some issues related to cloud storage as
"file system". I'm looking at checkpointing recently. What do you think would
be the improvement we could make for "non l
If you want to check if it's your modifications or just in mainline, you
can always just checkout mainline or stash your current changes to rebuild
(this is something I do pretty often when I run into bugs I don't think I
would have introduced).
On Mon, Jan 9, 2017 at 1:01 AM Liang-Chi Hsieh wrot
Hi,
As Seq(0 to 8) is:
scala> Seq(0 to 8)
res1: Seq[scala.collection.immutable.Range.Inclusive] = List(Range(0, 1, 2,
3, 4, 5, 6, 7, 8))
Do you actually want to create a Dataset of Range? If so, I think currently
ScalaReflection which the encoder relies doesn't support Range.
Jacek Laskowski
Hi Liang-Chi Hsieh,
Strange:
As the "toCarry" returned is the following when I tested your codes:
Map(1 -> Some(FooBar(Some(2016-01-04),lastAssumingSameDate)), 0 ->
Some(FooBar(Some(2016-01-02),second)))
For me it always looked like:
## carry
Map(2 -> None, 5 -> None, 4 -> No
Hi,
Just got this this morning using the fresh build of Spark
2.2.0-SNAPSHOT (with a few local modifications):
scala> Seq(0 to 8).toDF
scala.MatchError: scala.collection.immutable.Range.Inclusive (of class
scala.reflect.internal.Types$ClassNoArgsTypeRef)
at
org.apache.spark.sql.catalyst.ScalaR
The map "toCarry" will return you (partitionIndex, None) for empty
partition.
So I think line 51 won't fail. Line 58 can fail if "lastNotNullRow" is None.
You of course should check if an Option has value or not before you access
it.
As the "toCarry" returned is the following when I tested your
14 matches
Mail list logo