ave been considered or
whether this work is something that could be useful to the wider community.
Regards
Mick Davies
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
-
To unsubscribe e-mail: dev-
Hi,
Regarding higher order functions
> Yes, we intend to contribute this to open source.
It doesn't look like this is in 2.3.0, at least I can't find it.
Do you know when it might reach open source.
Thanks
Mick
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
--
If I write unit tests that indirectly initialize org.apache.spark.util.Utils,
for example use sql types, but produce no logging, I get the following
unpleasant stack trace in my test output.
This caused by the the Utils class adding a shutdown hook which logs the
message logDebug("Shutdown hook ca
Thanks - we have tried this and it works nicely.
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Caching-tables-at-column-level-tp10377p10618.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
-
I have put in a PR on Parquet to support dictionaries when filters are pushed
down, which should reduce binary conversion overhear when Spark pushes down
string predicates on columns that are dictionary encoded.
https://github.com/apache/incubator-parquet-mr/pull/117
It's blocked at the moment as
I have been working a lot recently with denormalised tables with lots of
columns, nearly 600. We are using this form to avoid joins.
I have tried to use cache table with this data, but it proves too expensive
as it seems to try to cache all the data in the table.
For data sets such as the one I
http://succinct.cs.berkeley.edu/wp/wordpress/
Looks like a really interesting piece of work that could dovetail well with
Spark.
I have been trying recently to optimize some queries I have running on Spark
on top of Parquet but the support from Parquet for predicate push down
especially for dict
Looking at Parquet code - it looks like hooks are already in place to
support this.
In particular PrimitiveConverter has methods hasDictionarySupport and
addValueFromDictionary for this purpose. These are not used by
CatalystPrimitiveConverter.
I think that it would be pretty straightforward to
Here are some timings showing effect of caching last Binary->String
conversion. Query times are reduced significantly and variation in timings
due to reduction in garbage is very significant.
Set of sample queries selecting various columns, applying some filtering and
then aggregating
Spark 1.2.0
Added a JIRA to track
https://issues.apache.org/jira/browse/SPARK-5309
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10189.html
Sent from the Apache Spark Developers List mailing list arch
Hi,
It seems that a reasonably large proportion of query time using Spark SQL
seems to be spent decoding Parquet Binary objects to produce Java Strings.
Has anyone considered trying to optimize these conversions as many are
duplicated.
Details are outlined in the conversation in the user mailing
11 matches
Mail list logo