Re: Optimize encoding/decoding strings when using Parquet

2015-02-13 Thread Mick Davies
I have put in a PR on Parquet to support dictionaries when filters are pushed down, which should reduce binary conversion overhear when Spark pushes down string predicates on columns that are dictionary encoded. https://github.com/apache/incubator-parquet-mr/pull/117 It's blocked at the moment as

Re: Optimize encoding/decoding strings when using Parquet

2015-01-23 Thread Michael Davies
Added PR https://github.com/apache/spark/pull/4139 - I think tests have been re-arranged so merge necessary Mick > On 19 Jan 2015, at 18:31, Reynold Xin wrote: > > Definitely go for a pull request! > > > On Mon, Jan 19, 2015 at 10:10 AM, Mick Dav

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Reynold Xin
Definitely go for a pull request! On Mon, Jan 19, 2015 at 10:10 AM, Mick Davies wrote: > > Looking at Parquet code - it looks like hooks are already in place to > support this. > > In particular PrimitiveConverter has methods hasDictionarySupport and > addValueFromDictionary for this purpose. T

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Looking at Parquet code - it looks like hooks are already in place to support this. In particular PrimitiveConverter has methods hasDictionarySupport and addValueFromDictionary for this purpose. These are not used by CatalystPrimitiveConverter. I think that it would be pretty straightforward to

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Here are some timings showing effect of caching last Binary->String conversion. Query times are reduced significantly and variation in timings due to reduction in garbage is very significant. Set of sample queries selecting various columns, applying some filtering and then aggregating Spark 1.2.0

Re: Optimize encoding/decoding strings when using Parquet

2015-01-19 Thread Mick Davies
Added a JIRA to track https://issues.apache.org/jira/browse/SPARK-5309 -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-tp10141p10189.html Sent from the Apache Spark Developers List mailing list arch

Re: Optimize encoding/decoding strings when using Parquet

2015-01-16 Thread Michael Armbrust
+1 to adding such an optimization to parquet. The bytes are tagged specially as UTF8 in the parquet schema so it seem like it would be possible to add this. On Fri, Jan 16, 2015 at 8:17 AM, Mick Davies wrote: > Hi, > > It seems that a reasonably large proportion of query time using Spark SQL >