Will nested field performance improve?

2016-04-15 Thread James Aley
Hello, I'm trying to make a call on whether my team should invest time added a step to "flatten" our schema as part of our ETL pipeline to improve performance of interactive queries. Our data start out life as Avro before being converted to Parquet, and so we follow the Avro idioms of creating ou

Unreachable dead objects permanently retained on heap

2015-09-25 Thread James Aley
Hi, We have an application that submits several thousands jobs within the same SparkContext, using a thread pool to run about 50 in parallel. We're running on YARN using Spark 1.4.1 and seeing a problem where our driver is killed by YARN due to running beyond physical memory limits (no Java OOM st

Java UDFs in GROUP BY expressions

2015-09-07 Thread James Aley
Hi everyone, I raised this JIRA ticket back in July: https://issues.apache.org/jira/browse/SPARK-9435 The problem is that it seems Spark SQL doesn't recognise columns we transform with a UDF when referenced in the GROUP BY clause. There's a minimal reproduction Java file attached to illustrate th

Re: Help optimising Spark SQL query

2015-06-30 Thread James Aley
and then Spark wouldn't have to sift > through all the billions of rows to get to the millions it needs to > aggregate. > > Regards > Sab > > On Tue, Jun 23, 2015 at 4:35 PM, James Aley > wrote: > >> Thanks for the suggestions everyone, appreciate the advic

Re: Help optimising Spark SQL query

2015-06-23 Thread James Aley
not cast in the where >> part of a sql query. It is also not necessary in your case. Getting rid of >> casts in the whole query will be also beneficial. >> >> Le lun. 22 juin 2015 à 17:29, James Aley a >> écrit : >> >>> Hello, >>>

Re: Help optimising Spark SQL query

2015-06-22 Thread James Aley
grading if you are not already on 1.4. > > > > Cheers, > > Matthew > > > > > > *From:* Lior Chaga [mailto:lio...@taboola.com] > *Sent:* 22 June 2015 17:24 > *To:* James Aley > *Cc:* user > *Subject:* Re: Help optimising Spark SQL query &

Help optimising Spark SQL query

2015-06-22 Thread James Aley
Hello, A colleague of mine ran the following Spark SQL query: select count(*) as uses, count (distinct cast(id as string)) as users from usage_events where from_unixtime(cast(timestamp_millis/1000 as bigint)) between '2015-06-09' and '2015-06-16' The table contains billions of rows, but to

Re: Optimisation advice for Avro->Parquet merge job

2015-06-12 Thread James Aley
g(2.0) >> for(s = 2; s < steps;s++) { >> int stride = n/(1 << s); // n/(2^s) >> for(int i = 0;i < stride;i++) { >> executor.submit(new Runnable() { >> public void run() { >> // union of i and i+n/2 >>

Re: Optimisation advice for Avro->Parquet merge job

2015-06-04 Thread James Aley
, any recommendations appreciated. Thanks for the help! James. On 4 June 2015 at 15:00, Eugen Cepoi wrote: > Hi > > 2015-06-04 15:29 GMT+02:00 James Aley : > >> Hi, >> >> We have a load of Avro data coming into our data systems in the form of >> relatively sma

Optimisation advice for Avro->Parquet merge job

2015-06-04 Thread James Aley
Hi, We have a load of Avro data coming into our data systems in the form of relatively small files, which we're merging into larger Parquet files with Spark. I've been following the docs and the approach I'm taking seemed fairly obvious, and pleasingly simple, but I'm wondering if perhaps it's not

[Spark SQL] Problems creating a table in specified schema/database

2015-04-28 Thread James Aley
Hey all, I'm trying to create tables from existing Parquet data in different schemata. The following isn't working for me: CREATE DATABASE foo; CREATE TABLE foo.bar USING com.databricks.spark.avro OPTIONS (path '...'); -- Error: org.apache.spark.sql.AnalysisException: cannot recognize input nea

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread James Aley
tables. The trade-off here would be more complexity, > but less downtime due to the server restarting. > > On Tue, Apr 7, 2015 at 12:34 PM, James Aley > wrote: > >> Hi Michael, >> >> Thanks so much for the reply - that really cleared a lot of things up for >>

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread James Aley
Hi Michael, Thanks so much for the reply - that really cleared a lot of things up for me! Let me just check that I've interpreted one of your suggestions for (4) correctly... Would it make sense for me to write a small wrapper app that pulls in hive-thriftserver as a dependency, iterates my Parqu

Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread James Aley
Hello, First of all, thank you to everyone working on Spark. I've only been using it for a few weeks now but so far I'm really enjoying it. You saved me from a big, scary elephant! :-) I was wondering if anyone might be able to offer some advice about working with the Thrift JDBC server? I'm tryi