Hello,
I'm trying to make a call on whether my team should invest time added a
step to "flatten" our schema as part of our ETL pipeline to improve
performance of interactive queries.
Our data start out life as Avro before being converted to Parquet, and so
we follow the Avro idioms of creating ou
Hi,
We have an application that submits several thousands jobs within the same
SparkContext, using a thread pool to run about 50 in parallel. We're
running on YARN using Spark 1.4.1 and seeing a problem where our driver is
killed by YARN due to running beyond physical memory limits (no Java OOM
st
Hi everyone,
I raised this JIRA ticket back in July:
https://issues.apache.org/jira/browse/SPARK-9435
The problem is that it seems Spark SQL doesn't recognise columns we
transform with a UDF when referenced in the GROUP BY clause. There's a
minimal reproduction Java file attached to illustrate th
and then Spark wouldn't have to sift
> through all the billions of rows to get to the millions it needs to
> aggregate.
>
> Regards
> Sab
>
> On Tue, Jun 23, 2015 at 4:35 PM, James Aley
> wrote:
>
>> Thanks for the suggestions everyone, appreciate the advic
not cast in the where
>> part of a sql query. It is also not necessary in your case. Getting rid of
>> casts in the whole query will be also beneficial.
>>
>> Le lun. 22 juin 2015 à 17:29, James Aley a
>> écrit :
>>
>>> Hello,
>>>
grading if you are not already on 1.4.
>
>
>
> Cheers,
>
> Matthew
>
>
>
>
>
> *From:* Lior Chaga [mailto:lio...@taboola.com]
> *Sent:* 22 June 2015 17:24
> *To:* James Aley
> *Cc:* user
> *Subject:* Re: Help optimising Spark SQL query
&
Hello,
A colleague of mine ran the following Spark SQL query:
select
count(*) as uses,
count (distinct cast(id as string)) as users
from usage_events
where
from_unixtime(cast(timestamp_millis/1000 as bigint))
between '2015-06-09' and '2015-06-16'
The table contains billions of rows, but to
g(2.0)
>> for(s = 2; s < steps;s++) {
>> int stride = n/(1 << s); // n/(2^s)
>> for(int i = 0;i < stride;i++) {
>> executor.submit(new Runnable() {
>> public void run() {
>> // union of i and i+n/2
>>
, any recommendations appreciated. Thanks for the help!
James.
On 4 June 2015 at 15:00, Eugen Cepoi wrote:
> Hi
>
> 2015-06-04 15:29 GMT+02:00 James Aley :
>
>> Hi,
>>
>> We have a load of Avro data coming into our data systems in the form of
>> relatively sma
Hi,
We have a load of Avro data coming into our data systems in the form of
relatively small files, which we're merging into larger Parquet files with
Spark. I've been following the docs and the approach I'm taking seemed
fairly obvious, and pleasingly simple, but I'm wondering if perhaps it's
not
Hey all,
I'm trying to create tables from existing Parquet data in different
schemata. The following isn't working for me:
CREATE DATABASE foo;
CREATE TABLE foo.bar
USING com.databricks.spark.avro
OPTIONS (path '...');
-- Error: org.apache.spark.sql.AnalysisException: cannot recognize input
nea
tables. The trade-off here would be more complexity,
> but less downtime due to the server restarting.
>
> On Tue, Apr 7, 2015 at 12:34 PM, James Aley
> wrote:
>
>> Hi Michael,
>>
>> Thanks so much for the reply - that really cleared a lot of things up for
>>
Hi Michael,
Thanks so much for the reply - that really cleared a lot of things up for
me!
Let me just check that I've interpreted one of your suggestions for (4)
correctly... Would it make sense for me to write a small wrapper app that
pulls in hive-thriftserver as a dependency, iterates my Parqu
Hello,
First of all, thank you to everyone working on Spark. I've only been using
it for a few weeks now but so far I'm really enjoying it. You saved me from
a big, scary elephant! :-)
I was wondering if anyone might be able to offer some advice about working
with the Thrift JDBC server? I'm tryi
14 matches
Mail list logo