subject:"Spark query"

Re: Spark query performance of cached data affected by RDD lineage

2021-05-24 Thread fwy

Thanks to all for the quick replies, they helped a lot. To answer a few of the follow-up questions ... > 1. How did you fix this performance which I gather programmatically The main problem in my original code was that the logic was not being executed when it should have been. This 'if' clause

Re: Spark query performance of cached data affected by RDD lineage

2021-05-24 Thread Sebastian Piu

nce degradation. After investigation, I found that the through the data frame> check logic above was not correct, causing a new > combined data frame to build on the lineage of the old data frame RDD that > it is replacing. The Spark physical plan of many queries was becoming > la

Re: Spark query performance of cached data affected by RDD lineage

2021-05-23 Thread Mich Talebzadeh

ting in steady > performance degradation. After investigation, I found that the through the data frame> check logic above was not correct, causing a new > combined data frame to build on the lineage of the old data frame RDD that > it is replacing. The Spark physical plan of many

Spark query performance of cached data affected by RDD lineage

2021-05-22 Thread Fred Yeadon

a new combined data frame to build on the lineage of the old data frame RDD that it is replacing. The Spark physical plan of many queries was becoming larger and larger because of 'filter' and 'union' transformations being added to the same data frame. I am not yet very famili

Re: [Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Russell Spitzer

>>> find out that this query use network (network IO increases while the query >>> is running). The strange part of this situation is that the number shown in >>> the spark UI's shuffle section is very small. >>> How can I find out the root cause of this problem and solve that? >>> link of stackoverflow.com <http://stackoverflow.com/> : >>> https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache >>> >>> <https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache> >

Re: [Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Russell Spitzer

y small. > How can I find out the root cause of this problem and solve that? > link of stackoverflow.com <http://stackoverflow.com/> : > https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache > > <https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>

[Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Mohamadreza Rostami

https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache <https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread yeshwanth kumar

Hi Ayan, , thanks for the explanation, I am aware of compression codecs. How does locality level set? Is it done by Spark or yarn? Please let me know, Thanks, Yesh On Nov 22, 2016 5:13 PM, "ayan guha" wrote: Hi RACK_LOCAL = Task running on the same rack but not on the same node where d

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread ayan guha

Hi RACK_LOCAL = Task running on the same rack but not on the same node where data is NODE_LOCAL = task and data is co-located. Probably you were looking for this one? GZIP - Read is through GZIP codec, but because it is non-splittable, so you can have atmost 1 task reading a gzip file. Now, the c

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread yeshwanth kumar

Hi Ayan, we have default rack topology. -Yeshwanth Can you Imagine what I would do if I could do all I can - Art of War On Tue, Nov 22, 2016 at 6:37 AM, ayan guha wrote: > Because snappy is not splittable, so single task makes sense. > > Are sure about rack topology? Ie 225 is in a differen

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-22 Thread ayan guha

Because snappy is not splittable, so single task makes sense. Are sure about rack topology? Ie 225 is in a different rack than 227 or 228? What does your topology file says? On 22 Nov 2016 10:14, "yeshwanth kumar" wrote: > Thanks for your reply, > > i can definitely change the underlying compres

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread yeshwanth kumar

Thanks for your reply, i can definitely change the underlying compression format. but i am trying to understand the Locality Level, why executor ran on a different node, where the blocks are not present, when Locality Level is RACK_LOCAL can you shed some light on this. Thanks, Yesh -Yeshwant

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread Jörn Franke

Use as a format orc, parquet or avro because they support any compression type with parallel processing. Alternatively split your file in several smaller ones. Another alternative would be bzip2 (but slower in general) or Lzo (usually it is not included by default in many distributions). > On 2

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread Aniket Bhatnagar

Try changing compression to bzip2 or lzo. For reference - http://comphadoop.weebly.com Thanks, Aniket On Mon, Nov 21, 2016, 10:18 PM yeshwanth kumar wrote: > Hi, > > we are running Hive on Spark, we have an external table over snappy > compressed csv file of size 917.4 M > HDFS block size is se

RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread yeshwanth kumar

Hi, we are running Hive on Spark, we have an external table over snappy compressed csv file of size 917.4 M HDFS block size is set to 256 MB as per my Understanding, if i run a query over that external table , it should launch 4 tasks. one for each block. but i am seeing one executor and one task

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Mich Talebzadeh

;>> >>>> >>>> >>>> On Jun 9, 2016, at 07:14, Gourav Sengupta < >>>> gourav.sengu...@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening >>&

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Gourav Sengupta

in Query2. >>> >>> >>> --- >>> ENVIRONMENT: >>> --- >>> >>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 >>> million rows. Both the files are single gzip

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Mich Talebzadeh

lumns x 24 million rows and Table B has 2 columns x 3 >> million rows. Both the files are single gzipped csv file. >> > Both table A and B are external tables in AWS S3 and created in HIVE >> accessed through SPARK using HiveContext >> > EMR 4.6, Spark 1.6.1 an

Re: HIVE Query 25x faster than SPARK Query

2016-06-16 Thread Jörn Franke

gt;> > million rows. Both the files are single gzipped csv file. >>>> > Both table A and B are external tables in AWS S3 and created in HIVE >>>> > accessed through SPARK using HiveContext >>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >>>> >

Re: HIVE Query 25x faster than SPARK Query

2016-06-15 Thread Gourav Sengupta

6, Spark 1.6.1 and Hive 1.0.0 (clusters started using > allowMaximumResource allocation and node types are c3.4xlarge). > > -- > QUERY1: > -- > select A.PK, B.FK > from A > left outer join B on (A.PK = B.FK) > where B.FK is not null; > > > > This

Re: HIVE Query 25x faster than SPARK Query

2016-06-15 Thread Mahender Sarangam

A.PK<http://A.PK>, B.FK<http://B.FK> from A left outer join B on (A.PK<http://A.PK> = B.FK<http://B.FK>) where B.FK<http://B.FK> is not null; This query takes 4 mins in HIVE and 1.1 hours in SPARK -- QUERY 2: -- select A.PK<http://A.

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta

gzipped csv file. >>> > Both table A and B are external tables in AWS S3 and created in HIVE >>> accessed through SPARK using HiveContext >>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >>> allowMaximumResource allocation and node types are c

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gavin Yue

ers started using >> allowMaximumResource allocation and node types are c3.4xlarge). >> >> -- >> QUERY1: >> -- >> select A.PK, B.FK >> from A >> left outer join B on (A.PK = B.FK) >> where B.FK is not null; >> >>

Re: HIVE Query 25x faster than SPARK Query

2016-06-10 Thread Gourav Sengupta

PK, B.FK > from A > left outer join B on (A.PK = B.FK) > where B.FK is not null; > > > > This query takes 4 mins in HIVE and 1.1 hours in SPARK > > > -- > QUERY 2: > -- > > select A.PK, B.FK > from (select PK from A) A > left outer join B on (A.PK = B.FK) > where B.FK is not null; > > This query takes 4.5 mins in SPARK > > > > Regards, > Gourav Sengupta > > > >

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gavin Yue

> -- > select A.PK, B.FK > from A > left outer join B on (A.PK = B.FK) > where B.FK is not null; > > > > This query takes 4 mins in HIVE and 1.1 hours in SPARK > > > -- > QUERY 2: > -- > > select A.PK, B.FK

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Mich Talebzadeh

- >>>>> ENVIRONMENT: >>>>> --- >>>>> >>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 >>>>> million rows. Both the files are single gzipped csv file. >>>>&g

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta

;>>> --- >>>> >>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3 >>>> million rows. Both the files are single gzipped csv file. >>>> > Both table A and B are external tables

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Stephen Boesch

t;> > Both table A and B are external tables in AWS S3 and created in HIVE >>> accessed through SPARK using HiveContext >>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >>> allowMaximumResource allocation and node types are c3.4xlarge). >>> >

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta

>> accessed through SPARK using HiveContext >> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using >> allowMaximumResource allocation and node types are c3.4xlarge). >> >> -- >> QUERY1: >> ------

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Mich Talebzadeh

gt; EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using > allowMaximumResource allocation and node types are c3.4xlarge). > > -- > QUERY1: > -- > select A.PK, B.FK > from A > left outer join B on (A.PK = B.FK) > where B.FK is not null; >

HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta

3.4xlarge). -- QUERY1: -- select A.PK, B.FK from A left outer join B on (A.PK = B.FK) where B.FK is not null; This query takes 4 mins in HIVE and 1.1 hours in SPARK ------ QUERY 2: -- select A.PK, B.FK from (select PK from A) A left outer join B

Re: hive on spark query error

2015-09-25 Thread Marcelo Vanzin

Seems like you have "hive.server2.enable.doAs" enabled; you can either disable it, or configure hs2 so that the user running the service ("hadoop" in your case) can impersonate others. See: https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/Superusers.html On Fri, Sep 25, 201

RE: hive on spark query error

2015-09-25 Thread Garry Chen

exited with code 1. -Original Message- From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Friday, September 25, 2015 1:12 PM To: Garry Chen Cc: Jimmy Xiang ; user@spark.apache.org Subject: Re: hive on spark query error On Fri, Sep 25, 2015 at 10:05 AM, Garry Chen wrote: > In sp

Re: hive on spark query error

2015-09-25 Thread Marcelo Vanzin

On Fri, Sep 25, 2015 at 10:05 AM, Garry Chen wrote: > In spark-defaults.conf the spark.master is spark://hostname:7077. From > hive-site.xml > spark.master > hostname > That's not a valid value for spark.master (as the error indicates). You should set it to "spark://hostname:7077"

RE: hive on spark query error

2015-09-25 Thread Garry Chen

In spark-defaults.conf the spark.master is spark://hostname:7077. From hive-site.xml spark.master hostname From: Jimmy Xiang [mailto:jxi...@cloudera.com] Sent: Friday, September 25, 2015 1:00 PM To: Garry Chen Cc: user@spark.apache.org Subject: Re: hive on spark query error

Re: hive on spark query error

2015-09-25 Thread Jimmy Xiang

> Error: Master must start with yarn, spark, mesos, or local What's your setting for spark.master? On Fri, Sep 25, 2015 at 9:56 AM, Garry Chen wrote: > Hi All, > > I am following > https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started? > to setup hive

hive on spark query error

2015-09-25 Thread Garry Chen

Hi All, I am following https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started? to setup hive on spark. After setup/configuration everything startup I am able to show tables but when executing sql statement within beeline I got error. Please help and

Re: Spark query

2015-07-08 Thread Brandon White

/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions >> >> >> At 2015-07-09 12:02:44, "Ravisankar Mani" wrote: >> >> Hi everyone, >> >> I can't get 'day of year' when using spark query. Can you help any way >> to achieve day of year? >> >> Regards, >> Ravi >> >> >> >> >

Re: Spark query

2015-07-08 Thread Harish Butani

:02:44, "Ravisankar Mani" wrote: > > Hi everyone, > > I can't get 'day of year' when using spark query. Can you help any way to > achieve day of year? > > Regards, > Ravi > > > >

Spark query

2015-07-08 Thread Ravisankar Mani

Hi everyone, I can't get 'day of year' when using spark query. Can you help any way to achieve day of year? Regards, Ravi

40 matches

Mail list logo