Thanks to all for the quick replies, they helped a lot. To answer a few of
the follow-up questions ...
> 1. How did you fix this performance which I gather programmatically
The main problem in my original code was that the logic was not being executed when it should have been. This
'if' clause
nce degradation. After investigation, I found that the through the data frame> check logic above was not correct, causing a new
> combined data frame to build on the lineage of the old data frame RDD that
> it is replacing. The Spark physical plan of many queries was becoming
> la
ting in steady
> performance degradation. After investigation, I found that the through the data frame> check logic above was not correct, causing a new
> combined data frame to build on the lineage of the old data frame RDD that
> it is replacing. The Spark physical plan of many
a new
combined data frame to build on the lineage of the old data frame RDD that
it is replacing. The Spark physical plan of many queries was becoming
larger and larger because of 'filter' and 'union' transformations being
added to the same data frame. I am not yet very famili
>>> find out that this query use network (network IO increases while the query
>>> is running). The strange part of this situation is that the number shown in
>>> the spark UI's shuffle section is very small.
>>> How can I find out the root cause of this problem and solve that?
>>> link of stackoverflow.com <http://stackoverflow.com/> :
>>> https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
>>>
>>> <https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>
>
y small.
> How can I find out the root cause of this problem and solve that?
> link of stackoverflow.com <http://stackoverflow.com/> :
> https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
>
> <https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>
https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
<https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>
Hi Ayan,
, thanks for the explanation,
I am aware of compression codecs.
How does locality level set?
Is it done by Spark or yarn?
Please let me know,
Thanks,
Yesh
On Nov 22, 2016 5:13 PM, "ayan guha" wrote:
Hi
RACK_LOCAL = Task running on the same rack but not on the same node where
d
Hi
RACK_LOCAL = Task running on the same rack but not on the same node where
data is
NODE_LOCAL = task and data is co-located. Probably you were looking for
this one?
GZIP - Read is through GZIP codec, but because it is non-splittable, so you
can have atmost 1 task reading a gzip file. Now, the c
Hi Ayan,
we have default rack topology.
-Yeshwanth
Can you Imagine what I would do if I could do all I can - Art of War
On Tue, Nov 22, 2016 at 6:37 AM, ayan guha wrote:
> Because snappy is not splittable, so single task makes sense.
>
> Are sure about rack topology? Ie 225 is in a differen
Because snappy is not splittable, so single task makes sense.
Are sure about rack topology? Ie 225 is in a different rack than 227 or
228? What does your topology file says?
On 22 Nov 2016 10:14, "yeshwanth kumar" wrote:
> Thanks for your reply,
>
> i can definitely change the underlying compres
Thanks for your reply,
i can definitely change the underlying compression format.
but i am trying to understand the Locality Level,
why executor ran on a different node, where the blocks are not present,
when Locality Level is RACK_LOCAL
can you shed some light on this.
Thanks,
Yesh
-Yeshwant
Use as a format orc, parquet or avro because they support any compression type
with parallel processing. Alternatively split your file in several smaller
ones. Another alternative would be bzip2 (but slower in general) or Lzo
(usually it is not included by default in many distributions).
> On 2
Try changing compression to bzip2 or lzo. For reference -
http://comphadoop.weebly.com
Thanks,
Aniket
On Mon, Nov 21, 2016, 10:18 PM yeshwanth kumar
wrote:
> Hi,
>
> we are running Hive on Spark, we have an external table over snappy
> compressed csv file of size 917.4 M
> HDFS block size is se
Hi,
we are running Hive on Spark, we have an external table over snappy
compressed csv file of size 917.4 M
HDFS block size is set to 256 MB
as per my Understanding, if i run a query over that external table , it
should launch 4 tasks. one for each block.
but i am seeing one executor and one task
;>>
>>>>
>>>>
>>>> On Jun 9, 2016, at 07:14, Gourav Sengupta <
>>>> gourav.sengu...@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Query1 is almost 25x faster in HIVE than in SPARK. What is happening
>>&
in Query2.
>>>
>>>
>>> ---
>>> ENVIRONMENT:
>>> ---
>>>
>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>> million rows. Both the files are single gzip
lumns x 24 million rows and Table B has 2 columns x 3
>> million rows. Both the files are single gzipped csv file.
>> > Both table A and B are external tables in AWS S3 and created in HIVE
>> accessed through SPARK using HiveContext
>> > EMR 4.6, Spark 1.6.1 an
gt;> > million rows. Both the files are single gzipped csv file.
>>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>>> > accessed through SPARK using HiveContext
>>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>>> >
6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
> allowMaximumResource allocation and node types are c3.4xlarge).
>
> --
> QUERY1:
> --
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
>
>
> This
A.PK<http://A.PK>, B.FK<http://B.FK>
from A
left outer join B on (A.PK<http://A.PK> = B.FK<http://B.FK>)
where B.FK<http://B.FK> is not null;
This query takes 4 mins in HIVE and 1.1 hours in SPARK
--
QUERY 2:
--
select A.PK<http://A.
gzipped csv file.
>>> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>> allowMaximumResource allocation and node types are c
ers started using
>> allowMaximumResource allocation and node types are c3.4xlarge).
>>
>> --
>> QUERY1:
>> --
>> select A.PK, B.FK
>> from A
>> left outer join B on (A.PK = B.FK)
>> where B.FK is not null;
>>
>>
PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
>
>
> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>
>
> --
> QUERY 2:
> --
>
> select A.PK, B.FK
> from (select PK from A) A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
> This query takes 4.5 mins in SPARK
>
>
>
> Regards,
> Gourav Sengupta
>
>
>
>
> --
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
>
>
> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>
>
> --
> QUERY 2:
> --
>
> select A.PK, B.FK
-
>>>>> ENVIRONMENT:
>>>>> ---
>>>>>
>>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>>>> million rows. Both the files are single gzipped csv file.
>>>>&g
;>>> ---
>>>>
>>>> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
>>>> million rows. Both the files are single gzipped csv file.
>>>> > Both table A and B are external tables
t;> > Both table A and B are external tables in AWS S3 and created in HIVE
>>> accessed through SPARK using HiveContext
>>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>>> allowMaximumResource allocation and node types are c3.4xlarge).
>>>
>
>> accessed through SPARK using HiveContext
>> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
>> allowMaximumResource allocation and node types are c3.4xlarge).
>>
>> --
>> QUERY1:
>> ------
gt; EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
> allowMaximumResource allocation and node types are c3.4xlarge).
>
> --
> QUERY1:
> --
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
3.4xlarge).
--
QUERY1:
--
select A.PK, B.FK
from A
left outer join B on (A.PK = B.FK)
where B.FK is not null;
This query takes 4 mins in HIVE and 1.1 hours in SPARK
------
QUERY 2:
--
select A.PK, B.FK
from (select PK from A) A
left outer join B
Seems like you have "hive.server2.enable.doAs" enabled; you can either
disable it, or configure hs2 so that the user running the service
("hadoop" in your case) can impersonate others.
See:
https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-common/Superusers.html
On Fri, Sep 25, 201
exited with code 1.
-Original Message-
From: Marcelo Vanzin [mailto:van...@cloudera.com]
Sent: Friday, September 25, 2015 1:12 PM
To: Garry Chen
Cc: Jimmy Xiang ; user@spark.apache.org
Subject: Re: hive on spark query error
On Fri, Sep 25, 2015 at 10:05 AM, Garry Chen wrote:
> In sp
On Fri, Sep 25, 2015 at 10:05 AM, Garry Chen wrote:
> In spark-defaults.conf the spark.master is spark://hostname:7077. From
> hive-site.xml
> spark.master
> hostname
>
That's not a valid value for spark.master (as the error indicates).
You should set it to "spark://hostname:7077"
In spark-defaults.conf the spark.master is spark://hostname:7077. From
hive-site.xml
spark.master
hostname
From: Jimmy Xiang [mailto:jxi...@cloudera.com]
Sent: Friday, September 25, 2015 1:00 PM
To: Garry Chen
Cc: user@spark.apache.org
Subject: Re: hive on spark query error
> Error: Master must start with yarn, spark, mesos, or local
What's your setting for spark.master?
On Fri, Sep 25, 2015 at 9:56 AM, Garry Chen wrote:
> Hi All,
>
> I am following
> https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started?
> to setup hive
Hi All,
I am following
https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started?
to setup hive on spark. After setup/configuration everything startup I am
able to show tables but when executing sql statement within beeline I got
error. Please help and
/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
>>
>>
>> At 2015-07-09 12:02:44, "Ravisankar Mani" wrote:
>>
>> Hi everyone,
>>
>> I can't get 'day of year' when using spark query. Can you help any way
>> to achieve day of year?
>>
>> Regards,
>> Ravi
>>
>>
>>
>>
>
:02:44, "Ravisankar Mani" wrote:
>
> Hi everyone,
>
> I can't get 'day of year' when using spark query. Can you help any way to
> achieve day of year?
>
> Regards,
> Ravi
>
>
>
>
Hi everyone,
I can't get 'day of year' when using spark query. Can you help any way to
achieve day of year?
Regards,
Ravi
40 matches
Mail list logo