Hello All
I am facing FileNotFoundException for shuffle index file when running
job with large data. Same job runs fine with smaller datasets. These our my
cluster specifications -
No of nodes - 19
Total cores - 380
Memory per executor - 32G
Spark 1.6 mapr version
spark.shuffle.service.enabled
unsubscribe
From: S Malligarjunan
Sent: Saturday, December 3, 2016 11:55:41 AM
To: user@spark.apache.org
Subject: Re: Unsubscribe
Unsubscribe
Thanks and Regards,
Malligarjunan S.
On Saturday, 3 December 2016, 20:42, Sivakumar S
wrote:
Unsubscribe
Unsubscribe Thanks and Regards,Malligarjunan S.
Unsubscribe Thanks and Regards,Malligarjunan S.
On Saturday, 3 December 2016, 20:42, Sivakumar S
wrote:
Unsubscribe
ephemeral storage on ssd will be very painful to maintain especially with
large datasets. we will pretty soon have somewhere in PB.
I am thinking to leverage something like below. But not sure how much
performance gain we could get out of that.
https://github.com/stec-inc/EnhanceIO
On Sat, Dec 3
Hi,
I know this is a broad question. If this is not the right forum, appreciate
if you can point to other sites/areas that may be helpful.
Before posing this question, I did use our friend Google, but sanitizing
the query results from my need angle hasn't been easy.
Who I am:
- Have done data
What about ephemeral storage on ssd ? If performance is required it's
generally for production so the cluster would never be stopped. Then a
spark job to backup/restore on S3 allows to shut down completely the cluster
Le 3 déc. 2016 1:28 PM, "David Mitchell" a
écrit :
> To get a node local read
Unsubscribe
To get a node local read from Spark to Cassandra, one has to use a read
consistency level of LOCAL_ONE. For some use cases, this is not an
option. For example, if you need to use a read consistency level
of LOCAL_QUORUM, as many use cases demand, then one is not going to get a
node local read.
A
guys,
This is my suggestion. Use Spark SQL instead of Impala from Hive tables to
get correct timestamp values all the time. The situation is explained below:
I have come across a situation where a multi-tenant cluster is being used
to read and write to Parquet file.
This causes some issues as I
On 3 Dec 2016, at 09:16, Manish Malhotra
mailto:manish.malhotra.w...@gmail.com>> wrote:
thanks for sharing number as well !
Now a days even network can be with very high throughput, and might out perform
the disk, but as Sean mentioned data on network will have other dependencies
like network
hmm GCE pretty much seems to follow the same model as AWS.
On Sat, Dec 3, 2016 at 1:22 AM, kant kodali wrote:
> GCE seems to have better options. Any one had any experience with GCE?
>
> On Sat, Dec 3, 2016 at 1:16 AM, Manish Malhotra <
> manish.malhotra.w...@gmail.com> wrote:
>
>> thanks for sh
GCE seems to have better options. Any one had any experience with GCE?
On Sat, Dec 3, 2016 at 1:16 AM, Manish Malhotra <
manish.malhotra.w...@gmail.com> wrote:
> thanks for sharing number as well !
>
> Now a days even network can be with very high throughput, and might out
> perform the disk, but
thanks for sharing number as well !
Now a days even network can be with very high throughput, and might out
perform the disk, but as Sean mentioned data on network will have other
dependencies like network hops, like if its across rack, which can have
switch in between.
But yes people are discuss
Forgot to mention my entire cluster is on one DC. so if it is across
multiple DC's then colocating does makes sense in theory as well.
On Sat, Dec 3, 2016 at 1:12 AM, kant kodali wrote:
> Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive
> throughput ) on my spark worker
Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive
throughput ) on my spark worker machine when I do `sudo iftop -B`
The problem with instance store on AWS is that they all are ephemeral so
placing Cassandra on top doesn't make a lot of sense. so In short, AWS
doesn't seem
I'm sure he meant that this is downside to not colocating.
You are asking the right question. While networking is traditionally much
slower than disk, that changes a bit in the cloud, where attached storage
is remote too.
The disk throughput here is mostly achievable in normal workloads. However
I
wait, how is that a benefit? isn't that a bad thing if you are saying
colocating leads to more latency and overall execution time is longer?
On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:
> You get more latency on reads so overall execution time is l
You get more latency on reads so overall execution time is longer
Le 3 déc. 2016 7:39 AM, "kant kodali" a écrit :
>
> I wonder what benefits do I really I get If I colocate my spark worker
> process and Cassandra server process on each node?
>
> I understand the concept of moving compute towards
19 matches
Mail list logo