clue.
>
> On Thu, Jul 21, 2016 at 8:26 PM, Ilya Ganelin wrote:
>
>> Hello - I'm trying to deploy the Spark TimeSeries library in a new
>> environment. I'm running Spark 1.6.1 submitted through YARN in a cluster
>> with Java 8 installed on all nodes but I'
Hello - I'm trying to deploy the Spark TimeSeries library in a new
environment. I'm running Spark 1.6.1 submitted through YARN in a cluster
with Java 8 installed on all nodes but I'm getting the NoClassDef at
runtime when trying to create a new TimeSeriesRDD. Since ZonedDateTime is
part of Java 8 I
It gets serialized once per physical container, Instead of being serialized
once per task of every stage that uses it.
On Sat, Feb 20, 2016 at 8:15 AM jeff saremi wrote:
> Is the broadcasted variable distributed to every executor or every worker?
> Now i'm more confused
> I thought it was suppose
The solution I normally use is to zipWithIndex() and then use the filter
operation. Filter is an O(m) operation where m is the size of your
partition, not an O(N) operation.
-Ilya Ganelin
On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel wrote:
> Problem is I have RDD of about 10M rows and it ke
rk.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m
-XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+AggressiveOpts -XX:+UseCompressedOops" --master yarn-client
-Ilya Ganelin
Turning off replication sacrifices durability of your data, so if a node
goes down the data is lost - in case that's not obvious.
On Wed, Nov 25, 2015 at 8:43 AM Alex Gittens wrote:
> Thanks, the issue was indeed the dfs replication factor. To fix it without
> entirely clearing out HDFS and reboo
Your Kerberos cert is likely expiring. Check your expiration settings.
-Ilya Ganelin
On Mon, Nov 16, 2015 at 9:20 PM, Vipul Rai wrote:
> Hi Nikhil,
> It seems you have Kerberos enabled cluster and it is unable to
> authenticate using the ticket.
> Please check the Kerberos settin
Others feel free to chime in here, the recommendation I've seen is to not
reduce the batch size below 500ms since the Spark machinery gets in the way
below that point.
If you're looking for latency at the millisecond scale then I'd recommend
checking out Apache Apex, Storm, and Flink. All of those
bda x : (x[1] ,x[0])).partitionBy(30)
> ratings = bob.values()
> model = ALS.train(ratings, rank, numIterations)
>
>
> On Jun 28, 2015, at 8:24 AM, Ilya Ganelin wrote:
>
> You can also select pieces of your RDD by first doing a zipWithIndex and
> then doing a filter operatio
Oops - code should be :
Val a = rdd.zipWithIndex().filter(s => 1 < s._2 < 100)
On Sun, Jun 28, 2015 at 8:24 AM Ilya Ganelin wrote:
> You can also select pieces of your RDD by first doing a zipWithIndex and
> then doing a filter operation on the second element of the RDD.
>
>>>>
>>>>> —
>>>>> Sent from Mailbox <https://www.dropbox.com/mailbox>
>>>>>
>>>>>
>>>>> On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya <
>>>>> ilya.gane...@capitalone.com> wrote:
>&g
All - this issue showed up when I was tearing down a spark context and
creating a new one. Often, I was unable to then write to HDFS due to this
error. I subsequently switched to a different implementation where instead
of tearing down and re initializing the spark context I'd instead submit a
sepa
Nope. It will just work when you all x.value.
On Fri, May 15, 2015 at 5:39 PM N B wrote:
> Thanks Ilya. Does one have to call broadcast again once the underlying
> data is updated in order to get the changes visible on all nodes?
>
> Thanks
> NB
>
>
> On Fri, May 1
The broadcast variable is like a pointer. If the underlying data changes
then the changes will be visible throughout the cluster.
On Fri, May 15, 2015 at 5:18 PM NB wrote:
> Hello,
>
> Once a broadcast variable is created using sparkContext.broadcast(), can it
> ever be updated again? The use cas
I believe the typical answer is that Spark is actually a bit slower.
On Mon, Apr 27, 2015 at 7:34 PM bit1...@163.com wrote:
> Hi,
>
> I am frequently asked why spark is also much faster than Hadoop MapReduce
> on disk (without the use of memory cache). I have no convencing answer for
> this quest
That's a much better idea :)
On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers wrote:
> Use KafkaRDD directly. It is in spark-streaming-kafka package
>
> On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora > wrote:
>
>> Hi
>>
>> I want to consume messages from kafka queue using spark batch program not
r of
> > partitions of the input RDD was low as well so the chunks were really too
> > big. Increased parallelism and repartitioning seems to be helping...
> >
> > Thanks!
> > Antony.
> >
> >
> > On Thursday, 19 February 2015, 16:45, Ilya Ganelin
>
The stupid question is whether you're deleting the file from hdfs on the
right node?
On Thu, Feb 19, 2015 at 11:31 AM Pavel Velikhov
wrote:
> Yeah, I do manually delete the files, but it still fails with this error.
>
> On Feb 19, 2015, at 8:16 PM, Ganelin, Ilya
> wrote:
>
> When writing to hdf
By default you will have (fileSize in Mb / 64) partitions. You can also set
the number of partitions when you read in a file with sc.textFile as an
optional second parameter.
On Thu, Feb 19, 2015 at 8:07 AM Alessandro Lulli wrote:
> Hi All,
>
> Could you please help me understanding how Spark def
Hi Anthony - you are seeing a problem that I ran into. The underlying issue
is your default parallelism setting. What's happening is that within ALS
certain RDD operations end up changing the number of partitions you have of
your data. For example if you start with an RDD of 300 partitions, unless
Hi Shahab - if your data structures are small enough a broadcasted Map is
going to provide faster lookup. Lookup within an RDD is an O(m) operation
where m is the size of the partition. For RDDs with multiple partitions,
executors can operate on it in parallel so you get some improvement for
larger
Yep. the matrix model had two RDD vectors representing the decomposed
matrix. You can save these to disk and re use them.
On Thu, Feb 19, 2015 at 2:19 AM Antony Mayi
wrote:
> Hi,
>
> when getting the model out of ALS.train it would be beneficial to store it
> (to disk) so the model can be reused
The number of records is not stored anywhere. You either need to save it at
creation time or step through the RDD.
On Wed, Jan 14, 2015 at 1:46 PM Michael Segel
wrote:
> Sorry, but the accumulator is still going to require you to walk through
> the RDD to get an accurate count, right?
> Its not b
Welcome to Spark. What's more fun is that setting controls memory on the
executors but if you want to set memory limit on the driver you need to
configure it as a parameter of the spark-submit script. You also set
num-executors and executor-cores on the spark submit call.
See both the Spark tuning
Hi Patrick - is that cleanup present in 1.1?
The overhead I am talking about is with regards to what I believe is
shuffle related metadata. If I watch the execution log I see small
broadcast variables created for every stage of execution, a few KB at a
time, and a certain number of MB remaining of
Are you trying to do this in the shell? Shell is instantiated with a spark
context named sc.
-Ilya Ganelin
On Sat, Dec 27, 2014 at 5:24 PM, tfrisk wrote:
>
> Hi,
>
> Doing:
>val ssc = new StreamingContext(conf, Seconds(1))
>
> and getting:
>Only one SparkContex
Hello all - can anyone please offer any advice on this issue?
-Ilya Ganelin
On Mon, Dec 22, 2014 at 5:36 PM, Ganelin, Ilya
wrote:
> Hi all, I have a long running job iterating over a huge dataset. Parts of
> this operation are cached. Since the job runs for so long, eventually the
>
Hi all. I've been working on a similar problem. One solution that is
straightforward (if suboptimal) is to do the following.
A.zipWithIndex().filter(_._2 >=range_start && _._2 < range_end). Lastly
just put that in a for loop. I've found that this approach scales very
well.
As Matei said another o
The split is something like 30 million into 2 milion partitions. The reason
that it becomes tractable is that after I perform the Cartesian on the
split data and operate on it I don't keep the full results - I actually
only keep a tiny fraction of that generated dataset - making the overall
dataset
Hello Steve .
1) When you call new SparkConf you should get an object with the default
config values. You can reference the spark configuration and tuning pages
for details on what those are.
2) Yes. Properties set in this configuration will be pushed down to worker
nodes actually executing the s
In Spark, certain functions have an optional parameter to determine the
number of partitions (distinct, textFile, etc..). You can also use the
coalesce () or repartiton() functions to change the number of partitions
for your RDD. Thanks.
On Oct 28, 2014 9:58 AM, "shahab" wrote:
> Thanks for the u
CCAM-S9zS-%2B-MSXVcohWEhjiAEKaCccOKr_N5e0HPXcNgnxZd%3DHw%40mail.gmail.com%253E&ei=97FPVIfyCsbgsASL94CoDQ&usg=AFQjCNEQ6gUlwpr6KzlcZVd0sQeCSdjQgQ&sig2=Ne7pL_Z94wN4g9BwSutsXQ
-Ilya Ganelin
On Mon, Oct 27, 2014 at 6:12 PM, Xiangrui Meng wrote:
> Could you save the data before ALS and tr
in pinpointing the bug.
>
> Thanks,
> Burak
>
> - Original Message -
> From: "Ilya Ganelin"
> To: "user"
> Sent: Monday, October 27, 2014 11:36:46 AM
> Subject: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0
>
> Hello all -
"spark.akka.frameSize","50")
.set("spark.yarn.executor.memoryOverhead","1024")
Does anyone have any suggestions as to why i'm seeing the above error or
how to get around it?
It may be possible to upgrade to the latest version of Spark but the
mechanism for doing so in our environment isn't obvious yet.
-Ilya Ganelin
Hi all. Just upgraded our cluster to CDH 5.2 (with Spark 1.1) but now I can
no longer set the number of executors or executor-cores. No matter what
values I pass on the command line to spark they are overwritten by the
defaults. Does anyone have any idea what could have happened here? Running
on Sp
Hey Steve - the way to do this is to use the coalesce() function to
coalesce your RDD into a single partition. Then you can do a saveAsTextFile
and you'll wind up with outpuDir/part-0 containing all the data.
-Ilya Ganelin
On Mon, Oct 20, 2014 at 11:01 PM, jay vyas
wrote:
> sou
Check for any variables you've declared in your class. Even if you're not
calling them from the function they are passed to the worker nodes as part
of the context. Consequently, if you have something without a default
serializer (like an imported class) it will also get passed.
To fix this you ca
Also - if you're doing a text file read you can pass the number of
resulting partitions as the second argument.
On Oct 17, 2014 9:05 PM, "Larry Liu" wrote:
> Thanks, Andrew. What about reading out of local?
>
> On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash wrote:
>
>> When reading out of HDFS it's
Hello all . Does anyone else have any suggestions? Even understanding what
this error is from would help a lot.
On Oct 11, 2014 12:56 AM, "Ilya Ganelin" wrote:
> Hi Akhil - I tried your suggestions and tried varying my partition sizes.
> Reducing the number of partitions led
ven sending info about every map's output size
> to each reducer was a problem, so Reynold has a patch that avoids that if
> the number of tasks is large.
>
> Matei
>
> On Oct 10, 2014, at 10:09 PM, Ilya Ganelin wrote:
>
> > Hi Matei - I read your post with great inte
Because of how closures work in Scala, there is no support for nested
map/rdd-based operations. Specifically, if you have
Context a {
Context b {
}
}
Operations within context b, when distributed across nodes, will no longer
have visibility of variables specific to context a because that
I would also be interested in knowing more about this. I have used the
cloudera manager and the spark resource interface (clientnode:4040) but
would love to know if there are other tools out there - either for post
processing or better observation during execution.
On Oct 9, 2014 4:50 PM, "Rohit Pu
Hi Matei - I read your post with great interest. Could you possibly comment
in more depth on some of the issues you guys saw when scaling up spark and
how you resolved them? I am interested specifically in spark-related
problems. I'm working on scaling up spark to very large datasets and have
been
the partitions
> from 1600 to 200.
>
> Thanks
> Best Regards
>
> On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin wrote:
>
>> Hi all – I could use some help figuring out a couple of exceptions I’ve
>> been getting regularly.
>>
>> I have been running on
. I faced this issue and it was gone when i dropped the partitions
> from 1600 to 200.
>
> Thanks
> Best Regards
>
> On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin wrote:
>
>> Hi all – I could use some help figuring out a couple of exceptions I’ve
>> been gettin
Hi all – I could use some help figuring out a couple of exceptions I’ve
been getting regularly.
I have been running on a fairly large dataset (150 gigs). With smaller
datasets I don't have any issues.
My sequence of operations is as follows – unless otherwise specified, I am
not caching:
Map a 3
On Oct 9, 2014 10:18 AM, "Ilya Ganelin" wrote:
Hi all – I could use some help figuring out a couple of exceptions I’ve
been getting regularly.
I have been running on a fairly large dataset (150 gigs). With smaller
datasets I don't have any issues.
My sequence of operation
47 matches
Mail list logo