Re: NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ilya Ganelin
clue. > > On Thu, Jul 21, 2016 at 8:26 PM, Ilya Ganelin wrote: > >> Hello - I'm trying to deploy the Spark TimeSeries library in a new >> environment. I'm running Spark 1.6.1 submitted through YARN in a cluster >> with Java 8 installed on all nodes but I'

NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ilya Ganelin
Hello - I'm trying to deploy the Spark TimeSeries library in a new environment. I'm running Spark 1.6.1 submitted through YARN in a cluster with Java 8 installed on all nodes but I'm getting the NoClassDef at runtime when trying to create a new TimeSeriesRDD. Since ZonedDateTime is part of Java 8 I

Re: Access to broadcasted variable

2016-02-20 Thread Ilya Ganelin
It gets serialized once per physical container, Instead of being serialized once per task of every stage that uses it. On Sat, Feb 20, 2016 at 8:15 AM jeff saremi wrote: > Is the broadcasted variable distributed to every executor or every worker? > Now i'm more confused > I thought it was suppose

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-24 Thread Ilya Ganelin
The solution I normally use is to zipWithIndex() and then use the filter operation. Filter is an O(m) operation where m is the size of your partition, not an O(N) operation. -Ilya Ganelin On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel wrote: > Problem is I have RDD of about 10M rows and it ke

Spark LDA

2016-01-22 Thread Ilya Ganelin
rk.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m -XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts -XX:+UseCompressedOops" --master yarn-client -Ilya Ganelin

Re: Why does a 3.8 T dataset take up 11.59 Tb on HDFS

2015-11-25 Thread Ilya Ganelin
Turning off replication sacrifices durability of your data, so if a node goes down the data is lost - in case that's not obvious. On Wed, Nov 25, 2015 at 8:43 AM Alex Gittens wrote: > Thanks, the issue was indeed the dfs replication factor. To fix it without > entirely clearing out HDFS and reboo

Re: Spark Job is getting killed after certain hours

2015-11-16 Thread Ilya Ganelin
Your Kerberos cert is likely expiring. Check your expiration settings. -Ilya Ganelin On Mon, Nov 16, 2015 at 9:20 PM, Vipul Rai wrote: > Hi Nikhil, > It seems you have Kerberos enabled cluster and it is unable to > authenticate using the ticket. > Please check the Kerberos settin

Re: Spark Streaming Latency in practice

2015-10-13 Thread Ilya Ganelin
Others feel free to chime in here, the recommendation I've seen is to not reduce the batch size below 500ms since the Spark machinery gets in the way below that point. If you're looking for latency at the millisecond scale then I'd recommend checking out Apache Apex, Storm, and Flink. All of those

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin
bda x : (x[1] ,x[0])).partitionBy(30) > ratings = bob.values() > model = ALS.train(ratings, rank, numIterations) > > > On Jun 28, 2015, at 8:24 AM, Ilya Ganelin wrote: > > You can also select pieces of your RDD by first doing a zipWithIndex and > then doing a filter operatio

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin
Oops - code should be : Val a = rdd.zipWithIndex().filter(s => 1 < s._2 < 100) On Sun, Jun 28, 2015 at 8:24 AM Ilya Ganelin wrote: > You can also select pieces of your RDD by first doing a zipWithIndex and > then doing a filter operation on the second element of the RDD. >

Re: Matrix Multiplication and mllib.recommendation

2015-06-28 Thread Ilya Ganelin
>>>> >>>>> — >>>>> Sent from Mailbox <https://www.dropbox.com/mailbox> >>>>> >>>>> >>>>> On Thu, Jun 18, 2015 at 6:01 AM, Ganelin, Ilya < >>>>> ilya.gane...@capitalone.com> wrote: >&g

Re: SaveAsTextFile brings down data nodes with IO Exceptions

2015-05-16 Thread Ilya Ganelin
All - this issue showed up when I was tearing down a spark context and creating a new one. Often, I was unable to then write to HDFS due to this error. I subsequently switched to a different implementation where instead of tearing down and re initializing the spark context I'd instead submit a sepa

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread Ilya Ganelin
Nope. It will just work when you all x.value. On Fri, May 15, 2015 at 5:39 PM N B wrote: > Thanks Ilya. Does one have to call broadcast again once the underlying > data is updated in order to get the changes visible on all nodes? > > Thanks > NB > > > On Fri, May 1

Re: Broadcast variables can be rebroadcast?

2015-05-15 Thread Ilya Ganelin
The broadcast variable is like a pointer. If the underlying data changes then the changes will be visible throughout the cluster. On Fri, May 15, 2015 at 5:18 PM NB wrote: > Hello, > > Once a broadcast variable is created using sparkContext.broadcast(), can it > ever be updated again? The use cas

Re: Why Spark is much faster than Hadoop MapReduce even on disk

2015-04-27 Thread Ilya Ganelin
I believe the typical answer is that Spark is actually a bit slower. On Mon, Apr 27, 2015 at 7:34 PM bit1...@163.com wrote: > Hi, > > I am frequently asked why spark is also much faster than Hadoop MapReduce > on disk (without the use of memory cache). I have no convencing answer for > this quest

Re: spark with kafka

2015-04-18 Thread Ilya Ganelin
That's a much better idea :) On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers wrote: > Use KafkaRDD directly. It is in spark-streaming-kafka package > > On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora > wrote: > >> Hi >> >> I want to consume messages from kafka queue using spark batch program not

Re: loads of memory still GC overhead limit exceeded

2015-02-20 Thread Ilya Ganelin
r of > > partitions of the input RDD was low as well so the chunks were really too > > big. Increased parallelism and repartitioning seems to be helping... > > > > Thanks! > > Antony. > > > > > > On Thursday, 19 February 2015, 16:45, Ilya Ganelin >

Re: Spark job fails on cluster but works fine on a single machine

2015-02-19 Thread Ilya Ganelin
The stupid question is whether you're deleting the file from hdfs on the right node? On Thu, Feb 19, 2015 at 11:31 AM Pavel Velikhov wrote: > Yeah, I do manually delete the files, but it still fails with this error. > > On Feb 19, 2015, at 8:16 PM, Ganelin, Ilya > wrote: > > When writing to hdf

Re: RDD Partition number

2015-02-19 Thread Ilya Ganelin
By default you will have (fileSize in Mb / 64) partitions. You can also set the number of partitions when you read in a file with sc.textFile as an optional second parameter. On Thu, Feb 19, 2015 at 8:07 AM Alessandro Lulli wrote: > Hi All, > > Could you please help me understanding how Spark def

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Ilya Ganelin
Hi Anthony - you are seeing a problem that I ran into. The underlying issue is your default parallelism setting. What's happening is that within ALS certain RDD operations end up changing the number of partitions you have of your data. For example if you start with an RDD of 300 partitions, unless

Re: Why is RDD lookup slow?

2015-02-19 Thread Ilya Ganelin
Hi Shahab - if your data structures are small enough a broadcasted Map is going to provide faster lookup. Lookup within an RDD is an O(m) operation where m is the size of the partition. For RDDs with multiple partitions, executors can operate on it in parallel so you get some improvement for larger

Re: storing MatrixFactorizationModel (pyspark)

2015-02-19 Thread Ilya Ganelin
Yep. the matrix model had two RDD vectors representing the decomposed matrix. You can save these to disk and re use them. On Thu, Feb 19, 2015 at 2:19 AM Antony Mayi wrote: > Hi, > > when getting the model out of ALS.train it would be beneficial to store it > (to disk) so the model can be reused

Re: quickly counting the number of rows in a partition?

2015-01-14 Thread Ilya Ganelin
The number of records is not stored anywhere. You either need to save it at creation time or step through the RDD. On Wed, Jan 14, 2015 at 1:46 PM Michael Segel wrote: > Sorry, but the accumulator is still going to require you to walk through > the RDD to get an accurate count, right? > Its not b

Re: spark ignoring all memory settings and defaulting to 512MB?

2014-12-31 Thread Ilya Ganelin
Welcome to Spark. What's more fun is that setting controls memory on the executors but if you want to set memory limit on the driver you need to configure it as a parameter of the spark-submit script. You also set num-executors and executor-cores on the spark submit call. See both the Spark tuning

Re: Long-running job cleanup

2014-12-28 Thread Ilya Ganelin
Hi Patrick - is that cleanup present in 1.1? The overhead I am talking about is with regards to what I believe is shuffle related metadata. If I watch the execution log I see small broadcast variables created for every stage of execution, a few KB at a time, and a certain number of MB remaining of

Re: Problem with StreamingContext - getting SPARK-2243

2014-12-27 Thread Ilya Ganelin
Are you trying to do this in the shell? Shell is instantiated with a spark context named sc. -Ilya Ganelin On Sat, Dec 27, 2014 at 5:24 PM, tfrisk wrote: > > Hi, > > Doing: >val ssc = new StreamingContext(conf, Seconds(1)) > > and getting: >Only one SparkContex

Re: Long-running job cleanup

2014-12-25 Thread Ilya Ganelin
Hello all - can anyone please offer any advice on this issue? -Ilya Ganelin On Mon, Dec 22, 2014 at 5:36 PM, Ganelin, Ilya wrote: > Hi all, I have a long running job iterating over a huge dataset. Parts of > this operation are cached. Since the job runs for so long, eventually the >

Re: what is the best way to implement mini batches?

2014-12-11 Thread Ilya Ganelin
Hi all. I've been working on a similar problem. One solution that is straightforward (if suboptimal) is to do the following. A.zipWithIndex().filter(_._2 >=range_start && _._2 < range_end). Lastly just put that in a for loop. I've found that this approach scales very well. As Matei said another o

Re: GC Issues with randomSplit on large dataset

2014-10-30 Thread Ilya Ganelin
The split is something like 30 million into 2 milion partitions. The reason that it becomes tractable is that after I perform the Cartesian on the split data and operate on it I don't keep the full results - I actually only keep a tiny fraction of that generated dataset - making the overall dataset

Re: Questions about serialization and SparkConf

2014-10-29 Thread Ilya Ganelin
Hello Steve . 1) When you call new SparkConf you should get an object with the default config values. You can reference the spark configuration and tuning pages for details on what those are. 2) Yes. Properties set in this configuration will be pushed down to worker nodes actually executing the s

Re: How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread Ilya Ganelin
In Spark, certain functions have an optional parameter to determine the number of partitions (distinct, textFile, etc..). You can also use the coalesce () or repartiton() functions to change the number of partitions for your RDD. Thanks. On Oct 28, 2014 9:58 AM, "shahab" wrote: > Thanks for the u

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-28 Thread Ilya Ganelin
CCAM-S9zS-%2B-MSXVcohWEhjiAEKaCccOKr_N5e0HPXcNgnxZd%3DHw%40mail.gmail.com%253E&ei=97FPVIfyCsbgsASL94CoDQ&usg=AFQjCNEQ6gUlwpr6KzlcZVd0sQeCSdjQgQ&sig2=Ne7pL_Z94wN4g9BwSutsXQ -Ilya Ganelin On Mon, Oct 27, 2014 at 6:12 PM, Xiangrui Meng wrote: > Could you save the data before ALS and tr

Re: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Ilya Ganelin
in pinpointing the bug. > > Thanks, > Burak > > - Original Message - > From: "Ilya Ganelin" > To: "user" > Sent: Monday, October 27, 2014 11:36:46 AM > Subject: MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0 > > Hello all -

MLLib ALS ArrayIndexOutOfBoundsException with Scala Spark 1.1.0

2014-10-27 Thread Ilya Ganelin
"spark.akka.frameSize","50") .set("spark.yarn.executor.memoryOverhead","1024") Does anyone have any suggestions as to why i'm seeing the above error or how to get around it? It may be possible to upgrade to the latest version of Spark but the mechanism for doing so in our environment isn't obvious yet. -Ilya Ganelin

Num-executors and executor-cores overwritten by defaults

2014-10-21 Thread Ilya Ganelin
Hi all. Just upgraded our cluster to CDH 5.2 (with Spark 1.1) but now I can no longer set the number of executors or executor-cores. No matter what values I pass on the command line to spark they are overwritten by the defaults. Does anyone have any idea what could have happened here? Running on Sp

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread Ilya Ganelin
Hey Steve - the way to do this is to use the coalesce() function to coalesce your RDD into a single partition. Then you can do a saveAsTextFile and you'll wind up with outpuDir/part-0 containing all the data. -Ilya Ganelin On Mon, Oct 20, 2014 at 11:01 PM, jay vyas wrote: > sou

Re: What's wrong with my spark filter? I get "org.apache.spark.SparkException: Task not serializable"

2014-10-19 Thread Ilya Ganelin
Check for any variables you've declared in your class. Even if you're not calling them from the function they are passed to the worker nodes as part of the context. Consequently, if you have something without a default serializer (like an imported class) it will also get passed. To fix this you ca

Re: input split size

2014-10-18 Thread Ilya Ganelin
Also - if you're doing a text file read you can pass the number of resulting partitions as the second argument. On Oct 17, 2014 9:05 PM, "Larry Liu" wrote: > Thanks, Andrew. What about reading out of local? > > On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash wrote: > >> When reading out of HDFS it's

Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-14 Thread Ilya Ganelin
Hello all . Does anyone else have any suggestions? Even understanding what this error is from would help a lot. On Oct 11, 2014 12:56 AM, "Ilya Ganelin" wrote: > Hi Akhil - I tried your suggestions and tried varying my partition sizes. > Reducing the number of partitions led

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Ilya Ganelin
ven sending info about every map's output size > to each reducer was a problem, so Reynold has a patch that avoids that if > the number of tasks is large. > > Matei > > On Oct 10, 2014, at 10:09 PM, Ilya Ganelin wrote: > > > Hi Matei - I read your post with great inte

Re: How To Implement More Than One Subquery in Scala/Spark

2014-10-11 Thread Ilya Ganelin
Because of how closures work in Scala, there is no support for nested map/rdd-based operations. Specifically, if you have Context a { Context b { } } Operations within context b, when distributed across nodes, will no longer have visibility of variables specific to context a because that

Re: Debug Spark in Cluster Mode

2014-10-10 Thread Ilya Ganelin
I would also be interested in knowing more about this. I have used the cloudera manager and the spark resource interface (clientnode:4040) but would love to know if there are other tools out there - either for post processing or better observation during execution. On Oct 9, 2014 4:50 PM, "Rohit Pu

Re: Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Ilya Ganelin
Hi Matei - I read your post with great interest. Could you possibly comment in more depth on some of the issues you guys saw when scaling up spark and how you resolved them? I am interested specifically in spark-related problems. I'm working on scaling up spark to very large datasets and have been

Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-10 Thread Ilya Ganelin
the partitions > from 1600 to 200. > > Thanks > Best Regards > > On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin wrote: > >> Hi all – I could use some help figuring out a couple of exceptions I’ve >> been getting regularly. >> >> I have been running on

Re: IOException and appcache FileNotFoundException in Spark 1.02

2014-10-10 Thread Ilya Ganelin
. I faced this issue and it was gone when i dropped the partitions > from 1600 to 200. > > Thanks > Best Regards > > On Fri, Oct 10, 2014 at 5:58 AM, Ilya Ganelin wrote: > >> Hi all – I could use some help figuring out a couple of exceptions I’ve >> been gettin

IOException and appcache FileNotFoundException in Spark 1.02

2014-10-09 Thread Ilya Ganelin
Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operations is as follows – unless otherwise specified, I am not caching: Map a 3

IOException and appcache FileNotFoundException in Spark 1.02

2014-10-09 Thread Ilya Ganelin
On Oct 9, 2014 10:18 AM, "Ilya Ganelin" wrote: Hi all – I could use some help figuring out a couple of exceptions I’ve been getting regularly. I have been running on a fairly large dataset (150 gigs). With smaller datasets I don't have any issues. My sequence of operation