Re: OS killing Executor due to high (possibly off heap) memory usage

2016-12-08 Thread Aniket Bhatnagar
ors. > > now with dataframes the memory usage is both on and off heap, and we have > no way of limiting the off heap memory usage by spark, yet yarn requires a > maximum total memory usage and if you go over it yarn kills the executor. > > On Fri, Nov 25, 2016 at 12:14 PM, Anik

Re: OS killing Executor due to high (possibly off heap) memory usage

2016-11-25 Thread Aniket Bhatnagar
r node. Sent from my Windows 10 phone *From: *Rodrick Brown *Sent: *Friday, November 25, 2016 12:25 AM *To: *Aniket Bhatnagar *Cc: *user *Subject: *Re: OS killing Executor due to high (possibly off heap) memory usage Try setting spark.yarn.executor.memoryOverhead 1 On Thu, Nov 24, 2

OS killing Executor due to high (possibly off heap) memory usage

2016-11-24 Thread Aniket Bhatnagar
Hi Spark users I am running a job that does join of a huge dataset (7 TB+) and the executors keep crashing randomly, eventually causing the job to crash. There are no out of memory exceptions in the log and looking at the dmesg output, it seems like the OS killed the JVM because of high memory usa

Re: RDD Partitions on HDFS file in Hive on Spark Query

2016-11-21 Thread Aniket Bhatnagar
Try changing compression to bzip2 or lzo. For reference - http://comphadoop.weebly.com Thanks, Aniket On Mon, Nov 21, 2016, 10:18 PM yeshwanth kumar wrote: > Hi, > > we are running Hive on Spark, we have an external table over snappy > compressed csv file of size 917.4 M > HDFS block size is se

Re: Very long pause/hang at end of execution

2016-11-16 Thread Aniket Bhatnagar
Also, how are you launching the application? Through spark submit or creating spark content in your app? Thanks, Aniket On Wed, Nov 16, 2016 at 10:44 AM Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Thanks for sharing the thread dump. I had a look at them and couldn't f

Re: Very long pause/hang at end of execution

2016-11-16 Thread Aniket Bhatnagar
g thread dumps and I will see if I can > figure out what is going on. > > > On Sunday, November 6, 2016 10:02 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > > > I doubt it's GC as you mentioned that the pause is several minutes. Since > it'

Re: Dataset API | Setting number of partitions during join/groupBy

2016-11-11 Thread Aniket Bhatnagar
ns is less? Or you want to do it for a > perf gain? > > > > Also, what were your initial Dataset partitions and how many did you have > for the result of join? > > > > *From:* Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com] > *Sent:* Friday, November 11, 2016 9:2

Dataset API | Setting number of partitions during join/groupBy

2016-11-11 Thread Aniket Bhatnagar
Hi I can't seem to find a way to pass number of partitions while join 2 Datasets or doing a groupBy operation on the Dataset. There is an option of repartitioning the resultant Dataset but it's inefficient to repartition after the Dataset has been joined/grouped into default number of partitions.

Re: Very long pause/hang at end of execution

2016-11-06 Thread Aniket Bhatnagar
d I don't know how to > interpret what I saw... I don't think it could be my code directly, since > at this point my code has all completed? Could GC be taking that long? > > (I could also try grabbing the thread dumps and pasting them here, if that > would help?) > > O

Re: Very long pause/hang at end of execution

2016-11-06 Thread Aniket Bhatnagar
In order to know what's going on, you can study the thread dumps either from spark UI or from any other thread dump analysis tool. Thanks, Aniket On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson wrote: > I'm doing some processing and then clustering of a small dataset (~150 > MB). Everything seem

Re: Improvement proposal | Dynamic disk allocation

2016-11-06 Thread Aniket Bhatnagar
If people agree that is desired, I am willing to submit a SIP for this and find time to work on it. Thanks, Aniket On Sun, Nov 6, 2016 at 1:06 PM Aniket Bhatnagar wrote: > Hello > > Dynamic allocation feature allows you to add executors and scale > computation power. This is great

Improvement proposal | Dynamic disk allocation

2016-11-06 Thread Aniket Bhatnagar
Hello Dynamic allocation feature allows you to add executors and scale computation power. This is great, however, I feel like we also need a way to dynamically scale storage. Currently, if the disk is not able to hold the spilled/shuffle data, the job is aborted causing frustration and loss of tim

Re: RuntimeException: Null value appeared in non-nullable field when holding Optional Case Class

2016-11-03 Thread Aniket Bhatnagar
Issue raised - SPARK-18251 On Wed, Nov 2, 2016, 9:12 PM Aniket Bhatnagar wrote: > Hi all > > I am running into a runtime exception when a DataSet is holding an Empty > object instance for an Option type that is holding non-nullable field. For > instance, if we have the follo

RuntimeException: Null value appeared in non-nullable field when holding Optional Case Class

2016-11-02 Thread Aniket Bhatnagar
Hi all I am running into a runtime exception when a DataSet is holding an Empty object instance for an Option type that is holding non-nullable field. For instance, if we have the following case class: case class DataRow(id: Int, value: String) Then, DataSet[Option[DataRow]] can only hold Some(D

Re: Execute function once on each node

2016-07-18 Thread Aniket Bhatnagar
to read the data in situ without a copy. > > I understand that manually assigning tasks to nodes reduces fault > tolerance, but the simulation codes already explicitly assign tasks, so a > failure of any one node is already a full-job failure. > > On Mon, Jul 18, 2016 at 3:43 PM Anik

Re: Execute function once on each node

2016-07-18 Thread Aniket Bhatnagar
You can't assume that the number to nodes will be constant as some may fail, hence you can't guarantee that a function will execute at most once or atleast once on a node. Can you explain your use case in a bit more detail? On Mon, Jul 18, 2016, 10:57 PM joshuata wrote: > I am working on a spark

Re: Help me! Spark WebUI is corrupted!

2015-12-31 Thread Aniket Bhatnagar
Are you running on YARN or standalone? On Thu, Dec 31, 2015, 3:35 PM LinChen wrote: > *Screenshot1(Normal WebUI)* > > > > *Screenshot2(Corrupted WebUI)* > > > > As screenshot2 shows, the format of my Spark WebUI looks strange and I > cannot click the description of active jobs. It seems there is

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
NG_STATS); > >JavaPairDStream> newData = > stages.filter(NEW_STATS); > >newData.foreachRDD{ > rdd.forEachPartition{ >//Store to external storage. > } > } > > Without using updateStageByKey, I'm only have the stats of t

Re: Spark Streaming data checkpoint performance

2015-11-06 Thread Aniket Bhatnagar
Can you try storing the state (word count) in an external key value store? On Sat, Nov 7, 2015, 8:40 AM Thúy Hằng Lê wrote: > Hi all, > > Anyone could help me on this. It's a bit urgent for me on this. > I'm very confused and curious about Spark data checkpoint performance? Is > there any detail

Re: How to close connection in mapPartitions?

2015-10-22 Thread Aniket Bhatnagar
Are you sure RedisClientPool is being initialized properly in the constructor of RedisCache? Can you please copy paste the code that you use to initialize RedisClientPool inside the constructor of RedisCache? Thanks, Aniket On Fri, Oct 23, 2015 at 11:47 AM Bin Wang wrote: > BTW, "lines" is a DS

Re: HBase Spark Streaming giving error after restore

2015-10-16 Thread Aniket Bhatnagar
Can you try changing classOf[OutputFormat[String, BoxedUnit]] to classOf[OutputFormat[String, Put]] while configuring hconf? On Sat, Oct 17, 2015, 11:44 AM Amit Hora wrote: > Hi, > > Regresta for delayed resoonse > please find below full stack trace > > ava.lang.ClassCastException: scala.runtime

Re: Launching EC2 instances with Spark compiled for Scala 2.11

2015-10-08 Thread Aniket Bhatnagar
Is it possible for you to use EMR instead of EC2? If so, you may be able to tweak EMR bootstrap scripts to install your custom spark build. Thanks, Aniket On Thu, Oct 8, 2015 at 5:58 PM Theodore Vasiloudis < theodoros.vasilou...@gmail.com> wrote: > Hello, > > I was wondering if there is an easy

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Aniket Bhatnagar
eStateByKey(ProbabilityCalculator.updateCountsOfProcessGivenRole, > new HashPartitioner(3), initialProcessGivenRoleRdd) > > I am getting an error -- 'missing arguments for method > updateCountsOfProcessGivenRole'. Looking at the method calls, the function > that is called for

Re: Example of updateStateByKey with initial RDD?

2015-10-08 Thread Aniket Bhatnagar
Here is an example: val interval = 60 * 1000 val counts = eventsStream.map(event => { (event.timestamp - event.timestamp % interval, event) }).updateStateByKey[Long](updateFunc = (events: Seq[Event], prevStateOpt: Option[Long]) => { val prevCount = prevStateOpt.getOrElse(0L) val newCount = p

Re: spark-submit classloader issue...

2015-09-28 Thread Aniket Bhatnagar
Hi Rachna Can you just use http client provided via spark transitive dependencies instead of excluding them? The reason user classpath first is failing could be because you have spark artifacts in your assembly jar that dont match with what is deployed (version mismatch or you built the version y

Re: word count (group by users) in spark

2015-09-20 Thread Aniket Bhatnagar
+ _) > > val output = wordCounts. >map({case ((user, word), count) => (user, (word, count))}). >groupByKey() > > By Aniket, if we group by user first, it could run out of memory when > spark tries to put all words in a single sequence, couldn't it? > > On Sat, S

Re: word count (group by users) in spark

2015-09-19 Thread Aniket Bhatnagar
Using scala API, you can first group by user and then use combineByKey. Thanks, Aniket On Sat, Sep 19, 2015, 6:41 PM kali.tumm...@gmail.com wrote: > Hi All, > I would like to achieve this below output using spark , I managed to write > in Hive and call it in spark but not in just spark (scala),

Re: question building spark in a virtual machine

2015-09-19 Thread Aniket Bhatnagar
Hi Eval Can you check if your Ubuntu VM has enough RAM allocated to run JVM of size 3gb? thanks, Aniket On Sat, Sep 19, 2015, 9:09 PM Eyal Altshuler wrote: > Hi, > > I had configured the MAVEN_OPTS environment variable the same as you wrote. > My java version is 1.7.0_75. > I didn't customized

Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-18 Thread Aniket Bhatnagar
; /Shahab > > On Fri, Sep 18, 2015 at 12:54 PM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> Can you try yarn-client mode? >> >> On Fri, Sep 18, 2015, 3:38 PM shahab wrote: >> >>> Hi, >>> >>> Probably I have

Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-18 Thread Aniket Bhatnagar
Can you try yarn-client mode? On Fri, Sep 18, 2015, 3:38 PM shahab wrote: > Hi, > > Probably I have wrong zeppelin configuration, because I get the following > error when I execute spark statements in Zeppelin: > > org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't > running

Re: Checkpointing with Kinesis

2015-09-17 Thread Aniket Bhatnagar
You can perhaps setup a WAL that logs to S3? New cluster should pick the records that weren't processed due previous cluster termination. Thanks, Aniket On Thu, Sep 17, 2015, 9:19 PM Alan Dipert wrote: > Hello, > We are using Spark Streaming 1.4.1 in AWS EMR to process records from > Kinesis.

Re: java.lang.NoSuchMethodError and yarn-client mode

2015-09-09 Thread Aniket Bhatnagar
Hi Tom There has to be a difference in classpaths in yarn-client and yarn-cluster mode. Perhaps a good starting point would be to print classpath as a first thing in SimpleApp.main. It should give clues around why it works in yarn-cluster mode. Thanks, Aniket On Wed, Sep 9, 2015, 2:11 PM Tom Sed

Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2015-05-23 Thread Aniket Bhatnagar
test it out? > > > > On Thu, May 21, 2015 at 1:43 AM, Tathagata Das > wrote: > >> Thanks for the JIRA. I will look into this issue. >> >> TD >> >> On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar < >> aniket.bhatna...@gmail.com> wrote: >>

Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2015-05-21 Thread Aniket Bhatnagar
I ran into one of the issues that are potentially caused because of this and have logged a JIRA bug - https://issues.apache.org/jira/browse/SPARK-7788 Thanks, Aniket On Wed, Sep 24, 2014 at 12:59 PM Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Hi all > > Readin

Re: OOM in SizeEstimator while using combineByKey

2015-04-15 Thread Aniket Bhatnagar
the hashCode and equals method correctly. > > > > On Wednesday, April 15, 2015 at 3:10 PM, Aniket Bhatnagar wrote: > > > I am aggregating a dataset using combineByKey method and for a certain > input size, the job fails with the following error. I have enabled head > dumps

OOM in SizeEstimator while using combineByKey

2015-04-15 Thread Aniket Bhatnagar
I am aggregating a dataset using combineByKey method and for a certain input size, the job fails with the following error. I have enabled head dumps to better analyze the issue and will report back if I have any findings. Meanwhile, if you guys have any idea of what could possibly result in this er

Saprk 1.2.0 | Spark job fails with MetadataFetchFailedException

2015-03-19 Thread Aniket Bhatnagar
I have a job that sorts data and runs a combineByKey operation and it sometimes fails with the following error. The job is running on spark 1.2.0 cluster with yarn-client deployment mode. Any clues on how to debug the error? org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output

Re: No executors allocated on yarn with latest master branch

2015-02-12 Thread Aniket Bhatnagar
This is tricky to debug. Check logs of node and resource manager of YARN to see if you can trace the error. In the past I have to closely look at arguments getting passed to YARN container (they get logged before attempting to launch containers). If I still don't get a clue, I had to check the scri

Re: Spark Job running on localhost on yarn cluster

2015-02-04 Thread Aniket Bhatnagar
Have you set master in SparkConf/SparkContext in your code? Driver logs show in which mode the spark job is running. Double check if the logs mention local or yarn-cluster. Also, what's the error that you are getting? On Wed, Feb 4, 2015, 6:13 PM kundan kumar wrote: > Hi, > > I am trying to exec

Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-02-02 Thread Aniket Bhatnagar
t;: "true", "spark.hadoop.fs.s3.consistent.metadata.tableName": "EmrFSMetadata", "spark.hadoop.fs.s3.consistent.metadata.read.capacity": "500", "spark.hadoop.fs.s3.consistent.metadata.write.capacity": "100", "spark.hadoop.fs.s3.consistent.fastList"

Re: Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-01-30 Thread Aniket Bhatnagar
e permissions for it. > -Sven > > On Fri, Jan 30, 2015 at 6:44 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> I am programmatically submit spark jobs in yarn-client mode on EMR. >> Whenever a job tries to save file to s3, it gives the below ment

Programmatic Spark 1.2.0 on EMR | S3 filesystem is not working when using

2015-01-30 Thread Aniket Bhatnagar
I am programmatically submit spark jobs in yarn-client mode on EMR. Whenever a job tries to save file to s3, it gives the below mentioned exception. I think the issue might be what EMR is not setup properly as I have to set all hadoop configurations manually in SparkContext. However, I am not sure

Re: saving rdd to multiple files named by the key

2015-01-26 Thread Aniket Bhatnagar
This might be helpful: http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job On Tue Jan 27 2015 at 07:45:18 Sharon Rapoport wrote: > Hi, > > I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k]. > I got them by combining many [k,v]

Re: ClosureCleaner should use ClassLoader created by SparkContext

2015-01-21 Thread Aniket Bhatnagar
(ThreadPoolExecutor.java:615) [na:1.7.0_71] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71] On Wed Jan 21 2015 at 17:34:34 Aniket Bhatnagar wrote: > While implementing a spark server, I realized that Thread's context loader > must be set to any dynamically loaded classloa

ClosureCleaner should use ClassLoader created by SparkContext

2015-01-21 Thread Aniket Bhatnagar
While implementing a spark server, I realized that Thread's context loader must be set to any dynamically loaded classloader so that ClosureCleaner can do it's thing. Should the ClosureCleaner not use classloader created by SparkContext (that has all dynamically added jars via SparkContext.addJar)

Re: How to output to S3 and keep the order

2015-01-19 Thread Aniket Bhatnagar
When you repartiton, ordering can get lost. You would need to sort after repartitioning. Aniket On Tue, Jan 20, 2015, 7:08 AM anny9699 wrote: > Hi, > > I am using Spark on AWS and want to write the output to S3. It is a > relatively small file and I don't want them to output as multiple parts.

Re: kinesis multiple records adding into stream

2015-01-16 Thread Aniket Bhatnagar
Sorry. I couldn't understand the issue. Are you trying to send data to kinesis from a spark batch/real time job? - Aniket On Fri, Jan 16, 2015, 9:40 PM Hafiz Mujadid wrote: > Hi Experts! > > I am using kinesis dependency as follow > groupId = org.apache.spark > artifactId = spark-streaming-kin

Re: Inserting an element in RDD[String]

2015-01-15 Thread Aniket Bhatnagar
Sure there is. Create a new RDD just containing the schema line (hint: use sc.parallelize) and then union both the RDDs (the header RDD and data RDD) to get a final desired RDD. On Thu Jan 15 2015 at 19:48:52 Hafiz Mujadid wrote: > hi experts! > > I hav an RDD[String] and i want to add schema li

Re: kinesis creating stream scala code exception

2015-01-15 Thread Aniket Bhatnagar
Are you using spark in standalone mode or yarn or mesos? If its yarn, please mention the hadoop distribution and version. What spark distribution are you using (it seems 1.2.0 but compiled with which hadoop version)? Thanks, Aniket On Thu, Jan 15, 2015, 4:59 PM Hafiz Mujadid wrote: > Hi, Exper

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-12 Thread Aniket Bhatnagar
d putting the > Spark assembly first in the classpath fixes the issue. I expect that the > version of Parquet that's being included in the EMR libs just needs to be > upgraded. > > > ~ Jonathan Kelly > > From: Aniket Bhatnagar > Date: Sunday, January 4, 2015 at

Re: spark-network-yarn 2.11 depends on spark-network-shuffle 2.10

2015-01-08 Thread Aniket Bhatnagar
t; libraries are java-only (the scala version appended there is just for > helping the build scripts). > > But it does look weird, so it would be nice to fix it. > > On Wed, Jan 7, 2015 at 12:25 AM, Aniket Bhatnagar > wrote: > > It seems that spark-network-yarn compiled for sc

spark-network-yarn 2.11 depends on spark-network-shuffle 2.10

2015-01-07 Thread Aniket Bhatnagar
It seems that spark-network-yarn compiled for scala 2.11 depends on spark-network-shuffle compiled for scala 2.10. This causes cross version dependencies conflicts in sbt. Seems like a publishing error? http://www.uploady.com/#!/download/6Yn95UZA0DR/3taAJFjCJjrsSXOR

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-04 Thread Aniket Bhatnagar
Can you confirm your emr version? Could it be because of the classpath entries for emrfs? You might face issues with using S3 without them. Thanks, Aniket On Mon, Jan 5, 2015, 11:16 AM Adam Gilmore wrote: > Just an update on this - I found that the script by Amazon was the culprit > - not exact

Re: A spark newbie question

2015-01-04 Thread Aniket Bhatnagar
Go through spark API documentation. Basically you have to do group by (date, message_type) and then do a count. On Sun, Jan 4, 2015, 9:58 PM Dinesh Vallabhdas wrote: > A spark cassandra newbie question. Thanks in advance for the help. > I have a cassandra table with 2 columns message_timestamp(t

Re: sparkContext.textFile does not honour the minPartitions argument

2015-01-02 Thread Aniket Bhatnagar
of partitions value is to increase number of partitions > not reduce it from default value. > > On Thu, Jan 1, 2015 at 10:43 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> I am trying to read a file into a single partition but it seems like &g

sparkContext.textFile does not honour the minPartitions argument

2015-01-01 Thread Aniket Bhatnagar
I am trying to read a file into a single partition but it seems like sparkContext.textFile ignores the passed minPartitions value. I know I can repartition the RDD but I was curious to know if this is expected or if this is a bug that needs to be further investigated?

Re: Host Error on EC2 while accessing hdfs from stadalone

2014-12-30 Thread Aniket Bhatnagar
Did you check firewall rules in security groups? On Tue, Dec 30, 2014, 9:34 PM Laeeq Ahmed wrote: > Hi, > > I am using spark standalone on EC2. I can access ephemeral hdfs from > spark-shell interface but I can't access hdfs in standalone application. I > am using spark 1.2.0 with hadoop 2.4.0 a

Re: action progress in ipython notebook?

2014-12-29 Thread Aniket Bhatnagar
adding a > facade / stable interface in front of JobProgressListener and thus has > little to no risk to introduce new bugs elsewhere in Spark. > > > > On Mon, Dec 29, 2014 at 3:08 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> Hi Josh >>

Re: action progress in ipython notebook?

2014-12-29 Thread Aniket Bhatnagar
Hi Josh Is there documentation available for status API? I would like to use it. Thanks, Aniket On Sun Dec 28 2014 at 02:37:32 Josh Rosen wrote: > The console progress bars are implemented on top of a new stable "status > API" that was added in Spark 1.2. It's possible to query job progress >

Re: Spark 1.2.0 Yarn not published

2014-12-29 Thread Aniket Bhatnagar
d61V1/Spark-yarn+1.2.0&subj=Re+spark+yarn_2+10+1+2+0+artifacts > > Cheers > > On Dec 28, 2014, at 11:13 PM, Aniket Bhatnagar > wrote: > > Hi all > > I just realized that spark-yarn artifact hasn't been published for 1.2.0 > release. Any particular reason

Spark 1.2.0 Yarn not published

2014-12-28 Thread Aniket Bhatnagar
Hi all I just realized that spark-yarn artifact hasn't been published for 1.2.0 release. Any particular reason for that? I was using it in my yet another spark-job-server project to submit jobs to a YARN cluster through convenient REST APIs (with some success). The job server was creating SparkCon

Re: Are lazy values created once per node or once per partition?

2014-12-17 Thread Aniket Bhatnagar
I would think that it has to be per worker. On Wed, Dec 17, 2014, 6:32 PM Ashic Mahtab wrote: > Hello, > Say, I have the following code: > > let something = Something() > > someRdd.foreachRdd(something.someMethod) > > And in something, I have a lazy member variable that gets created in > somethi

Re: Spark 1.1.0 does not spawn more than 6 executors in yarn-client mode and ignores --num-executors

2014-12-16 Thread Aniket Bhatnagar
Hi guys I am hoping someone might have a clue on why this is happening. Otherwise I will have to dwell into YARN module's source code to better understand the issue. On Wed, Dec 10, 2014, 11:54 PM Aniket Bhatnagar wrote: > I am running spark 1.1.0 on AWS EMR and I am running a batch

Re: Streaming | Partition count mismatch exception while saving data in RDD

2014-12-16 Thread Aniket Bhatnagar
It turns out that this happens when checkpoint is set to a local directory path. I have opened a JIRA SPARK-4862 for Spark streaming to output better error message. Thanks, Aniket On Tue Dec 16 2014 at 20:08:13 Aniket Bhatnagar wrote: > I am using spark 1.1.0 running a streaming job that u

Streaming | Partition count mismatch exception while saving data in RDD

2014-12-16 Thread Aniket Bhatnagar
I am using spark 1.1.0 running a streaming job that uses updateStateByKey and then (after a bunch of maps/flatMaps) does a foreachRDD to save data in each RDD by making HTTP calls. The issue is that each time I attempt to save the RDD (using foreach on RDD), it gives me the following exception: or

Re: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Aniket Bhatnagar
e 397 in Playframework logs as > follows. > https://gist.github.com/TomoyaIgarashi/9688bdd5663af95ddd4d > > Is there any problem? > > > 2014-12-15 18:48 GMT+09:00 Aniket Bhatnagar : >> >> Try the workaround (addClassPathJars(sparkContext, >> this.getClass.getClas

Re: Spark with HBase

2014-12-15 Thread Aniket Bhatnagar
In case you are still looking for help, there has been multiple discussions in this mailing list that you can try searching for. Or you can simply use https://github.com/unicredit/hbase-rdd :-) Thanks, Aniket On Wed Dec 03 2014 at 16:11:47 Ted Yu wrote: > Which hbase release are you running ? >

Re: Serialization issue when using HBase with Spark

2014-12-15 Thread Aniket Bhatnagar
"The reason not using sc.newAPIHadoopRDD is it only support one scan each time." I am not sure is that's true. You can use multiple scans as following: val scanStrings = scans.map(scan => convertScanToString(scan)) conf.setStrings(MultiTableInputFormat.SCANS, scanStrings : _*) where convertScanT

Re: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Aniket Bhatnagar
Try the workaround (addClassPathJars(sparkContext, this.getClass.getClassLoader) discussed in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAJOb8buD1B6tUtOfG8_Ok7F95C3=r-zqgffoqsqbjdxd427...@mail.gmail.com%3E Thanks, Aniket On Mon Dec 15 2014 at 07:43:24 Tomoya Igarashi < to

Re: Having problem with Spark streaming with Kinesis

2014-12-14 Thread Aniket Bhatnagar
The reason is because of the following code: val numStreams = numShards val kinesisStreams = (0 until numStreams).map { i => KinesisUtils.createStream(ssc, streamName, endpointUrl, kinesisCheckpointInterval, InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2) } In the above co

Spark 1.1.0 does not spawn more than 6 executors in yarn-client mode and ignores --num-executors

2014-12-10 Thread Aniket Bhatnagar
I am running spark 1.1.0 on AWS EMR and I am running a batch job that should seems to be highly parallelizable in yarn-client mode. But spark stop spawning any more executors after spawning 6 executors even though YARN cluster has 15 healthy m1.large nodes. I even tried providing '--num-executors 6

Re: Programmatically running spark jobs using yarn-client

2014-12-09 Thread Aniket Bhatnagar
your code (sbt > package will give you one inside target/scala-*/projectname-*.jar) and then > use it while submitting. If you are not using spark-submit then you can > simply add this jar to spark by > sc.addJar("/path/to/target/scala*/projectname*jar") > > Thanks

Programmatically running spark jobs using yarn-client

2014-12-08 Thread Aniket Bhatnagar
I am trying to create (yet another) spark as a service tool that lets you submit jobs via REST APIs. I think I have nearly gotten it to work baring a few issues. Some of which seem already fixed in 1.2.0 (like SPARK-2889) but I have hit the road block with the following issue. I have created a sim

Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Aniket Bhatnagar
<https://twitter.com/ashrafuzzaman> | Blog > <http://jitu-blog.blogspot.com/> | Facebook > <https://www.facebook.com/ashrafuzzaman.jitu> > > Check out The Academy <http://newscred.com/theacademy>, your #1 source > for free content marketing resources > &

Re: Having problem with Spark streaming with Kinesis

2014-11-26 Thread Aniket Bhatnagar
What's your cluster size? For streamig to work, it needs shards + 1 executors. On Wed, Nov 26, 2014, 5:53 PM A.K.M. Ashrafuzzaman < ashrafuzzaman...@gmail.com> wrote: > Hi guys, > When we are using Kinesis with 1 shard then it works fine. But when we use > more that 1 then it falls into an infini

Re: Getting spark job progress programmatically

2014-11-19 Thread Aniket Bhatnagar
gt; https://github.com/apache/spark/pull/3009 > > If there is anything that needs to be added, please add it to those issues > or PRs. > > On Wed, Nov 19, 2014 at 7:55 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> I have for now submitted a JIRA ticket @ >

Re: Getting spark job progress programmatically

2014-11-19 Thread Aniket Bhatnagar
in the public API. > > Any other ideas? > > On Tue Nov 18 2014 at 4:03:35 PM Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> Thanks Andy. This is very useful. This gives me all active stages & their >> percentage completion but I am unable to

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Aniket Bhatnagar
; My reading of org/ apache/spark/Aggregator.scala is that your function > will always see the items in the order that they are in the input RDD. An > RDD partition is always accessed as an iterator, so it will not be read out > of order. > > On Wed, Nov 19, 2014 at 2:28 PM, Anik

Re: Is sorting persisted after pair rdd transformations?

2014-11-19 Thread Aniket Bhatnagar
so 8 is the first > number that demonstrates this.) > > On Wed, Nov 19, 2014 at 9:05 AM, Akhil Das > wrote: > >> If something is persisted you can easily see them under the Storage tab >> in the web ui. >> >> Thanks >> Best Regards >> >> On Tue

Re: Getting spark job progress programmatically

2014-11-18 Thread Aniket Bhatnagar
14 at 19:40:06 andy petrella wrote: > I started some quick hack for that in the notebook, you can head to: > https://github.com/andypetrella/spark-notebook/ > blob/master/common/src/main/scala/notebook/front/widgets/SparkInfo.scala > > On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatna

Is sorting persisted after pair rdd transformations?

2014-11-18 Thread Aniket Bhatnagar
I am trying to figure out if sorting is persisted after applying Pair RDD transformations and I am not able to decisively tell after reading the documentation. For example: val numbers = .. // RDD of numbers val pairedNumbers = numbers.map(number => (number % 100, number)) val sortedPairedNumbers

Getting spark job progress programmatically

2014-11-18 Thread Aniket Bhatnagar
I am writing yet another Spark job server and have been able to submit jobs and return/save results. I let multiple jobs use the same spark context but I set job group while firing each job so that I can in future cancel jobs. Further, what I deserve to do is provide some kind of status update/prog

Re: Out of memory with Spark Streaming

2014-10-31 Thread Aniket Bhatnagar
may require some further investigation on my part. i > wanna stay on top of this if it's an issue. > > thanks for posting this, aniket! > > -chris > > On Fri, Sep 12, 2014 at 5:34 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> Hi all &

Re: Is SparkSQL + JDBC server a good approach for caching?

2014-10-24 Thread Aniket Bhatnagar
Just curious... Why would you not store the processed results in regular relational database? Not sure what you meant by persist the appropriate RDDs. Did you mean output of your job will be RDDs? On 24 October 2014 13:35, ankits wrote: > I want to set up spark SQL to allow ad hoc querying over

Spark doesn't retry task while writing to HDFS

2014-10-24 Thread Aniket Bhatnagar
Hi all I have written a job that reads data from HBASE and writes to HDFS (fairly simple). While running the job, I noticed that a few of the tasks failed with the following error. Quick googling on the error suggests that its an unexplained error and is perhaps intermittent. What I am curious to

[Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2014-09-24 Thread Aniket Bhatnagar
Hi all Reading through Spark streaming's custom receiver documentation, it is recommended that onStart and onStop methods should not block indefinitely. However, looking at the source code of KinesisReceiver, the onStart method calls worker.run that blocks until worker is shutdown (via a call to o

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-23 Thread Aniket Bhatnagar
run on the cluster? Thanks, Aniket On 22 September 2014 18:14, Aniket Bhatnagar wrote: > Hi all > > I was finally able to figure out why this streaming appeared stuck. The > reason was that I was running out of workers in my standalone deployment of > Spark. There was no feedback

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-22 Thread Aniket Bhatnagar
yarn-client mode and this is again giving the same problem. Is it possible to run out of workers in YARN mode as well? If so, how can I figure that out? Thanks, Aniket On 19 September 2014 18:07, Aniket Bhatnagar wrote: > Apologies in delay in getting back on this. It seems the Kinesis exam

Re: Bulk-load to HBase

2014-09-19 Thread Aniket Bhatnagar
can be slow. > > So, it would be nice if bulk-load could be used, since it bypasses the > write path. > > > > Thanks. > > > > *From:* Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com > ] > *Sent:* Friday, September 19, 2014 9:01 PM > *To:* innowireless TaeYun Kim

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-19 Thread Aniket Bhatnagar
> Also ccing, chris fregly who wrote Kinesis integration. > > TD > > > > > On Thu, Sep 11, 2014 at 4:51 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> Hi all >> >> I am trying to run kinesis spark streaming application on a standalone

Re: Bulk-load to HBase

2014-09-19 Thread Aniket Bhatnagar
I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat instead of HFileOutputFormat. But, hopefully this should help you: val hbaseZookeeperQuorum = s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath" val conf = HBaseConfiguration.create() conf.set("hbase.zookeeper.quorum", hbase

Re: Out of memory with Spark Streaming

2014-09-12 Thread Aniket Bhatnagar
t; Which version of spark are you running? > > If you are running the latest one, then could try running not a window but > a simple event count on every 2 second batch, and see if you are still > running out of memory? > > TD > > > On Thu, Sep 11, 2014 at 10:34 AM,

Re: Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-12 Thread Aniket Bhatnagar
ver.namenode.NameNode.(NameNode.java:720) > > I am searching the web already for a week trying to figure out how to make > this work :-/ > > all the help or hints are greatly appreciated > reinis > > > -- > -Original-Nachricht- >

Re: Re[2]: HBase 0.96+ with Spark 1.0+

2014-09-11 Thread Aniket Bhatnagar
Dependency hell... My fav problem :). I had run into a similar issue with hbase and jetty. I cant remember thw exact fix, but is are excerpts from my dependencies that may be relevant: val hadoop2Common = "org.apache.hadoop" % "hadoop-common" % hadoop2Version excludeAll(

Re: Spark on Raspberry Pi?

2014-09-11 Thread Aniket Bhatnagar
Just curiois... What's the use case you are looking to implement? On Sep 11, 2014 10:50 PM, "Daniil Osipov" wrote: > Limited memory could also cause you some problems and limit usability. If > you're looking for a local testing environment, vagrant boxes may serve you > much better. > > On Thu, S

Re: Out of memory with Spark Streaming

2014-09-11 Thread Aniket Bhatnagar
> You could set "spark.executor.memory" to something bigger than the > default (512mb) > > > On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> I am running a simple Spark Streaming program that pulls in data from >&

Out of memory with Spark Streaming

2014-09-11 Thread Aniket Bhatnagar
I am running a simple Spark Streaming program that pulls in data from Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps data and persists to a store. The program is running in local mode right now and runs out of memory after a while. I am yet to investigate heap dumps but

Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-11 Thread Aniket Bhatnagar
Hi all I am trying to run kinesis spark streaming application on a standalone spark cluster. The job works find in local mode but when I submit it (using spark-submit), it doesn't do anything. I enabled logs for org.apache.spark.streaming.kinesis package and I regularly get the following in worker

Using Spark's ActionSystem for performing analytics using Akka

2014-09-02 Thread Aniket Bhatnagar
Sorry about the noob question, but I was just wondering if we use Spark's ActorSystem (SparkEnv.actorSystem), would it distribute actors across worker nodes or would the actors only run in driver JVM?

[Streaming] Triggering an action in absence of data

2014-09-01 Thread Aniket Bhatnagar
Hi all I am struggling to implement a use case wherein I need to trigger an action in case no data has been received for X amount of time. I haven't been able to figure out an easy way to do this. No state/foreach methods get called when no data has arrived. I thought of generating a 'tick' DStrea

  1   2   >