ors.
>
> now with dataframes the memory usage is both on and off heap, and we have
> no way of limiting the off heap memory usage by spark, yet yarn requires a
> maximum total memory usage and if you go over it yarn kills the executor.
>
> On Fri, Nov 25, 2016 at 12:14 PM, Anik
r node.
Sent from my Windows 10 phone
*From: *Rodrick Brown
*Sent: *Friday, November 25, 2016 12:25 AM
*To: *Aniket Bhatnagar
*Cc: *user
*Subject: *Re: OS killing Executor due to high (possibly off heap) memory
usage
Try setting spark.yarn.executor.memoryOverhead 1
On Thu, Nov 24, 2
Hi Spark users
I am running a job that does join of a huge dataset (7 TB+) and the
executors keep crashing randomly, eventually causing the job to crash.
There are no out of memory exceptions in the log and looking at the dmesg
output, it seems like the OS killed the JVM because of high memory usa
Try changing compression to bzip2 or lzo. For reference -
http://comphadoop.weebly.com
Thanks,
Aniket
On Mon, Nov 21, 2016, 10:18 PM yeshwanth kumar
wrote:
> Hi,
>
> we are running Hive on Spark, we have an external table over snappy
> compressed csv file of size 917.4 M
> HDFS block size is se
Also, how are you launching the application? Through spark submit or
creating spark content in your app?
Thanks,
Aniket
On Wed, Nov 16, 2016 at 10:44 AM Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:
> Thanks for sharing the thread dump. I had a look at them and couldn't f
g thread dumps and I will see if I can
> figure out what is going on.
>
>
> On Sunday, November 6, 2016 10:02 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>
> I doubt it's GC as you mentioned that the pause is several minutes. Since
> it'
ns is less? Or you want to do it for a
> perf gain?
>
>
>
> Also, what were your initial Dataset partitions and how many did you have
> for the result of join?
>
>
>
> *From:* Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com]
> *Sent:* Friday, November 11, 2016 9:2
Hi
I can't seem to find a way to pass number of partitions while join 2
Datasets or doing a groupBy operation on the Dataset. There is an option of
repartitioning the resultant Dataset but it's inefficient to repartition
after the Dataset has been joined/grouped into default number of
partitions.
d I don't know how to
> interpret what I saw... I don't think it could be my code directly, since
> at this point my code has all completed? Could GC be taking that long?
>
> (I could also try grabbing the thread dumps and pasting them here, if that
> would help?)
>
> O
In order to know what's going on, you can study the thread dumps either
from spark UI or from any other thread dump analysis tool.
Thanks,
Aniket
On Sun, Nov 6, 2016 at 1:31 PM Michael Johnson
wrote:
> I'm doing some processing and then clustering of a small dataset (~150
> MB). Everything seem
If people agree that is desired, I am willing to submit a SIP for this and
find time to work on it.
Thanks,
Aniket
On Sun, Nov 6, 2016 at 1:06 PM Aniket Bhatnagar
wrote:
> Hello
>
> Dynamic allocation feature allows you to add executors and scale
> computation power. This is great
Hello
Dynamic allocation feature allows you to add executors and scale
computation power. This is great, however, I feel like we also need a way
to dynamically scale storage. Currently, if the disk is not able to hold
the spilled/shuffle data, the job is aborted causing frustration and loss
of tim
Issue raised - SPARK-18251
On Wed, Nov 2, 2016, 9:12 PM Aniket Bhatnagar
wrote:
> Hi all
>
> I am running into a runtime exception when a DataSet is holding an Empty
> object instance for an Option type that is holding non-nullable field. For
> instance, if we have the follo
Hi all
I am running into a runtime exception when a DataSet is holding an Empty
object instance for an Option type that is holding non-nullable field. For
instance, if we have the following case class:
case class DataRow(id: Int, value: String)
Then, DataSet[Option[DataRow]] can only hold Some(D
to read the data in situ without a copy.
>
> I understand that manually assigning tasks to nodes reduces fault
> tolerance, but the simulation codes already explicitly assign tasks, so a
> failure of any one node is already a full-job failure.
>
> On Mon, Jul 18, 2016 at 3:43 PM Anik
You can't assume that the number to nodes will be constant as some may
fail, hence you can't guarantee that a function will execute at most once
or atleast once on a node. Can you explain your use case in a bit more
detail?
On Mon, Jul 18, 2016, 10:57 PM joshuata wrote:
> I am working on a spark
Are you running on YARN or standalone?
On Thu, Dec 31, 2015, 3:35 PM LinChen wrote:
> *Screenshot1(Normal WebUI)*
>
>
>
> *Screenshot2(Corrupted WebUI)*
>
>
>
> As screenshot2 shows, the format of my Spark WebUI looks strange and I
> cannot click the description of active jobs. It seems there is
NG_STATS);
>
>JavaPairDStream> newData =
> stages.filter(NEW_STATS);
>
>newData.foreachRDD{
> rdd.forEachPartition{
>//Store to external storage.
> }
> }
>
> Without using updateStageByKey, I'm only have the stats of t
Can you try storing the state (word count) in an external key value store?
On Sat, Nov 7, 2015, 8:40 AM Thúy Hằng Lê wrote:
> Hi all,
>
> Anyone could help me on this. It's a bit urgent for me on this.
> I'm very confused and curious about Spark data checkpoint performance? Is
> there any detail
Are you sure RedisClientPool is being initialized properly in the
constructor of RedisCache? Can you please copy paste the code that you use
to initialize RedisClientPool inside the constructor of RedisCache?
Thanks,
Aniket
On Fri, Oct 23, 2015 at 11:47 AM Bin Wang wrote:
> BTW, "lines" is a DS
Can you try changing classOf[OutputFormat[String,
BoxedUnit]] to classOf[OutputFormat[String,
Put]] while configuring hconf?
On Sat, Oct 17, 2015, 11:44 AM Amit Hora wrote:
> Hi,
>
> Regresta for delayed resoonse
> please find below full stack trace
>
> ava.lang.ClassCastException: scala.runtime
Is it possible for you to use EMR instead of EC2? If so, you may be able to
tweak EMR bootstrap scripts to install your custom spark build.
Thanks,
Aniket
On Thu, Oct 8, 2015 at 5:58 PM Theodore Vasiloudis <
theodoros.vasilou...@gmail.com> wrote:
> Hello,
>
> I was wondering if there is an easy
eStateByKey(ProbabilityCalculator.updateCountsOfProcessGivenRole,
> new HashPartitioner(3), initialProcessGivenRoleRdd)
>
> I am getting an error -- 'missing arguments for method
> updateCountsOfProcessGivenRole'. Looking at the method calls, the function
> that is called for
Here is an example:
val interval = 60 * 1000
val counts = eventsStream.map(event => {
(event.timestamp - event.timestamp % interval, event)
}).updateStateByKey[Long](updateFunc = (events: Seq[Event], prevStateOpt:
Option[Long]) => {
val prevCount = prevStateOpt.getOrElse(0L)
val newCount = p
Hi Rachna
Can you just use http client provided via spark transitive dependencies
instead of excluding them?
The reason user classpath first is failing could be because you have spark
artifacts in your assembly jar that dont match with what is deployed
(version mismatch or you built the version y
+ _)
>
> val output = wordCounts.
>map({case ((user, word), count) => (user, (word, count))}).
>groupByKey()
>
> By Aniket, if we group by user first, it could run out of memory when
> spark tries to put all words in a single sequence, couldn't it?
>
> On Sat, S
Using scala API, you can first group by user and then use combineByKey.
Thanks,
Aniket
On Sat, Sep 19, 2015, 6:41 PM kali.tumm...@gmail.com
wrote:
> Hi All,
> I would like to achieve this below output using spark , I managed to write
> in Hive and call it in spark but not in just spark (scala),
Hi Eval
Can you check if your Ubuntu VM has enough RAM allocated to run JVM of size
3gb?
thanks,
Aniket
On Sat, Sep 19, 2015, 9:09 PM Eyal Altshuler
wrote:
> Hi,
>
> I had configured the MAVEN_OPTS environment variable the same as you wrote.
> My java version is 1.7.0_75.
> I didn't customized
; /Shahab
>
> On Fri, Sep 18, 2015 at 12:54 PM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> Can you try yarn-client mode?
>>
>> On Fri, Sep 18, 2015, 3:38 PM shahab wrote:
>>
>>> Hi,
>>>
>>> Probably I have
Can you try yarn-client mode?
On Fri, Sep 18, 2015, 3:38 PM shahab wrote:
> Hi,
>
> Probably I have wrong zeppelin configuration, because I get the following
> error when I execute spark statements in Zeppelin:
>
> org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't
> running
You can perhaps setup a WAL that logs to S3? New cluster should pick the
records that weren't processed due previous cluster termination.
Thanks,
Aniket
On Thu, Sep 17, 2015, 9:19 PM Alan Dipert wrote:
> Hello,
> We are using Spark Streaming 1.4.1 in AWS EMR to process records from
> Kinesis.
Hi Tom
There has to be a difference in classpaths in yarn-client and yarn-cluster
mode. Perhaps a good starting point would be to print classpath as a first
thing in SimpleApp.main. It should give clues around why it works in
yarn-cluster mode.
Thanks,
Aniket
On Wed, Sep 9, 2015, 2:11 PM Tom Sed
test it out?
>
>
>
> On Thu, May 21, 2015 at 1:43 AM, Tathagata Das
> wrote:
>
>> Thanks for the JIRA. I will look into this issue.
>>
>> TD
>>
>> On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar <
>> aniket.bhatna...@gmail.com> wrote:
>>
I ran into one of the issues that are potentially caused because of this
and have logged a JIRA bug -
https://issues.apache.org/jira/browse/SPARK-7788
Thanks,
Aniket
On Wed, Sep 24, 2014 at 12:59 PM Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:
> Hi all
>
> Readin
the hashCode and equals method correctly.
>
>
>
> On Wednesday, April 15, 2015 at 3:10 PM, Aniket Bhatnagar wrote:
>
> > I am aggregating a dataset using combineByKey method and for a certain
> input size, the job fails with the following error. I have enabled head
> dumps
I am aggregating a dataset using combineByKey method and for a certain
input size, the job fails with the following error. I have enabled head
dumps to better analyze the issue and will report back if I have any
findings. Meanwhile, if you guys have any idea of what could possibly
result in this er
I have a job that sorts data and runs a combineByKey operation and it
sometimes fails with the following error. The job is running on spark 1.2.0
cluster with yarn-client deployment mode. Any clues on how to debug the
error?
org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output
This is tricky to debug. Check logs of node and resource manager of YARN to
see if you can trace the error. In the past I have to closely look at
arguments getting passed to YARN container (they get logged before
attempting to launch containers). If I still don't get a clue, I had to
check the scri
Have you set master in SparkConf/SparkContext in your code? Driver logs
show in which mode the spark job is running. Double check if the logs
mention local or yarn-cluster.
Also, what's the error that you are getting?
On Wed, Feb 4, 2015, 6:13 PM kundan kumar wrote:
> Hi,
>
> I am trying to exec
t;: "true",
"spark.hadoop.fs.s3.consistent.metadata.tableName": "EmrFSMetadata",
"spark.hadoop.fs.s3.consistent.metadata.read.capacity": "500",
"spark.hadoop.fs.s3.consistent.metadata.write.capacity": "100",
"spark.hadoop.fs.s3.consistent.fastList"
e permissions for it.
> -Sven
>
> On Fri, Jan 30, 2015 at 6:44 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> I am programmatically submit spark jobs in yarn-client mode on EMR.
>> Whenever a job tries to save file to s3, it gives the below ment
I am programmatically submit spark jobs in yarn-client mode on EMR.
Whenever a job tries to save file to s3, it gives the below mentioned
exception. I think the issue might be what EMR is not setup properly as I
have to set all hadoop configurations manually in SparkContext. However, I
am not sure
This might be helpful:
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
On Tue Jan 27 2015 at 07:45:18 Sharon Rapoport wrote:
> Hi,
>
> I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k].
> I got them by combining many [k,v]
(ThreadPoolExecutor.java:615)
[na:1.7.0_71]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_71]
On Wed Jan 21 2015 at 17:34:34 Aniket Bhatnagar
wrote:
> While implementing a spark server, I realized that Thread's context loader
> must be set to any dynamically loaded classloa
While implementing a spark server, I realized that Thread's context loader
must be set to any dynamically loaded classloader so that ClosureCleaner
can do it's thing. Should the ClosureCleaner not use classloader created by
SparkContext (that has all dynamically added jars via SparkContext.addJar)
When you repartiton, ordering can get lost. You would need to sort after
repartitioning.
Aniket
On Tue, Jan 20, 2015, 7:08 AM anny9699 wrote:
> Hi,
>
> I am using Spark on AWS and want to write the output to S3. It is a
> relatively small file and I don't want them to output as multiple parts.
Sorry. I couldn't understand the issue. Are you trying to send data to
kinesis from a spark batch/real time job?
- Aniket
On Fri, Jan 16, 2015, 9:40 PM Hafiz Mujadid
wrote:
> Hi Experts!
>
> I am using kinesis dependency as follow
> groupId = org.apache.spark
> artifactId = spark-streaming-kin
Sure there is. Create a new RDD just containing the schema line (hint: use
sc.parallelize) and then union both the RDDs (the header RDD and data RDD)
to get a final desired RDD.
On Thu Jan 15 2015 at 19:48:52 Hafiz Mujadid
wrote:
> hi experts!
>
> I hav an RDD[String] and i want to add schema li
Are you using spark in standalone mode or yarn or mesos? If its yarn,
please mention the hadoop distribution and version. What spark distribution
are you using (it seems 1.2.0 but compiled with which hadoop version)?
Thanks, Aniket
On Thu, Jan 15, 2015, 4:59 PM Hafiz Mujadid
wrote:
> Hi, Exper
d putting the
> Spark assembly first in the classpath fixes the issue. I expect that the
> version of Parquet that's being included in the EMR libs just needs to be
> upgraded.
>
>
> ~ Jonathan Kelly
>
> From: Aniket Bhatnagar
> Date: Sunday, January 4, 2015 at
t; libraries are java-only (the scala version appended there is just for
> helping the build scripts).
>
> But it does look weird, so it would be nice to fix it.
>
> On Wed, Jan 7, 2015 at 12:25 AM, Aniket Bhatnagar
> wrote:
> > It seems that spark-network-yarn compiled for sc
It seems that spark-network-yarn compiled for scala 2.11 depends on
spark-network-shuffle compiled for scala 2.10. This causes cross version
dependencies conflicts in sbt. Seems like a publishing error?
http://www.uploady.com/#!/download/6Yn95UZA0DR/3taAJFjCJjrsSXOR
Can you confirm your emr version? Could it be because of the classpath
entries for emrfs? You might face issues with using S3 without them.
Thanks,
Aniket
On Mon, Jan 5, 2015, 11:16 AM Adam Gilmore wrote:
> Just an update on this - I found that the script by Amazon was the culprit
> - not exact
Go through spark API documentation. Basically you have to do group by
(date, message_type) and then do a count.
On Sun, Jan 4, 2015, 9:58 PM Dinesh Vallabhdas
wrote:
> A spark cassandra newbie question. Thanks in advance for the help.
> I have a cassandra table with 2 columns message_timestamp(t
of partitions value is to increase number of partitions
> not reduce it from default value.
>
> On Thu, Jan 1, 2015 at 10:43 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> I am trying to read a file into a single partition but it seems like
&g
I am trying to read a file into a single partition but it seems like
sparkContext.textFile ignores the passed minPartitions value. I know I can
repartition the RDD but I was curious to know if this is expected or if
this is a bug that needs to be further investigated?
Did you check firewall rules in security groups?
On Tue, Dec 30, 2014, 9:34 PM Laeeq Ahmed
wrote:
> Hi,
>
> I am using spark standalone on EC2. I can access ephemeral hdfs from
> spark-shell interface but I can't access hdfs in standalone application. I
> am using spark 1.2.0 with hadoop 2.4.0 a
adding a
> facade / stable interface in front of JobProgressListener and thus has
> little to no risk to introduce new bugs elsewhere in Spark.
>
>
>
> On Mon, Dec 29, 2014 at 3:08 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> Hi Josh
>>
Hi Josh
Is there documentation available for status API? I would like to use it.
Thanks,
Aniket
On Sun Dec 28 2014 at 02:37:32 Josh Rosen wrote:
> The console progress bars are implemented on top of a new stable "status
> API" that was added in Spark 1.2. It's possible to query job progress
>
d61V1/Spark-yarn+1.2.0&subj=Re+spark+yarn_2+10+1+2+0+artifacts
>
> Cheers
>
> On Dec 28, 2014, at 11:13 PM, Aniket Bhatnagar
> wrote:
>
> Hi all
>
> I just realized that spark-yarn artifact hasn't been published for 1.2.0
> release. Any particular reason
Hi all
I just realized that spark-yarn artifact hasn't been published for 1.2.0
release. Any particular reason for that? I was using it in my yet another
spark-job-server project to submit jobs to a YARN cluster through
convenient REST APIs (with some success). The job server was creating
SparkCon
I would think that it has to be per worker.
On Wed, Dec 17, 2014, 6:32 PM Ashic Mahtab wrote:
> Hello,
> Say, I have the following code:
>
> let something = Something()
>
> someRdd.foreachRdd(something.someMethod)
>
> And in something, I have a lazy member variable that gets created in
> somethi
Hi guys
I am hoping someone might have a clue on why this is happening. Otherwise I
will have to dwell into YARN module's source code to better understand the
issue.
On Wed, Dec 10, 2014, 11:54 PM Aniket Bhatnagar
wrote:
> I am running spark 1.1.0 on AWS EMR and I am running a batch
It turns out that this happens when checkpoint is set to a local directory
path. I have opened a JIRA SPARK-4862 for Spark streaming to output better
error message.
Thanks,
Aniket
On Tue Dec 16 2014 at 20:08:13 Aniket Bhatnagar
wrote:
> I am using spark 1.1.0 running a streaming job that u
I am using spark 1.1.0 running a streaming job that uses updateStateByKey
and then (after a bunch of maps/flatMaps) does a foreachRDD to save data in
each RDD by making HTTP calls. The issue is that each time I attempt to
save the RDD (using foreach on RDD), it gives me the following exception:
or
e 397 in Playframework logs as
> follows.
> https://gist.github.com/TomoyaIgarashi/9688bdd5663af95ddd4d
>
> Is there any problem?
>
>
> 2014-12-15 18:48 GMT+09:00 Aniket Bhatnagar :
>>
>> Try the workaround (addClassPathJars(sparkContext,
>> this.getClass.getClas
In case you are still looking for help, there has been multiple discussions
in this mailing list that you can try searching for. Or you can simply use
https://github.com/unicredit/hbase-rdd :-)
Thanks,
Aniket
On Wed Dec 03 2014 at 16:11:47 Ted Yu wrote:
> Which hbase release are you running ?
>
"The reason not using sc.newAPIHadoopRDD is it only support one scan each
time."
I am not sure is that's true. You can use multiple scans as following:
val scanStrings = scans.map(scan => convertScanToString(scan))
conf.setStrings(MultiTableInputFormat.SCANS, scanStrings : _*)
where convertScanT
Try the workaround (addClassPathJars(sparkContext,
this.getClass.getClassLoader) discussed in
http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAJOb8buD1B6tUtOfG8_Ok7F95C3=r-zqgffoqsqbjdxd427...@mail.gmail.com%3E
Thanks,
Aniket
On Mon Dec 15 2014 at 07:43:24 Tomoya Igarashi <
to
The reason is because of the following code:
val numStreams = numShards
val kinesisStreams = (0 until numStreams).map { i =>
KinesisUtils.createStream(ssc, streamName, endpointUrl,
kinesisCheckpointInterval,
InitialPositionInStream.LATEST, StorageLevel.MEMORY_AND_DISK_2)
}
In the above co
I am running spark 1.1.0 on AWS EMR and I am running a batch job that
should seems to be highly parallelizable in yarn-client mode. But spark
stop spawning any more executors after spawning 6 executors even though
YARN cluster has 15 healthy m1.large nodes. I even tried providing
'--num-executors 6
your code (sbt
> package will give you one inside target/scala-*/projectname-*.jar) and then
> use it while submitting. If you are not using spark-submit then you can
> simply add this jar to spark by
> sc.addJar("/path/to/target/scala*/projectname*jar")
>
> Thanks
I am trying to create (yet another) spark as a service tool that lets you
submit jobs via REST APIs. I think I have nearly gotten it to work baring a
few issues. Some of which seem already fixed in 1.2.0 (like SPARK-2889) but
I have hit the road block with the following issue.
I have created a sim
<https://twitter.com/ashrafuzzaman> | Blog
> <http://jitu-blog.blogspot.com/> | Facebook
> <https://www.facebook.com/ashrafuzzaman.jitu>
>
> Check out The Academy <http://newscred.com/theacademy>, your #1 source
> for free content marketing resources
>
&
What's your cluster size? For streamig to work, it needs shards + 1
executors.
On Wed, Nov 26, 2014, 5:53 PM A.K.M. Ashrafuzzaman <
ashrafuzzaman...@gmail.com> wrote:
> Hi guys,
> When we are using Kinesis with 1 shard then it works fine. But when we use
> more that 1 then it falls into an infini
gt; https://github.com/apache/spark/pull/3009
>
> If there is anything that needs to be added, please add it to those issues
> or PRs.
>
> On Wed, Nov 19, 2014 at 7:55 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> I have for now submitted a JIRA ticket @
>
in the public API.
>
> Any other ideas?
>
> On Tue Nov 18 2014 at 4:03:35 PM Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> Thanks Andy. This is very useful. This gives me all active stages & their
>> percentage completion but I am unable to
; My reading of org/ apache/spark/Aggregator.scala is that your function
> will always see the items in the order that they are in the input RDD. An
> RDD partition is always accessed as an iterator, so it will not be read out
> of order.
>
> On Wed, Nov 19, 2014 at 2:28 PM, Anik
so 8 is the first
> number that demonstrates this.)
>
> On Wed, Nov 19, 2014 at 9:05 AM, Akhil Das
> wrote:
>
>> If something is persisted you can easily see them under the Storage tab
>> in the web ui.
>>
>> Thanks
>> Best Regards
>>
>> On Tue
14 at 19:40:06 andy petrella
wrote:
> I started some quick hack for that in the notebook, you can head to:
> https://github.com/andypetrella/spark-notebook/
> blob/master/common/src/main/scala/notebook/front/widgets/SparkInfo.scala
>
> On Tue Nov 18 2014 at 2:44:48 PM Aniket Bhatna
I am trying to figure out if sorting is persisted after applying Pair RDD
transformations and I am not able to decisively tell after reading the
documentation.
For example:
val numbers = .. // RDD of numbers
val pairedNumbers = numbers.map(number => (number % 100, number))
val sortedPairedNumbers
I am writing yet another Spark job server and have been able to submit jobs
and return/save results. I let multiple jobs use the same spark context but
I set job group while firing each job so that I can in future cancel jobs.
Further, what I deserve to do is provide some kind of status
update/prog
may require some further investigation on my part. i
> wanna stay on top of this if it's an issue.
>
> thanks for posting this, aniket!
>
> -chris
>
> On Fri, Sep 12, 2014 at 5:34 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> Hi all
&
Just curious... Why would you not store the processed results in regular
relational database? Not sure what you meant by persist the appropriate
RDDs. Did you mean output of your job will be RDDs?
On 24 October 2014 13:35, ankits wrote:
> I want to set up spark SQL to allow ad hoc querying over
Hi all
I have written a job that reads data from HBASE and writes to HDFS (fairly
simple). While running the job, I noticed that a few of the tasks failed
with the following error. Quick googling on the error suggests that its an
unexplained error and is perhaps intermittent. What I am curious to
Hi all
Reading through Spark streaming's custom receiver documentation, it is
recommended that onStart and onStop methods should not block indefinitely.
However, looking at the source code of KinesisReceiver, the onStart method
calls worker.run that blocks until worker is shutdown (via a call to
o
run on the cluster?
Thanks,
Aniket
On 22 September 2014 18:14, Aniket Bhatnagar
wrote:
> Hi all
>
> I was finally able to figure out why this streaming appeared stuck. The
> reason was that I was running out of workers in my standalone deployment of
> Spark. There was no feedback
yarn-client mode and this is
again giving the same problem. Is it possible to run out of workers in YARN
mode as well? If so, how can I figure that out?
Thanks,
Aniket
On 19 September 2014 18:07, Aniket Bhatnagar
wrote:
> Apologies in delay in getting back on this. It seems the Kinesis exam
can be slow.
>
> So, it would be nice if bulk-load could be used, since it bypasses the
> write path.
>
>
>
> Thanks.
>
>
>
> *From:* Aniket Bhatnagar [mailto:aniket.bhatna...@gmail.com
> ]
> *Sent:* Friday, September 19, 2014 9:01 PM
> *To:* innowireless TaeYun Kim
> Also ccing, chris fregly who wrote Kinesis integration.
>
> TD
>
>
>
>
> On Thu, Sep 11, 2014 at 4:51 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> Hi all
>>
>> I am trying to run kinesis spark streaming application on a standalone
I have been using saveAsNewAPIHadoopDataset but I use TableOutputFormat
instead of HFileOutputFormat. But, hopefully this should help you:
val hbaseZookeeperQuorum =
s"$zookeeperHost:$zookeeperPort:$zookeeperHbasePath"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", hbase
t; Which version of spark are you running?
>
> If you are running the latest one, then could try running not a window but
> a simple event count on every 2 second batch, and see if you are still
> running out of memory?
>
> TD
>
>
> On Thu, Sep 11, 2014 at 10:34 AM,
ver.namenode.NameNode.(NameNode.java:720)
>
> I am searching the web already for a week trying to figure out how to make
> this work :-/
>
> all the help or hints are greatly appreciated
> reinis
>
>
> --
> -Original-Nachricht-
>
Dependency hell... My fav problem :).
I had run into a similar issue with hbase and jetty. I cant remember thw
exact fix, but is are excerpts from my dependencies that may be relevant:
val hadoop2Common = "org.apache.hadoop" % "hadoop-common" % hadoop2Version
excludeAll(
Just curiois... What's the use case you are looking to implement?
On Sep 11, 2014 10:50 PM, "Daniil Osipov" wrote:
> Limited memory could also cause you some problems and limit usability. If
> you're looking for a local testing environment, vagrant boxes may serve you
> much better.
>
> On Thu, S
> You could set "spark.executor.memory" to something bigger than the
> default (512mb)
>
>
> On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> I am running a simple Spark Streaming program that pulls in data from
>&
I am running a simple Spark Streaming program that pulls in data from
Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
data and persists to a store.
The program is running in local mode right now and runs out of memory after
a while. I am yet to investigate heap dumps but
Hi all
I am trying to run kinesis spark streaming application on a standalone
spark cluster. The job works find in local mode but when I submit it (using
spark-submit), it doesn't do anything. I enabled logs
for org.apache.spark.streaming.kinesis package and I regularly get the
following in worker
Sorry about the noob question, but I was just wondering if we use Spark's
ActorSystem (SparkEnv.actorSystem), would it distribute actors across
worker nodes or would the actors only run in driver JVM?
Hi all
I am struggling to implement a use case wherein I need to trigger an action
in case no data has been received for X amount of time. I haven't been able
to figure out an easy way to do this. No state/foreach methods get called
when no data has arrived. I thought of generating a 'tick' DStrea
1 - 100 of 113 matches
Mail list logo