I am looking to write an ETL job using spark that reads data from the
source, perform transformation and insert it into the destination. I am
trying to understand how spark deals with failures? I can't seem to find
the documentation. I am interested in learning the following scenarios:
1. Source be
On Wed, Jun 8, 2016 at 3:42 AM, Jacek Laskowski wrote:
> On Wed, Jun 8, 2016 at 2:38 AM, Mohit Anchlia
> wrote:
> > I am looking to write an ETL job using spark that reads data from the
> > source, perform transformation and insert it into the destination.
>
> Is this
Is something similar to park.streaming.receiver.writeAheadLog.enable
available on SparkContext? It looks like it only works for spark streaming.
Wed, Jun 8, 2016 at 3:14 PM, Mohit Anchlia
> wrote:
>
>> Is something similar to park.streaming.receiver.writeAheadLog.enable
>> available on SparkContext? It looks like it only works for spark streaming.
>>
>
>
Does SparkR support all the algorithms that R library supports?
I am running in yarn-client mode and trying to execute network word count
example. When I connect through nc I see the following in spark app logs:
Exception in thread "main" java.lang.AssertionError: assertion failed: The
checkpoint directory has not been set. Please use
StreamingContext.checkpoi
I am trying to run this program as a yarn-client. The job seems to be
submitting successfully however I don't see any process listening on this
host on port
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaRecoverableNetworkWordCount.j
are running on a cluster, the listening is occurring on one of the
> executors, not in the driver.
>
> On Mon, Aug 10, 2015 at 10:29 AM, Mohit Anchlia
> wrote:
>
>> I am trying to run this program as a yarn-client. The job seems to be
>> submitting successfully however I don&
park-submit. The executors are running in the YARN cluster nodes, and the
> socket receiver listening on port is running in one of the executors.
>
> On Mon, Aug 10, 2015 at 11:43 AM, Mohit Anchlia
> wrote:
>
>> I am running as a yarn-client which probably means that the progr
submit your app in client mode on that to see whether you are seeing the
> process listening on or not.
>
> On Mon, Aug 10, 2015 at 12:43 PM, Mohit Anchlia
> wrote:
>
>> I've verified all the executors and I don't see a process listening on
>> the port. Ho
I do see this message:
15/08/10 19:19:12 WARN YarnScheduler: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient resources
On Mon, Aug 10, 2015 at 4:15 PM, Mohit Anchlia
wrote:
>
> I am using the same exac
I am seeing following error. I think it's not able to find some other
associated classes as I see "$2" in the exception, but not sure what I am
missing.
15/08/11 16:00:15 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0
(TID 50, ip-10-241-251-141.us-west-2.compute.internal):
java.lang.Cl
. Please note I am using it in yarn
On Tue, Aug 11, 2015 at 1:52 PM, Mohit Anchlia
wrote:
> I am seeing following error. I think it's not able to find some other
> associated classes as I see "$2" in the exception, but not sure what I am
> missing.
>
>
> 15/08
How does partitioning in spark work when it comes to streaming? What's the
best way to partition a time series data grouped by a certain tag like
categories of product video, music etc.
After changing the '--deploy_mode client' the program seems to work
however it looks like there is a bug in spark when using --deploy_mode as
'yarn'. Should I open a bug?
On Tue, Aug 11, 2015 at 3:02 PM, Mohit Anchlia
wrote:
> I see the following line in the log &
partition too much.
>
> Best
> Ayan
>
> On Wed, Aug 12, 2015 at 9:06 AM, Mohit Anchlia
> wrote:
>
>> How does partitioning in spark work when it comes to streaming? What's
>> the best way to partition a time series data grouped by a certain tag like
>> cate
generated during the
> batchInterval are partitions of the RDD.
>
> Now if you want to repartition based on a key, a shuffle is needed.
>
> On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia
> wrote:
>
>> How does partitioning in spark work when it comes to streaming? What's
&g
Is there a way to run spark streaming methods in standalone eclipse
environment to test out the functionality?
I have this call trying to save to hdfs 2.6
wordCounts.saveAsNewAPIHadoopFiles("prefix", "txt");
but I am getting the following:
java.lang.RuntimeException: class scala.runtime.Nothing$ not
org.apache.hadoop.mapreduce.OutputFormat
I was able to get this working by using an alternative method however I
only see 0 bytes files in hadoop. I've verified that the output does exist
in the logs however it's missing from hdfs.
On Thu, Aug 13, 2015 at 10:49 AM, Mohit Anchlia
wrote:
> I have this call trying to sav
ee the data:
$ ls -ltr !$
ls -ltr /tmp/out
-rw-r--r-- 1 yarn yarn 5230 Aug 13 15:45 /tmp/out
On Fri, Aug 14, 2015 at 6:15 AM, Ted Yu wrote:
> Which Spark release are you using ?
>
> Can you show us snippet of your code ?
>
> Have you checked namenode log ?
>
> Thanks
>
prefix: String,
> suffix: String,
> keyClass: Class[_],
> valueClass: Class[_],
> outputFormatClass: Class[F]) {
>
> Did you intend to use outputPath as prefix ?
>
> Cheers
>
>
> On Fri, Aug 14, 2015 at 1:36 PM, Mohit Anchlia
> wrote:
>
&
I am running on Yarn and do have a question on how spark runs executors on
different data nodes. Is that primarily decided based on number of
receivers?
What do I need to do to ensure that multiple nodes are being used for data
processing?
Spark stream seems to be creating 0 bytes files even when there is no data.
Also, I have 2 concerns here:
1) Extra unnecessary files is being created from the output
2) Hadoop doesn't work really well with too many files and I see that it is
creating a directory with a timestamp every 1 second. Is
er-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
>> or have a separate program which will do the clean up for you.
>>
>> Thanks
>> Best Regards
>>
>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia
>> wr
applying your filters
> 3) Write this StringBuilder to File when you want to write (The duration
> can be defined as a condition)
>
> On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia
> wrote:
>
>> Is there a way to store all the results in one file and keep the file
>
Any help would be appreciated
On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia
wrote:
> My question was how to do this in Hadoop? Could somebody point me to some
> examples?
>
> On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY
> wrote:
>
>> Of course, Java or Scala can
file,
which seems like a extra overhead from maintenance and IO perspective.
On Mon, Aug 24, 2015 at 2:51 PM, Mohit Anchlia
wrote:
> Any help would be appreciated
>
> On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia
> wrote:
>
>> My question was how to do this in Hadoop?
When spark writes the partition it writes in the format as:
=/key value>
Is there a way to have spark write only by keyvalue?
I am trying to understand the best way to handle the scenario where null
array "[]" is passed. Can somebody suggest if there is a way to filter out
such records. I've tried numerous things including using
dataframe.head().isEmpty but pyspark doesn't recognize isEmpty even though
I see it in the API
ect classpath.
>
> On Tue, Mar 10, 2015 at 5:54 PM, Mohit Anchlia
> wrote:
> > I am trying out streaming example as documented and I am using spark
> 1.2.1
> > streaming from maven for Java.
> >
> > When I add this code I get compilation error on and eclipse is not
is indeed insufficient and you have to
> download all the recursive dependencies. May be you should create a Maven
> project inside Eclipse?
>
> TD
>
> On Tue, Mar 10, 2015 at 11:00 AM, Mohit Anchlia
> wrote:
>
>> How do I do that? I haven't used Scala bef
Scala deps transitively and can import scala.*
> classes. However, it would be a little bit more correct to depend
> directly on the scala library classes, but in practice, easiest not to
> in simple use cases.
>
> If you're still having trouble look at the output of "mvn depende
f I should delete that file from local repo and re-sync
On Tue, Mar 10, 2015 at 1:08 PM, Mohit Anchlia
wrote:
> I ran the dependency command and see the following dependencies:
>
> I only see org.scala-lang.
>
> [INFO] org.spark.test:spak-test:jar:0.0.1-SNAPSHOT
>
> [
I am getting following error. When I look at the sources it seems to be a
scala source, but not sure why it's complaining about it.
The method map(Function) in the type JavaDStream is not
applicable for the arguments (new
PairFunction(){})
And my code has been taken from the spark examples site
Hi,
I am trying to understand Hadoop Map method compared to spark Map and I
noticed that spark Map only receives 3 arguments 1) input value 2) output
key 3) output value, however in hadoop map it has 4 values 1) input key 2)
input value 3) output key 4) output value. Is there any reason it was
des
works now. I should have checked :)
On Tue, Mar 10, 2015 at 1:44 PM, Sean Owen wrote:
> Ah, that's a typo in the example: use words.mapToPair
> I can make a little PR to fix that.
>
> On Tue, Mar 10, 2015 at 8:32 PM, Mohit Anchlia
> wrote:
> > I am getting following
Does Spark Streaming also supports SQLs? Something like how Esper does CEP.
Is there a good architecture doc that gives a sufficient overview of high
level and low level details of spark with some good diagrams?
I am trying out streaming example as documented and I am using spark 1.2.1
streaming from maven for Java.
When I add this code I get compilation error on and eclipse is not able to
recognize Tuple2. I also don't see any "import scala.Tuple2" class.
http://spark.apache.org/docs/1.2.0/streaming-pr
It works after sync, thanks for the pointers
On Tue, Mar 10, 2015 at 1:22 PM, Mohit Anchlia
wrote:
> I navigated to maven dependency and found scala library. I also found
> Tuple2.class and when I click on it in eclipse I get "invalid LOC header
> (bad signature)"
>
> j
I am trying to look for a documentation on partitioning, which I can't seem
to find. I am looking at spark streaming and was wondering how does it
partition RDD in a multi node environment. Where are the keys defined that
is used for partitioning? For instance in below example keys seem to be
impli
I am running spark streaming standalone in ec2 and I am trying to run
wordcount example from my desktop. The program is unable to connect to the
master, in the logs I see, which seems to be an issue with hostname.
15/03/13 17:37:44 ERROR EndpointWriter: dropping message [class
akka.actor.ActorSele
docs). But use it at your own risk. If you modify the
> keys, and yet preserve partitioning, the partitioning would not make sense
> any more as the hash of the keys have changed.
>
> TD
>
>
>
> On Fri, Mar 13, 2015 at 2:26 PM, Mohit Anchlia
> wrote:
>
>> I am try
I am trying to understand how to load balance the incoming data to multiple
spark streaming workers. Could somebody help me understand how I can
distribute my incoming data from various sources such that incoming data is
going to multiple spark streaming nodes? Is it done by spark client with
help
Is there a module in spark streaming that lets you listen to
the alerts/conditions as they happen in the streaming module? Generally
spark streaming components will execute on large set of clusters like hdfs
or Cassandra, however when it comes to alerting you generally can't send it
directly from t
incoming rate, they
> may vary.
>
> I do not know exactly what the life cycle of the receiver is, but I don't
> think sth actually happens when you create the DStream. My guess would be
> that the receiver is allocated when you call
> StreamingContext#startStreams(),
>
h which you could do:
>
> val data = ssc.textFileStream("sigmoid/")
> val dist = data.filter(_.contains("ERROR")).foreachRDD(rdd =>
> alert("Errors :" + rdd.count()))
>
> And the alert() function could be anything triggering an email or
I am facing the same issue as listed here:
http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a-spark-job-using-maven-td5615.html
Solution mentioned is here:
https://gist.github.com/prb/d776a47bd164f704eecb
However, I think I don't understand few things:
1) Why are jars being split
I am trying to run the word count example but for some reason it's not
working as expected. I start "nc" server on port and then submit the
spark job to the cluster. Spark job gets successfully submitting but I
never see any connection from spark getting established. I also tried to
type words
What's the best way to troubleshoot inside spark to see why Spark is not
connecting to nc on port ? I don't see any errors either.
On Thu, Mar 26, 2015 at 2:38 PM, Mohit Anchlia
wrote:
> I am trying to run the word count example but for some reason it's not
> working
or netstat to make sure Spark executor backend
> is connected to the nc process. Also grep the executor's log to see if
> there's log like "Connecting to " and "Connected to
> " which shows that receiver is correctly connected to nc process.
>
> Thanks
I tried to file a bug in git repo however I don't see a link to "open
issues"
On Fri, Mar 27, 2015 at 10:55 AM, Mohit Anchlia
wrote:
> I checked the ports using netstat and don't see any connections
> established on that port. Logs show only this:
>
> 15/03/27 1
t; 15/03/27 13:50:53 IN
>
>
>
> On Thu, Mar 26, 2015 at 6:50 PM, Saisai Shao
> wrote:
>
>> Hi,
>>
>> Did you run the word count example in Spark local mode or other mode, in
>> local mode you have to set Local[n], where n >=2. For other mode, make sure
any cores are present in the works allocated to the standalone
> cluster spark://ip-10-241-251-232:7077 ?
>
>
> On Fri, Apr 3, 2015 at 2:18 PM, Mohit Anchlia
> wrote:
>
>> If I use local[2] instead of *URL:* spark://ip-10-241-251-232:7077 this
>> seems to work. I don'
Interesting, I see 0 cores in the UI?
- *Cores:* 0 Total, 0 Used
On Fri, Apr 3, 2015 at 2:55 PM, Tathagata Das wrote:
> What does the Spark Standalone UI at port 8080 say about number of cores?
>
> On Fri, Apr 3, 2015 at 2:53 PM, Mohit Anchlia
> wrote:
>
>> [ec2-u
I am trying to start the worker by:
sbin/start-slave.sh spark://ip-10-241-251-232:7077
In the logs it's complaining about:
Master must be a URL of the form spark://hostname:port
I also have this in spark-defaults.conf
spark.master spark://ip-10-241-251-232:7077
Did I miss
I am seeing the following, is this because of my maven version?
15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
ip-10-241-251-232.us-west-2.compute.internal):
java.io.InvalidClassException: org.apache.spark.Aggregator; local class
incompatible: stream classdesc serialVers
Finally got it working by increasing the spark version in maven to 1.2.1
On Thu, Apr 9, 2015 at 12:30 PM, Mohit Anchlia
wrote:
> I changed the JDK to Oracle but I still get this error. Not sure what it
> means by "Stream class is incompatible with local class". I am using the
&
> I don't have experience with mixed JDK's.
> Can you try with using single JDK ?
>
> Cheers
>
> On Wed, Apr 8, 2015 at 3:26 PM, Mohit Anchlia
> wrote:
>
>> For the build I am using java version "1.7.0_65" which seems to be the
>> same as
> Cheers
>
> On Wed, Apr 8, 2015 at 12:43 PM, Mohit Anchlia
> wrote:
>
>> I am seeing the following, is this because of my maven version?
>>
>> 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
>> ip-10-241-251-232.us-wes
61 matches
Mail list logo