Re: Spark Streaming on top of Cassandra?

2015-05-21 Thread jay vyas
unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- jay vyas

Spark jobs without a login

2016-06-16 Thread jay vyas
:675) I'm not too worries about this - but it seems like it might be nice if maybe we could specify a user name as part of sparks context or as part of an external parameter rather then having to use the java based user/group extractor. -- jay vyas

JavaSparkContext: dependency on ui/

2016-06-27 Thread jay vyas
arkContext.(JavaSparkContext.scala:58) -- jay vyas

Re: How to build Spark with my own version of Hadoop?

2015-07-22 Thread jay vyas
On Tue, Jul 21, 2015 at 11:11 PM, Dogtail Ray wrote: > Hi, > > I have modified some Hadoop code, and want to build Spark with the > modified version of Hadoop. Do I need to change the compilation dependency > files? How to then? Great thanks! > -- jay vyas

Re: Amazon DynamoDB & Spark

2015-08-07 Thread Jay Vyas
In general the simplest way is that you can use the Dynamo Java API as is and call it inside a map(), and use the asynchronous put() Dynamo api call . > On Aug 7, 2015, at 9:08 AM, Yasemin Kaya wrote: > > Hi, > > Is there a way using DynamoDB in spark application? I have to persist my > res

Re: Unit Testing

2015-08-13 Thread jay vyas
urantees that you'll have a producer and a consumer, so that you don't get a starvation scenario. On Wed, Aug 12, 2015 at 7:31 PM, Mohit Anchlia wrote: > Is there a way to run spark streaming methods in standalone eclipse > environment to test out the functionality? > -- jay vyas

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Jay Vyas
Its a very valid idea indeed, but... It's a tricky subject since the entire ASF is run on mailing lists , hence there are so many different but equally sound ways of looking at this idea, which conflict with one another. > On Jan 21, 2015, at 7:03 AM, btiernay wrote: > > I think this is a re

SparkSQL DateTime

2015-02-09 Thread jay vyas
hive dates just for dealing with time stamps. Whats the simplest and cleanest way to map non-spark time values into SparkSQL friendly time values? - One option could be a custom SparkSQL type, i guess? - Any plan to have native spark sql support for Joda Time or (yikes) java.util.Calendar ? -- jay vyas

Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
}).join(ProductMetaData).by(product,meta=>product.id=meta.id). toSchemaRDD ? I know the above snippet is totally wacky but, you get the idea :) -- jay vyas

Re: Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
Ah, nevermind, I just saw http://spark.apache.org/docs/1.2.0/sql-programming-guide.html (language integrated queries) which looks quite similar to what i was thinking about. I'll give that a whirl... On Wed, Feb 11, 2015 at 7:40 PM, jay vyas wrote: > Hi spark. is there anything in t

Re: Spark Streaming and message ordering

2015-02-18 Thread jay vyas
ons/spark-streaming.html > . > > With a kafka receiver that pulls data from a single kafka partition of a > kafka topic, are individual messages in the microbatch in same the order as > kafka partition? Are successive microbatches originating from a kafka > partition executed in order? &

Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Jay Vyas
-https://wiki.apache.org/incubator/IgniteProposal has I think been updated recently and has a good comparison. - Although grid gain has been around since the spark days, Apache Ignite is quite new and just getting started I think so - you will probably want to reach out to the developers for

Re: Untangling dependency issues in spark streaming

2015-03-29 Thread jay vyas
actory.(PoolingHttpClientConnectionManager.java:494) > at > org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:149) > at > org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:138) > at > org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:114) > > -- jay vyas

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread jay vyas
Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that "you can't rely on memory in distributed analytics"...now maybe we are challenging the assumption that "big data analytics need to distributed"? I've been asking the same question lately and seen similarly t

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread jay vyas
Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- jay vyas

Re: Re: spark streaming printing no output

2015-04-16 Thread jay vyas
>>>> import org.apache.spark.streaming.StreamingContext._ >>>>> import org.apache.spark.streaming.dstream.DStream >>>>> import org.apache.spark.streaming.Duration >>>>> import org.apache.spark.streaming.Seconds >>>>> val ssc = new StreamingContext( sc, Seconds(1)) >>>>> val lines = ssc.socketTextStream("hostname",) >>>>> lines.print() >>>>> ssc.start() >>>>> ssc.awaitTermination() >>>>> >>>>> Jobs are getting created when I see webUI but nothing gets printed on >>>>> console. >>>>> >>>>> I have started a nc script on hostname port and can see messages >>>>> typed on this port from another console. >>>>> >>>>> >>>>> >>>>> Please let me know If I am doing something wrong. >>>>> >>>>> >>>>> >>>>> >>>> >>> >> > -- jay vyas

SPARK_WORKER_PORT (standalone cluster)

2014-07-14 Thread jay vyas
... if so please just point me to the right documentation if im mising something obvious :) thanks ! -- jay vyas

Re: SPARK_WORKER_PORT (standalone cluster)

2014-07-16 Thread jay vyas
the slaves can be ephemeral. Since the master is fixed, though, a new slave can reconnect at any time. On Mon, Jul 14, 2014 at 10:01 PM, jay vyas wrote: > Hi spark ! > > What is the purpose of the randomly assigned SPARK_WORKER_PORT > > from the documentation it sais to "

Re: Error with spark-submit (formatting corrected)

2014-07-17 Thread Jay Vyas
I think I know what is happening to you. I've looked some into this just this week, and so its fresh in my brain :) hope this helps. When no workers are known to the master, iirc, you get this message. I think this is how it works. 1) You start your master 2) You start a slave, and give it m

RDD.pipe(...)

2014-07-20 Thread jay vyas
ed RDD capture the standard out from the process as its output (i assume that is the most common implementation)? Incidentally, I have not been able to use the pipe command to run an external process yet, so any hints on that would be appreciated. -- jay vyas

Re: RDD.pipe(...)

2014-07-20 Thread jay vyas
essentially an implementation of something analgous to hadoop's streaming api. On Sun, Jul 20, 2014 at 4:09 PM, jay vyas wrote: > According to the api docs for the pipe operator, > def pipe(command: String): RDD > <http://spark.apache.org/docs/1.0.0/api/scala/org/apache/s

Spark over graphviz (SPARK-1015, SPARK-975)

2014-07-22 Thread jay vyas
this , possibly I could lend a hand if there are any loose ends needing to be tied. -- jay vyas

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread jay vyas
r JUnit > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- jay vyas

Re: pyspark script fails on EMR with an ERROR in configuring object.

2014-08-03 Thread jay vyas
> at java.lang.Class.forName(Class.java:270) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820) > at > org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:89) > ... 64 more > > > On Sun, Aug 3, 2014 at 6:04 PM, Rahul Bhojwani < > rahulbhojwani2...@gmail.com> wrote: > >> Hi, >> >> I used to run spark scripts on local machine. Now i am porting my codes >> to EMR and i am facing lots of problem. >> >> The main one now is that the spark script which is running properly on my >> local machine is giving error when run on Amazon EMR Cluster. >> Here is the error: >> >> >> >> >> >> What can be the possible reason? >> Thanks in advance >> -- >> >> [image: http://] >> Rahul K Bhojwani >> [image: http://]about.me/rahul_bhojwani >> <http://about.me/rahul_bhojwani> >> >> > > > > -- > > [image: http://] > Rahul K Bhojwani > [image: http://]about.me/rahul_bhojwani > <http://about.me/rahul_bhojwani> > > > -- jay vyas

Re: Spark inside Eclipse

2014-10-03 Thread jay vyas
bug and learn. > > thanks > > sanjay > > > > > -- > Daniel Siegmann, Software Developer > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 > E: daniel.siegm...@velos.io W: www.velos.io > > > -- jay vyas

Re: Does Ipython notebook work with spark? trivial example does not work. Re: bug with IPython notebook?

2014-10-10 Thread jay vyas
t; > Andy > > > import sys > from operator import add > > from pyspark import SparkContext > > # only stand alone jobs should create a SparkContext > sc = SparkContext(appName="pyStreamingSparkRDDPipe”) > > data = [1, 2, 3, 4, 5] > rdd = sc.parallelize(data) > > def echo(data): > print "python recieved: %s" % (data) # output winds up in the shell > console in my cluster (ie. The machine I launched pyspark from) > > rdd.foreach(echo) > print "we are done" > > > -- jay vyas

Streams: How do RDDs get Aggregated?

2014-10-11 Thread jay vyas
Hi spark ! I dont quite yet understand the semantics of RDDs in a streaming context very well yet. Are there any examples of how to implement CustomInputDStreams, with corresponding Receivers in the docs ? Ive hacked together a custom stream, which is being opened and is consuming data internal

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread jay vyas
ert a JavaRDD >> into >> > an iterator or iterable over then entire data set without using collect >> or >> > holding all data in memory. >> >In many problems where it is desirable to parallelize intermediate >> steps >> > but use a single process for handling the final result this could be >> very >> > useful. >> > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > > -- jay vyas

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
Hi Spark ! I found out why my RDD's werent coming through in my spark stream. It turns out you need the onStart() needs to return , it seems - i.e. you need to launch the worker part of your start process in a thread. For example def onStartMock():Unit ={ val future = new Thread(new

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
On Tue, Oct 21, 2014 at 11:02 AM, jay vyas wrote: > Hi Spark ! I found out why my RDD's werent coming through in my spark > stream. > > It turns out you need the onStart() needs to return , it seems - i.e. you > need to launch the worker part of your > start process

Re: real-time streaming

2014-10-28 Thread jay vyas
t; To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- jay vyas

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread jay vyas
;>> On Oct 20, 2014, at 3:07 AM, Gerard Maas >>>>>> wrote: >>>>>> >>>>>> Pinging TD -- I'm sure you know :-) >>>>>> >>>>>> -kr, Gerard. >>>>>> >>>>>> On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> We have been implementing several Spark Streaming jobs that are >>>>>>> basically processing data and inserting it into Cassandra, sorting it >>>>>>> among >>>>>>> different keyspaces. >>>>>>> >>>>>>> We've been following the pattern: >>>>>>> >>>>>>> dstream.foreachRDD(rdd => >>>>>>> val records = rdd.map(elem => record(elem)) >>>>>>> targets.foreach(target => records.filter{record => >>>>>>> isTarget(target,record)}.writeToCassandra(target,table)) >>>>>>> ) >>>>>>> >>>>>>> I've been wondering whether there would be a performance difference >>>>>>> in transforming the dstream instead of transforming the RDD within the >>>>>>> dstream with regards to how the transformations get scheduled. >>>>>>> >>>>>>> Instead of the RDD-centric computation, I could transform the >>>>>>> dstream until the last step, where I need an rdd to store. >>>>>>> For example, the previous transformation could be written as: >>>>>>> >>>>>>> val recordStream = dstream.map(elem => record(elem)) >>>>>>> targets.foreach{target => recordStream.filter(record => >>>>>>> isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))} >>>>>>> >>>>>>> Would be a difference in execution and/or performance? What would >>>>>>> be the preferred way to do this? >>>>>>> >>>>>>> Bonus question: Is there a better (more performant) way to sort the >>>>>>> data in different "buckets" instead of filtering the data collection >>>>>>> times >>>>>>> the #buckets? >>>>>>> >>>>>>> thanks, Gerard. >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> -- jay vyas

Re: random shuffle streaming RDDs?

2014-11-03 Thread Jay Vyas
A use case would be helpful? Batches of RDDs from Streams are going to have temporal ordering in terms of when they are processed in a typical application ... , but maybe you could shuffle the way batch iterations work > On Nov 3, 2014, at 11:59 AM, Josh J wrote: > > When I'm outputting the

Embedding static files in a spark app

2014-11-08 Thread Jay Vyas
Hi spark. I have a set of text files that are dependencies of my app. They are less than 2mb in total size. What's the idiom for packaing text file dependencies for a spark based jar file? Class resources in packages ? Or distributing them separately?

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Jay Vyas
Yup , very important that n>1 for spark streaming jobs, If local use local[2] The thing to remember is that your spark receiver will take a thread to itself and produce data , so u need another thread to consume it . In a cluster manager like yarn or mesos, the word thread Is not used any

Re: Does Spark Streaming calculate during a batch?

2014-11-13 Thread jay vyas
the periodic CPU spike - I had a reduceByKey, so was it > doing that only after all the batch data was in? > > Thanks > -- jay vyas

Re: Streaming: getting total count over all windows

2014-11-13 Thread jay vyas
at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- jay vyas

Re: Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Jay Vyas
This seems pretty standard: your IntelliJ classpath isn't matched to the correct ones that are used in spark shell Are you using the SBT plugin? If not how are you putting deps into IntelliJ? > On Nov 20, 2014, at 7:35 PM, Sanjay Subramanian > wrote: > > hey guys > > I am at AmpCamp 2014

Re: How to execute a custom python library on spark

2014-11-25 Thread jay vyas
and when to use spark submit to execute python > scripts/module > Bonus points if one can point an example library and how to run it :) > Thanks > -- jay vyas

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Jay Vyas
Here's an example of a Cassandra etl that you can follow which should exit on its own. I'm using it as a blueprint for revolving spark streaming apps on top of. For me, I kill the streaming app w system.exit after a sufficient amount of data is collected. That seems to work for most any scena

Re: Unit testing and Spark Streaming

2014-12-12 Thread Jay Vyas
https://github.com/jayunit100/SparkStreamingCassandraDemo On this note, I've built a framework which is mostly "pure" so that functional unit tests can be run composing mock data for Twitter statuses, with just regular junit... That might be relevant also. I think at some point we should come

Re: Spark Streaming Threading Model

2014-12-19 Thread jay vyas
awaiting processing or does it just process them? > > Asim > -- jay vyas

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Jay Vyas
Hi enno. Might be worthwhile to cross post this on dev@hadoop... Obviously a simple spark way to test this would be to change the uri to write to hdfs:// or maybe you could do file:// , and confirm that the extra slash goes away. - if it's indeed a jets3t issue we should add a new unit test for

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Jay Vyas
I find importing a working SBT project into IntelliJ is the way to go. How did you load the project into intellij? > On Jan 13, 2015, at 4:45 PM, Enno Shioji wrote: > > Had the same issue. I can't remember what the issue was but this works: > > libraryDependencies ++= { > val sparkVers

Re: Benchmarking Spark with YCSB

2014-05-16 Thread Jay Vyas
t; > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Benchmarking-Spark-with-YCSB-tp5813.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- Jay Vyas http://jayunit100.blogspot.com