unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
jay vyas
:675)
I'm not too worries about this - but it seems like it might be nice if
maybe we could specify a user name as part of sparks context or as part of
an external parameter rather then having to
use the java based user/group extractor.
--
jay vyas
arkContext.(JavaSparkContext.scala:58)
--
jay vyas
On Tue, Jul 21, 2015 at 11:11 PM, Dogtail Ray wrote:
> Hi,
>
> I have modified some Hadoop code, and want to build Spark with the
> modified version of Hadoop. Do I need to change the compilation dependency
> files? How to then? Great thanks!
>
--
jay vyas
In general the simplest way is that you can use the Dynamo Java API as is and
call it inside a map(), and use the asynchronous put() Dynamo api call .
> On Aug 7, 2015, at 9:08 AM, Yasemin Kaya wrote:
>
> Hi,
>
> Is there a way using DynamoDB in spark application? I have to persist my
> res
urantees that you'll have a producer and a consumer, so that you
don't get a starvation scenario.
On Wed, Aug 12, 2015 at 7:31 PM, Mohit Anchlia
wrote:
> Is there a way to run spark streaming methods in standalone eclipse
> environment to test out the functionality?
>
--
jay vyas
Its a very valid idea indeed, but... It's a tricky subject since the entire
ASF is run on mailing lists , hence there are so many different but equally
sound ways of looking at this idea, which conflict with one another.
> On Jan 21, 2015, at 7:03 AM, btiernay wrote:
>
> I think this is a re
hive dates just for
dealing with time stamps.
Whats the simplest and cleanest way to map non-spark time values into
SparkSQL friendly time values?
- One option could be a custom SparkSQL type, i guess?
- Any plan to have native spark sql support for Joda Time or (yikes)
java.util.Calendar ?
--
jay vyas
}).join(ProductMetaData).by(product,meta=>product.id=meta.id). toSchemaRDD ?
I know the above snippet is totally wacky but, you get the idea :)
--
jay vyas
Ah, nevermind, I just saw
http://spark.apache.org/docs/1.2.0/sql-programming-guide.html (language
integrated queries) which looks quite similar to what i was thinking
about. I'll give that a whirl...
On Wed, Feb 11, 2015 at 7:40 PM, jay vyas
wrote:
> Hi spark. is there anything in t
ons/spark-streaming.html
> .
>
> With a kafka receiver that pulls data from a single kafka partition of a
> kafka topic, are individual messages in the microbatch in same the order as
> kafka partition? Are successive microbatches originating from a kafka
> partition executed in order?
&
-https://wiki.apache.org/incubator/IgniteProposal has I think been updated
recently and has a good comparison.
- Although grid gain has been around since the spark days, Apache Ignite is
quite new and just getting started I think so
- you will probably want to reach out to the developers for
actory.(PoolingHttpClientConnectionManager.java:494)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:149)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:138)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:114)
>
>
--
jay vyas
Just the same as spark was disrupting the hadoop ecosystem by changing the
assumption that "you can't rely on memory in distributed analytics"...now
maybe we are challenging the assumption that "big data analytics need to
distributed"?
I've been asking the same question lately and seen similarly t
Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
jay vyas
>>>> import org.apache.spark.streaming.StreamingContext._
>>>>> import org.apache.spark.streaming.dstream.DStream
>>>>> import org.apache.spark.streaming.Duration
>>>>> import org.apache.spark.streaming.Seconds
>>>>> val ssc = new StreamingContext( sc, Seconds(1))
>>>>> val lines = ssc.socketTextStream("hostname",)
>>>>> lines.print()
>>>>> ssc.start()
>>>>> ssc.awaitTermination()
>>>>>
>>>>> Jobs are getting created when I see webUI but nothing gets printed on
>>>>> console.
>>>>>
>>>>> I have started a nc script on hostname port and can see messages
>>>>> typed on this port from another console.
>>>>>
>>>>>
>>>>>
>>>>> Please let me know If I am doing something wrong.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
--
jay vyas
... if so please just point
me to the right documentation if im mising something obvious :)
thanks !
--
jay vyas
the slaves can be ephemeral. Since the
master is fixed, though, a new slave can reconnect at any time.
On Mon, Jul 14, 2014 at 10:01 PM, jay vyas
wrote:
> Hi spark !
>
> What is the purpose of the randomly assigned SPARK_WORKER_PORT
>
> from the documentation it sais to "
I think I know what is happening to you. I've looked some into this just this
week, and so its fresh in my brain :) hope this helps.
When no workers are known to the master, iirc, you get this message.
I think this is how it works.
1) You start your master
2) You start a slave, and give it m
ed RDD capture the standard out from the process as its
output (i assume that is the most common implementation)?
Incidentally, I have not been able to use the pipe command to run an
external process yet, so any hints on that would be appreciated.
--
jay vyas
essentially an implementation of something analgous to hadoop's
streaming api.
On Sun, Jul 20, 2014 at 4:09 PM, jay vyas
wrote:
> According to the api docs for the pipe operator,
> def pipe(command: String): RDD
> <http://spark.apache.org/docs/1.0.0/api/scala/org/apache/s
this , possibly I could lend a hand if there are any loose ends needing
to be tied.
--
jay vyas
r JUnit
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
--
jay vyas
> at java.lang.Class.forName(Class.java:270)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
> at
> org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:89)
> ... 64 more
>
>
> On Sun, Aug 3, 2014 at 6:04 PM, Rahul Bhojwani <
> rahulbhojwani2...@gmail.com> wrote:
>
>> Hi,
>>
>> I used to run spark scripts on local machine. Now i am porting my codes
>> to EMR and i am facing lots of problem.
>>
>> The main one now is that the spark script which is running properly on my
>> local machine is giving error when run on Amazon EMR Cluster.
>> Here is the error:
>>
>>
>>
>>
>>
>> What can be the possible reason?
>> Thanks in advance
>> --
>>
>> [image: http://]
>> Rahul K Bhojwani
>> [image: http://]about.me/rahul_bhojwani
>> <http://about.me/rahul_bhojwani>
>>
>>
>
>
>
> --
>
> [image: http://]
> Rahul K Bhojwani
> [image: http://]about.me/rahul_bhojwani
> <http://about.me/rahul_bhojwani>
>
>
>
--
jay vyas
bug and learn.
>
> thanks
>
> sanjay
>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>
>
>
--
jay vyas
t;
> Andy
>
>
> import sys
> from operator import add
>
> from pyspark import SparkContext
>
> # only stand alone jobs should create a SparkContext
> sc = SparkContext(appName="pyStreamingSparkRDDPipe”)
>
> data = [1, 2, 3, 4, 5]
> rdd = sc.parallelize(data)
>
> def echo(data):
> print "python recieved: %s" % (data) # output winds up in the shell
> console in my cluster (ie. The machine I launched pyspark from)
>
> rdd.foreach(echo)
> print "we are done"
>
>
>
--
jay vyas
Hi spark !
I dont quite yet understand the semantics of RDDs in a streaming context
very well yet.
Are there any examples of how to implement CustomInputDStreams, with
corresponding Receivers in the docs ?
Ive hacked together a custom stream, which is being opened and is
consuming data internal
ert a JavaRDD
>> into
>> > an iterator or iterable over then entire data set without using collect
>> or
>> > holding all data in memory.
>> >In many problems where it is desirable to parallelize intermediate
>> steps
>> > but use a single process for handling the final result this could be
>> very
>> > useful.
>>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>
--
jay vyas
Hi Spark ! I found out why my RDD's werent coming through in my spark
stream.
It turns out you need the onStart() needs to return , it seems - i.e. you
need to launch the worker part of your
start process in a thread. For example
def onStartMock():Unit ={
val future = new Thread(new
On Tue, Oct 21, 2014 at 11:02 AM, jay vyas
wrote:
> Hi Spark ! I found out why my RDD's werent coming through in my spark
> stream.
>
> It turns out you need the onStart() needs to return , it seems - i.e. you
> need to launch the worker part of your
> start process
t; To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
jay vyas
;>> On Oct 20, 2014, at 3:07 AM, Gerard Maas
>>>>>> wrote:
>>>>>>
>>>>>> Pinging TD -- I'm sure you know :-)
>>>>>>
>>>>>> -kr, Gerard.
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We have been implementing several Spark Streaming jobs that are
>>>>>>> basically processing data and inserting it into Cassandra, sorting it
>>>>>>> among
>>>>>>> different keyspaces.
>>>>>>>
>>>>>>> We've been following the pattern:
>>>>>>>
>>>>>>> dstream.foreachRDD(rdd =>
>>>>>>> val records = rdd.map(elem => record(elem))
>>>>>>> targets.foreach(target => records.filter{record =>
>>>>>>> isTarget(target,record)}.writeToCassandra(target,table))
>>>>>>> )
>>>>>>>
>>>>>>> I've been wondering whether there would be a performance difference
>>>>>>> in transforming the dstream instead of transforming the RDD within the
>>>>>>> dstream with regards to how the transformations get scheduled.
>>>>>>>
>>>>>>> Instead of the RDD-centric computation, I could transform the
>>>>>>> dstream until the last step, where I need an rdd to store.
>>>>>>> For example, the previous transformation could be written as:
>>>>>>>
>>>>>>> val recordStream = dstream.map(elem => record(elem))
>>>>>>> targets.foreach{target => recordStream.filter(record =>
>>>>>>> isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))}
>>>>>>>
>>>>>>> Would be a difference in execution and/or performance? What would
>>>>>>> be the preferred way to do this?
>>>>>>>
>>>>>>> Bonus question: Is there a better (more performant) way to sort the
>>>>>>> data in different "buckets" instead of filtering the data collection
>>>>>>> times
>>>>>>> the #buckets?
>>>>>>>
>>>>>>> thanks, Gerard.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
--
jay vyas
A use case would be helpful?
Batches of RDDs from Streams are going to have temporal ordering in terms of
when they are processed in a typical application ... , but maybe you could
shuffle the way batch iterations work
> On Nov 3, 2014, at 11:59 AM, Josh J wrote:
>
> When I'm outputting the
Hi spark. I have a set of text files that are dependencies of my app.
They are less than 2mb in total size.
What's the idiom for packaing text file dependencies for a spark based jar
file? Class resources in packages ? Or distributing them separately?
Yup , very important that n>1 for spark streaming jobs, If local use
local[2]
The thing to remember is that your spark receiver will take a thread to itself
and produce data , so u need another thread to consume it .
In a cluster manager like yarn or mesos, the word thread Is not used any
the periodic CPU spike - I had a reduceByKey, so was it
> doing that only after all the batch data was in?
>
> Thanks
>
--
jay vyas
at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
jay vyas
This seems pretty standard: your IntelliJ classpath isn't matched to the
correct ones that are used in spark shell
Are you using the SBT plugin? If not how are you putting deps into IntelliJ?
> On Nov 20, 2014, at 7:35 PM, Sanjay Subramanian
> wrote:
>
> hey guys
>
> I am at AmpCamp 2014
and when to use spark submit to execute python
> scripts/module
> Bonus points if one can point an example library and how to run it :)
> Thanks
>
--
jay vyas
Here's an example of a Cassandra etl that you can follow which should exit on
its own. I'm using it as a blueprint for revolving spark streaming apps on top
of.
For me, I kill the streaming app w system.exit after a sufficient amount of
data is collected.
That seems to work for most any scena
https://github.com/jayunit100/SparkStreamingCassandraDemo
On this note, I've built a framework which is mostly "pure" so that functional
unit tests can be run composing mock data for Twitter statuses, with just
regular junit... That might be relevant also.
I think at some point we should come
awaiting processing or does it just process them?
>
> Asim
>
--
jay vyas
Hi enno. Might be worthwhile to cross post this on dev@hadoop... Obviously a
simple spark way to test this would be to change the uri to write to hdfs:// or
maybe you could do file:// , and confirm that the extra slash goes away.
- if it's indeed a jets3t issue we should add a new unit test for
I find importing a working SBT project into IntelliJ is the way to go.
How did you load the project into intellij?
> On Jan 13, 2015, at 4:45 PM, Enno Shioji wrote:
>
> Had the same issue. I can't remember what the issue was but this works:
>
> libraryDependencies ++= {
> val sparkVers
t;
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Benchmarking-Spark-with-YCSB-tp5813.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
--
Jay Vyas
http://jayunit100.blogspot.com
45 matches
Mail list logo