Re: Spark Streaming on top of Cassandra?

2015-05-21 Thread jay vyas
hi. I have a spark streaming -> cassandra application which you can
probably borrow pretty easily.

You can always rewrite a part of it in java if you need to , or else, you
can just use scala (see the blog post below if you want a java style dev
workflow w/ scala using intellij)/

This application implements a spark stream w twitter and ETLs it into
either a file queue or cassandra (see the commented out cassandra snippet).

https://github.com/jayunit100/SparkStreamingApps/blob/master/src/main/scala/sparkapps/ctakes/CTakesTwitterStreamingApp.scala

Cassandra sink works really well with the spark context compile time
bindings .

Maybe just clone this repo down and use it as a blueprint :) There is a
blog post here about how to set up your IDE

so that the dev workflow is very similar to that of standard java
http://jayunit100.blogspot.com/2014/07/set-up-spark-application-devleopment.html

good luck !.

On Thu, May 21, 2015 at 4:24 PM, tshah77  wrote:

> Can some one provide example of Spark Streaming using Java?
>
> I have cassandra running but did not configure spark but would like to
> create Dstream.
>
> Thanks
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-top-of-Cassandra-tp1283p22978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
jay vyas


Spark jobs without a login

2016-06-16 Thread jay vyas
Hi spark:

Is it possible to avoid reliance on a login user when running a spark job?

I'm running out a container that doesnt supply a valid user name,
and thus, I'm getting the following exception:

org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:675)

I'm not too worries about this - but it seems like it might be nice if
maybe we could specify a user name as part of sparks context or as part of
an external parameter rather then having to
 use the java based user/group extractor.



-- 
jay vyas


JavaSparkContext: dependency on ui/

2016-06-27 Thread jay vyas
I notice that there is a dependency from the SparkContext on the
"createLiveUI" functionality.

Is that really required?  Or is there a more minimal JAvaSparkContext we
can create?

Im packaging a jar with a spark client and would rather avoid resource/
dependencys as they might be trickier to maintain than just class deps
alone.


java.lang.Exception: Could not find resource path for Web UI:
org/apache/spark/ui/static
at
org.apache.spark.ui.JettyUtils$.createStaticHandler(JettyUtils.scala:182)
at org.apache.spark.ui.SparkUI.initialize(SparkUI.scala:73)
at org.apache.spark.ui.SparkUI.(SparkUI.scala:81)
at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:215)
at org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:157)
at org.apache.spark.SparkContext.(SparkContext.scala:445)
at
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:58)

-- 
jay vyas


Re: How to build Spark with my own version of Hadoop?

2015-07-22 Thread jay vyas
As you know, the hadoop versions and so on are available in the spark build
files, iirc the top level pox.xml has all the maven variables for versions.

So I think if you just build hadoop locally (i.e. build it as it to
2.2.1234-SNAPSHOT and mvn install it), you should be able to change the
corresponding varaible in the top level spark pom.xml.

.

Of course this is a pandoras box where now you need to also deploy your
custom YARN on your cluster, make sure it matches the spark target, and so
on (if your running spark on YARN).  RPMs and DEB packages tend to be
useful for this kind of thing, since you can easily sync the /etc/ config
files and uniformly manage/upgrade versions etc.  ...  Thus... if your
really serious about building a custom distribution, mixing & matching
hadoop components separately, you might want to consider using Apache
BigTop, just bring this up on that mailing list... We curate a hadoop
distribution "builder" that builds spark, hadoop, hive, ignite, kafka,
zookeeper, hbase and so on...  Since bigtop has all the tooling necessary
to fully build, test, and deploy on VMs/containers your hadoop bits, it
might make your life a little easier.



On Tue, Jul 21, 2015 at 11:11 PM, Dogtail Ray  wrote:

> Hi,
>
> I have modified some Hadoop code, and want to build Spark with the
> modified version of Hadoop. Do I need to change the compilation dependency
> files? How to then? Great thanks!
>



-- 
jay vyas


Re: Amazon DynamoDB & Spark

2015-08-07 Thread Jay Vyas
In general the simplest way is that you can use the Dynamo Java API as is and 
call it inside  a map(), and use the asynchronous put() Dynamo api call .


> On Aug 7, 2015, at 9:08 AM, Yasemin Kaya  wrote:
> 
> Hi,
> 
> Is there a way using DynamoDB in spark application? I have to persist my 
> results to DynamoDB.
> 
> Thanx,
> yasemin
> 
> -- 
> hiç ender hiç

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Unit Testing

2015-08-13 Thread jay vyas
yes there certainly is, so long as eclipse has the right plugins and so on
to run scala programs.  You're really asking two questions: (1) Can I use a
modern IDE to develop spark apps and (2) can we easily  unit test spark
streaming apps.

the answer is yes to both...

Regarding your IDE:

I like to use intellij with the set plugins for scala development.  It
allows you to run everything from inside the IDE.  I've written up setup
instructions here:
http://jayunit100.blogspot.com/2014/07/set-up-spark-application-devleopment.html

Now, regarding local unit testing:

As an example, here is a unit test for confirming that spark can write to
cassandra.

https://github.com/jayunit100/SparkStreamingApps/blob/master/src/test/scala/TestTwitterCassandraETL.scala

The key here is to just set your local master in the unit test, like so

sc.setMaster("local[2]")

local[2] gaurantees that you'll have a producer and a consumer, so that you
don't get a starvation scenario.


On Wed, Aug 12, 2015 at 7:31 PM, Mohit Anchlia 
wrote:

> Is there a way to run spark streaming methods in standalone eclipse
> environment to test out the functionality?
>



-- 
jay vyas


Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Jay Vyas
Its a very valid  idea indeed, but... It's a tricky  subject since the entire 
ASF is run on mailing lists , hence there are so many different but equally 
sound ways of looking at this idea, which conflict with one another.

> On Jan 21, 2015, at 7:03 AM, btiernay  wrote:
> 
> I think this is a really great idea for really opening up the discussions
> that happen here. Also, it would be nice to know why there doesn't seem to
> be much interest. Maybe I'm misunderstanding some nuance of Apache projects.
> 
> Cheers
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Discourse-A-proposed-alternative-to-the-Spark-User-list-tp20851p21288.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL DateTime

2015-02-09 Thread jay vyas
Hi spark ! We are working on the bigpetstore-spark implementation in apache
bigtop, and want to implement idiomatic date/time usage for SparkSQL.

It appears that org.joda.time.DateTime  isnt in  SparkSQL's  rolodex of
reflection types.

I'd rather not force an artificial dependency on hive dates just for
dealing with time stamps.

Whats the simplest and cleanest way to map non-spark time values  into
SparkSQL friendly time values?

- One option could be a custom SparkSQL type, i guess?
- Any plan to have native spark sql support for Joda Time or (yikes)
java.util.Calendar ?

-- 
jay vyas


Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
Hi spark.  is there anything in the works for a  typesafe HQL like API for
building spark queries from case classes ? i.e. where, given a domain
object "Product" with a "cost" associated with it  , we can do something
like:

query.select(Product).filter({ _.cost > 50.00f
}).join(ProductMetaData).by(product,meta=>product.id=meta.id). toSchemaRDD ?

I know the above snippet is totally wacky but, you get the idea :)


-- 
jay vyas


Re: Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
Ah, nevermind, I just saw
http://spark.apache.org/docs/1.2.0/sql-programming-guide.html (language
integrated queries) which looks quite similar to what i was thinking
about.  I'll give that a whirl...

On Wed, Feb 11, 2015 at 7:40 PM, jay vyas 
wrote:

> Hi spark.  is there anything in the works for a  typesafe HQL like API for
> building spark queries from case classes ? i.e. where, given a domain
> object "Product" with a "cost" associated with it  , we can do something
> like:
>
> query.select(Product).filter({ _.cost > 50.00f
> }).join(ProductMetaData).by(product,meta=>product.id=meta.id).
> toSchemaRDD ?
>
> I know the above snippet is totally wacky but, you get the idea :)
>
>
> --
> jay vyas
>



-- 
jay vyas


Re: Spark Streaming and message ordering

2015-02-18 Thread jay vyas
This is a *fantastic* question.  The idea of how we identify individual
things in multiple  DStreams is worth looking at.

The reason being, that you can then fine tune your streaming job, based on
the RDD identifiers (i.e. are the timestamps from the producer correlating
closely to the order in which RDD elements are being produced) ?  If *NO*
then you need to (1) dial up throughput on producer sources or else (2)
increase cluster size so that spark is capable of evenly handling load.

You cant decide to do (1) or (2) unless you can track  when the streaming
elements are being  converted to RDDs by spark itself.



On Wed, Feb 18, 2015 at 6:54 PM, Neelesh  wrote:

> There does not seem to be a definitive answer on this. Every time I google
> for message ordering,the only relevant thing that comes up is this  -
> http://samza.apache.org/learn/documentation/0.8/comparisons/spark-streaming.html
> .
>
> With a kafka receiver that pulls data from a single kafka partition of a
> kafka topic, are individual messages in the microbatch in same the order as
> kafka partition? Are successive microbatches originating from a kafka
> partition executed in order?
>
>
> Thanks!
>
>



-- 
jay vyas


Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Jay Vyas
-https://wiki.apache.org/incubator/IgniteProposal has I think been updated 
recently and has a good comparison.  

- Although grid gain has been around since the spark days, Apache Ignite is 
quite new and just getting started I think so 

- you will probably want to reach out to the developers for details on ignites 
roadmap because there might be interesting details not yet codified.

> On Feb 26, 2015, at 1:08 PM, Sean Owen  wrote:
> 
> Ignite is the renaming of GridGain, if that helps. It's like Oracle
> Coherence, if that helps. These do share some similarities -- fault
> tolerant, in-memory, distributed processing. The pieces they're built
> on differ, the architecture differs, the APIs differ. So fairly
> different in particulars. I never used the above, so can't be much
> more useful.
> 
> On Thu, Feb 26, 2015 at 5:46 PM, Ognen Duzlevski
>  wrote:
>> Can someone with experience briefly share or summarize the differences
>> between Ignite and Spark? Are they complementary? Totally unrelated?
>> Overlapping? Seems like ignite has reached version 1.0, I have never heard
>> of it until a few days ago and given what is advertised, it sounds pretty
>> interesting but I am unsure how this relates to or differs from Spark.
>> 
>> Thanks!
>> Ognen
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Untangling dependency issues in spark streaming

2015-03-29 Thread jay vyas
thanks for posting this! Ive ran into similar issues before, and generally
its a bad idea to swap the libraries out and "pray fot the best", so the
shade functionality is probably the best feature.

Unfortunately, im not sure how well SBT and Gradle support shading... how
do folks using next gen build tools solve this problem?



On Sun, Mar 29, 2015 at 3:10 AM, Neelesh  wrote:

> Hi,
>   My streaming app uses org.apache.httpcomponent:httpclient:4.3.6, but
> spark uses 4.2.6 , and I believe thats what's causing the following error.
> I've tried setting
> spark.executor.userClassPathFirst & spark.driver.userClassPathFirst to
> true in the config, but that does not solve it either. Finally I had to
> resort to relocating classes using maven shade plugin while building my
> apps uber jar, using
>
> 
> 
> org.apache.http
> org.shaded.apache.http
> 
> 
>
>
> Hope this is useful to others in the same situation. It would be really great 
> to deal with this the right way (like tomcat or any other servlet container - 
> classloader hierarchy etc).
>
>
> Caused by: java.lang.NoSuchFieldError: INSTANCE
> at
> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.(DefaultHttpRequestWriterFactory.java:52)
> at
> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.(DefaultHttpRequestWriterFactory.java:56)
> at
> org.apache.http.impl.io.DefaultHttpRequestWriterFactory.(DefaultHttpRequestWriterFactory.java:46)
> at
> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.(ManagedHttpClientConnectionFactory.java:72)
> at
> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.(ManagedHttpClientConnectionFactory.java:84)
> at
> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory.(ManagedHttpClientConnectionFactory.java:59)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.(PoolingHttpClientConnectionManager.java:494)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:149)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:138)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:114)
>
> and ...
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.http.impl.conn.ManagedHttpClientConnectionFactory
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager$InternalConnectionFactory.(PoolingHttpClientConnectionManager.java:494)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:149)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:138)
> at
> org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:114)
>
>


-- 
jay vyas


Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread jay vyas
Just the same as spark was disrupting the hadoop ecosystem by changing the
assumption that "you can't rely on memory in distributed analytics"...now
maybe we are challenging the assumption that "big data analytics need to
distributed"?

I've been asking the same question lately and seen similarly that spark
performs quite reliably and well on local single node system even for an
app which I ran for a streaming app which I ran for ten days in a row...  I
almost felt guilty that I never put it on a cluster!
On Mar 30, 2015 5:51 AM, "Steve Loughran"  wrote:

>
>  Note that even the Facebook "four degrees of separation" paper went down
> to a single machine running WebGraph (http://webgraph.di.unimi.it/) for
> the final steps, after running jobs in there Hadoop cluster to build the
> dataset for that final operation.
>
>  "The computations were performed on a 24-core machine with 72 GiB of
> memory and 1 TiB of disk space.6 The first task was to import the Facebook
> graph(s) into a compressed form for WebGraph [4], so that the multiple
> scans required by HyperANF’s diffusive process could be carried out
> relatively quickly."
>
>  Some toolkits/libraries are optimised for that single dedicated use —yet
> are downstream of the raw data; where memory reads $L1-$L3 cache locality
> becomes the main performance problem, and where synchronisation techniques
> like BSP aren't necessarily needed.
>
>
>
>
>  On 29 Mar 2015, at 23:18, Eran Medan  wrote:
>
>  Hi Sean,
> I think your point about the ETL costs are the wining argument here. but I
> would like to see more research on the topic.
>
> What I would like to see researched - is ability to run a specialized set
> of common algorithms in "fast-local-mode" just like a compiler optimizer
> can decide to inline some methods, or rewrite a recursive function as a for
> loop if it's in tail position, I would say that the future of GraphX can be
> that if a certain algorithm is a well known one (e.g. shortest paths) and
> can be run locally faster than on a distributed set (taking into account
> bringing all the data locally) then it will do so.
>
>  Thanks!
>
> On Sat, Mar 28, 2015 at 1:34 AM, Sean Owen  wrote:
>
>> (I bet the Spark implementation could be improved. I bet GraphX could
>> be optimized.)
>>
>> Not sure about this one, but "in core" benchmarks often start by
>> assuming that the data is local. In the real world, data is unlikely
>> to be. The benchmark has to include the cost of bringing all the data
>> to the local computation too, since the point of distributed
>> computation is bringing work to the data.
>>
>> Specialist implementations for a special problem should always win
>> over generalist, and Spark is a generalist. Likewise you can factor
>> matrices way faster in a GPU than in Spark. These aren't entirely
>> either/or propositions; you can use Rust or GPU in a larger
>> distributed program.
>>
>> Typically a real-world problem involves more than core computation:
>> ETL, security, monitoring. Generalists are more likely to have an
>> answer to hand for these.
>>
>> Specialist implementations do just one thing, and they typically have
>> to be custom built. Compare the cost of highly skilled developer time
>> to generalist computing resources; $1m buys several dev years but also
>> rents a small data center.
>>
>> Speed is an important issue but by no means everything in the real
>> world, and these are rarely mutually exclusive options in the OSS
>> world. This is a great piece of work, but I don't think it's some kind
>> of argument against distributed computing.
>>
>>
>> On Fri, Mar 27, 2015 at 6:32 PM, Eran Medan 
>> wrote:
>> > Remember that article that went viral on HN? (Where a guy showed how
>> GraphX
>> > / Giraph / GraphLab / Spark have worse performance on a 128 cluster
>> than on
>> > a 1 thread machine? if not here is the article
>> > -
>> http://www.frankmcsherry.org/graph/scalability/cost/2015/01/15/COST.html)
>> >
>> >
>> > Well as you may recall, this stirred up a lot of commotion in the big
>> data
>> > community (and Spark/GraphX in particular)
>> >
>> > People (justly I guess) blamed him for not really having “big data”, as
>> all
>> > of his data set fits in memory, so it doesn't really count.
>> >
>> >
>> > So he took the challenge and came with a pretty hard to argue counter
>> > benchmark, now with a huge data set (1TB of data, encoded using Hilbert
>> > curves to 154GB, but still large).
>> > see at -
>> >
>> http://www.frankmcsherry.org/graph/scalability/cost/2015/02/04/COST2.html
>> >
>> > He provided the source here https://github.com/frankmcsherry/COST as an
>> > example
>> >
>> > His benchmark shows how on a 128 billion edges graph, he got X2 to X10
>> > faster results on a single threaded Rust based implementation
>> >
>> > So, what is the counter argument? it pretty much seems like a blow in
>> the
>> > face of Spark / GraphX etc, (which I like and use on a daily basis)
>> >
>> > Before I dive into re-validat

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread jay vyas
yup a related JIRA is here
https://issues.apache.org/jira/browse/SPARK-5113 which you might want to
leave a comment in.  This can be quite tricky we found ! but there are a
host of env variable hacks you can use when launching spark masters/slaves.

On Thu, Apr 2, 2015 at 5:18 PM, Michael Quinlan  wrote:

> I was able to hack this on my similar setup issue by running (on the
> driver)
>
> $ sudo hostname ip
>
> Where ip is the same value set in the "spark.driver.host" property. This
> isn't a solution I would use universally and hope the someone can fix this
> bug in the distribution.
>
> Regards,
>
> Mike
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-to-a-cluster-behind-a-VPN-configuring-different-IP-address-tp9360p22363.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
jay vyas


Re: Re: spark streaming printing no output

2015-04-16 Thread jay vyas
empty folders generally means that you need to just increase the window
intervals; i.e. spark streaming
saveAsTxtFiles will save folders for each interval regardless

On Wed, Apr 15, 2015 at 5:03 AM, Shushant Arora 
wrote:

> Its printing on console but on HDFS all folders are still empty .
>
> On Wed, Apr 15, 2015 at 2:29 PM, Shushant Arora  > wrote:
>
>> Thanks !! Yes message types on this console is seen on another console.
>>
>> When I closed another console, spark streaming job is printing messages
>> on console .
>>
>>  Isn't the message written on a port using netcat be avaible for multiple
>> consumers?
>>
>> On Wed, Apr 15, 2015 at 2:22 PM, bit1...@163.com  wrote:
>>
>>> Looks  the message is consumed by the another console?( can see
>>> messages typed on this port from another console.)
>>>
>>> --
>>> bit1...@163.com
>>>
>>>
>>> *From:* Shushant Arora 
>>> *Date:* 2015-04-15 17:11
>>> *To:* Akhil Das 
>>> *CC:* user@spark.apache.org
>>> *Subject:* Re: spark streaming printing no output
>>> When I launched spark-shell using, spark-shell ---master local[2].
>>> Same behaviour, no output on console but only timestamps.
>>>
>>> When I did, lines.saveAsTextFiles("hdfslocation",suffix);
>>> I get empty files of 0 bytes on hdfs
>>>
>>> On Wed, Apr 15, 2015 at 12:46 PM, Akhil Das 
>>> wrote:
>>>
>>>> Just make sure you have atleast 2 cores available for processing. You
>>>> can try launching it in local[2] and make sure its working fine.
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Tue, Apr 14, 2015 at 11:41 PM, Shushant Arora <
>>>> shushantaror...@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I am running a spark streaming application but on console nothing is
>>>>> getting printed.
>>>>>
>>>>> I am doing
>>>>> 1.bin/spark-shell --master clusterMgrUrl
>>>>> 2.import org.apache.spark.streaming.StreamingContext
>>>>> import org.apache.spark.streaming.StreamingContext._
>>>>> import org.apache.spark.streaming.dstream.DStream
>>>>> import org.apache.spark.streaming.Duration
>>>>> import org.apache.spark.streaming.Seconds
>>>>> val ssc = new StreamingContext( sc, Seconds(1))
>>>>> val lines = ssc.socketTextStream("hostname",)
>>>>> lines.print()
>>>>> ssc.start()
>>>>> ssc.awaitTermination()
>>>>>
>>>>> Jobs are getting created when I see webUI but nothing gets printed on
>>>>> console.
>>>>>
>>>>> I have started a nc script on hostname  port  and can see messages
>>>>> typed on this port from another console.
>>>>>
>>>>>
>>>>>
>>>>> Please let me know If I am doing something wrong.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
jay vyas


SPARK_WORKER_PORT (standalone cluster)

2014-07-14 Thread jay vyas
Hi spark !

What is the purpose of the randomly assigned SPARK_WORKER_PORT

from the documentation it sais to "join a cluster", but its not clear to me
how a random port could be used to communicate with other members of a
spark  pool.

This question might be grounded in my ignorance ... if so please just point
me to the right documentation if im mising something obvious :)

thanks !
-- 
jay vyas


Re: SPARK_WORKER_PORT (standalone cluster)

2014-07-16 Thread jay vyas
Now I see the answer to this.

Spark slaves are start on random ports, and tell the master where they are.
then the master acknowledges them.

(worker logs)
Starting Spark worker :43282

(master logs)
Registering worker on :43282 with 8 cores, 16.5 GB RAM

Thus, the port is random because the slaves can be ephemeral.  Since the
master is fixed, though, a new slave can reconnect at any time.



On Mon, Jul 14, 2014 at 10:01 PM, jay vyas 
wrote:

> Hi spark !
>
> What is the purpose of the randomly assigned SPARK_WORKER_PORT
>
> from the documentation it sais to "join a cluster", but its not clear to
> me how a random port could be used to communicate with other members of a
> spark  pool.
>
> This question might be grounded in my ignorance ... if so please just
> point me to the right documentation if im mising something obvious :)
>
> thanks !
> --
> jay vyas
>



-- 
jay vyas


Re: Error with spark-submit (formatting corrected)

2014-07-17 Thread Jay Vyas
I think I know what is happening to you.  I've looked some into this just this 
week, and so its fresh in my brain :) hope this helps.


When no workers are known to the master, iirc, you get this message.

I think  this is how it works.

1) You start your master
2) You start a slave, and give it master url as an argument.
3) The slave then binds to a random port
4) The slave then does a handshake with master, which you can see in the slave 
logs (it sais something like "sucesfully connected to master at …".
  Actualy, i think tha master also logs that it now is aware of a slave running 
on ip:port…

So in your case, I suspect, none of the slaves have connected to the master, so 
the job sits idle.

This is similar to the yarn scenario of submitting a job to a resource manager 
with no node-managers running. 



On Jul 17, 2014, at 6:57 PM, ranjanp  wrote:

> Hi, 
> I am new to Spark and trying out with a stand-alone, 3-node (1 master, 2
> workers) cluster. 
> 
> From the Web UI at the master, I see that the workers are registered. But
> when I try running the SparkPi example from the master node, I get the
> following message and then an exception. 
> 
> 14/07/17 01:20:36 INFO AppClient$ClientActor: Connecting to master
> spark://10.1.3.7:7077... 
> 14/07/17 01:20:46 WARN TaskSchedulerImpl: Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient memory 
> 
> I searched a bit for the above warning, and found and found that others have
> encountered this problem before, but did not see a clear resolution except
> for this link:
> http://apache-spark-user-list.1001560.n3.nabble.com/TaskSchedulerImpl-Initial-job-has-not-accepted-any-resources-check-your-cluster-UI-to-ensure-that-woy-tt8247.html#a8444
> 
> Based on the suggestion there I tried supplying --executor-memory option to
> spark-submit but that did not help. 
> 
> Any suggestions. Here are the details of my set up. 
> - 3 nodes (each with 4 CPU cores and 7 GB memory) 
> - 1 node configured as Master, and the other two configured as workers 
> - Firewall is disabled on all nodes, and network communication between the
> nodes is not a problem 
> - Edited the conf/spark-env.sh on all nodes to set the following: 
>  SPARK_WORKER_CORES=3 
>  SPARK_WORKER_MEMORY=5G 
> - The Web UI as well as logs on master show that Workers were able to
> register correctly. Also the Web UI correctly shows the aggregate available
> memory and CPU cores on the workers: 
> 
> URL: spark://vmsparkwin1:7077
> Workers: 2
> Cores: 6 Total, 0 Used
> Memory: 10.0 GB Total, 0.0 B Used
> Applications: 0 Running, 0 Completed
> Drivers: 0 Running, 0 Completed
> Status: ALIVE
> 
> I try running the SparkPi example first using the run-example (which was
> failing) and later directly using the spark-submit as shown below: 
> 
> $ export MASTER=spark://vmsparkwin1:7077
> 
> $ echo $MASTER
> spark://vmsparkwin1:7077
> 
> azureuser@vmsparkwin1 /cygdrive/c/opt/spark-1.0.0
> $ ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> spark://10.1.3.7:7077 --executor-memory 1G --total-executor-cores 2
> ./lib/spark-examples-1.0.0-hadoop2.2.0.jar 10
> 
> 
> The following is the full screen output:
> 
> 14/07/17 01:20:13 INFO SecurityManager: Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 14/07/17 01:20:13 INFO SecurityManager: Changing view acls to: azureuser
> 14/07/17 01:20:13 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(azureuser)
> 14/07/17 01:20:14 INFO Slf4jLogger: Slf4jLogger started
> 14/07/17 01:20:14 INFO Remoting: Starting remoting
> 14/07/17 01:20:14 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sp...@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
> 14/07/17 01:20:14 INFO Remoting: Remoting now listens on addresses:
> [akka.tcp://sp...@vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49839]
> 14/07/17 01:20:14 INFO SparkEnv: Registering MapOutputTracker
> 14/07/17 01:20:14 INFO SparkEnv: Registering BlockManagerMaster
> 14/07/17 01:20:14 INFO DiskBlockManager: Created local directory at
> C:\cygwin\tmp\spark-local-20140717012014-b606
> 14/07/17 01:20:14 INFO MemoryStore: MemoryStore started with capacity 294.9
> MB.
> 14/07/17 01:20:14 INFO ConnectionManager: Bound socket to port 49842 with id
> = ConnectionManagerId(vmsparkwin1.cssparkwin.b1.internal.cloudapp.net,49842)
> 14/07/17 01:20:14 INFO BlockManagerMaster: Trying to register BlockManager
> 14/07/17 01:20:14 INFO BlockManagerInfo: Registering block manager
> vmsparkwin1.cssparkwin.b1.internal.cloudapp.net:49842 with 294.9 MB RAM
> 14/07/17 01:20:14 INFO BlockManagerMaster: Registered BlockManager
> 14/07/17 01:20:14 INFO HttpServer: Starting HTTP Server
> 14/07/17 01:20:14 INFO HttpBroadcast: Broadcast server started at
> http://10.1.3.7:49843
> 14/07/17 01:20:14 INFO HttpFileServer: HTTP File server d

RDD.pipe(...)

2014-07-20 Thread jay vyas
According to the api docs for the pipe operator,
def  pipe(command: String): RDD
<http://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/rdd/RDD.html>
[String]: Return an RDD created by piping elements to a forked external
process.
However, its not clear to me:

Will the outputted RDD capture the standard out from the process as its
output (i assume that is the most common implementation)?

Incidentally, I have not been able to use the pipe command to run an
external process yet, so any hints on that would be appreciated.

-- 
jay vyas


Re: RDD.pipe(...)

2014-07-20 Thread jay vyas
Nevermind :)  I found my answer in the docs for the PipedRDD

/**
 * An RDD that pipes the contents of each parent partition through an
external command
 * (printing them one per line) and returns the output as a collection of
strings.
 */
private[spark] class PipedRDD[T: ClassTag](

So, this is essentially an implementation of something analgous to hadoop's
streaming api.




On Sun, Jul 20, 2014 at 4:09 PM, jay vyas 
wrote:

> According to the api docs for the pipe operator,
> def  pipe(command: String): RDD
> <http://spark.apache.org/docs/1.0.0/api/scala/org/apache/spark/rdd/RDD.html>
> [String]: Return an RDD created by piping elements to a forked external
> process.
> However, its not clear to me:
>
> Will the outputted RDD capture the standard out from the process as its
> output (i assume that is the most common implementation)?
>
> Incidentally, I have not been able to use the pipe command to run an
> external process yet, so any hints on that would be appreciated.
>
> --
> jay vyas
>



-- 
jay vyas


Spark over graphviz (SPARK-1015, SPARK-975)

2014-07-22 Thread jay vyas
Hi spark.

I see there has been some work around graphviz visualization for spark jobs.

1) I'm wondering if anyone actively maintaining this stuff, and if so what
the best docs are for it - or else, if there is interest in an upstream
JIRA for updating the graphviz APIs it.

2) Also, am curious about utilities for visualizing/optimizing the flow of
data through an RDD at runtime and where those are in the existing codebase.

Any thoughts around pipeline visualization for spark would be appreciated.
I see some conversations about it in JIRAs but not sure what the future is
for this , possibly I could lend a hand if there are any loose ends needing
to be tied.

-- 
jay vyas


Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread jay vyas
I've been working some on building spark blueprints, and recently tried to
generalize one for easy blueprints of spark apps.

https://github.com/jayunit100/SparkBlueprint.git

It runs the spark app's main method in a unit test, and builds in SBT.

You can easily try it out and improve on it.

Obviously, calling a main method is the wrong kind of coupling for a unit
test, but it works pretty good in a simple CI environment.

I'll improve it eventually by injecting the SparkContext and validating the
RDD directly, in a next iteration.

Pull requests welcome :)





On Tue, Jul 29, 2014 at 11:29 AM, soumick86 
wrote:

> Is there any example out there for unit testing a Spark application in
> Java?
> Even a trivial application like word count will be very helpful. I am very
> new to this and I am struggling to understand how I can use JavaSpark
> Context for JUnit
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>



-- 
jay vyas


Re: pyspark script fails on EMR with an ERROR in configuring object.

2014-08-03 Thread jay vyas
NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
> ... 57 more
> Caused by: java.lang.IllegalArgumentException: Compression codec
> com.hadoop.compression.lzo.LzoCodec not found.
> at
> org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:96)
> at
> org.apache.hadoop.io.compress.CompressionCodecFactory.(CompressionCodecFactory.java:134)
> at
> org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:38)
> ... 62 more
> Caused by: java.lang.ClassNotFoundException:
> com.hadoop.compression.lzo.LzoCodec
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820)
> at
> org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:89)
> ... 64 more
>
>
> On Sun, Aug 3, 2014 at 6:04 PM, Rahul Bhojwani <
> rahulbhojwani2...@gmail.com> wrote:
>
>> Hi,
>>
>> I used to run spark scripts on local machine. Now i am porting my codes
>> to EMR and i am facing lots of problem.
>>
>> The main one now is that the spark script which is running properly on my
>> local machine is giving error when run on Amazon EMR Cluster.
>> Here is the error:
>>
>> 
>>
>>
>>
>> What can be the possible reason?
>> Thanks in advance
>> --
>>
>>  [image: http://]
>> Rahul K Bhojwani
>> [image: http://]about.me/rahul_bhojwani
>>  <http://about.me/rahul_bhojwani>
>>
>>
>
>
>
> --
>
>  [image: http://]
> Rahul K Bhojwani
> [image: http://]about.me/rahul_bhojwani
>  <http://about.me/rahul_bhojwani>
>
>
>


-- 
jay vyas


Re: Spark inside Eclipse

2014-10-03 Thread jay vyas
For intelliJ + SBT, also you can follow the directions
http://jayunit100.blogspot.com/2014/07/set-up-spark-application-devleopment.html
.  ITs really easy to run spark in an IDE .  The process for eclipse is
virtually identical.

On Fri, Oct 3, 2014 at 10:03 AM, Sanjay Subramanian <
sanjaysubraman...@yahoo.com.invalid> wrote:

> cool thanks will set this up and report back how things went
> regards
> sanjay
>   --
>  *From:* Daniel Siegmann 
> *To:* Ashish Jain 
> *Cc:* Sanjay Subramanian ; "
> user@spark.apache.org" 
> *Sent:* Thursday, October 2, 2014 6:52 AM
> *Subject:* Re: Spark inside Eclipse
>
> You don't need to do anything special to run in local mode from within
> Eclipse. Just create a simple SparkConf and create a SparkContext from
> that. I have unit tests which execute on a local SparkContext, and they
> work from inside Eclipse as well as SBT.
>
> val conf = new SparkConf().setMaster("local").setAppName(s"Whatever")
> val sc = new SparkContext(sparkConf)
>
> Keep in mind you can only have one local SparkContext at a time,
> otherwise you will get some weird errors. If you have tests running
> sequentially, make sure to close the SparkContext in your tear down
> method. If tests run in parallel you'll need to share the SparkContext
> between tests.
>
> For unit testing, you can make use of SparkContext.parallelize to set up
> your test inputs and RDD.collect to retrieve the outputs.
>
>
>
>
> On Wed, Oct 1, 2014 at 7:43 PM, Ashish Jain  wrote:
>
> Hello Sanjay,
> This can be done, and is a very effective way to debug.
> 1) Compile and package your project to get a fat jar
> 2) In your SparkConf use setJars and give location of this jar. Also set
> your master here as local in SparkConf
> 3) Use this SparkConf when creating JavaSparkContext
> 4) Debug your program like you would any normal program.
> Hope this helps.
> Thanks
> Ashish
> On Oct 1, 2014 4:35 PM, "Sanjay Subramanian"
>  wrote:
>
> hey guys
>
> Is there a way to run Spark in local mode from within Eclipse.
> I am running Eclipse Kepler on a Macbook Pro with Mavericks
> Like one can run hadoop map/reduce applications from within Eclipse and
> debug and learn.
>
> thanks
>
> sanjay
>
>
>
>
> --
> Daniel Siegmann, Software Developer
> Velos
> Accelerating Machine Learning
>
> 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
> E: daniel.siegm...@velos.io W: www.velos.io
>
>
>


-- 
jay vyas


Re: Does Ipython notebook work with spark? trivial example does not work. Re: bug with IPython notebook?

2014-10-10 Thread jay vyas
PySpark definetly works for me in ipython notebook.  A good way to debug is
do "setMaster("local")" in your python sc object, see if that works.  Then
from there, modify it to point to the real spark server.

Also, I added a hack where i  did sys.path.insert the path to pyspark in my
python note book to get it working properly.

You can try these instructions out if you want which i recently put
together based on some other stuff online + a few minor modifications .

http://jayunit100.blogspot.com/2014/07/ipython-on-spark.html


On Thu, Oct 9, 2014 at 2:50 PM, Andy Davidson  wrote:

> I wonder if I am starting iPython notebook incorrectly. The example in my
> original email does not work. It looks like stdout is not configured
> correctly If I submit it as a python.py file It works correctly
>
> Any idea how I what the problem is?
>
>
> Thanks
>
> Andy
>
>
> From: Andrew Davidson 
> Date: Tuesday, October 7, 2014 at 4:23 PM
> To: "user@spark.apache.org" 
> Subject: bug with IPython notebook?
>
> Hi
>
> I think I found a bug in the iPython notebook integration. I am not sure
> how to report it
>
> I am running spark-1.1.0-bin-hadoop2.4 on an AWS ec2 cluster. I start the
> cluster using the launch script provided by spark
>
> I start iPython notebook on my cluster master as follows and use an ssh
> tunnel to open the notebook in a browser running on my local computer
>
> ec2-user@ip-172-31-20-107 ~]$ IPYTHON_OPTS="notebook --pylab inline
> --no-browser --port=7000" /root/spark/bin/pyspark
>
> Bellow is the code my notebook executes
>
>
> Bug list:
>
>1. Why do I need to create a SparkContext? If I run pyspark
>interactively The context is created automatically for me
>2. The print statement causes the output to be displayed in the
>terminal I started pyspark, not in the notebooks output
>
> Any comments or suggestions would be greatly appreciated
>
> Thanks
>
> Andy
>
>
> import sys
> from operator import add
>
> from pyspark import SparkContext
>
> # only stand alone jobs should create a SparkContext
> sc = SparkContext(appName="pyStreamingSparkRDDPipe”)
>
> data = [1, 2, 3, 4, 5]
> rdd = sc.parallelize(data)
>
> def echo(data):
> print "python recieved: %s" % (data) # output winds up in the shell
> console in my cluster (ie. The machine I launched pyspark from)
>
> rdd.foreach(echo)
> print "we are done"
>
>
>


-- 
jay vyas


Streams: How do RDDs get Aggregated?

2014-10-11 Thread jay vyas
Hi spark !

I dont quite yet understand the semantics of RDDs in a streaming context
very well yet.

Are there any examples of how to implement CustomInputDStreams, with
corresponding Receivers in the docs ?

Ive hacked together a  custom stream, which is being opened and is
consuming data internally, however,  it is not empty RDDs, even though I am
calling store(...) mutliple times - however, Im relying on the default
implementation of store(...) which may be a mistake on my end.

By making my slide duration  small, I can make sure that indeed the job
finishes - however im not quite sure how we're supposed to shuttle data
from the ReceiverInputDStream into RDDs ?

Thanks!


Re: How do you write a JavaRDD into a single file

2014-10-20 Thread jay vyas
sounds more like a use case for using "collect"... and writing out the file
in your program?

On Mon, Oct 20, 2014 at 6:53 PM, Steve Lewis  wrote:

> Sorry I missed the discussion - although it did not answer the question -
> In my case (and I suspect the askers) the 100 slaves are doing a lot of
> useful work but the generated output is small enough to be handled by a
> single process.
> Many of the large data problems I have worked process a lot of data but
> end up with a single report file - frequently in a format specified by
> preexisting downstream code.
>   I do not want a separate  hadoop merge step for a lot of reasons
> starting with
> better control of the generation of the file.
> However toLocalIterator is exactly what I need.
> Somewhat off topic - I am being overwhelmed by getting a lot of emails
> from the list - is there s way to get a daily summary which might be a lot
> easier to keep up with
>
>
> On Mon, Oct 20, 2014 at 3:23 PM, Sean Owen  wrote:
>
>> This was covered a few days ago:
>>
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-write-a-RDD-into-One-Local-Existing-File-td16720.html
>>
>> The multiple output files is actually essential for parallelism, and
>> certainly not a bad idea. You don't want 100 distributed workers
>> writing to 1 file in 1 place, not if you want it to be fast.
>>
>> RDD and  JavaRDD already expose a method to iterate over the data,
>> called toLocalIterator. It does not require that the RDD fit entirely
>> in memory.
>>
>> On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis 
>> wrote:
>> >   At the end of a set of computation I have a JavaRDD . I want a
>> > single file where each string is printed in order. The data is small
>> enough
>> > that it is acceptable to handle the printout on a single processor. It
>> may
>> > be large enough that using collect to generate a list might be
>> unacceptable.
>> > the saveAsText command creates multiple files with names like part,
>> > part0001  This was bed behavior in Hadoop for final output and is
>> also
>> > bad for Spark.
>> >   A more general issue is whether is it possible to convert a JavaRDD
>> into
>> > an iterator or iterable over then entire data set without using collect
>> or
>> > holding all data in memory.
>> >In many problems where it is desirable to parallelize intermediate
>> steps
>> > but use a single process for handling the final result this could be
>> very
>> > useful.
>>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>


-- 
jay vyas


Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
Hi Spark !  I found out why my RDD's werent coming through in my spark
stream.

It turns out you need the onStart()  needs to return , it seems - i.e. you
need to launch the worker part of your
start process in a thread.  For example

def onStartMock():Unit ={
  val future = new Thread(new Runnable() {
def run() {
  for(x <- 1 until 10) {
val newMem = Runtime.getRuntime.freeMemory()/12188091;
if(newMem != lastMem){
  System.out.println("in thread : " + newMem);
}
lastMem=newMem;
store(mockStatus);
  }
}});

Hope that helps somebody in the same situation.  FYI Its in the docs :)

 * {{{
 *  class MyReceiver(storageLevel: StorageLevel) extends
NetworkReceiver[String](storageLevel) {
 *  def onStart() {
 *  // Setup stuff (start threads, open sockets, etc.) to start
receiving data.
 *  // Must start new thread to receive data, as onStart() must be
non-blocking.
 *
 *  // Call store(...) in those threads to store received data into
Spark's memory.
 *
 *  // Call stop(...), restart(...) or reportError(...) on any
thread based on how
 *  // different errors needs to be handled.
 *
 *  // See corresponding method documentation for more details
 *  }
 *
 *  def onStop() {
 *  // Cleanup stuff (stop threads, close sockets, etc.) to stop
receiving data.
 *  }
 *  }
 * }}}


Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
Oh - and one other note on this, which appears to be the case.

If , in your stream forEachRDD implementation, you do something stupid
(like call rdd.count())

tweetStream.foreachRDD((rdd,lent)=> {
  tweetStream.repartition(1)
  numTweetsCollected+=1;
  //val count = rdd.count() DONT DO THIS !

You can also get stuck in a situation where your RDD processor blocks
infinitely.

And for twitter specific stuff, make sure to look at modifying the
TwitterInputDStream class
so that it implements the stuff from SPARK-2464, which can lead to infinite
stream reopening as well.



On Tue, Oct 21, 2014 at 11:02 AM, jay vyas 
wrote:

> Hi Spark !  I found out why my RDD's werent coming through in my spark
> stream.
>
> It turns out you need the onStart()  needs to return , it seems - i.e. you
> need to launch the worker part of your
> start process in a thread.  For example
>
> def onStartMock():Unit ={
>   val future = new Thread(new Runnable() {
> def run() {
>   for(x <- 1 until 10) {
> val newMem = Runtime.getRuntime.freeMemory()/12188091;
> if(newMem != lastMem){
>   System.out.println("in thread : " + newMem);
> }
> lastMem=newMem;
> store(mockStatus);
>   }
> }});
>
> Hope that helps somebody in the same situation.  FYI Its in the docs :)
>
>  * {{{
>  *  class MyReceiver(storageLevel: StorageLevel) extends
> NetworkReceiver[String](storageLevel) {
>  *  def onStart() {
>  *  // Setup stuff (start threads, open sockets, etc.) to start
> receiving data.
>  *  // Must start new thread to receive data, as onStart() must be
> non-blocking.
>  *
>  *  // Call store(...) in those threads to store received data
> into Spark's memory.
>  *
>  *  // Call stop(...), restart(...) or reportError(...) on any
> thread based on how
>  *  // different errors needs to be handled.
>  *
>  *  // See corresponding method documentation for more details
>  *  }
>  *
>  *      def onStop() {
>  *  // Cleanup stuff (stop threads, close sockets, etc.) to stop
> receiving data.
>  *  }
>  *  }
>  * }}}
>
>


-- 
jay vyas


Re: real-time streaming

2014-10-28 Thread jay vyas
a REAL TIME stream, by definition, delivers data every X seconds.  you can
easily do this with spark. roughly here is the way to create a stream
gobbler and attach a spark app to read its data every X seconds

- Write a Runnable thread which reads data from a source.  Test that it
works independently.

- Add that thread into a DStream Handler, and implement onStart() such that
the thread above is launched in the onStart(), andadd logic to onStop() to
safely destroy the above thread.

- Set the window time (i.e. to 5 seconds)

- Start your spark streaming context, and run a forEachRDD (...) in your
spark app.

- MAke sure that you launch with 2 or more workers.



On Tue, Oct 28, 2014 at 1:44 PM, ll  wrote:

> the spark tutorial shows that we can create a stream that reads "new files"
> from a directory.
>
> that seems to have some lag time, as we have to write the data to file
> first
> and then wait until spark stream picks it up.
>
> what is the best way to implement REAL 'REAL-TIME' streaming for analysis
> in
> real time?  for example, like streaming videos, sounds, images, etc
> continuously?
>
> thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/real-time-streaming-tp17526.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
jay vyas


Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread jay vyas
)
>>>> targets.foreach(target => records.filter{record =>
>>>> isTarget(target,record)}.writeToCassandra(target,table))
>>>> )
>>>>
>>>> - As most dstream transformations as possible before foreachRDD:
>>>>
>>>> val recordStream = dstream.map(elem => record(elem))
>>>> targets.foreach{target => recordStream.filter(record =>
>>>> isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))}
>>>>
>>>> ?
>>>>
>>>> kr, Gerard.
>>>>
>>>>
>>>>
>>>> On Wed, Oct 22, 2014 at 2:12 PM, Gerard Maas 
>>>> wrote:
>>>>
>>>>> Thanks Matt,
>>>>>
>>>>> Unlike the feared RDD operations on the driver, it's my understanding
>>>>> that these Dstream ops on the driver are merely creating an execution plan
>>>>> for each RDD.
>>>>> My question still remains: Is it better to foreachRDD early in the
>>>>> process or do as much Dstream transformations before going into the
>>>>> foreachRDD call?
>>>>>
>>>>> Maybe this will require some empirical testing specific to each
>>>>> implementation?
>>>>>
>>>>> -kr, Gerard.
>>>>>
>>>>>
>>>>> On Mon, Oct 20, 2014 at 5:07 PM, Matt Narrell 
>>>>> wrote:
>>>>>
>>>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html
>>>>>>
>>>>>> foreachRDD is executed on the driver….
>>>>>>
>>>>>> mn
>>>>>>
>>>>>> On Oct 20, 2014, at 3:07 AM, Gerard Maas 
>>>>>> wrote:
>>>>>>
>>>>>> Pinging TD  -- I'm sure you know :-)
>>>>>>
>>>>>> -kr, Gerard.
>>>>>>
>>>>>> On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We have been implementing several Spark Streaming jobs that are
>>>>>>> basically processing data and inserting it into Cassandra, sorting it 
>>>>>>> among
>>>>>>> different keyspaces.
>>>>>>>
>>>>>>> We've been following the pattern:
>>>>>>>
>>>>>>> dstream.foreachRDD(rdd =>
>>>>>>> val records = rdd.map(elem => record(elem))
>>>>>>> targets.foreach(target => records.filter{record =>
>>>>>>> isTarget(target,record)}.writeToCassandra(target,table))
>>>>>>> )
>>>>>>>
>>>>>>> I've been wondering whether there would be a performance difference
>>>>>>> in transforming the dstream instead of transforming the RDD within the
>>>>>>> dstream with regards to how the transformations get scheduled.
>>>>>>>
>>>>>>> Instead of the RDD-centric computation, I could transform the
>>>>>>> dstream until the last step, where I need an rdd to store.
>>>>>>> For example, the  previous  transformation could be written as:
>>>>>>>
>>>>>>> val recordStream = dstream.map(elem => record(elem))
>>>>>>> targets.foreach{target => recordStream.filter(record =>
>>>>>>> isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))}
>>>>>>>
>>>>>>> Would  be a difference in execution and/or performance?  What would
>>>>>>> be the preferred way to do this?
>>>>>>>
>>>>>>> Bonus question: Is there a better (more performant) way to sort the
>>>>>>> data in different "buckets" instead of filtering the data collection 
>>>>>>> times
>>>>>>> the #buckets?
>>>>>>>
>>>>>>> thanks,  Gerard.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>


-- 
jay vyas


Re: random shuffle streaming RDDs?

2014-11-03 Thread Jay Vyas
A use case would be helpful? 

Batches of  RDDs from Streams are going to have temporal ordering in terms of 
when they are processed in a typical application ... , but maybe you could 
shuffle the way batch iterations work

> On Nov 3, 2014, at 11:59 AM, Josh J  wrote:
> 
> When I'm outputting the RDDs to an external source, I would like the RDDs to 
> be outputted in a random shuffle so that even the order is random. So far 
> what I understood is that the RDDs do have a type of order, in that the order 
> for spark streaming RDDs would be the order in which spark streaming read the 
> tuples from source (e.g. ordered by roughly when the producer sent the tuple 
> in addition to any latency)
> 
>> On Mon, Nov 3, 2014 at 8:48 AM, Sean Owen  wrote:
>> I think the answer will be the same in streaming as in the core. You
>> want a random permutation of an RDD? in general RDDs don't have
>> ordering at all -- excepting when you sort for example -- so a
>> permutation doesn't make sense. Do you just want a well-defined but
>> random ordering of the data? Do you just want to (re-)assign elements
>> randomly to partitions?
>> 
>> On Mon, Nov 3, 2014 at 4:33 PM, Josh J  wrote:
>> > Hi,
>> >
>> > Is there a nice or optimal method to randomly shuffle spark streaming RDDs?
>> >
>> > Thanks,
>> > Josh
> 


Embedding static files in a spark app

2014-11-08 Thread Jay Vyas
Hi spark. I have a set of text files that are dependencies of my app.

They are less than 2mb in total size.

What's the idiom for packaing text file dependencies for a spark based jar 
file? Class resources in packages ? Or distributing them separately?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Jay Vyas
Yup , very important that  n>1 for spark streaming jobs, If local use 
local[2] 

The thing to remember is that your spark receiver will take a thread to itself 
and produce data , so u need another thread to consume it .

In a cluster manager like yarn or mesos, the word thread Is not used anymore, I 
guess has different meaning- you need 2 or more free compute slots, and that 
should be guaranteed by looking to see how many free node managers are running 
etc.

> On Nov 12, 2014, at 7:53 PM, "Shao, Saisai"  wrote:
> 
> Did you configure Spark master as local, it should be local[n], n > 1 for 
> local mode. Beside there’s a Kafka wordcount example in Spark Streaming 
> example, you can try that. I’ve tested with latest master, it’s OK.
>  
> Thanks
> Jerry
>  
> From: Tobias Pfeiffer [mailto:t...@preferred.jp] 
> Sent: Thursday, November 13, 2014 8:45 AM
> To: Bill Jay
> Cc: u...@spark.incubator.apache.org
> Subject: Re: Spark streaming cannot receive any message from Kafka
>  
> Bill,
>  
> However, when I am currently using Spark 1.1.0. the Spark streaming job 
> cannot receive any messages from Kafka. I have not made any change to the 
> code.
>  
> Do you see any suspicious messages in the log output?
>  
> Tobias
>  


Re: Does Spark Streaming calculate during a batch?

2014-11-13 Thread jay vyas
1) Your have a receiver thread. That thread might use alot of CPU, or not,
depending on how  you implement the thread in onStart.

2) Every 5 minutes, spark will submit a job which process
every RDD which was created (i.e using the store() call) in the
receiver .  That job will run asynchronously to the receiver, which
is still working to produce new RDDs for the next batch,


So, maybe you're monitoring the CPU only on the
spark workers which is running the batch jobs, and not
on the spark worker which is doing the RDD ingestion?





On Thu, Nov 13, 2014 at 10:35 AM, Michael Campbell <
michael.campb...@gmail.com> wrote:

> I was running a proof of concept for my company with spark streaming, and
> the conclusion I came to is that spark collects data for the
> batch-duration, THEN starts the data-pipeline calculations.
>
> My batch size was 5 minutes, and the CPU was all but dead for 5, then when
> the 5 minutes were up the CPU's would spike for a while presumably doing
> the calculations.
>
> Is this presumption true, or is it running the data through the
> calculation pipeline before the batch is up?
>
> What could lead to the periodic CPU spike - I had a reduceByKey, so was it
> doing that only after all the batch data was in?
>
> Thanks
>



-- 
jay vyas


Re: Streaming: getting total count over all windows

2014-11-13 Thread jay vyas
I would think this should be done at the application level.
After all, the core functionality of SparkStreaming is to capture RDDs in
some real time interval and process them -
not to aggregate their results.

But maybe there is a better way...

On Thu, Nov 13, 2014 at 8:28 PM, SK  wrote:

> Hi,
>
> I am using the following code to generate the (score, count) for each
> window:
>
> val score_count_by_window  = topic.map(r =>  r._2)   // r._2 is the integer
> score
>  .countByValue()
>
> score_count_by_window.print()
>
> E.g. output for a window is as follows, which means that within the Dstream
> for that window, there are 2 rdds with score 0; 3 with score 1, and 1 with
> score -1.
> (0, 2)
> (1, 3)
> (-1, 1)
>
> I would like to get the aggregate count for each score over all windows
> until program terminates. I tried countByValueAndWindow() but the result is
> same as countByValue() (i.e. it is producing only per window counts).
> reduceByWindow also does not produce the result I am expecting. What is the
> correct way to sum up the counts over multiple windows?
>
> thanks
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-getting-total-count-over-all-windows-tp1.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
jay vyas


Re: Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Jay Vyas
This seems pretty standard: your IntelliJ classpath isn't matched to the 
correct ones that are used in spark shell

Are you using the SBT plugin? If not how are you putting deps into IntelliJ?

> On Nov 20, 2014, at 7:35 PM, Sanjay Subramanian 
>  wrote:
> 
> hey guys
> 
> I am at AmpCamp 2014 at UCB right now :-) 
> 
> Funny Issue...
> 
> This code works in Spark-Shell but throws a funny exception in IntelliJ
> 
> CODE
> 
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
> val wikiData = 
> sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")
> wikiData.registerTempTable("wikiData")
> sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username 
> <> '' GROUP BY username ORDER BY cnt DESC LIMIT 
> 10").collect().foreach(println)
> 
> RESULTS
> 
> [Waacstats,2003]
> [Cydebot,949]
> [BattyBot,939]
> [Yobot,890]
> [Addbot,853]
> [Monkbot,668]
> [ChrisGualtieri,438]
> [RjwilmsiBot,387]
> [OccultZone,377]
> [ClueBot NG,353]
> 
> 
> INTELLIJ CODE
> =
> object ParquetSql {
>   def main(args: Array[String]) {
> 
> val sconf = new 
> SparkConf().setMaster("local").setAppName("MedicalSideFx-NamesFoodSql")
> val sc = new SparkContext(sconf)
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
> val wikiData = 
> sqlContext.parquetFile("/Users/sansub01/mycode/knowledge/spark_ampcamp_2014/data/wiki_parquet")
> wikiData.registerTempTable("wikiData")
> val results = sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM 
> wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10")
> results.collect().foreach(println)
>   }
> 
> }
> 
> INTELLIJ ERROR
> ==
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
> interface org.apache.spark.serializer.Serializer, but class was expected
>   at 
> org.apache.spark.sql.parquet.ParquetFilters$.serializeFilterExpressions(ParquetFilters.scala:244)
>   at 
> org.apache.spark.sql.parquet.ParquetTableScan.execute(ParquetTableOperations.scala:109)
>   at 
> org.apache.spark.sql.execution.Filter.execute(basicOperators.scala:57)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
>   at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:48)
>   at 
> org.apache.spark.sql.execution.Exchange$$anonfun$execute$1.apply(Exchange.scala:45)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
>   at org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:151)
>   at 
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1.apply(Aggregate.scala:127)
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
>   at org.apache.spark.sql.execution.Aggregate.execute(Aggregate.scala:126)
>   at 
> org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:171)
>   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
>   at org.medicalsidefx.common.utils.ParquetSql$.main(ParquetSql.scala:18)
>   at org.medicalsidefx.common.utils.ParquetSql.main(ParquetSql.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
> 
> 
> 


Re: How to execute a custom python library on spark

2014-11-25 Thread jay vyas
a quick thought on this: I think this is distro dependent also, right?
We ran into a similar issue in
https://issues.apache.org/jira/browse/BIGTOP-1546 where it looked like the
python libraries might be overwritten on launch.

On Tue, Nov 25, 2014 at 3:09 PM, Chengi Liu  wrote:

> Hi,
>   I have written few datastructures as classes like following..
>
> So, here is my code structure:
>
> project/foo/foo.py , __init__.py
>   /bar/bar.py, __init__.py  bar.py  imports foo as from foo.foo
> import *
>  /execute/execute.py  imports bar as from bar.bar import *
>
> Ultimately I am executing execute.py as
>
> pyspark execute.py
>
> And this works fine locally.. but as soon I submit it on cluster... I see
> modules missing error..
> I tried to send each and every file using --py-files flag (foo.py bar.py )
> and other helper files..
>
> But even then it complaints that module is not found So, the question
> is.. When one is building a library which is suppose to execute on top of
> spark, how should the imports and library be structured so that it works
> fine on spark.
> When to use pyspark and when to use spark submit to execute python
> scripts/module
> Bonus points if one can point an example library and how to run it :)
> Thanks
>



-- 
jay vyas


Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Jay Vyas
Here's an example of a Cassandra etl that you can follow which should exit on 
its own.  I'm using it as a blueprint for revolving spark streaming apps on top 
of.

For me, I kill the streaming app w system.exit after a sufficient amount of 
data is collected.

That seems to work for most any scenario... 

But you I guess could also kill on the stream handler side as well if you are 
writing a custom dstream.

https://github.com/jayunit100/SparkBlueprint/blob/master/src/main/scala/sparkapps/tweetstream/Processor.scala

> On Dec 5, 2014, at 1:50 AM, Akhil Das  wrote:
> 
> Batch is the batch duration that you are specifying while creating the 
> StreamingContext, so at the end of every batch's computation the data will 
> get flushed to Cassandra, and why are you stopping your program with Ctrl + 
> C? You can always specify the time with the sc.awaitTermination(Duration) 
> 
> Thanks
> Best Regards
> 
>> On Fri, Dec 5, 2014 at 11:53 AM,  wrote:
>> Hi Gerard/Akhil,
>> 
>> By "how do I specify a batch" I was trying to ask that when does the data in 
>> the JavaDStream gets flushed into Cassandra table?. 
>> I read somewhere that the streaming data in batches gets written in 
>> Cassandra. This batch can be of some particular time, or one particular run.
>> That was what I was trying to understand, how to set that "Batch" in my 
>> program. Because if a batch means one cycle run of my  streaming app, then 
>> in my app, I'm hitting a Ctrl+C to kill the program. So the program is 
>> terminating, and would the data get inserted successfully into my Cassandra 
>> table?
>> For example, 
>> 
>> in Terminal-A I'm running Kafka-producer to stream-in messages. 
>> 
>> Terminal-B I'm running my Streaming App. In my App there is a line 
>> jssc.awaitTermination();​ which will keep running my App till I kill it.
>> Eventually I am hitting Ctrl+C in my App terminal, i.e. Terminal-B and 
>> killing it. So its a kind of ungraceful termination. So in this case will 
>> the data in my App DStream get written into Cassandra?
>> 
>> 
>> 
>> Thanks and Regards,
>> 
>> Md. Aiman Sarosh.
>> Accenture Services Pvt. Ltd.
>> Mob #:  (+91) - 9836112841.
>>  
>> From: Gerard Maas 
>> Sent: Thursday, December 4, 2014 10:22 PM
>> To: Akhil Das
>> Cc: Sarosh, M.; user@spark.apache.org
>> Subject: Re: Spark-Streaming: output to cassandra
>>  
>> I guess he's already doing so, given the 'saveToCassandra' usage.  
>> What I don't understand is the question "how do I specify a batch". That 
>> doesn't make much sense to me. Could you explain further?
>> 
>> -kr, Gerard.
>> 
>>> On Thu, Dec 4, 2014 at 5:36 PM, Akhil Das  
>>> wrote:
>>> You can use the datastax's  Cassandra connector.
>>> 
>>> Thanks
>>> Best Regards
>>> 
 On Thu, Dec 4, 2014 at 8:21 PM,  wrote:
 Hi,
 
 I have written the code below which is streaming data from kafka, and 
 printing to the console.
 I want to extend this, and want my data to go into Cassandra table instead.
 
 JavaStreamingContext jssc = new JavaStreamingContext("local[4]", 
 "SparkStream", new Duration(1000));
 JavaPairReceiverInputDStream messages = 
 KafkaUtils.createStream(jssc, args[0], args[1], topicMap );
 
 System.out.println("Connection done!");
 JavaDStream data = messages.map(new Function>>> String>, String>() 
 {
 public String call(Tuple2 message)
 {
 return message._2();
 }
 }
 );
 //data.print();   --> output to console
 data.foreachRDD(saveToCassandra("mykeyspace","mytable"));
 jssc.start();
 jssc.awaitTermination();
 
 
 How should I implement the line:
 data.foreachRDD(saveToCassandra("mykeyspace","mytable"));​
 so that data goes into Cassandra, in each batch.  And how do I specify a 
 batch, because if i do Ctrl+C on the console of streaming-job-jar, nothing 
 will be entered into cassandra for sure since it is getting killed.
 
 Please help.
 
 Thanks and Regards,
 
 Md. Aiman Sarosh.
 Accenture Services Pvt. Ltd.
 Mob #:  (+91) - 9836112841.
 
 
 This message is for the designated recipient only and may contain 
 privileged, proprietary, or otherwise confidential information. If you 
 have received it in error, please notify the sender immediately and delete 
 the original. Any other use of the e-mail by you is prohibited. Where 
 allowed by local law, electronic communications with Accenture and its 
 affiliates, including e-mail and instant messaging (including content), 
 may be scanned by our systems for the purposes of information security and 
 assessment of internal compliance with Accenture policy. 
 __
 
 www.accenture.com
> 


Re: Unit testing and Spark Streaming

2014-12-12 Thread Jay Vyas
https://github.com/jayunit100/SparkStreamingCassandraDemo
 
On this note, I've built a framework which is mostly "pure" so that functional 
unit tests can be run composing mock data for Twitter statuses, with just 
regular junit... That might be relevant also.

I think at some point we should come up with a robust test driven  framework 
for building stream apps... And the idea of Scala test with the injection and 
comparison you did might be a good start.

Thanks for starting this dialogue!

> On Dec 12, 2014, at 9:18 AM, Emre Sevinc  wrote:
> 
> On Fri, Dec 12, 2014 at 2:17 PM, Eric Loots  wrote:
> > How can the log level in test mode be reduced (or extended when needed) ?
> 
> Hello Eric,
> 
> The following might be helpful for reducing the log messages during unit 
> testing:
> 
>  http://stackoverflow.com/a/2736/236007
> 
> --
> Emre Sevinç
> https://be.linkedin.com/in/emresevinc
> 


Re: Spark Streaming Threading Model

2014-12-19 Thread jay vyas
So , at any point does a stream stop producing RDDs ?  If not, is there a
possibility, if the batching isnt working or is broken, that your disk /
RAM will fill up to the brim w/ unprocessed RDD backlog?

On Fri, Dec 19, 2014 at 1:29 PM, Silvio Fiorito <
silvio.fior...@granturing.com> wrote:
>
>   Batches will wait for the previous batch to finish. The monitoring
> console will show you the backlog of waiting batches.
>
>   From: Asim Jalis 
> Date: Friday, December 19, 2014 at 1:16 PM
> To: user 
> Subject: Spark Streaming Threading Model
>
>   Q: In Spark Streaming if your DStream transformation and output action
> take longer than the batch duration will the system process the next batch
> in another thread? Or will it just wait until the first batch’s RDD is
> processed? In other words does it build up a queue of buffered RDDs
> awaiting processing or does it just process them?
>
>  Asim
>


-- 
jay vyas


Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Jay Vyas
Hi enno.  Might be worthwhile to cross post this on dev@hadoop... Obviously a 
simple spark way to test this would be to change the uri to write to hdfs:// or 
maybe you could do file:// , and confirm that the extra slash goes away.

- if it's indeed a jets3t issue we should add a new unit test for this if the 
hcfs tests are passing for jets3tfilesystem, yet this error still exists.

- To learn how to run HCFS tests against any FileSystem , see the wiki page : 
https://wiki.apache.org/hadoop/HCFS/Progress (see the July 14th entry on that 
page).

- Is there another S3FileSystem implementation for AbstractFileSystem or is 
jets3t the only one?  That would be a easy  way to test this. And also a good 
workaround.

I'm wondering, also why jets3tfilesystem is the AbstractFileSystem used by so 
many - is that the standard impl for storing using AbstractFileSystem interface?

> On Dec 23, 2014, at 6:06 AM, Enno Shioji  wrote:
> 
> Is anybody experiencing this? It looks like a bug in JetS3t to me, but 
> thought I'd sanity check before filing an issue.
> 
> 
> 
> I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL 
> ("s3://fake-test/1234").
> 
> The code does write to S3, but with double forward slashes (e.g. 
> "s3://fake-test//1234/-141933428/".
> 
> I did a debug and it seem like the culprit is 
> Jets3tFileSystemStore#pathToKey(path), which returns "/fake-test/1234/..." 
> for the input "s3://fake-test/1234/...". when it should hack off the first 
> forward slash. However, I couldn't find any bug report for JetS3t for this.
> 
> Am I missing something, or is this likely a JetS3t bug?
> 
> 
> 
> ᐧ


Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Jay Vyas
I find importing a working SBT project into IntelliJ is the way to go. 

How did you load the project into intellij? 

> On Jan 13, 2015, at 4:45 PM, Enno Shioji  wrote:
> 
> Had the same issue. I can't remember what the issue was but this works:
> 
> libraryDependencies ++= {
>   val sparkVersion = "1.2.0"
>   Seq(
> "org.apache.spark" %% "spark-core" % sparkVersion % "provided",
> "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
> "org.apache.spark" %% "spark-streaming-twitter" % sparkVersion % 
> "provided",
> "org.apache.spark" %% "spark-streaming-kafka" % sparkVersion % "provided",
> "javax.servlet" % "javax.servlet-api" % "3.0.1" % "provided"
>   )
> }
> 
> In order to run classes in "main" source in Intellij, you must invoke it from 
> a source under "test" as Intellij won't provide the "provided" scope 
> libraries when running code in "main" source (but it will for sources under 
> "test").
> 
> With this config you can "sbt assembly" in order to get the fat jar without 
> Spark jars.  
> 
> 
> ᐧ
> 
>> On Tue, Jan 13, 2015 at 12:16 PM, Night Wolf  wrote:
>> Hi,
>> 
>> I'm trying to load up an SBT project in IntelliJ 14 (windows) running 1.7 
>> JDK, SBT 0.13.5 -I seem to be getting errors with the project.
>> 
>> The build.sbt file is super simple;
>> 
>> name := "scala-spark-test1"
>> version := "1.0"
>> 
>> scalaVersion := "2.10.4"
>> 
>> libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.0"
>> 
>> Then I have a super simple test class;
>> 
>> package test
>> 
>> import org.apache.spark.{SparkContext, SparkConf}
>> 
>> case class Blah(s: Int, d: String)
>> 
>> object Test1  {
>>   def main(args: Array[String]): Unit = {
>> val sparkconf = new 
>> SparkConf().setMaster("local[4]").setAppName("test-spark")
>> val sc = new SparkContext(sparkconf)
>> 
>> val rdd = sc.parallelize(Seq(
>>   Blah(1,"dsdsd"),
>>   Blah(2,"daaa"),
>>   Blah(3,"dhghghgh")
>> ))
>> 
>> rdd.collect().foreach(println)
>> 
>>   }
>> }
>> 
>> When I try to run the Test1 object in IntelliJ I get the following error;
>> 
>> Exception in thread "main" java.lang.NoClassDefFoundError: 
>> javax/servlet/http/HttpServletResponse
>>  at 
>> org.apache.spark.HttpServer.org$apache$spark$HttpServer$$doStart(HttpServer.scala:73)
>>  at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:60)
>>  at org.apache.spark.HttpServer$$anonfun$1.apply(HttpServer.scala:60)
>>  at 
>> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1676)
>>  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>>  at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1667)
>>  at org.apache.spark.HttpServer.start(HttpServer.scala:60)
>>  at org.apache.spark.HttpFileServer.initialize(HttpFileServer.scala:45)
>>  at org.apache.spark.SparkEnv$.create(SparkEnv.scala:304)
>>  at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:159)
>>  at org.apache.spark.SparkContext.(SparkContext.scala:232)
>>  at test.Test1$.main(Test1.scala:10)
>>  at test.Test1.main(Test1.scala)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  at java.lang.reflect.Method.invoke(Method.java:606)
>>  at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
>> Caused by: java.lang.ClassNotFoundException: 
>> javax.servlet.http.HttpServletResponse
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>  at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>  at java.security.AccessController.doPrivileged(Native Method)
>>  at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>>  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>  ... 18 more
>> 
>> 
>> For whatever reason it seems that IntelliJ isnt pulling in these deps. Doing 
>> an sbt run works fine. Looking at the project structure it seems that 7 libs 
>> dont get marked as a dependency for my module... But they are on the dep 
>> tree http://pastebin.com/REkQh5ux 
>> 
>> 
>> 
>> Is this something to do with the libs and scoping or shading in Spark and 
>> its associated libs? Has anyone else seen this issue?
>> 
>> Cheers,
>> NW
> 


Re: Benchmarking Spark with YCSB

2014-05-16 Thread Jay Vyas
I'm not sure what you mean... YCSB is for transactional systems.  Spark
isnt really in that category - its an analytics platform.  RDDs by their
very nature are not transactional.,


On Fri, May 16, 2014 at 6:37 AM, bhusted  wrote:

> Can anyone comment on what it would take to run Spark with YCSB and HBase
> for
> benchmarking?  Has this been done before or been reviewed?  If not, our
> company is looking to make an investment to perform the development
> necessary but we are looking for any technical insight on what it would
> take
> to make this happen.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Benchmarking-Spark-with-YCSB-tp5813.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>



-- 
Jay Vyas
http://jayunit100.blogspot.com