Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Jay Han
+1. It sounds awesome! Kiran Kumar Dusi 于2024年3月21日周四 14:16写道: > +1 > > On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri < > farsheed.asho...@gmail.com> wrote: > >> +1 >> >> On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, >> wrote: >> >>> Some of you may be aware that Databricks community Home | Dat

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
t; wrote: >> >>> Did you check if mapreduce.fileoutputcommitter.algorithm.version 2 is >>> supported on GCS? IIRC it wasn't, but you could check with GCP support >>> >>> >>> On Mon, Jul 17, 2023 at 3:54 PM Dipayan Dev >>> wrote: >

Re: Spark File Output Committer algorithm for GCS

2023-07-17 Thread Jay
You can try increasing fs.gs.batch.threads and fs.gs.max.requests.per.batch. The definitions for these flags are available here - https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md On Mon, 17 Jul 2023 at 14:59, Dipayan Dev wrote: > No, I am using Spark 2.4

Re: Spark 3 + Delta 0.7.0 Hive Metastore Integration Question

2020-12-20 Thread Jay
I think I found the issue, Hive metastore 2.3.6 doesn't have the necessary support. After upgrading to Hive 3.1.2 I was able to run the select query. On Sun, 20 Dec 2020 at 12:00, Jay wrote: > Thanks Matt. > > I have set the two configs in my sparkConfig as bel

Re: Spark 3 + Delta 0.7.0 Hive Metastore Integration Question

2020-12-19 Thread Jay
uot;SELECT * FROM $tableName") res2: org.apache.spark.sql.DataFrame = [col1: int] But when I try to do .show() it returns an error scala> spark.sql(s"SELECT * FROM $tableName").show() org.apache.spark.sql.AnalysisException: Table does not support reads: default.tblname_3;

Spark 3 + Delta 0.7.0 Hive Metastore Integration Question

2020-12-19 Thread Jay
Hi All - I have currently setup a Spark 3.0.1 cluster with delta version 0.7.0 which is connected to an external hive metastore. I run the below set of commands :- val tableName = tblname_2 spark.sql(s"CREATE TABLE $tableName(col1 INTEGER) USING delta options(path='GCS_PATH')") *20/12/19 17:30:

Spark-hive integration on HDInsight

2019-02-20 Thread Jay Singh
I am trying to integrate spark with hive on HDInsight spark cluster . I copied hive-site.xml in spark/conf directory. In addition I added hive metastore properties like jdbc connection info on Ambari as well. But still the database and tables created using spark-sql are not visible in hive. Cha

Re: Reg:- Py4JError in Windows 10 with Spark

2018-06-06 Thread Jay
Are you running this in local mode or cluster mode ? If you are running in cluster mode have you ensured that numpy is present on all nodes ? On Tue 5 Jun, 2018, 2:43 AM @Nandan@, wrote: > Hi , > I am getting error :- > > --

Re: Dataframe from 1.5G json (non JSONL)

2018-06-06 Thread Jay
I might have missed it but can you tell if the OOM is happening in driver or executor ? Also it would be good if you can post the actual exception. On Tue 5 Jun, 2018, 1:55 PM Nicolas Paris, wrote: > IMO your json cannot be read in parallell at all then spark only offers > you > to play again w

Re: spark partitionBy with partitioned column in json output

2018-06-04 Thread Jay
The partitionBy clause is used to create hive folders so that you can point a hive partitioned table on the data . What are you using the partitionBy for ? What is the use case ? On Mon 4 Jun, 2018, 4:59 PM purna pradeep, wrote: > im reading below json in spark > > {"bucket": "B01", "action

Re: [PySpark] Releasing memory after a spark job is finished

2018-06-04 Thread Jay
Can you tell us what version of Spark you are using and if Dynamic Allocation is enabled ? Also, how are the files being read ? Is it a single read of all files using a file matching regex or are you running different threads in the same pyspark job? On Mon 4 Jun, 2018, 1:27 PM Shuporno Choudhu

Re: Append In-Place to S3

2018-06-01 Thread Jay
Benjamin, The append will append the "new" data to the existing data with removing the duplicates. You would need to overwrite the file everytime if you need unique values. Thanks, Jayadeep On Fri, Jun 1, 2018 at 9:31 PM Benjamin Kim wrote: > I have a situation where I trying to add only new r

Re: How does partitioning happen for binary files in spark ?

2017-04-06 Thread Jay
The code that you see in github is for version 2.1. For versions below that the default partitions for binary files is set to 2 which you can change by using the minPartitions value. I am not sure starting 2.1 how the minPartitions column will work because as you said the field is completely ignore

Re: How does partitioning happen for binary files in spark ?

2017-04-06 Thread Jay
The code that you see in github is for version 2.1. For versions below that the default partitions for binary files is set to 2 which you can change by using the minPartitions value. I am not sure starting 2.1 how the minPartitions column will work because as you said the field is completely ignore

Re: spark 2.0 in intellij

2016-08-09 Thread Michael Jay
Hi, The problem has been solved simply by updating the scala sdk version from incompactible 2.10.x to correct version 2.11.x From: Michael Jay Sent: Tuesday, August 9, 2016 10:11:12 PM To: user@spark.apache.org Subject: spark 2.0 in intellij Dear all, I am

spark 2.0 in intellij

2016-08-09 Thread Michael Jay
Dear all, I am Newbie to Spark. Currently I am trying to import the source code of Spark 2.0 as a Module to an existing client project. I have imported Spark-core, Spark-sql and Spark-catalyst as maven dependencies in this client project. During compilation errors as missing SqlBaseParser.java

JavaSparkContext: dependency on ui/

2016-06-27 Thread jay vyas
arkContext.(JavaSparkContext.scala:58) -- jay vyas

Spark jobs without a login

2016-06-16 Thread jay vyas
:675) I'm not too worries about this - but it seems like it might be nice if maybe we could specify a user name as part of sparks context or as part of an external parameter rather then having to use the java based user/group extractor. -- jay vyas

Re: Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread Jay Luan
Thank you, that helps a lot. On Mon, Feb 22, 2016 at 6:01 PM, Takeshi Yamamuro wrote: > You're correct, reduceByKey is just an example. > > On Tue, Feb 23, 2016 at 10:57 AM, Jay Luan wrote: > >> Could you elaborate on how this would work? >> >> So from wh

Re: Force Partitioner to use entire entry of PairRDD as key

2016-02-22 Thread Jay Luan
Could you elaborate on how this would work? So from what I can tell, this maps a key to a tuple which always has a 0 as the second element. From there the hash widely changes because we now hash something like ((1,4), 0) and ((1,3), 0). Thus mapping this would create more even partitions. Why redu

RE: [MLLIB] Best way to extract RandomForest decision splits

2016-02-10 Thread Jay Luan
Thanks for the reply, I'd like to export the decision splits for each tree out to an external file which is read elsewhere not using spark. As far as I know, saving a model to a path will save a bunch of binary files which can be loaded back into spark. Is this correct? On Feb 10, 2016 7:21 PM, "Mo

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-08 Thread Shipper, Jay [USA]
determine the root cause, as I cannot replicate this issue. Thanks for your help. From: Ted Yu mailto:yuzhih...@gmail.com>> Date: Friday, February 5, 2016 at 5:40 PM To: Jay Shipper mailto:shipper_...@bah.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>"

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
could confirm they’re having this issue with Spark 1.6.0. Ideally, we should also have some simple proof of concept that can be posted with the bug. From: Ted Yu mailto:yuzhih...@gmail.com>> Date: Wednesday, February 3, 2016 at 3:57 PM To: Jay Shipper mailto:shipper_...@bah.com>&g

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
One quick update on this: The NPE is not happening with Spark 1.5.2, so this problem seems specific to Spark 1.6.0. From: Jay Shipper mailto:shipper_...@bah.com>> Date: Wednesday, February 3, 2016 at 12:06 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto

Re: [External] Re: Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
Right, I could already tell that from the stack trace and looking at Spark’s code. What I’m trying to determine is why that’s coming back as null now, just from upgrading Spark to 1.6.0. From: Ted Yu mailto:yuzhih...@gmail.com>> Date: Wednesday, February 3, 2016 at 12:04 PM To: Jay S

Spark 1.6.0 HiveContext NPE

2016-02-03 Thread Shipper, Jay [USA]
match what Spark 1.6.0 uses (2.6.0-cdh5.7.0-SNAPSHOT). Thanks, Jay

Re: How to run two operations on the same RDD simultaneously

2015-11-25 Thread Jay Luan
Ah, thank you so much, this is perfect On Fri, Nov 20, 2015 at 3:48 PM, Ali Tajeldin EDU wrote: > You can try to use an Accumulator ( > http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.Accumulator) > to keep count in map1. Note that the final count may be higher than th

Re: "insert overwrite table phonesall" in spark-sql resulted in java.io.StreamCorruptedException

2015-08-20 Thread John Jay
The answer is that my table was not serialized by kyro,but I started spark-sql shell with kyro,so the data could not be deserialized。 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/insert-overwrite-table-phonesall-in-spark-sql-resulted-in-java-io-StreamCorr

Re: Unit Testing

2015-08-13 Thread jay vyas
urantees that you'll have a producer and a consumer, so that you don't get a starvation scenario. On Wed, Aug 12, 2015 at 7:31 PM, Mohit Anchlia wrote: > Is there a way to run spark streaming methods in standalone eclipse > environment to test out the functionality? > -- jay vyas

Re: Amazon DynamoDB & Spark

2015-08-07 Thread Jay Vyas
In general the simplest way is that you can use the Dynamo Java API as is and call it inside a map(), and use the asynchronous put() Dynamo api call . > On Aug 7, 2015, at 9:08 AM, Yasemin Kaya wrote: > > Hi, > > Is there a way using DynamoDB in spark application? I have to persist my > res

Re: How to build Spark with my own version of Hadoop?

2015-07-22 Thread jay vyas
On Tue, Jul 21, 2015 at 11:11 PM, Dogtail Ray wrote: > Hi, > > I have modified some Hadoop code, and want to build Spark with the > modified version of Hadoop. Do I need to change the compilation dependency > files? How to then? Great thanks! > -- jay vyas

"insert overwrite table phonesall" in spark-sql resulted in java.io.StreamCorruptedException

2015-07-02 Thread John Jay
My spark-sql command: spark-sql --driver-memory 2g --master spark://hadoop04.xx.xx.com:8241 --conf spark.driver.cores=20 --conf spark.cores.max=20 --conf spark.executor.memory=2g --conf spark.driver.memory=2g --conf spark.akka.frameSize=500 --conf spark.eventLog.enabled=true --conf spark.eventLog.

Re: Spark Streaming on top of Cassandra?

2015-05-21 Thread jay vyas
unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- jay vyas

Re: Spark Streaming + Kafka failure recovery

2015-05-21 Thread Bill Jay
/latest/api/scala/org/apache/spark/streaming/StreamingContext.html>. > The information on consumed offset can be recovered from the checkpoint. > > On Tue, May 19, 2015 at 2:38 PM, Bill Jay > wrote: > >> If a Spark streaming job stops at 12:01 and I resume the job at 12:02. >&

Re: Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
ade code. > > On Tue, May 19, 2015 at 12:42 PM, Bill Jay > wrote: > >> Hi all, >> >> I am currently using Spark streaming to consume and save logs every hour >> in our production pipeline. The current setting is to run a crontab job to >> check every minute

Spark Streaming + Kafka failure recovery

2015-05-19 Thread Bill Jay
Hi all, I am currently using Spark streaming to consume and save logs every hour in our production pipeline. The current setting is to run a crontab job to check every minute whether the job is still there and if not resubmit a Spark streaming job. I am currently using the direct approach for Kafk

Partition number of Spark Streaming Kafka receiver-based approach

2015-05-18 Thread Bill Jay
Hi all, I am reading the docs of receiver-based Kafka consumer. The last parameters of KafkaUtils.createStream is per topic number of Kafka partitions to consume. My question is, does the number of partitions for topic in this parameter need to match the number of partitions in Kafka. For example

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
each batch. On Thu, Apr 30, 2015 at 11:15 AM, Cody Koeninger wrote: > Did you use lsof to see what files were opened during the job? > > On Thu, Apr 30, 2015 at 1:05 PM, Bill Jay > wrote: > >> The data ingestion is in outermost portion in foreachRDD block. Although >>

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-30 Thread Bill Jay
t's > impossible, but I'd think we need some evidence before speculating this has > anything to do with it. > > > On Wed, Apr 29, 2015 at 6:50 PM, Bill Jay > wrote: > >> This function is called in foreachRDD. I think it should be running in >> the executors. I

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
> On Wed, Apr 29, 2015 at 2:30 PM, Ted Yu wrote: > >> Maybe add statement.close() in finally block ? >> >> Streaming / Kafka experts may have better insight. >> >> On Wed, Apr 29, 2015 at 2:25 PM, Bill Jay >> wrote: >> >>> Thanks for the suggestion.

Re: Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
rity/limits.conf* > Cheers > > On Wed, Apr 29, 2015 at 2:07 PM, Bill Jay > wrote: > >> Hi all, >> >> I am using the direct approach to receive real-time data from Kafka in >> the following link: >> >> https://spark.apache.org/docs/1.3.0/stream

Too many open files when using Spark to consume messages from Kafka

2015-04-29 Thread Bill Jay
Hi all, I am using the direct approach to receive real-time data from Kafka in the following link: https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html My code follows the word count direct example: https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/

Re: Re: spark streaming printing no output

2015-04-16 Thread jay vyas
>>>> import org.apache.spark.streaming.StreamingContext._ >>>>> import org.apache.spark.streaming.dstream.DStream >>>>> import org.apache.spark.streaming.Duration >>>>> import org.apache.spark.streaming.Seconds >>>>> val ssc = new StreamingContext( sc, Seconds(1)) >>>>> val lines = ssc.socketTextStream("hostname",) >>>>> lines.print() >>>>> ssc.start() >>>>> ssc.awaitTermination() >>>>> >>>>> Jobs are getting created when I see webUI but nothing gets printed on >>>>> console. >>>>> >>>>> I have started a nc script on hostname port and can see messages >>>>> typed on this port from another console. >>>>> >>>>> >>>>> >>>>> Please let me know If I am doing something wrong. >>>>> >>>>> >>>>> >>>>> >>>> >>> >> > -- jay vyas

Re: org.apache.spark.ml.recommendation.ALS

2015-04-13 Thread Jay Katukuri
park spark-mllib_2.11 1.3.0 org.apache.spark spark-sql_2.11 1.3.0 I am using scala version 2.11.2. Could it be that "spark-1.3.0-bin-hadoop2.4.tgz requires a different version of scala ? Thanks, Jay On Apr 9, 2015, at 4:38 PM, Xiang

Re: org.apache.spark.ml.recommendation.ALS

2015-04-08 Thread Jay Katukuri
ster local[8] ALSNew.jar /input_path The stack trace is exactly same. Thanks, Jay On Apr 8, 2015, at 10:47 AM, Jay Katukuri wrote: > some additional context: > > Since, I am using features of spark 1.3.0, I have downloaded spark 1.3.0 and > used spark-submit from there. >

Re: org.apache.spark.ml.recommendation.ALS

2015-04-08 Thread Jay Katukuri
-submit from my downloaded spark-1.30. On Apr 6, 2015, at 1:37 PM, Jay Katukuri wrote: > Here is the command that I have used : > > spark-submit —class packagename.ALSNew --num-executors 100 --master yarn > ALSNew.jar -jar spark-sql_2.11-1.3.0.jar hdfs://input_path > > Btw

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
s.html > > -Xiangrui > > On Mon, Apr 6, 2015 at 12:27 PM, Jay Katukuri wrote: >> Hi, >> >> Here is the stack trace: >> >> >> Exception in thread "main" java.lang.NoSuchMethodError: >> scala.reflect.api.JavaUniverse.runtimeMirror(L

Re: org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Thanks, Jay On Apr 6, 2015, at 12:24 PM, Xiangrui Meng wrote: > Please attach the full stack trace. -Xiangrui > > On Mon, Apr 6, 2015 at 12:

org.apache.spark.ml.recommendation.ALS

2015-04-06 Thread Jay Katukuri
al ratings = purchase.map ( line => line.split(',') match { case Array(user, item, rate) => (user.toInt, item.toInt, rate.toFloat) }).toDF() Any help is appreciated ! I have tried passing the spark-sql jar using the -jar spark-sql_2.11-1.3.0.jar Thanks, Jay On

Re: Submitting to a cluster behind a VPN, configuring different IP address

2015-04-02 Thread jay vyas
Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- jay vyas

Re: Single threaded laptop implementation beating a 128 node GraphX cluster on a 1TB data set (128 billion nodes) - What is a use case for GraphX then? when is it worth the cost?

2015-03-30 Thread jay vyas
Just the same as spark was disrupting the hadoop ecosystem by changing the assumption that "you can't rely on memory in distributed analytics"...now maybe we are challenging the assumption that "big data analytics need to distributed"? I've been asking the same question lately and seen similarly t

Re: Untangling dependency issues in spark streaming

2015-03-29 Thread jay vyas
actory.(PoolingHttpClientConnectionManager.java:494) > at > org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:149) > at > org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:138) > at > org.apache.http.impl.conn.PoolingHttpClientConnectionManager.(PoolingHttpClientConnectionManager.java:114) > > -- jay vyas

Re: Apache Ignite vs Apache Spark

2015-02-26 Thread Jay Vyas
-https://wiki.apache.org/incubator/IgniteProposal has I think been updated recently and has a good comparison. - Although grid gain has been around since the spark days, Apache Ignite is quite new and just getting started I think so - you will probably want to reach out to the developers for

Re: Spark Streaming and message ordering

2015-02-18 Thread jay vyas
ons/spark-streaming.html > . > > With a kafka receiver that pulls data from a single kafka partition of a > kafka topic, are individual messages in the microbatch in same the order as > kafka partition? Are successive microbatches originating from a kafka > partition executed in order? &

Re: Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
Ah, nevermind, I just saw http://spark.apache.org/docs/1.2.0/sql-programming-guide.html (language integrated queries) which looks quite similar to what i was thinking about. I'll give that a whirl... On Wed, Feb 11, 2015 at 7:40 PM, jay vyas wrote: > Hi spark. is there anything in t

Strongly Typed SQL in Spark

2015-02-11 Thread jay vyas
}).join(ProductMetaData).by(product,meta=>product.id=meta.id). toSchemaRDD ? I know the above snippet is totally wacky but, you get the idea :) -- jay vyas

SparkSQL DateTime

2015-02-09 Thread jay vyas
hive dates just for dealing with time stamps. Whats the simplest and cleanest way to map non-spark time values into SparkSQL friendly time values? - One option could be a custom SparkSQL type, i guess? - Any plan to have native spark sql support for Joda Time or (yikes) java.util.Calendar ? -- jay vyas

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-02-03 Thread Jay Hutfles
efore performing any operations on it. That way, the EdgeRDDImpl class doesn't have to use the default partitioner. Hope this helps! Jay On Tue Feb 03 2015 at 12:35:14 AM NicolasC wrote: > On 01/29/2015 08:31 PM, Ankur Dave wrote: > > Thanks for the reminder. I just crea

Re: GraphX: ShortestPaths does not terminate on a grid graph

2015-01-29 Thread Jay Hutfles
Just curious, is this set to be merged at some point? On Thu Jan 22 2015 at 4:34:46 PM Ankur Dave wrote: > At 2015-01-22 02:06:37 -0800, NicolasC wrote: > > I try to execute a simple program that runs the ShortestPaths algorithm > > (org.apache.spark.graphx.lib.ShortestPaths) on a small grid gr

Re: Discourse: A proposed alternative to the Spark User list

2015-01-21 Thread Jay Vyas
Its a very valid idea indeed, but... It's a tricky subject since the entire ASF is run on mailing lists , hence there are so many different but equally sound ways of looking at this idea, which conflict with one another. > On Jan 21, 2015, at 7:03 AM, btiernay wrote: > > I think this is a re

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Jay Vyas
I find importing a working SBT project into IntelliJ is the way to go. How did you load the project into intellij? > On Jan 13, 2015, at 4:45 PM, Enno Shioji wrote: > > Had the same issue. I can't remember what the issue was but this works: > > libraryDependencies ++= { > val sparkVers

Re: ReceiverInputDStream#saveAsTextFiles with a S3 URL results in double forward slash key names in S3

2014-12-23 Thread Jay Vyas
Hi enno. Might be worthwhile to cross post this on dev@hadoop... Obviously a simple spark way to test this would be to change the uri to write to hdfs:// or maybe you could do file:// , and confirm that the extra slash goes away. - if it's indeed a jets3t issue we should add a new unit test for

spark-shell bug with RDDs and case classes?

2014-12-19 Thread Jay Hutfles
Found a problem in the spark-shell, but can't confirm that it's related to open issues on Spark's JIRA page. I was wondering if anyone could help identify if this is an issue or if it's already being addressed. Test: (in spark-shell) case class Person(name: String, age: Int) val peopleList = Lis

spark-shell bug with RDD distinct?

2014-12-19 Thread Jay Hutfles
this ( https://github.com/apache/spark/pull/1588) was closed. Is this something I just have to live with when using the REPL? Or is this covered by something bigger that's being addressed? Thanks in advance -Jay

Re: Spark Streaming Threading Model

2014-12-19 Thread jay vyas
awaiting processing or does it just process them? > > Asim > -- jay vyas

Re: Unit testing and Spark Streaming

2014-12-12 Thread Jay Vyas
https://github.com/jayunit100/SparkStreamingCassandraDemo On this note, I've built a framework which is mostly "pure" so that functional unit tests can be run composing mock data for Twitter statuses, with just regular junit... That might be relevant also. I think at some point we should come

Re: Spark-Streaming: output to cassandra

2014-12-05 Thread Jay Vyas
Here's an example of a Cassandra etl that you can follow which should exit on its own. I'm using it as a blueprint for revolving spark streaming apps on top of. For me, I kill the streaming app w system.exit after a sufficient amount of data is collected. That seems to work for most any scena

GraphX Pregel halting condition

2014-12-03 Thread Jay Hutfles
I'm trying to implement a graph algorithm that does a form of path searching. Once a certain criteria is met on any path in the graph, I wanted to halt the rest of the iterations. But I can't see how to do that with the Pregel API, since any vertex isn't able to know the state of other arbitrary

GraphX Pregel halting condition

2014-12-03 Thread Jay Hutfles
I'm trying to implement a graph algorithm that does a form of path searching. Once a certain criteria is met on any path in the graph, I wanted to halt the rest of the iterations. But I can't see how to do that with the Pregel API, since any vertex isn't able to know the state of other arbitrary

Re: Lifecycle of RDD in spark-streaming

2014-11-27 Thread Bill Jay
hout issues. > > Regularly checking 'scheduling delay' and 'total delay' on the Streaming > tab in the UI is a must. (And soon we will have that on the metrics report > as well!! :-) ) > > -kr, Gerard. > > > > On Thu, Nov 27, 2014 at 8:14 AM, Bill Ja

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
your use case, the cleanest way to solve this, is by > asking Spark Streaming "remember" stuff for longer, by using > streamingContext.remember(). This will ensure that Spark > Streaming will keep around all the stuff for at least that duration. > Hope this helps. > > TD &g

Re: Lifecycle of RDD in spark-streaming

2014-11-26 Thread Bill Jay
Just add one more point. If Spark streaming knows when the RDD will not be used any more, I believe Spark will not try to retrieve data it will not use any more. However, in practice, I often encounter the error of "cannot compute split". Based on my understanding, this is because Spark cleared ou

Re: How to execute a custom python library on spark

2014-11-25 Thread jay vyas
and when to use spark submit to execute python > scripts/module > Bonus points if one can point an example library and how to run it :) > Thanks > -- jay vyas

Re: Error when Spark streaming consumes from Kafka

2014-11-23 Thread Bill Jay
> https://github.com/dibbhatt/kafka-spark-consumer > > Regards, > Dibyendu > > On Sun, Nov 23, 2014 at 2:13 AM, Bill Jay > wrote: > >> Hi all, >> >> I am using Spark to consume from Kafka. However, after the job has run >> for several hours, I saw t

Error when Spark streaming consumes from Kafka

2014-11-22 Thread Bill Jay
Hi all, I am using Spark to consume from Kafka. However, after the job has run for several hours, I saw the following failure of an executor: kafka.common.ConsumerRebalanceFailedException: group-1416624735998_ip-172-31-5-242.ec2.internal-1416648124230-547d2c31 can't rebalance after 4 retries

Re: Code works in Spark-Shell but Fails inside IntelliJ

2014-11-20 Thread Jay Vyas
This seems pretty standard: your IntelliJ classpath isn't matched to the correct ones that are used in spark shell Are you using the SBT plugin? If not how are you putting deps into IntelliJ? > On Nov 20, 2014, at 7:35 PM, Sanjay Subramanian > wrote: > > hey guys > > I am at AmpCamp 2014

Spark streaming: java.io.IOException: Version Mismatch (Expected: 28, Received: 18245 )

2014-11-18 Thread Bill Jay
Hi all, I am running a Spark Streaming job. It was able to produce the correct results up to some time. Later on, the job was still running but producing no result. I checked the Spark streaming UI and found that 4 tasks of a stage failed. The error messages showed that "Job aborted due to stage

Re: Spark streaming cannot receive any message from Kafka

2014-11-18 Thread Bill Jay
sure there’s the a parameter in KafkaUtils.createStream you can specify > the spark parallelism, also what is the exception stacks. > > > > Thanks > > Jerry > > > > *From:* Bill Jay [mailto:bill.jaypeter...@gmail.com] > *Sent:* Tuesday, November 18, 2

Re: Spark streaming cannot receive any message from Kafka

2014-11-17 Thread Bill Jay
Thu, Nov 13, 2014 at 5:00 AM, Helena Edelson wrote: > I encounter no issues with streaming from kafka to spark in 1.1.0. Do you > perhaps have a version conflict? > > Helena > On Nov 13, 2014 12:55 AM, "Jay Vyas" wrote: > >> Yup , very important that n>1 f

Re: Streaming: getting total count over all windows

2014-11-13 Thread jay vyas
at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- jay vyas

Re: Does Spark Streaming calculate during a batch?

2014-11-13 Thread jay vyas
the periodic CPU spike - I had a reduceByKey, so was it > doing that only after all the batch data was in? > > Thanks > -- jay vyas

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Jay Vyas
should be local[n], n > 1 for > local mode. Beside there’s a Kafka wordcount example in Spark Streaming > example, you can try that. I’ve tested with latest master, it’s OK. > > Thanks > Jerry > > From: Tobias Pfeiffer [mailto:t...@preferred.jp] > Sent: Thursday,

Re: Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Streaming > example, you can try that. I’ve tested with latest master, it’s OK. > > > > Thanks > > Jerry > > > > *From:* Tobias Pfeiffer [mailto:t...@preferred.jp] > *Sent:* Thursday, November 13, 2014 8:45 AM > *To:* Bill Jay > *Cc:* u...@spark.incubator.apache.

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the co

Spark streaming cannot receive any message from Kafka

2014-11-12 Thread Bill Jay
Hi all, I have a Spark streaming job which constantly receives messages from Kafka. I was using Spark 1.0.2 and the job has been running for a month. However, when I am currently using Spark 1.1.0. the Spark streaming job cannot receive any messages from Kafka. I have not made any change to the co

Embedding static files in a spark app

2014-11-08 Thread Jay Vyas
Hi spark. I have a set of text files that are dependencies of my app. They are less than 2mb in total size. What's the idiom for packaing text file dependencies for a spark based jar file? Class resources in packages ? Or distributing them separately?

Spark streaming job failed due to "java.util.concurrent.TimeoutException"

2014-11-03 Thread Bill Jay
Hi all, I have a spark streaming job that consumes data from Kafka and produces some simple operations on the data. This job is run in an EMR cluster with 10 nodes. The batch size I use is 1 minute and it takes around 10 seconds to generate the results that are inserted to a MySQL database. Howeve

Re: random shuffle streaming RDDs?

2014-11-03 Thread Jay Vyas
A use case would be helpful? Batches of RDDs from Streams are going to have temporal ordering in terms of when they are processed in a typical application ... , but maybe you could shuffle the way batch iterations work > On Nov 3, 2014, at 11:59 AM, Josh J wrote: > > When I'm outputting the

Re: Transforming the Dstream vs transforming each RDDs in the Dstream.

2014-10-29 Thread jay vyas
;>> On Oct 20, 2014, at 3:07 AM, Gerard Maas >>>>>> wrote: >>>>>> >>>>>> Pinging TD -- I'm sure you know :-) >>>>>> >>>>>> -kr, Gerard. >>>>>> >>>>>> On Fri, Oct 17, 2014 at 11:20 PM, Gerard Maas >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> We have been implementing several Spark Streaming jobs that are >>>>>>> basically processing data and inserting it into Cassandra, sorting it >>>>>>> among >>>>>>> different keyspaces. >>>>>>> >>>>>>> We've been following the pattern: >>>>>>> >>>>>>> dstream.foreachRDD(rdd => >>>>>>> val records = rdd.map(elem => record(elem)) >>>>>>> targets.foreach(target => records.filter{record => >>>>>>> isTarget(target,record)}.writeToCassandra(target,table)) >>>>>>> ) >>>>>>> >>>>>>> I've been wondering whether there would be a performance difference >>>>>>> in transforming the dstream instead of transforming the RDD within the >>>>>>> dstream with regards to how the transformations get scheduled. >>>>>>> >>>>>>> Instead of the RDD-centric computation, I could transform the >>>>>>> dstream until the last step, where I need an rdd to store. >>>>>>> For example, the previous transformation could be written as: >>>>>>> >>>>>>> val recordStream = dstream.map(elem => record(elem)) >>>>>>> targets.foreach{target => recordStream.filter(record => >>>>>>> isTarget(target,record)).foreachRDD(_.writeToCassandra(target,table))} >>>>>>> >>>>>>> Would be a difference in execution and/or performance? What would >>>>>>> be the preferred way to do this? >>>>>>> >>>>>>> Bonus question: Is there a better (more performant) way to sort the >>>>>>> data in different "buckets" instead of filtering the data collection >>>>>>> times >>>>>>> the #buckets? >>>>>>> >>>>>>> thanks, Gerard. >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> -- jay vyas

Re: real-time streaming

2014-10-28 Thread jay vyas
t; To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- jay vyas

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
On Tue, Oct 21, 2014 at 11:02 AM, jay vyas wrote: > Hi Spark ! I found out why my RDD's werent coming through in my spark > stream. > > It turns out you need the onStart() needs to return , it seems - i.e. you > need to launch the worker part of your > start process

Re: Streams: How do RDDs get Aggregated?

2014-10-21 Thread jay vyas
Hi Spark ! I found out why my RDD's werent coming through in my spark stream. It turns out you need the onStart() needs to return , it seems - i.e. you need to launch the worker part of your start process in a thread. For example def onStartMock():Unit ={ val future = new Thread(new

Re: How do you write a JavaRDD into a single file

2014-10-20 Thread jay vyas
ert a JavaRDD >> into >> > an iterator or iterable over then entire data set without using collect >> or >> > holding all data in memory. >> >In many problems where it is desirable to parallelize intermediate >> steps >> > but use a single process for handling the final result this could be >> very >> > useful. >> > > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > > -- jay vyas

Streams: How do RDDs get Aggregated?

2014-10-11 Thread jay vyas
Hi spark ! I dont quite yet understand the semantics of RDDs in a streaming context very well yet. Are there any examples of how to implement CustomInputDStreams, with corresponding Receivers in the docs ? Ive hacked together a custom stream, which is being opened and is consuming data internal

Re: Does Ipython notebook work with spark? trivial example does not work. Re: bug with IPython notebook?

2014-10-10 Thread jay vyas
t; > Andy > > > import sys > from operator import add > > from pyspark import SparkContext > > # only stand alone jobs should create a SparkContext > sc = SparkContext(appName="pyStreamingSparkRDDPipe”) > > data = [1, 2, 3, 4, 5] > rdd = sc.parallelize(data) > > def echo(data): > print "python recieved: %s" % (data) # output winds up in the shell > console in my cluster (ie. The machine I launched pyspark from) > > rdd.foreach(echo) > print "we are done" > > > -- jay vyas

Re: Spark inside Eclipse

2014-10-03 Thread jay vyas
bug and learn. > > thanks > > sanjay > > > > > -- > Daniel Siegmann, Software Developer > Velos > Accelerating Machine Learning > > 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 > E: daniel.siegm...@velos.io W: www.velos.io > > > -- jay vyas

Re: questions about MLLib recommendation models

2014-08-08 Thread Jay Hutfles
is clear in the doc. You should use what Burak > suggested: > > val predictions = model.predict(data.map(x => (x.user, x.product))) > > Best, > Xiangrui > > On Thu, Aug 7, 2014 at 1:20 PM, Burak Yavuz wrote: > > Hi Jay, > > > > I've had the same problem you&#

questions about MLLib recommendation models

2014-08-07 Thread Jay Hutfles
sting for recommendation models?" Leave it nice and general... Thanks in advance. Sorry for the long ramble. Jay

Re: pyspark script fails on EMR with an ERROR in configuring object.

2014-08-03 Thread jay vyas
> at java.lang.Class.forName(Class.java:270) > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820) > at > org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:89) > ... 64 more > > > On Sun, Aug 3, 2014 at 6:04 PM, Rahul Bhojwani < > rahulbhojwani2...@gmail.com> wrote: > >> Hi, >> >> I used to run spark scripts on local machine. Now i am porting my codes >> to EMR and i am facing lots of problem. >> >> The main one now is that the spark script which is running properly on my >> local machine is giving error when run on Amazon EMR Cluster. >> Here is the error: >> >> >> >> >> >> What can be the possible reason? >> Thanks in advance >> -- >> >> [image: http://] >> Rahul K Bhojwani >> [image: http://]about.me/rahul_bhojwani >> <http://about.me/rahul_bhojwani> >> >> > > > > -- > > [image: http://] > Rahul K Bhojwani > [image: http://]about.me/rahul_bhojwani > <http://about.me/rahul_bhojwani> > > > -- jay vyas

Re: Unit Testing (JUnit) with Spark

2014-07-29 Thread jay vyas
r JUnit > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Unit-Testing-JUnit-with-Spark-tp10861.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > -- jay vyas

  1   2   >