CQs on WindowedStream created on running StreamingContext
Hi, We intend to run adhoc windowed continuous queries on spark streaming data. The queries could be registered/deregistered dynamically or can be submitted through command line. Currently Spark streaming doesn’t allow adding any new inputs, transformations, and output operations after starting a StreamingContext. But doing following code changes in DStream.scala allows me to create an window on DStream even after StreamingContext has started (in StreamingContextState.ACTIVE). 1) In DStream.validateAtInit() Allowed adding new inputs, transformations, and output operations after starting a streaming context 2) In DStream.persist() Allowed to change storage level of an DStream after streaming context has started Ultimately the window api just does slice on the parentRDD and returns allRDDsInWindow. We create DataFrames out of these RDDs from this particular WindowedDStream, and evaluate queries on those DataFrames. 1) Do you see any challenges and consequences with this approach ? 2) Will these on the fly created WindowedDStreams be accounted properly in Runtime and memory management? 3) What is the reason we do not allow creating new windows with StreamingContextState.ACTIVE state? 4) Does it make sense to add our own implementation of WindowedDStream in this case? - Yogesh
Task Execution
Concerning task execution, a worker executes its assigned tasks in parallel or sequentially? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Task-Execution-tp14411.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: failed to run spark sample on windows
thanks a lot, it works now after I set %HADOOP_HOME% On Tue, Sep 29, 2015 at 1:22 PM, saurfang wrote: > See > > http://stackoverflow.com/questions/26516865/is-it-possible-to-run-hadoop-jobs-like-the-wordcount-sample-in-the-local-mode > , > https://issues.apache.org/jira/browse/SPARK-6961 and finally > https://issues.apache.org/jira/browse/HADOOP-10775. The easy solution is > to > download a Windows Hadoop distribution and point %HADOOP_HOME% to that > location so winutils.exe can be picked up. > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/failed-to-run-spark-sample-on-windows-tp14393p14407.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: unsubscribe
Hi Sukesh, To unsubscribe from the dev list, please send a message to dev-unsubscr...@spark.apache.org. To unsubscribe from the user list, please send a message user-unsubscr...@spark.apache.org. Please see: http://spark.apache.org/community.html#mailing-lists. Thanks, -Rick sukesh kumar wrote on 09/28/2015 11:39:01 PM: > From: sukesh kumar > To: "u...@spark.apache.org" , > "dev@spark.apache.org" > Date: 09/28/2015 11:39 PM > Subject: unsubscribe > > unsubscribe > > -- > Thanks & Best Regards > Sukesh Kumar
Re: Hive permanent functions are not available in Spark SQL
+user list On Tue, Sep 29, 2015 at 3:43 PM, Pala M Muthaia wrote: > Hi, > > I am trying to use internal UDFs that we have added as permanent functions > to Hive, from within Spark SQL query (using HiveContext), but i encounter > NoSuchObjectException, i.e. the function could not be found. > > However, if i execute 'show functions' command in spark SQL, the permanent > functions appear in the list. > > I am using Spark 1.4.1 with Hive 0.13.1. I tried to debug this by looking > at the log and code, but it seems both the show functions command as well > as udf query both go through essentially the same code path, but the former > can see the UDF but the latter can't. > > Any ideas on how to debug/fix this? > > > Thanks, > pala >
GraphX PageRank keeps 3 copies of graph in memory
Dear Spark developers, I would like to understand GraphX caching behavior with regards to PageRank in Spark, in particular, the following implementation of PageRank: https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala On each iteration the new graph is created and cached, and the old graph is un-cached: 1) Create new graph and cache it: rankGraph = rankGraph.joinVertices(rankUpdates) { (id, oldRank, msgSum) => rPrb(src, id) + (1.0 - resetProb) * msgSum }.cache() 2) Unpersist the old one: prevRankGraph.vertices.unpersist(false) prevRankGraph.edges.unpersist(false) According to the code, at the end of each iteration only one graph should be in memory, i.e. one EdgeRDD and one VertexRDD. During the iteration, exactly between the mentioned lines of code, there will be two graphs: old and new. It is two pairs of Edge and Vertex RDDs. However, when I run the example provided in Spark examples folder, I observe the different behavior. Run the example (I checked that it runs the mentioned code): $SPARK_HOME/bin/spark-submit --class "org.apache.spark.examples.graphx.SynthBenchmark" --master spark://mynode.net:7077 $SPARK_HOME/examples/target/spark-examples.jar According to "Storage" and RDD DAG in Spark UI, 3 VertexRDDs and 3 EdgeRDDs are cached, even when all iterations are finished, given that the mentioned code suggests caching at most 2 (and only in particular stage of the iteration): https://drive.google.com/file/d/0BzYMzvDiCep5WFpnQjFzNy0zYlU/view?usp=sharing Edges (the green ones are cached): https://drive.google.com/file/d/0BzYMzvDiCep5S2JtYnhVTlV1Sms/view?usp=sharing Vertices (the green ones are cached): https://drive.google.com/file/d/0BzYMzvDiCep5S1k4N2NFb05RZDA/view?usp=sharing Could you explain, why 3 VertexRDDs and 3 EdgeRDDs are cached? Is it OK that there is a double caching in code, given that joinVertices implicitly caches vertices and then the graph is cached in the PageRank code? Best regards, Alexander
Speculatively using spare capacity
Hello, How feasible would it be to have spark speculatively increase the number of partitions when there is spare capacity in the system? We want to do this to increase to decrease application runtime. Initially, we will assume that function calls of the same type will have the same runtime (e.g. all maps take equal time) and that the runtime will scale linearly with the number of workers. If a numPartitions value is specified, we may increase beyond this, but if a Partitioner is specified, we would not change the number of partitions. Some initial questions we had: * Does spark already do this? * Is there interest in supporting this functionality? * Are there any potential issues that we should be aware of? * What changes would need to be made for such a project? Thanks, Muhammed
Re: Speculatively using spare capacity
Why change the number of partitions of RDDs? especially since you can't generally do that without a shuffle. If you just mean to ramp up and down resource usage, dynamic allocation (of executors) already does that. On Wed, Sep 30, 2015 at 10:49 PM, Muhammed Uluyol wrote: > Hello, > > How feasible would it be to have spark speculatively increase the number of > partitions when there is spare capacity in the system? We want to do this to > increase to decrease application runtime. Initially, we will assume that > function calls of the same type will have the same runtime (e.g. all maps > take equal time) and that the runtime will scale linearly with the number of > workers. If a numPartitions value is specified, we may increase beyond this, > but if a Partitioner is specified, we would not change the number of > partitions. > > Some initial questions we had: > * Does spark already do this? > * Is there interest in supporting this functionality? > * Are there any potential issues that we should be aware of? > * What changes would need to be made for such a project? > > Thanks, > Muhammed - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Spark 1.6 Release window is not updated in Spark-wiki
Hi, In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the current release window has not been changed from 1.5. Can anybody give an idea of the expected dates for 1.6 version? Regards, Meethu Mathew Senior Engineer Flytxt