CQs on WindowedStream created on running StreamingContext

2015-09-30 Thread Yogs
Hi,

We intend to run adhoc windowed continuous queries on spark streaming data.
The queries could be registered/deregistered dynamically or can be
submitted through command line. Currently Spark streaming doesn’t allow
adding any new inputs, transformations, and output operations after
starting a StreamingContext. But doing following code changes in
DStream.scala allows me to create an window on DStream even after
StreamingContext has started (in StreamingContextState.ACTIVE).

1) In DStream.validateAtInit()
Allowed adding new inputs, transformations, and output operations after
starting a streaming context
2) In DStream.persist()
Allowed to change storage level of an DStream after streaming context has
started

Ultimately the window api just does slice on the parentRDD and returns
allRDDsInWindow.
We create DataFrames out of these RDDs from this particular
WindowedDStream, and evaluate queries on those DataFrames.

1) Do you see any challenges and consequences with this approach ?
2) Will these on the fly created WindowedDStreams be accounted properly in
Runtime and memory management?
3) What is the reason we do not allow creating new windows with
StreamingContextState.ACTIVE state?
4) Does it make sense to add our own implementation of WindowedDStream in
this case?

- Yogesh


Task Execution

2015-09-30 Thread gsvic
Concerning task execution, a worker executes its assigned tasks in parallel
or sequentially?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Task-Execution-tp14411.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: failed to run spark sample on windows

2015-09-30 Thread Renyi Xiong
thanks a lot, it works now after I set %HADOOP_HOME%

On Tue, Sep 29, 2015 at 1:22 PM, saurfang  wrote:

> See
>
> http://stackoverflow.com/questions/26516865/is-it-possible-to-run-hadoop-jobs-like-the-wordcount-sample-in-the-local-mode
> ,
> https://issues.apache.org/jira/browse/SPARK-6961 and finally
> https://issues.apache.org/jira/browse/HADOOP-10775. The easy solution is
> to
> download a Windows Hadoop distribution and point %HADOOP_HOME% to that
> location so winutils.exe can be picked up.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/failed-to-run-spark-sample-on-windows-tp14393p14407.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: unsubscribe

2015-09-30 Thread Richard Hillegas

Hi Sukesh,

To unsubscribe from the dev list, please send a message to
dev-unsubscr...@spark.apache.org. To unsubscribe from the user list, please
send a message user-unsubscr...@spark.apache.org. Please see:
http://spark.apache.org/community.html#mailing-lists.

Thanks,
-Rick

sukesh kumar  wrote on 09/28/2015 11:39:01 PM:

> From: sukesh kumar 
> To: "u...@spark.apache.org" ,
> "dev@spark.apache.org" 
> Date: 09/28/2015 11:39 PM
> Subject: unsubscribe
>
> unsubscribe
>
> --
> Thanks & Best Regards
> Sukesh Kumar

Re: Hive permanent functions are not available in Spark SQL

2015-09-30 Thread Pala M Muthaia
+user list

On Tue, Sep 29, 2015 at 3:43 PM, Pala M Muthaia  wrote:

> Hi,
>
> I am trying to use internal UDFs that we have added as permanent functions
> to Hive, from within Spark SQL query (using HiveContext), but i encounter
> NoSuchObjectException, i.e. the function could not be found.
>
> However, if i execute 'show functions' command in spark SQL, the permanent
> functions appear in the list.
>
> I am using Spark 1.4.1 with Hive 0.13.1. I tried to debug this by looking
> at the log and code, but it seems both the show functions command as well
> as udf query both go through essentially the same code path, but the former
> can see the UDF but the latter can't.
>
> Any ideas on how to debug/fix this?
>
>
> Thanks,
> pala
>


GraphX PageRank keeps 3 copies of graph in memory

2015-09-30 Thread Ulanov, Alexander
Dear Spark developers,

I would like to understand GraphX caching behavior with regards to PageRank in 
Spark, in particular, the following implementation of PageRank:
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

On each iteration the new graph is created and cached, and the old graph is 
un-cached:
1) Create new graph and cache it:
rankGraph = rankGraph.joinVertices(rankUpdates) {
(id, oldRank, msgSum) => rPrb(src, id) + (1.0 - resetProb) * msgSum
  }.cache()
2) Unpersist the old one:
  prevRankGraph.vertices.unpersist(false)
  prevRankGraph.edges.unpersist(false)

According to the code, at the end of each iteration only one graph should be in 
memory, i.e. one EdgeRDD and one VertexRDD. During the iteration, exactly 
between the mentioned lines of code, there will be two graphs: old and new. It 
is two pairs of Edge and Vertex RDDs. However, when I run the example provided 
in Spark examples folder, I observe the different behavior.

Run the example (I checked that it runs the mentioned code):
$SPARK_HOME/bin/spark-submit --class 
"org.apache.spark.examples.graphx.SynthBenchmark"  --master 
spark://mynode.net:7077 $SPARK_HOME/examples/target/spark-examples.jar

According to "Storage" and RDD DAG in Spark UI, 3 VertexRDDs and 3 EdgeRDDs are 
cached, even when all iterations are finished, given that the mentioned code 
suggests caching at most 2 (and only in particular stage of the iteration):
https://drive.google.com/file/d/0BzYMzvDiCep5WFpnQjFzNy0zYlU/view?usp=sharing
Edges (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S2JtYnhVTlV1Sms/view?usp=sharing
Vertices (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S1k4N2NFb05RZDA/view?usp=sharing

Could you explain, why 3 VertexRDDs and 3 EdgeRDDs are cached?

Is it OK that there is a double caching in code, given that joinVertices 
implicitly caches vertices and then the graph is cached in the PageRank code?

Best regards, Alexander


Speculatively using spare capacity

2015-09-30 Thread Muhammed Uluyol
Hello,

How feasible would it be to have spark speculatively increase the number of
partitions when there is spare capacity in the system? We want to do this
to increase to decrease application runtime. Initially, we will assume that
function calls of the same type will have the same runtime (e.g. all maps
take equal time) and that the runtime will scale linearly with the number
of workers. If a numPartitions value is specified, we may increase beyond
this, but if a Partitioner is specified, we would not change the number of
partitions.

Some initial questions we had:
 * Does spark already do this?
 * Is there interest in supporting this functionality?
 * Are there any potential issues that we should be aware of?
 * What changes would need to be made for such a project?

Thanks,
Muhammed


Re: Speculatively using spare capacity

2015-09-30 Thread Sean Owen
Why change the number of partitions of RDDs? especially since you
can't generally do that without a shuffle. If you just mean to ramp up
and down resource usage, dynamic allocation (of executors) already
does that.

On Wed, Sep 30, 2015 at 10:49 PM, Muhammed Uluyol  wrote:
> Hello,
>
> How feasible would it be to have spark speculatively increase the number of
> partitions when there is spare capacity in the system? We want to do this to
> increase to decrease application runtime. Initially, we will assume that
> function calls of the same type will have the same runtime (e.g. all maps
> take equal time) and that the runtime will scale linearly with the number of
> workers. If a numPartitions value is specified, we may increase beyond this,
> but if a Partitioner is specified, we would not change the number of
> partitions.
>
> Some initial questions we had:
>  * Does spark already do this?
>  * Is there interest in supporting this functionality?
>  * Are there any potential issues that we should be aware of?
>  * What changes would need to be made for such a project?
>
> Thanks,
> Muhammed

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark 1.6 Release window is not updated in Spark-wiki

2015-09-30 Thread Meethu Mathew
Hi,
In the https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage the
current release window has not been changed from 1.5. Can anybody give an
idea of the expected dates for 1.6 version?

Regards,

Meethu Mathew
Senior Engineer
Flytxt