Kafka stream offset management question

2016-11-08 Thread Haopu Wang
I'm using Kafka direct stream (auto.offset.reset = earliest) and enable Spark streaming's checkpoint. The application starts and consumes messages correctly. Then I stop the application and clean the checkpoint folder. I restart the application and expect it to consumes old messages. But it

RE: InvalidClassException when load KafkaDirectStream from checkpoint (Spark 2.0.0)

2016-11-08 Thread Haopu Wang
It turns out to be a bug in application code. Thank you! From: Haopu Wang Sent: 2016年11月4日 17:23 To: user@spark.apache.org; Cody Koeninger Subject: InvalidClassException when load KafkaDirectStream from checkpoint (Spark 2.0.0) When I load spark

RE: expected behavior of Kafka dynamic topic subscription

2016-11-06 Thread Haopu Wang
Cody, thanks for the response. Do you think it's a Spark issue or Kafka issue? Can you please let me know the jira ticket number? -Original Message- From: Cody Koeninger [mailto:c...@koeninger.org] Sent: 2016年11月4日 22:35 To: Haopu Wang Cc: user@spark.apache.org Subject: Re: exp

InvalidClassException when load KafkaDirectStream from checkpoint (Spark 2.0.0)

2016-11-04 Thread Haopu Wang
When I load spark checkpoint, I get below error. Do you have any idea? Much thanks! * 2016-11-04 17:12:19,582 INFO [org.apache.spark.streaming.CheckpointReader] (main;) Checkpoint files found: file:/d:/temp/checkpoint/checkpoint-147825070,file:/d:/temp/checkpoi

expected behavior of Kafka dynamic topic subscription

2016-11-03 Thread Haopu Wang
I'm using Kafka010 integration API to create a DStream using SubscriberPattern ConsumerStrategy. The specified topic doesn't exist when I start the application. Then I create the topic and publish some test messages. I can see them in the console subscriber. But the spark application doesn't see

RE: Kafka integration: get existing Kafka messages?

2016-10-14 Thread Haopu Wang
mytopic2 0 2 after polling for 512 == -Original Message- From: Cody Koeninger [mailto:c...@koeninger.org] Sent: 2016年10月13日 9:31 To: Haopu Wang Cc: user@spark.apache.org Subject: Re: Kafka integration: get existing Kafka messages? Look at the presentation and blog post linked from https://github.com

RE: Kafka integration: get existing Kafka messages?

2016-10-12 Thread Haopu Wang
Cody, thanks for the response. So Kafka direct stream actually has consumer on both the driver and executor? Can you please provide more details? Thank you very much! From: Cody Koeninger [mailto:c...@koeninger.org] Sent: 2016年10月12日 20:10 To: Haopu Wang

Kafka integration: get existing Kafka messages?

2016-10-12 Thread Haopu Wang
Hi, I want to read the existing Kafka messages and then subscribe new stream messages. But I find "auto.offset.reset" property is always set to "none" in KafkaUtils. Does that mean I cannot specify "earliest" property value when create direct stream? Thank you!

Can Spark Streaming 2.0 work with Kafka 0.10?

2016-09-26 Thread Haopu Wang
Hi, in the official integration guide, it says "Spark Streaming 2.0.0 is compatible with Kafka 0.8.2.1." However, in maven repository, I can get "spark-streaming-kafka-0-10_2.11" which depends on Kafka 0.10.0.0 Is this artifact stable enough? Thank you!

[Spark 1.6.1] Beeline cannot start on Windows7

2016-06-27 Thread Haopu Wang
I see below stack trace when trying to run beeline command. I'm using JDK 7. Anything wrong? Much thanks! == D:\spark\download\spark-1.6.1-bin-hadoop2.4>bin\beeline Beeline version 1.6.1 by Apache Hive Exception in thread "main" java.lang.NoSuchMethodError: org.fusesource.jansi.interna

RE: Can I control the execution of Spark jobs?

2016-06-16 Thread Haopu Wang
ese jobs" be? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Thu, Jun 16, 2016 at 11:36 AM, Haopu Wang wrote: > Hi, > > > > Suppose I have a spark

Can I control the execution of Spark jobs?

2016-06-16 Thread Haopu Wang
Hi, Suppose I have a spark application which is doing several ETL types of things. I understand Spark can analyze and generate several jobs to execute. The question is: is it possible to control the dependency between these jobs? Thanks!

RE: Should I avoid "state" in an Spark application?

2016-06-12 Thread Haopu Wang
Can someone look at my questions? Thanks again! From: Haopu Wang Sent: 2016年6月12日 16:40 To: user@spark.apache.org Subject: Should I avoid "state" in an Spark application? I have a Spark application whose structure is below: var ts:

Should I avoid "state" in an Spark application?

2016-06-12 Thread Haopu Wang
I have a Spark application whose structure is below: var ts: Long = 0L dstream1.foreachRDD{ (x, time) => { ts = time x.do_something()... } } .. process_data(dstream2, ts, ..) I assume foreachRDD function call can up

How to get the list of running applications and Cores/Memory in use?

2015-12-06 Thread Haopu Wang
Hi, I have a Spark 1.5.2 standalone cluster running. I want to get all of the running applications and Cores/Memory in use. Besides the Master UI, is there any other ways to do that? I tried to send HTTP request using URL like this: "http://node1:6066/v1/applications"; The respo

RE: RE: Spark or Storm

2015-06-19 Thread Haopu Wang
My question is not directly related: about the "exactly-once semantic", the document (copied below) said spark streaming gives exactly-once semantic, but actually from my test result, with check-point enabled, the application always re-process the files in last batch after gracefully restart. =

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-18 Thread Haopu Wang
it? Much thanks! From: Akhil Das [mailto:ak...@sigmoidanalytics.com] Sent: Tuesday, June 16, 2015 3:26 PM To: Haopu Wang Cc: user Subject: Re: If not stop StreamingContext gracefully, will checkpoint data be consistent? Good question, with fileStream

RE: [SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-17 Thread Haopu Wang
Can someone help? Thank you! From: Haopu Wang Sent: Monday, June 15, 2015 3:36 PM To: user; d...@spark.apache.org Subject: [SparkStreaming] NPE in DStreamCheckPointData.scala:125 I use the attached program to test checkpoint. It's quite simple. When

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-15 Thread Haopu Wang
15, 2015 3:48 PM To: Haopu Wang Cc: user Subject: Re: If not stop StreamingContext gracefully, will checkpoint data be consistent? I think it should be fine, that's the whole point of check-pointing (in case of driver failure etc). Thanks Best Regards On Mon, Jun 15, 2015 at 6:

[SparkStreaming] NPE in DStreamCheckPointData.scala:125

2015-06-15 Thread Haopu Wang
I use the attached program to test checkpoint. It's quite simple. When I run the program second time, it will load checkpoint data, that's expected, however I see NPE in driver log. Do you have any idea about the issue? I'm on Spark 1.4.0, thank you very much! == logs == 15/

RE: If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-14 Thread Haopu Wang
Hi, can someone help to confirm the behavior? Thank you! -Original Message- From: Haopu Wang Sent: Friday, June 12, 2015 4:57 PM To: user Subject: If not stop StreamingContext gracefully, will checkpoint data be consistent? This is a quick question about Checkpoint. The question is: if

If not stop StreamingContext gracefully, will checkpoint data be consistent?

2015-06-12 Thread Haopu Wang
This is a quick question about Checkpoint. The question is: if the StreamingContext is not stopped gracefully, will the checkpoint be consistent? Or I should always gracefully shutdown the application even in order to use the checkpoint? Thank you very much! -

RE: [SparkStreaming 1.3.0] Broadcast failure after setting "spark.cleaner.ttl"

2015-06-09 Thread Haopu Wang
deleted somehow. -Original Message- From: Shao, Saisai [mailto:saisai.s...@intel.com] Sent: Tuesday, June 09, 2015 4:33 PM To: Haopu Wang; user Subject: RE: [SparkStreaming 1.3.0] Broadcast failure after setting "spark.cleaner.ttl" >From the stack I think this problem m

[SparkStreaming 1.3.0] Broadcast failure after setting "spark.cleaner.ttl"

2015-06-09 Thread Haopu Wang
When I ran a spark streaming application longer, I noticed the local directory's size was kept increasing. I set "spark.cleaner.ttl" to 1800 seconds in order clean the metadata. The spark streaming batch duration is 10 seconds and checkpoint duration is 10 minutes. The setting took effect but af

RE: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-09 Thread Haopu Wang
Cheng, yes, it works, I set the property in SparkConf before initiating SparkContext. The property name is "spark.hadoop.dfs.replication" Thanks fro the help! -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Monday, June 08, 2015 6:41 PM To: Haopu

RE: SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-07 Thread Haopu Wang
al Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Sunday, June 07, 2015 10:17 PM To: Haopu Wang; user Subject: Re: SparkSQL: How to specify replication factor on the persisted parquet files? Were you using HiveContext.setConf()? "dfs.replication" is a Hadoop configuration, b

SparkSQL: How to specify replication factor on the persisted parquet files?

2015-06-01 Thread Haopu Wang
Hi, I'm trying to save SparkSQL DataFrame to a persistent Hive table using the default parquet data source. I don't know how to change the replication factor of the generated parquet files on HDFS. I tried to set "dfs.replication" on HiveContext but that didn't work. Any suggestions are apprecia

Spark 1.3.0: how to let Spark history load old records?

2015-06-01 Thread Haopu Wang
When I start the Spark master process, the old records are not shown in the monitoring UI. How to show the old records? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-

RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-17 Thread Haopu Wang
secc.com] Sent: Monday, May 18, 2015 12:39 AM To: 'Akhil Das'; Haopu Wang Cc: 'user' Subject: RE: [SparkStreaming] Is it possible to delay the start of some DStream in the application? You can make ANY standard receiver sleep by implementing a custom Message Deserialize

RE: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-05-14 Thread Haopu Wang
Hi TD, regarding to the performance of updateStateByKey, do you have a JIRA for that so we can watch it? Thank you! From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 8:09 AM To: Krzysztof Zarzycki Cc: user Subject: Re: Is it feas

RE: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-14 Thread Haopu Wang
Thank you, should I open a JIRA for this issue? From: Olivier Girardot [mailto:ssab...@gmail.com] Sent: Tuesday, May 12, 2015 5:12 AM To: Reynold Xin Cc: Haopu Wang; user Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable? I'll look in

[SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-14 Thread Haopu Wang
In my application, I want to start a DStream computation only after an special event has happened (for example, I want to start the receiver only after the reference data has been properly initialized). My question is: it looks like the DStream will be started right after the StreaminContext has b

[SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-11 Thread Haopu Wang
I try to get the result schema of aggregate functions using DataFrame API. However, I find the result field of groupBy columns are always nullable even the source field is not nullable. I want to know if this is by design, thank you! Below is the simple code to show the issue. == import s

RE: [SparkSQL] cannot filter by a DateType column

2015-05-10 Thread Haopu Wang
, 2015 2:41 AM To: Haopu Wang Cc: user; d...@spark.apache.org Subject: Re: [SparkSQL] cannot filter by a DateType column What version of Spark are you using? It appears that at least in master we are doing the conversion correctly, but its possible older versions of applySchema do not. If you can

[SparkSQL] cannot filter by a DateType column

2015-05-08 Thread Haopu Wang
I want to filter a DataFrame based on a Date column. If the DataFrame object is constructed from a scala case class, it's working (either compare as String or Date). But if the DataFrame is generated by specifying a Schema to an RDD, it doesn't work. Below is the exception and test code. D

RE: Spark does not delete temporary directories

2015-05-07 Thread Haopu Wang
I think the temporary folders are used to store blocks and shuffles. That doesn't depend on the cluster manager. Ideally they should be removed after the application has been terminated. Can you check if there are contents under those folders? From: Taeyun Ki

RE: Is SQLContext thread-safe?

2015-04-30 Thread Haopu Wang
Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Yes it is thread safe, at least it's supposed to be. -Original Message----- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Monday, March 2, 2015 4

RE: [SparkSQL 1.3.0] Cannot resolve column name "SUM('p.q)" among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang
Michael, thanks for the response and looking forward to try 1.3.1 From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, April 03, 2015 6:52 AM To: Haopu Wang Cc: user Subject: Re: [SparkSQL 1.3.0] Cannot resolve column name "SUM('p.q)

[SparkSQL 1.3.0] Cannot resolve column name "SUM('p.q)" among (k, SUM('p.q));

2015-04-02 Thread Haopu Wang
Hi, I want to rename an aggregation field using DataFrame API. The aggregation is done on a nested field. But I got below exception. Do you see the same issue and any workaround? Thank you very much! == Exception in thread "main" org.apache.spark.sql.AnalysisException: Cannot resolve colu

RE: Can I call aggregate UDF in DataFrame?

2015-04-01 Thread Haopu Wang
Great! Thank you! From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, April 02, 2015 8:11 AM To: Haopu Wang Cc: user; d...@spark.apache.org Subject: Re: Can I call aggregate UDF in DataFrame? You totally can. https://github.com/apache/spark

Can I call aggregate UDF in DataFrame?

2015-03-26 Thread Haopu Wang
Specifically there are only 5 aggregate functions in class org.apache.spark.sql.GroupedData: sum/max/min/mean/count. Can I plugin a function to calculate stddev? Thank you! - To unsubscribe, e-mail: user-unsubscr...@spark.apach

[SparkSQL] How to calculate stddev on a DataFrame?

2015-03-25 Thread Haopu Wang
Hi, I have a DataFrame object and I want to do types of aggregations like count, sum, variance, stddev, etc. DataFrame has DSL to do simple aggregations like count and sum. How about variance and stddev? Thank you for any suggestions!

RE: [SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang
SELECT key,value FROM src") scala> output.saveAsTable("outputtable") From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Wednesday, March 11, 2015 8:25 AM To: Haopu Wang; user; d...@spark.apache.org Subject: RE: [SparkSQL] Reuse HiveContext to diffe

[SparkSQL] Reuse HiveContext to different Hive warehouse?

2015-03-10 Thread Haopu Wang
I'm using Spark 1.3.0 RC3 build with Hive support. In Spark Shell, I want to reuse the HiveContext instance to different warehouse locations. Below are the steps for my test (Assume I have loaded a file into table "src"). == 15/03/10 18:22:59 INFO SparkILoop: Created sql context (with

RE: Where can I find more information about the R interface forSpark?

2015-03-04 Thread Haopu Wang
Thanks, it's an active project. Will it be released with Spark 1.3.0? From: 鹰 [mailto:980548...@qq.com] Sent: Thursday, March 05, 2015 11:19 AM To: Haopu Wang; user Subject: Re: Where can I find more information about the R interface forSpark? yo

Spark Streaming and SchemaRDD usage

2015-03-04 Thread Haopu Wang
Hi, in the roadmap of Spark in 2015 (link: http://files.meetup.com/3138542/Spark%20in%202015%20Talk%20-%20Wendell.p ptx), I saw SchemaRDD is designed to be the basis of BOTH Spark Streaming and Spark SQL. My question is: what's the typical usage of SchemaRDD in a Spark Streaming application? Thank

RE: Is SQLContext thread-safe?

2015-03-02 Thread Haopu Wang
Hao, thank you so much for the reply! Do you already have some JIRA for the discussion? -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Tuesday, March 03, 2015 8:23 AM To: Haopu Wang; user Subject: RE: Is SQLContext thread-safe? Currently, each SQLContext has its

RE: Is SQLContext thread-safe?

2015-03-02 Thread Haopu Wang
Thanks for the response. Then I have another question: when will we want to create multiple SQLContext instances from the same SparkContext? What's the benefit? -Original Message- From: Cheng, Hao [mailto:hao.ch...@intel.com] Sent: Monday, March 02, 2015 9:05 PM To: Haopu Wang;

Is SQLContext thread-safe?

2015-03-02 Thread Haopu Wang
Hi, is it safe to use the same SQLContext to do Select operations in different threads at the same time? Thank you very much! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@s

Spark Streaming and SQL checkpoint error: (java.io.NotSerializableException: org.apache.hadoop.hive.conf.HiveConf)

2015-02-16 Thread Haopu Wang
I have a streaming application which registered temp table on a HiveContext for each batch duration. The application runs well in Spark 1.1.0. But I get below error from 1.1.1. Do you have any suggestions to resolve it? Thank you! java.io.NotSerializableException: org.apache.hadoop.hive.conf.

Do you know any Spark modeling tool?

2014-12-25 Thread Haopu Wang
Hi, I think a modeling tool may be helpful because sometimes it's hard/tricky to program Spark. I don't know if there is already such a tool. Thanks! - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional comm

RE: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1?

2014-12-19 Thread Haopu Wang
(MissingRequirementError.scala:17) …… From: Sean Owen [mailto:so...@cloudera.com] Sent: Saturday, December 20, 2014 8:12 AM To: Haopu Wang Cc: user@spark.apache.org; Raghavendra Pandey Subject: RE: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1? That's exactl

RE: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1?

2014-12-19 Thread Haopu Wang
issue? Thanks for any suggestions. From: Raghavendra Pandey [mailto:raghavendra.pan...@gmail.com] Sent: Saturday, December 20, 2014 12:08 AM To: Sean Owen; Haopu Wang Cc: user@spark.apache.org Subject: Re: Can Spark 1.1.0 save checkpoint to HDFS 2.5.1? It

Can Spark 1.1.0 save checkpoint to HDFS 2.5.1?

2014-12-19 Thread Haopu Wang
I’m using Spark 1.1.0 built for HDFS 2.4. My application enables check-point (to HDFS 2.5.1) and it can build. But when I run it, I get below error: Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 at org.apa

RE: About "Memory usage" in the Spark UI

2014-10-23 Thread Haopu Wang
gmail.com] Sent: 2014年10月24日 8:07 To: Haopu Wang Cc: Patrick Wendell; user Subject: Re: About "Memory usage" in the Spark UI The memory usage of blocks of data received through Spark Streaming is not reflected in the Spark UI. It only shows the memory usage due to cached RDDs. I didnt

RE: About "Memory usage" in the Spark UI

2014-10-23 Thread Haopu Wang
store RDD blocks? Thanks again for the answer! From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: 2014年10月23日 14:00 To: Haopu Wang Cc: user Subject: Re: About "Memory usage" in the Spark UI It shows the amount of memory used to store RDD blocks, w

About "Memory usage" in the Spark UI

2014-10-22 Thread Haopu Wang
Hi, please take a look at the attached screen-shot. I wonders what's the "Memory Used" column mean. I give 2GB memory to the driver process and 12GB memory to the executor process. Thank you!

Spark's shuffle file size keep increasing

2014-10-15 Thread Haopu Wang
I have a Spark application which is running Spark Streaming and Spark SQL. I observed the size of shuffle files under "spark.local.dir" folder keeps increase and never decreases. Eventually it will run out-of-disk-space error. The question is: when will Spark delete these shuffle files? In the ap

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-07 Thread Haopu Wang
Liquan, yes, for full outer join, one hash table on both sides is more efficient. For the left/right outer join, it looks like one hash table should be enought. From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 18:34 To: Haopu Wang Cc: d

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
anks again! From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 12:31 To: Haopu Wang Cc: d...@spark.apache.org; user Subject: Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin? Hi Haopu, My understanding is that the hashtable on both left

Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-09-29 Thread Haopu Wang
I take a look at HashOuterJoin and it's building a Hashtable for both sides. This consumes quite a lot of memory when the partition is big. And it doesn't reduce the iteration on streamed relation, right? Thanks! - To unsubscrib

Spark SQL question: how to control the storage level of cached SchemaRDD?

2014-09-28 Thread Haopu Wang
rom: Cheng Lian [mailto:lian.cs@gmail.com] Sent: 2014年9月26日 21:24 To: Haopu Wang; user@spark.apache.org Subject: Re: Spark SQL question: is cached SchemaRDD storage controlled by "spark.storage.memoryFraction"? Yes it is. The in-memory storage used with SchemaRDD also uses RDD.cache() un

Spark SQL question: is cached SchemaRDD storage controlled by "spark.storage.memoryFraction"?

2014-09-26 Thread Haopu Wang
Hi, I'm querying a big table using Spark SQL. I see very long GC time in some stages. I wonder if I can improve it by tuning the storage parameter. The question is: the schemaRDD has been cached with "cacheTable()" function. So is the cached schemaRDD part of memory storage controlled by the "spar

Spark SQL 1.1.0: NPE when join two cached table

2014-09-22 Thread Haopu Wang
I have two data sets and want to join them on each first field. Sample data are below: data set 1: id2,name1,2,300.0 data set 2: id1, The code is something like below: val sparkConf = new SparkConf().setAppName("JoinInScala") val sc = new SparkContext(spar

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang
Got it, thank you, Denny! From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 11:04 AM To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell Subject: RE: Announcing Spark 1.1.0! Yes, atleast for my query

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang
.” Did you try to read a hadoop 2.5.0 file using Spark 1.1 with hadoop 2.4? Thanks! From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 10:00 AM To: Patrick Wendell; Haopu Wang; d...@spark.apache.org; user@spark.apache.org

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang
. From: Denny Lee [mailto:denny.g@gmail.com] Sent: Friday, September 12, 2014 9:35 AM To: user@spark.apache.org; Haopu Wang; d...@spark.apache.org; Patrick Wendell Subject: RE: Announcing Spark 1.1.0! I’m not sure if I’m completely answering your question here but I’m currently working (on

RE: Announcing Spark 1.1.0!

2014-09-11 Thread Haopu Wang
I see the binary packages include hadoop 1, 2.3 and 2.4. Does Spark 1.1.0 support hadoop 2.5.0 at below address? http://hadoop.apache.org/releases.html#11+August%2C+2014%3A+Release+2.5.0+available -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Friday, Septembe

RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"

2014-07-23 Thread Haopu Wang
u, Please see the inline comments. Thanks Jerry -----Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Wednesday, July 23, 2014 3:00 PM To: user@spark.apache.org Subject: "spark.streaming.unpersist" and "spark.cleaner.ttl" I have a DStream re

"spark.streaming.unpersist" and "spark.cleaner.ttl"

2014-07-23 Thread Haopu Wang
I have a DStream receiving data from a socket. I'm using local mode. I set "spark.streaming.unpersist" to "false" and leave " spark.cleaner.ttl" to be infinite. I can see files for input and shuffle blocks under "spark.local.dir" folder and the size of folder keeps increasing, although JVM's memory

number of "Cached Partitions" v.s. "Total Partitions"

2014-07-22 Thread Haopu Wang
Hi, I'm using local mode and read a text file as RDD using JavaSparkContext.textFile() API. And then call "cache()" method on the result RDD. I look at the Storage information and find the RDD has 3 partitions but 2 of them have been cached. Is this a normal behavior? I assume all of partitio

RE: data locality

2014-07-21 Thread Haopu Wang
node for every job. The total number of executors is specified by the user. -Sandy On Fri, Jul 18, 2014 at 2:00 AM, Haopu Wang wrote: Sandy, Do you mean the “preferred location” is working for standalone cluster also? Because I check the code of SparkContext and see co

concurrent jobs

2014-07-18 Thread Haopu Wang
By looking at the code of JobScheduler, I find a parameter of below: private val numConcurrentJobs = ssc.conf.getInt("spark.streaming.concurrentJobs", 1) private val jobExecutor = Executors.newFixedThreadPool(numConcurrentJobs) Does that mean each App can have only one active stage?

RE: data locality

2014-07-18 Thread Haopu Wang
f, locData) -Sandy On Fri, Jul 18, 2014 at 12:35 AM, Haopu Wang wrote: I have a standalone spark cluster and a HDFS cluster which share some of nodes. When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS the location for each file block in order to get a right w

data locality

2014-07-18 Thread Haopu Wang
I have a standalone spark cluster and a HDFS cluster which share some of nodes. When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS the location for each file block in order to get a right worker node? How about a spark cluster on Yarn? Thank you very much!

RE: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-11 Thread Haopu Wang
On Thu, Jul 10, 2014 at 1:21 AM, Haopu Wang wrote: I'm running an App for hours in a standalone cluster. From the data injector and "Streaming" tab of web ui, it's running well. However, I see quite a lot of Active stages in web ui even some of them have all of their tasks co

RE: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Haopu Wang
been completed but the Stage is still shown as "Active"? Do you see any errors in the logs of the driver? On Thu, Jul 10, 2014 at 1:21 AM, Haopu Wang wrote: I'm running an App for hours in a standalone cluster. From the data injector and "Streaming" tab of we

How to clear the list of Completed Appliations in Spark web UI?

2014-07-09 Thread Haopu Wang
Besides restarting the Master, is there any other way to clear the Completed Applications in Master web UI?