Unsubscribe

2021-09-18 Thread Du Li

Unsubscribe

2021-09-06 Thread Du Li

unsubscribe

2021-07-20 Thread Du Li

Re: Finding the number of executors.

2015-08-21 Thread Du Li
Following is a method that retrieves the list of executors registered to a spark context. It worked perfectly with spark-submit in standalone mode for my project. /**   * A simplified method that just returns the current active/registered executors   * excluding the driver.   * @param sc   *    

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Du Li
repartition() means coalesce(shuffle=false) On Thursday, June 18, 2015 4:07 PM, Corey Nolet wrote: Doesn't repartition call coalesce(shuffle=true)?On Jun 18, 2015 6:53 PM, "Du Li" wrote: I got the same problem with rdd,repartition() in my streaming app, which

Re: Shuffle produces one huge partition and many tiny partitions

2015-06-18 Thread Du Li
I got the same problem with rdd,repartition() in my streaming app, which generated a few huge partitions and many tiny partitions. The resulting high data skew makes the processing time of a batch unpredictable and often exceeding the batch interval. I eventually solved the problem by using rdd

spark eventLog and history server

2015-06-08 Thread Du Li
Event log is enabled in my spark streaming app. My code runs in standalone mode and the spark version is 1.3.1. I periodically stop and restart the streaming context by calling ssc.stop(). However, from the web UI, when clicking on a past job, it says the job is still in progress and does not sh

Re: how to use rdd.countApprox

2015-05-15 Thread Du Li
move really fast between releases. 1.1.1 feels really old to me :P TD On Wed, May 13, 2015 at 1:25 PM, Du Li wrote: I do rdd.countApprox() and rdd.sparkContext.setJobGroup() inside dstream.foreachRDD{...}. After calling cancelJobGroup(), the spark context seems no longer valid, which crashes

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li
PM, Tathagata Das wrote: That is not supposed to happen :/ That is probably a bug.If you have the log4j logs, would be good to file a JIRA. This may be worth debugging. On Wed, May 13, 2015 at 12:13 PM, Du Li wrote: Actually I tried that before asking. However, it killed the spark context

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li
(jobGroupId)val approxCount = rdd.countApprox().getInitialValue   // job launched with the set job grouprdd.sparkContext.cancelJobGroup(jobGroupId)           // cancel the job On Wed, May 13, 2015 at 11:24 AM, Du Li wrote: Hi TD, Do you know how to cancel the rdd.countApprox(5000) tasks after

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li
Hi TD, Do you know how to cancel the rdd.countApprox(5000) tasks after the timeout? Otherwise it keeps running until completion, producing results not used but consuming resources. Thanks,Du On Wednesday, May 13, 2015 10:33 AM, Du Li wrote: Hi TD, Thanks a lot. rdd.countApprox

Re: how to use rdd.countApprox

2015-05-13 Thread Du Li
at you would have received by "rdd.count()" On Tue, May 12, 2015 at 5:03 PM, Du Li wrote: HI, I tested the following in my streaming app and hoped to get an approximate count within 5 seconds. However, rdd.countApprox(5000).getFinalValue() seemed to always return after it finishes complet

Re: force the kafka consumer process to different machines

2015-05-13 Thread Du Li
Alternatively, you may spread your kafka receivers to multiple machines as discussed in this blog post:How to spread receivers over worker hosts in Spark streaming |   | |   |   |   |   |   | | How to spread receivers over worker hosts in Spark streamingIn Spark Streaming, you can spawn multipl

Re: how to use rdd.countApprox

2015-05-12 Thread Du Li
On Wednesday, May 6, 2015 7:55 AM, Du Li wrote: I have to count RDD's in a spark streaming app. When data goes large, count() becomes expensive. Did anybody have experience using countApprox()? How accurate/reliable is it?  The documentation is pretty modest. Suppose the timeout

how to use rdd.countApprox

2015-05-06 Thread Du Li
I have to count RDD's in a spark streaming app. When data goes large, count() becomes expensive. Did anybody have experience using countApprox()? How accurate/reliable is it?  The documentation is pretty modest. Suppose the timeout parameter is in milliseconds. Can I retrieve the count value by

Re: RDD coalesce or repartition by #records or #bytes?

2015-05-05 Thread Du Li
get very similar number of >records. Thanks. Zhan Zhang On Mar 4, 2015, at 3:47 PM, Du Li wrote: Hi, My RDD's are created from kafka stream. After receiving a RDD, I want to do coalesce/repartition it so that the data will be processed in a set of machines in parallel as even as pos

set up spark cluster with heterogeneous hardware

2015-03-12 Thread Du Li
Hi Spark community, I searched for a way to configure a heterogeneous cluster because the need recently emerged in my project. I didn't find any solution out there. Now I have thought out a solution and thought it might be useful to many other people with similar needs. Following is a blog post

Re: How to use more executors

2015-03-11 Thread Du Li
Is it possible to extend this PR further (or create another PR) to allow for per-node configuration of workers?  There are many discussions about heterogeneous spark cluster. Currently configuration on master will override those on the workers. Many spark users have the need for having machines

Re: How to use more executors

2015-03-11 Thread Du Li
Is it being merged in the next release? It's indeed a critical patch! Du On Wednesday, January 21, 2015 3:59 PM, Nan Zhu wrote: …not sure when will it be reviewed… but for now you can work around by allowing multiple worker instances on a single machine  http://spark.apache.org/docs

Re: FW: RE: distribution of receivers in spark streaming

2015-03-10 Thread Du Li
reference to the community? Might be a good idea to post both methods with pros and cons, as different users may have different constraints. :)Thanks :) TD On Fri, Mar 6, 2015 at 4:07 PM, Du Li wrote: Yes but the caveat may not exist if we do this when the streaming app is launched, since we&#x

Re: distribution of receivers in spark streaming

2015-03-04 Thread Du Li
ext to let all the executors registered, then all the receivers can distribute to the nodes more evenly. Also setting locality is another way as you mentioned.   Thanks Jerry     From: Du Li [mailto:l...@yahoo-inc.com.INVALID] Sent: Thursday, March 5, 2015 1:50 PM To: User Subject: Re: distr

Re: distribution of receivers in spark streaming

2015-03-04 Thread Du Li
Figured it out: I need to override method preferredLocation() in MyReceiver class. On Wednesday, March 4, 2015 3:35 PM, Du Li wrote: Hi, I have a set of machines (say 5) and want to evenly launch a number (say 8) of kafka receivers on those machines. In my code I did something

RDD coalesce or repartition by #records or #bytes?

2015-03-04 Thread Du Li
Hi, My RDD's are created from kafka stream. After receiving a RDD, I want to do coalesce/repartition it so that the data will be processed in a set of machines in parallel as even as possible. The number of processing nodes is larger than the receiving nodes. My question is how the coalesce/repa

distribution of receivers in spark streaming

2015-03-04 Thread Du Li
Hi, I have a set of machines (say 5) and want to evenly launch a number (say 8) of kafka receivers on those machines. In my code I did something like the following, as suggested in the spark docs:        val streams = (1 to numReceivers).map(_ => ssc.receiverStream(new MyKafkaReceiver()))       

Re: RDD saveAsObjectFile write to local file and HDFS

2014-11-26 Thread Du Li
Add ³file://³ in front of your path. On 11/26/14, 10:15 AM, "firemonk9" wrote: >When I am running spark locally, RDD saveAsObjectFile writes the file to >local file system (ex : path /data/temp.txt) >and >when I am running spark on YARN cluster, RDD saveAsObjectFile writes the >file to hd

Re: SparkSQL performance

2014-10-31 Thread Du Li
We have seen all kinds of results published that often contradict each other. My take is that the authors often know more tricks about how to tune their own/familiar products than the others. So the product on focus is tuned for ideal performance while the competitors are not. The authors are no

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li
To clarify, this error was thrown from the thrift server when beeline was started to establish the connection, as follows: $ beeline -u jdbc:hive2://`hostname`:4080 –n username From: Du Li mailto:l...@yahoo-inc.com.INVALID>> Date: Tuesday, October 28, 2014 at 11:35 AM To: Chen

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li
(TSaslServerTransport.java:216) ... 4 more From: Cheng Lian mailto:lian.cs@gmail.com>> Date: Tuesday, October 28, 2014 at 2:50 AM To: Du Li mailto:l...@yahoo-inc.com.invalid>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subje

Re: [SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-28 Thread Du Li
ubject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1637) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:697) ... 74 more From: Cheng Lian mailto:lian.cs@gmail.com>> Date: Tuesday, October 28, 2014 at 2:50 AM To: Du Li mailto:l...@

[SPARK SQL] kerberos error when creating database from beeline/ThriftServer2

2014-10-27 Thread Du Li
Hi, I was trying to set up Spark SQL on a private cluster. I configured a hive-site.xml under spark/conf that uses a local metestore with warehouse and default FS name set to HDFS on one of my corporate cluster. Then I started spark master, worker and thrift server. However, when creating a dat

Re: HiveContext: cache table not supported for partitioned table?

2014-10-03 Thread Du Li
Thanks for your explanation. From: Cheng Lian mailto:lian.cs@gmail.com>> Date: Thursday, October 2, 2014 at 8:01 PM To: Du Li mailto:l...@yahoo-inc.com.INVALID>>, "d...@spark.apache.org<mailto:d...@spark.apache.org>" mailto:d...@spark.apache.org>> Cc:

HiveContext: cache table not supported for partitioned table?

2014-10-02 Thread Du Li
Hi, In Spark 1.1 HiveContext, I ran a create partitioned table command followed by a cache table command and got a java.sql.SQLSyntaxErrorException: Table/View 'PARTITIONS' does not exist. But cache table worked fine if the table is not a partitioned table. Can anybody confirm that cache of pa

Re: view not supported in spark thrift server?

2014-09-28 Thread Du Li
Sunday, September 28, 2014 at 12:13 PM To: Du Li mailto:l...@yahoo-inc.com.invalid>> Cc: "d...@spark.apache.org<mailto:d...@spark.apache.org>" mailto:d...@spark.apache.org>>, "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apa

view not supported in spark thrift server?

2014-09-28 Thread Du Li
Can anybody confirm whether or not view is currently supported in spark? I found “create view translate” in the blacklist of HiveCompatibilitySuite.scala and also the following scenario threw NullPointerException on beeline/thriftserver (1.1.0). Any plan to support it soon? > create table src

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-28 Thread Du Li
o provide the DDL of this partitioned table together >with the query you tried? The stacktrace suggests that the query was >trying to cast a map into something else, which is not supported in >Spark SQL. And I doubt whether Hive support casting a complex type to >some other type. > >

Re: SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Du Li
It might be a problem when inserting into a partitioned table. It worked fine to when the target table was unpartitioned. Can you confirm this? Thanks, Du On 9/26/14, 4:48 PM, "Du Li" wrote: >Hi, > >I was loading data into a partitioned table on Spark 1.1.0 >beeline-t

SparkSQL: map type MatchError when inserting into Hive table

2014-09-26 Thread Du Li
Hi, I was loading data into a partitioned table on Spark 1.1.0 beeline-thriftserver. The table has complex data types such as map and array>. The query is like ³insert overwrite table a partition (Š) select в and the select clause worked if run separately. However, when running the insert query,

Re: Spark SQL use of alias in where clause

2014-09-25 Thread Du Li
Thanks, Yanbo and Nicholas. Now it makes more sense — query optimization is the answer. /Du From: Nicholas Chammas mailto:nicholas.cham...@gmail.com>> Date: Thursday, September 25, 2014 at 6:43 AM To: Yanbo Liang mailto:yanboha...@gmail.com>> Cc: Du Li mailto:l...@yahoo-inc.com.in

Spark SQL use of alias in where clause

2014-09-24 Thread Du Li
Hi, The following query does not work in Shark nor in the new Spark SQLContext or HiveContext. SELECT key, value, concat(key, value) as combined from src where combined like ’11%’; The following tweak of syntax works fine although a bit ugly. SELECT key, value, concat(key, value) as combined fr

SQL status code to indicate success or failure of query

2014-09-23 Thread Du Li
Hi, After executing sql() in SQLContext or HiveContext, is there a way to tell whether the query/command succeeded or failed? Method sql() returns SchemaRDD which either is empty or contains some Rows of results. However, some queries and commands do not return results by nature; being empty is

Re: problem with HiveContext inside Actor

2014-09-18 Thread Du Li
y, September 18, 2014 at 7:17 AM To: Du Li mailto:l...@yahoo-inc.com.INVALID>> Cc: Michael Armbrust mailto:mich...@databricks.com>>, "Cheng, Hao" mailto:hao.ch...@intel.com>>, "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spar

Re: problem with HiveContext inside Actor

2014-09-17 Thread Du Li
I do "actor ! CreateSomeDB”, which seems to me just the same thing because the actor does nothing but call createDB(). Du From: Michael Armbrust mailto:mich...@databricks.com>> Date: Wednesday, September 17, 2014 at 7:40 PM To: "Cheng, Hao" mailto:hao.ch...@intel.com

problem with HiveContext inside Actor

2014-09-17 Thread Du Li
Hi, Wonder anybody had similar experience or any suggestion here. I have an akka Actor that processes database requests in high-level messages. Inside this Actor, it creates a HiveContext object that does the actual db work. The main thread creates the needed SparkContext and passes in to the

Re: NullWritable not serializable

2014-09-16 Thread Du Li
map(x => (NullWritable.get(), new Text(x))) res.saveAsSequenceFile("./test_data") val rdd2 = sc.sequenceFile("./test_data", classOf[NullWritable], classOf[Text]) assert(rdd.first == rdd2.first._2.toString) } } From: Matei Zaharia mailto:matei.zaha...@gmail.com>> Date: Monday, Se

Re: NullWritable not serializable

2014-09-15 Thread Du Li
() does not need to serialize and ship data while the other three methods do. Do you recall any difference between spark 1.0 and 1.1 that might cause this problem? Thanks, Du From: Matei Zaharia mailto:matei.zaha...@gmail.com>> Date: Friday, September 12, 2014 at 9:10 PM To: Du Li ma

Re: Does Spark always wait for stragglers to finish running?

2014-09-15 Thread Du Li
There is a parameter spark.speculation that is turned off by default. Look at the configuration doc: http://spark.apache.org/docs/latest/configuration.html From: Pramod Biligiri mailto:pramodbilig...@gmail.com>> Date: Monday, September 15, 2014 at 3:30 PM To: "user@spark.apache.org

NullWritable not serializable

2014-09-12 Thread Du Li
Hi, I was trying the following on spark-shell (built with apache master and hadoop 2.4.0). Both calling rdd2.collect and calling rdd3.collect threw java.io.NotSerializableException: org.apache.hadoop.io.NullWritable. I got the same problem in similar code of my app which uses the newly released

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-11 Thread Du Li
SchemaRDD has a method insertInto(table). When the table is partitioned, it would be more sensible and convenient to extend it with a list of partition key and values. From: Denny Lee mailto:denny.g@gmail.com>> Date: Thursday, September 11, 2014 at 6:39 PM To: Du Li mailto:l...

Re: SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li
Just moving it out of test is not enough. Must move the case class definition to the top level. Otherwise it would report a runtime error of task not serializable when executing collect(). From: Du Li mailto:l...@yahoo-inc.com.INVALID>> Date: Thursday, September 11, 2014 at 12:33

Re: spark sql - create new_table as select * from table

2014-09-11 Thread Du Li
The implementation of SparkSQL is currently incomplete. You may try it out with HiveContext instead of SQLContext. On 9/11/14, 1:21 PM, "jamborta" wrote: >Hi, > >I am trying to create a new table from a select query as follows: > >CREATE TABLE IF NOT EXISTS new_table ROW FORMAT DELIMITED F

Re: SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li
Solved it. The problem occurred because the case class was defined within a test case in FunSuite. Moving the case class definition out of test fixed the problem. From: Du Li mailto:l...@yahoo-inc.com.INVALID>> Date: Thursday, September 11, 2014 at 11:25 AM To: "user@spark

SparkSQL HiveContext TypeTag compile error

2014-09-11 Thread Du Li
Hi, I have the following code snippet. It works fine on spark-shell but in a standalone app it reports "No TypeTag available for MySchema” at compile time when calling hc.createScheamaRdd(rdd). Anybody knows what might be missing? Thanks, Du -- Import org.apache.spark.sql.hive.HiveContext

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-10 Thread Du Li
Hi Denny, There is a related question by the way. I have a program that reads in a stream of RDD¹s, each of which is to be loaded into a hive table as one partition. Currently I do this by first writing the RDD¹s to HDFS and then loading them to hive, which requires multiple passes of HDFS I/O an

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-09 Thread Du Li
You need to run mvn install so that the package you built is put into the local maven repo. Then when compiling your own app (with the right dependency specified), the package will be retrieved. On 9/9/14, 8:16 PM, "alexandria1101" wrote: >I think the package does not exist because I need to c

Re: Table not found: using jdbc console to query sparksql hive thriftserver

2014-09-09 Thread Du Li
Your tables were registered in the SqlContext, whereas the thrift server works with HiveContext. They seem to be in two different worlds today. On 9/9/14, 5:16 PM, "alexandria1101" wrote: >Hi, > >I want to use the sparksql thrift server in my application and make sure >everything is loading an

Re: SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Du Li
mailto:mich...@databricks.com>> Date: Wednesday, August 27, 2014 at 5:21 PM To: Du Li mailto:l...@yahoo-inc.com>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: SparkSQL returns ArrayBuffer for fields of type

SparkSQL returns ArrayBuffer for fields of type Array

2014-08-27 Thread Du Li
Hi, Michael. I used HiveContext to create a table with a field of type Array. However, in the hql results, this field was returned as type ArrayBuffer which is mutable. Would it make more sense to be an Array? The Spark version of my test is 1.0.2. I haven’t tested it on SQLContext nor newer v

Re: Execute HiveFormSpark ERROR.

2014-08-27 Thread Du Li
As suggested in the error messages, double-check your class path. From: CharlieLin mailto:chury...@gmail.com>> Date: Tuesday, August 26, 2014 at 8:29 PM To: "user@spark.apache.org" mailto:user@spark.apache.org>> Subject: Execute HiveFormSpark ERROR. hi, all :

Re: unable to instantiate HiveMetaStoreClient on LocalHiveContext

2014-08-26 Thread Du Li
IllegalAccessError was thrown, which was eventually translated into failure to instantiate HiveMetaStoreClient. It was discussed by Cheng Lian and Zhun Shen in another thread posted on 8/7/14. Du From: Yin Huai mailto:huaiyin@gmail.com>> Date: Tuesday, August 26, 2014 at 8:58 AM To: Du Li ma

unable to instantiate HiveMetaStoreClient on LocalHiveContext

2014-08-25 Thread Du Li
Hi, I created an instance of LocalHiveContext and attempted to create a database. However, it failed with message "org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException: Unable to i

Re: Hive From Spark

2014-08-25 Thread Du Li
Never mind. I have resolved this issue by moving the local guava dependency forward. Du On 8/22/14, 5:08 PM, "Du Li" wrote: >I thought the fix had been pushed to the apache master ref. commit >"[SPARK-2848] Shade Guava in uber-jars" By Marcelo Vanzin on 8/20. So my

Re: Hive From Spark

2014-08-22 Thread Du Li
t;>> I don't believe the Guava change has made it to the 1.1 branch. The >>> Guava doc says "hashInt" was added in 12.0, so what's probably >>> happening is that you have and old version of Guava in your classpath >>> before the Spark jars. (H

Re: Hive From Spark

2014-08-21 Thread Du Li
Hi, This guava dependency conflict problem should have been fixed as of yesterday according to https://issues.apache.org/jira/browse/SPARK-2420 However, I just got java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; by the following code

Re: Difference between SparkSQL and shark

2014-07-10 Thread Du Li
On the spark 1.0-jdbc branch, there is a thrift server and a beehive CLI that roughly keeps the shark style, corresponding to the shark server and shark CLI, respectively. Check out https://github.com/apache/spark/blob/branch-1.0-jdbc/docs/sql-programming-guide.md for more information. Du Fro

Re: error in creating external table

2014-07-09 Thread Du Li
Hi, I got an error when trying to create an external table with location on a remote HDFS address. I meant to quickly try out the basic features on spark SQL 1.0-JDBC and so started the thrift server on one terminal and beehive CLI on another. Didn’t do any extra configuration on spark sql, hi