date:20141125

java.io.IOException: sendMessageReliably failed without being ACK'd

2014-11-25 Thread xukun

data size: text file, 315G
cmd:
./spark-submit --class com.spark.test.JavaWordCountWithSave --num-executors 7 
--executor-memory 60g --driver-memory 2g --executor-cores 32
--master yarn-client /home/cjs/spark-test.jar hdfs://wordcount/input 
hdfs://wordcount/output

code of JavaWordCountWithSave:
```
public final class JavaWordCountWithSave {
private static final Pattern SPACE = Pattern.compile(" ");

public static void main(String[] args) throws Exception {

if (args.length < 2) {
System.err.println("Usage: JavaWordCount ");
System.exit(1);
}

SparkConf sparkConf = new 
SparkConf().setAppName("JavaWordCountWithSave");
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
JavaRDD lines = ctx.textFile(args[0], 1);

JavaRDD words = lines.flatMap(new FlatMapFunction() {
@Override
public Iterable call(String s) {
return Arrays.asList(SPACE.split(s));
}
});

JavaPairRDD ones = words.mapToPair(new 
PairFunction() {
@Override
public Tuple2 call(String s) {
return new Tuple2(s, 1);
}
});

JavaPairRDD counts = ones.reduceByKey(new 
Function2() {
@Override
public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
});

counts.saveAsTextFile(args[1]);

ctx.stop();
}
}
```


log of driver---
14/11/20 14:57:46 WARN TaskSetManager: Lost task 167.0 in stage 1.0 (TID 5207, 
linux-171): ExecutorLostFailure (executor lost)
14/11/20 14:57:46 WARN TaskSetManager: Lost task 41.0 in stage 1.0 (TID 5081, 
linux-171): ExecutorLostFailure (executor lost)
14/11/20 14:57:46 WARN TaskSetManager: Lost task 104.0 in stage 1.0 (TID 5144, 
linux-171): ExecutorLostFailure (executor lost)
14/11/20 14:57:46 WARN TaskSetManager: Lost task 62.0 in stage 1.0 (TID 5102, 
linux-171): ExecutorLostFailure (executor lost)
14/11/20 14:57:46 WARN TaskSetManager: Lost task 20.0 in stage 1.0 (TID 5060, 
linux-171): ExecutorLostFailure (executor lost)
14/11/20 14:57:46 ERROR YarnClientSchedulerBackend: Asked to remove non 
existant executor 5
14/11/20 14:57:46 INFO DAGScheduler: Executor lost: 5 (epoch 1)
14/11/20 14:57:46 ERROR YarnClientSchedulerBackend: Asked to remove non 
existant executor 5
14/11/20 14:57:46 INFO BlockManagerMasterActor: Trying to remove executor 5 
from BlockManagerMaster.
14/11/20 14:57:46 INFO BlockManagerMaster: Removed 5 successfully in 
removeExecutor
14/11/20 14:57:46 ERROR YarnClientSchedulerBackend: Asked to remove non 
existant executor 5
14/11/20 14:57:46 ERROR YarnClientSchedulerBackend: Asked to remove non 
existant executor 5

log of executor---
2014-11-20 14:57:46,879 | INFO  | [connection-manager-thread] | key already 
cancelled ? sun.nio.ch.SelectionKeyImpl@a6b0591 | 
org.apache.spark.Logging$class.logInfo(Logging.scala:80)
java.nio.channels.CancelledKeyException
at 
org.apache.spark.network.nio.ConnectionManager.run(ConnectionManager.scala:379)
at 
org.apache.spark.network.nio.ConnectionManager$$anon$4.run(ConnectionManager.scala:132)
2014-11-20 14:57:46,958 | INFO  | [handle-read-write-executor-3] | Removing 
SendingConnection to ConnectionManagerId(172.168.xxx.16,2267) | 
org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-20 14:57:46,963 | INFO  | [handle-read-write-executor-3] | Notifying 
org.apache.spark.network.nio.ConnectionManager$MessageStatus@272b8b5a | 
org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-20 14:57:46,963 | INFO  | [handle-read-write-executor-3] | Notifying 
org.apache.spark.network.nio.ConnectionManager$MessageStatus@1bc9d5cd | 
org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-20 14:57:47,107 | ERROR | [Connection manager future execution 
context-2] | Failed to get block(s) from 172.168.xxx.16:2267 | 
org.apache.spark.Logging$class.logError(Logging.scala:96)
java.io.IOException: sendMessageReliably failed without being ACK'd
at 
org.apache.spark.network.nio.ConnectionManager$$anonfun$14.apply(ConnectionManager.scala:822)
at 
org.apache.spark.network.nio.ConnectionManager$$anonfun$14.apply(ConnectionManager.scala:818)
at 
org.apache.spark.network.nio.ConnectionManager$MessageStatus.markDone(ConnectionManager.scala:61)
at 
org.apache.spark.network.nio.ConnectionManager$$anonfun$removeConnection$3.apply(ConnectionManager.scala:451)
at 
org.apache.spark.network.nio.ConnectionManager$$anonfun$removeConnection$3.apply(ConnectionManager.scala:449)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.network.nio.ConnectionManager.removeConnection(ConnectionManager.scala:449)
at 
org.apache.spark.network.nio.ConnectionManager$$anonfun$addListeners$3.apply(ConnectionManager.scala:428)
at 
org.apache.spark.network.nio.ConnectionManager$$anonfun

java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]

2014-11-25 Thread xukun

submit 12 spark applications in the same time. yarn web page shows: two task 
fail.

the cmd:
./spark-submit--class org.apache.spark.examples.JavaWordCount   --master 
yarn-cluster   ---executor-memory 2g   ../lib/spark-examples_2.10-1.1.0.jar 
hdfs://hacluster/bigData

driver log of one fail task:
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Exception in thread "Driver" java.util.concurrent.TimeoutException: Futures 
timed out after [1 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at akka.remote.Remoting.start(Remoting.scala:173)
at 
akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
at 
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1458)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1448)
at 
org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:56)
at org.apache.spark.SparkEnv$.create(SparkEnv.scala:161)
at org.apache.spark.SparkContext.(SparkContext.scala:213)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:56)
2014-11-23 18:41:19,010 | INFO  | [main] | Registered signal handlers for 
[TERM, HUP, INT] | 
org.apache.spark.util.SignalLogger$.register(SignalLogger.scala:47)
2014-11-23 18:41:54,403 | WARN  | [main] | Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable | 
org.apache.hadoop.util.NativeCodeLoader.(NativeCodeLoader.java:62)
2014-11-23 18:42:10,319 | INFO  | [main] | ApplicationAttemptId: 
appattempt_1416732306135_0043_01 | 
org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:12,213 | INFO  | [main] | Changing view acls to: omm,spark | 
org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:12,280 | INFO  | [main] | Changing modify acls to: omm,spark | 
org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:12,300 | INFO  | [main] | SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(omm, spark); users 
with modify permissions: Set(omm, spark) | 
org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:18,597 | INFO  | [main] | Starting the user JAR in a separate 
Thread | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:18,787 | INFO  | [main] | Waiting for spark context 
initialization | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:18,788 | INFO  | [main] | Waiting for spark context 
initialization ... 0 | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:19,801 | WARN  | [Driver] | In Spark 1.0 and later 
spark.local.dir will be overridden by the value set by the cluster manager (via 
SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN). | 
org.apache.spark.Logging$class.logWarning(Logging.scala:71)
2014-11-23 18:42:22,495 | INFO  | [Driver] | Changing view acls to: omm,spark | 
org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:22,521 | INFO  | [Driver] | Changing modify acls to: omm,spark 
| org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:22,521 | INFO  | [Driver] | SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(omm, spark); users 
with modify permissions: Set(omm, spark) | 
org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:28,823 | INFO  | [main] | Waiting for spark context 
initialization ... 1 | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:38,896 | INFO  | [main] | Waiting for spark context 
initialization ... 2 | org.apache.spark.Logging$class.logInfo(Logging.scala:59)
2014-11-23 18:42:47,737 | INFO  | [sparkDriver-akka.actor.default-dispatcher-3] 
| Slf4jLogger started | 
akka.event.slf4j.Slf4jLogger$$anonfun$receive$1.applyOrElse(Sl

How to resolve Spark site issues?

2014-11-25 Thread York, Brennon

For JIRA tickets like 
SPARK-4046 (Incorrect Java 
example on site) is there a way to go about fixing those things? Its a trivial 
fix, but I’m not seeing that code in the codebase anywhere. Is this something 
the admins are going to have to take care of? Just want to clarify before I let 
it go and let the example sit on the site :)


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Re: How to resolve Spark site issues?

2014-11-25 Thread Reynold Xin

The website is hosted on some svn server by ASF and unfortunately it
doesn't have a github mirror, so we will have to manually patch it ...


On Tue, Nov 25, 2014 at 11:12 AM, York, Brennon  wrote:

> For JIRA tickets like SPARK-4046<
> https://issues.apache.org/jira/browse/SPARK-4046> (Incorrect Java example
> on site) is there a way to go about fixing those things? Its a trivial fix,
> but I’m not seeing that code in the codebase anywhere. Is this something
> the admins are going to have to take care of? Just want to clarify before I
> let it go and let the example sit on the site :)
> 
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>

Re: How to resolve Spark site issues?

2014-11-25 Thread Sean Owen

For the interested, the SVN repo for the site is viewable at
http://svn.apache.org/viewvc/spark/site/ and to check it out, you can
"svn co https://svn.apache.org/repos/asf/spark/site";

I assume the best process is to make a diff and attach it to the JIRA.
How old school.

On Tue, Nov 25, 2014 at 7:30 PM, Reynold Xin  wrote:
> The website is hosted on some svn server by ASF and unfortunately it
> doesn't have a github mirror, so we will have to manually patch it ...
>
>
> On Tue, Nov 25, 2014 at 11:12 AM, York, Brennon > wrote:
>
>> For JIRA tickets like SPARK-4046<
>> https://issues.apache.org/jira/browse/SPARK-4046> (Incorrect Java example
>> on site) is there a way to go about fixing those things? Its a trivial fix,
>> but I’m not seeing that code in the codebase anywhere. Is this something
>> the admins are going to have to take care of? Just want to clarify before I
>> let it go and let the example sit on the site :)
>> 
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed.  If the reader of this message is not the
>> intended recipient, you are hereby notified that any review,
>> retransmission, dissemination, distribution, copying or other use of, or
>> taking of any action in reliance upon this information is strictly
>> prohibited. If you have received this communication in error, please
>> contact the sender and delete the material from your computer.
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang

Hi,

Looks like the latest SparkSQL with Hive 0.12 has a bug in Parquet support.
I got the following exceptions:

org.apache.hadoop.hive.ql.parse.SemanticException: Output Format must
implement HiveOutputFormat, otherwise it should be either
IgnoreKeyTextOutputFormat or SequenceFileOutputFormat
at
org.apache.hadoop.hive.ql.plan.CreateTableDesc.validate(CreateTableDesc.java:431)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:9964)
at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9180)
at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)

Using the same DDL and Analyze script above.

Jianshi


On Sat, Oct 11, 2014 at 2:18 PM, Jianshi Huang 
wrote:

> It works fine, thanks for the help Michael.
>
> Liancheng also told me a trick, using a subquery with LIMIT n. It works in
> latest 1.2.0
>
> BTW, looks like the broadcast optimization won't be recognized if I do a
> left join instead of a inner join. Is that true? How can I make it work for
> left joins?
>
> Cheers,
> Jianshi
>
> On Thu, Oct 9, 2014 at 3:10 AM, Michael Armbrust 
> wrote:
>
>> Thanks for the input.  We purposefully made sure that the config option
>> did not make it into a release as it is not something that we are willing
>> to support long term.  That said we'll try and make this easier in the
>> future either through hints or better support for statistics.
>>
>> In this particular case you can get what you want by registering the
>> tables as external tables and setting an flag.  Here's a helper function to
>> do what you need.
>>
>> /**
>>  * Sugar for creating a Hive external table from a parquet path.
>>  */
>> def createParquetTable(name: String, file: String): Unit = {
>>   import org.apache.spark.sql.hive.HiveMetastoreTypes
>>
>>   val rdd = parquetFile(file)
>>   val schema = rdd.schema.fields.map(f => s"${f.name}
>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>   val ddl = s"""
>> |CREATE EXTERNAL TABLE $name (
>> |  $schema
>> |)
>> |ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
>> |STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
>> |OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
>> |LOCATION '$file'""".stripMargin
>>   sql(ddl)
>>   setConf("spark.sql.hive.convertMetastoreParquet", "true")
>> }
>>
>> You'll also need to run this to populate the statistics:
>>
>> ANALYZE TABLE  tableName COMPUTE STATISTICS noscan;
>>
>>
>> On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang 
>> wrote:
>>
>>> Ok, currently there's cost-based optimization however Parquet statistics
>>> is not implemented...
>>>
>>> What's the good way if I want to join a big fact table with several tiny
>>> dimension tables in Spark SQL (1.1)?
>>>
>>> I wish we can allow user hint for the join.
>>>
>>> Jianshi
>>>
>>> On Wed, Oct 8, 2014 at 2:18 PM, Jianshi Huang 
>>> wrote:
>>>
 Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not
 merged into master?

 I cannot find spark.sql.hints.broadcastTables in latest master, but
 it's in the following patch.


 https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5


 Jianshi


 On Mon, Sep 29, 2014 at 1:24 AM, Jianshi Huang >>> > wrote:

> Yes, looks like it can only be controlled by the
> parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit 
> weird
> to me.
>
> How am I suppose to know the exact bytes of a table? Let me specify
> the join algorithm is preferred I think.
>
> Jianshi
>
> On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu  wrote:
>
>> Have you looked at SPARK-1800 ?
>>
>> e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
>> Cheers
>>
>> On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang <
>> jianshi.hu...@gmail.com> wrote:
>>
>>> I cannot find it in the documentation. And I have a dozen dimension
>>> tables to (left) join...
>>>
>>>
>>> Cheers,
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github & Blog: http://huangjs.github.com/

>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang

Oh, I found a explanation from
http://cmenguy.github.io/blog/2013/10/30/using-hive-with-parquet-format-in-cdh-4-dot-3/

The error here is a bit misleading, what it really means is that the class
parquet.hive.DeprecatedParquetOutputFormat isn’t in the classpath for Hive.
Sure enough, doing a ls /usr/lib/hive/lib doesn’t show any of the parquet
jars, but ls /usr/lib/impala/lib shows the jar we’re looking for as
parquet-hive-1.0.jar
Is it removed from latest Spark?

Jianshi


On Wed, Nov 26, 2014 at 2:13 PM, Jianshi Huang 
wrote:

> Hi,
>
> Looks like the latest SparkSQL with Hive 0.12 has a bug in Parquet
> support. I got the following exceptions:
>
> org.apache.hadoop.hive.ql.parse.SemanticException: Output Format must
> implement HiveOutputFormat, otherwise it should be either
> IgnoreKeyTextOutputFormat or SequenceFileOutputFormat
> at
> org.apache.hadoop.hive.ql.plan.CreateTableDesc.validate(CreateTableDesc.java:431)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeCreateTable(SemanticAnalyzer.java:9964)
> at
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9180)
> at
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
>
> Using the same DDL and Analyze script above.
>
> Jianshi
>
>
> On Sat, Oct 11, 2014 at 2:18 PM, Jianshi Huang 
> wrote:
>
>> It works fine, thanks for the help Michael.
>>
>> Liancheng also told me a trick, using a subquery with LIMIT n. It works
>> in latest 1.2.0
>>
>> BTW, looks like the broadcast optimization won't be recognized if I do a
>> left join instead of a inner join. Is that true? How can I make it work for
>> left joins?
>>
>> Cheers,
>> Jianshi
>>
>> On Thu, Oct 9, 2014 at 3:10 AM, Michael Armbrust 
>> wrote:
>>
>>> Thanks for the input.  We purposefully made sure that the config option
>>> did not make it into a release as it is not something that we are willing
>>> to support long term.  That said we'll try and make this easier in the
>>> future either through hints or better support for statistics.
>>>
>>> In this particular case you can get what you want by registering the
>>> tables as external tables and setting an flag.  Here's a helper function to
>>> do what you need.
>>>
>>> /**
>>>  * Sugar for creating a Hive external table from a parquet path.
>>>  */
>>> def createParquetTable(name: String, file: String): Unit = {
>>>   import org.apache.spark.sql.hive.HiveMetastoreTypes
>>>
>>>   val rdd = parquetFile(file)
>>>   val schema = rdd.schema.fields.map(f => s"${f.name}
>>> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>>>   val ddl = s"""
>>> |CREATE EXTERNAL TABLE $name (
>>> |  $schema
>>> |)
>>> |ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe'
>>> |STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat'
>>> |OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat'
>>> |LOCATION '$file'""".stripMargin
>>>   sql(ddl)
>>>   setConf("spark.sql.hive.convertMetastoreParquet", "true")
>>> }
>>>
>>> You'll also need to run this to populate the statistics:
>>>
>>> ANALYZE TABLE  tableName COMPUTE STATISTICS noscan;
>>>
>>>
>>> On Wed, Oct 8, 2014 at 1:44 AM, Jianshi Huang 
>>> wrote:
>>>
 Ok, currently there's cost-based optimization however Parquet
 statistics is not implemented...

 What's the good way if I want to join a big fact table with several
 tiny dimension tables in Spark SQL (1.1)?

 I wish we can allow user hint for the join.

 Jianshi

 On Wed, Oct 8, 2014 at 2:18 PM, Jianshi Huang 
 wrote:

> Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not
> merged into master?
>
> I cannot find spark.sql.hints.broadcastTables in latest master, but
> it's in the following patch.
>
>
> https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5
>
>
> Jianshi
>
>
> On Mon, Sep 29, 2014 at 1:24 AM, Jianshi Huang <
> jianshi.hu...@gmail.com> wrote:
>
>> Yes, looks like it can only be controlled by the
>> parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit 
>> weird
>> to me.
>>
>> How am I suppose to know the exact bytes of a table? Let me specify
>> the join algorithm is preferred I think.
>>
>> Jianshi
>>
>> On Sun, Sep 28, 2014 at 11:57 PM, Ted Yu  wrote:
>>
>>> Have you looked at SPARK-1800 ?
>>>
>>> e.g. see
>>> sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
>>> Cheers
>>>
>>> On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang <
>>> jianshi.hu...@gmail.com> wrote:
>>>
 I cannot find it in the documentation. And I have a dozen dimension
 tables to (left) join...


 Cheers,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
>>

java.io.IOException: sendMessageReliably failed without being ACK'd

java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds]

How to resolve Spark site issues?

Re: How to resolve Spark site issues?

Re: How to resolve Spark site issues?

Re: How to do broadcast join in SparkSQL

Re: How to do broadcast join in SparkSQL

7 matches

Site Navigation

Mail list logo

Footer information