Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
> > But it's really weird to be setting SPARK_HOME in the environment of > your node managers. YARN shouldn't need to know about that. > On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang > wrote: > > > > > https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
so it does not get > expanded by the shell). > > But it's really weird to be setting SPARK_HOME in the environment of > your node managers. YARN shouldn't need to know about that. > On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang > wrote: > > > > > https://github.c

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
Yes, that's right. On Fri, Oct 5, 2018 at 3:35 AM Gourav Sengupta wrote: > Hi Marcelo, > it will be great if you illustrate what you mean, I will be interested to > know. > > Hi Jianshi, > so just to be sure you want to work on SPARK 2.3 while having SPARK 2.1 >

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
-installed 2.2.1 path. I don't want to make any changes to worker node configuration, so any way to override the order? Jianshi On Fri, Oct 5, 2018 at 12:11 AM Marcelo Vanzin wrote: > Normally the version of Spark installed on the cluster does not > matter, since Spark is uploade

Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
'2g') > ,('spark.executor.cores', '4') > ,('spark.executor.instances', '2') > ,('spark.executor.memory', '4g') > ,('spark.network.timeout', '800') > ,('spark.scheduler.mode&#x

Spark 2.0 Structured Streaming: sc.parallelize in foreach sink cause Task not serializable error

2016-09-25 Thread Jianshi
01:18:00.0,user_test,11) (2016-10-19 01:17:00.0,2016-10-19 01:19:00.0,user_test,11) (2016-10-19 01:18:00.0,2016-10-19 01:20:00.0,user_test,11) (2016-10-19 01:19:00.0,2016-10-19 01:21:00.0,user_test,11) (2016-10-19 01:20:00.0,2016-10-19 01:22:00.0,user_test,11) Does anyone have an idea of the

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
Hmm...all files under the event log folder has permission 770 but strangely my account cannot read other user's files. Permission denied. I'll sort it out with our Hadoop admin. Thanks for then help! Jianshi On Thu, May 28, 2015 at 12:13 AM, Marcelo Vanzin wrote: > Then: >

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
BTW, is there an option to set file permission for spark event logs? Jianshi On Thu, May 28, 2015 at 11:25 AM, Jianshi Huang wrote: > Hmm...all files under the event log folder has permission 770 but > strangely my account cannot read other user's files. Permission denied. > >

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
Yes, all written to the same directory on HDFS. Jianshi On Wed, May 27, 2015 at 11:57 PM, Marcelo Vanzin wrote: > You may be the only one not seeing all the logs. Are you sure all the > users are writing to the same log directory? The HS can only read from a > single log directory. &

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
No one using History server? :) Am I the only one need to see all user's logs? Jianshi On Thu, May 21, 2015 at 1:29 PM, Jianshi Huang wrote: > Hi, > > I'm using Spark 1.4.0-rc1 and I'm using default settings for history > server. > > But I can only see my own

View all user's application logs in history server

2015-05-20 Thread Jianshi Huang
Hi, I'm using Spark 1.4.0-rc1 and I'm using default settings for history server. But I can only see my own logs. Is it possible to view all user's logs? The permission is fine for the user group. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Why so slow

2015-05-12 Thread Jianshi Huang
t;= 2014-04-30)) PhysicalRDD [meta#143,nvar#145,date#147], MapPartitionsRDD[6] at explain at :32 Jianshi On Tue, May 12, 2015 at 10:34 PM, Olivier Girardot wrote: > can you post the explain too ? > > Le mar. 12 mai 2015 à 12:11, Jianshi Huang a > écrit : > >> Hi,

Why so slow

2015-05-12 Thread Jianshi Huang
s like https://issues.apache.org/jira/browse/SPARK-5446 is still open, when can we have it fixed? :) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread Jianshi Huang
I'm using the default settings. Jianshi On Wed, May 6, 2015 at 7:05 PM, twinkle sachdeva wrote: > Hi, > > Can you please share your compression etc settings, which you are using. > > Thanks, > Twinkle > > On Wed, May 6, 2015 at 4:15 PM, Jianshi Huang > wrot

FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread Jianshi Huang
I'm facing this error in Spark 1.3.1 https://issues.apache.org/jira/browse/SPARK-4105 Anyone knows what's the workaround? Change the compression codec for shuffle output? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Parquet error reading data that contains array of structs

2015-04-27 Thread Jianshi Huang
:BINARY GZIP DO:0 FPO:265489723 SZ:240588/461522/1.92 VC:62173 ENC:PLAIN_DICTIONARY,RLE Jianshi On Mon, Apr 27, 2015 at 12:40 PM, Cheng Lian wrote: > Had an offline discussion with Jianshi, the dataset was generated by Pig. > > Jianshi - Could you please attach the output of "

Re: Parquet error reading data that contains array of structs

2015-04-26 Thread Jianshi Huang
Hi Huai, I'm using Spark 1.3.1. You're right. The dataset is not generated by Spark. It's generated by Pig using Parquet 1.6.0rc7 jars. Let me see if I can send a testing dataset to you... Jianshi On Sat, Apr 25, 2015 at 2:22 AM, Yin Huai wrote: > oh, I missed that. It

Parquet error reading data that contains array of structs

2015-04-24 Thread Jianshi Huang
at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: String literal in dataframe.select(...)

2015-04-22 Thread Jianshi Huang
Oh, I found it out. Need to import sql.functions._ Then I can do table.select(lit("2015-04-22").as("date")) Jianshi On Wed, Apr 22, 2015 at 7:27 PM, Jianshi Huang wrote: > Hi, > > I want to do this in Spark SQL DSL: > > select '2015-04-22&#x

String literal in dataframe.select(...)

2015-04-22 Thread Jianshi Huang
Hi, I want to do this in Spark SQL DSL: select '2015-04-22' as date from table How to do this? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

How to write Hive's map(key, value, ...) in Spark SQL DSL

2015-04-22 Thread Jianshi Huang
Hi, I want to write this in Spark SQL DSL: select map('c1', c1, 'c2', c2) as m from table Is there a way? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: How to do dispatching in Streaming?

2015-04-17 Thread Jianshi Huang
Thanks everyone for the reply. Looks like foreachRDD + filtering is the way to go. I'll have 4 independent Spark streaming applications so the overhead seems acceptable. Jianshi On Fri, Apr 17, 2015 at 5:17 PM, Evo Eftimov wrote: > Good use of analogies J > > > > Yep fri

How to do dispatching in Streaming?

2015-04-12 Thread Jianshi Huang
ne DStream -> multiple DStreams) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Add partition support in saveAsParquet

2015-03-26 Thread Jianshi Huang
Hi, Anyone has similar request? https://issues.apache.org/jira/browse/SPARK-6561 When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: def saveAsParquet(path: String, partitionColumns: Seq[String]) -- Jianshi Huang LinkedIn

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Oh, by default it's set to 0L. I'll try setting it to 3 immediately. Thanks for the help! Jianshi On Mon, Mar 16, 2015 at 11:32 PM, Jianshi Huang wrote: > Thanks Shixiong! > > Very strange that our tasks were retried on the same executor again and

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Thanks Shixiong! Very strange that our tasks were retried on the same executor again and again. I'll check spark.scheduler.executorTaskBlacklistTime. Jianshi On Mon, Mar 16, 2015 at 6:02 PM, Shixiong Zhu wrote: > There are 2 cases for "No space left on device": > >

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353 On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang wrote: > Hi, > > We're facing "No space left on device" errors lately from time to time. > The job will fail after retries. Obvious in such case, retry w

Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
he problematic datanode before retrying it. And maybe dynamically allocate another datanode if dynamic allocation is enabled. I think there needs to be a class of fatal errors that can't be recovered with retries. And it's best Spark can handle it nicely. Thanks, -- Jianshi Huang LinkedIn:

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Liancheng also found out that the Spark jars are not included in the classpath of URLClassLoader. Hmm... we're very close to the truth now. Jianshi On Fri, Mar 13, 2015 at 6:03 PM, Jianshi Huang wrote: > I'm almost certain the problem is the ClassLoader. > > So adding

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
I'm almost certain the problem is the ClassLoader. So adding fork := true solves problems for test and run. The problem is how can I fork a JVM for sbt console? fork in console := true seems not working... Jianshi On Fri, Mar 13, 2015 at 4:35 PM, Jianshi Huang wrote: > I gues

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
I guess it's a ClassLoader issue. But I have no idea how to debug it. Any hints? Jianshi On Fri, Mar 13, 2015 at 3:00 PM, Eric Charles wrote: > i have the same issue running spark sql code from eclipse workspace. If > you run your code from the command line (with a packaged j

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
Forget about my last message. I was confused. Spark 1.2.1 + Scala 2.10.4 started by SBT console command also failed with this error. However running from a standard spark shell works. Jianshi On Fri, Mar 13, 2015 at 2:46 PM, Jianshi Huang wrote: > Hmm... look like the console command st

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
Hmm... look like the console command still starts a Spark 1.3.0 with Scala 2.11.6 even I changed them in build.sbt. So the test with 1.2.1 is not valid. Jianshi On Fri, Mar 13, 2015 at 2:34 PM, Jianshi Huang wrote: > I've confirmed it only failed in console started by SBT. > >

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
intln("Created spark context as sc.") [info] [info] def time[T](f: => T): T = { [info] import System.{currentTimeMillis => now} [info] val start = now [info] try { f } finally { println("Elapsed: " + (now - start)/1000.0 + " s") } [info] } [info] [info]

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
BTW, I was running tests from SBT when I get the errors. One test turn a Seq of case class to DataFrame. I also tried run similar code in the console, but failed with same error. I tested both Spark 1.3.0-rc2 and 1.2.1 with Scala 2.11.6 and 2.10.4 Any idea? Jianshi On Fri, Mar 13, 2015 at 2

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
Same issue here. But the classloader in my exception is somehow different. scala.ScalaReflectionException: class org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with java.net.URLClassLoader@53298398 of type class java.net.URLClassLoader with classpath Jianshi On Sun, Mar 1, 2015 at

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Thanks Sean. I'll ask our Hadoop admin. Actually I didn't find hadoop.tmp.dir in the Hadoop settings...using user home is suggested by other users. Jianshi On Wed, Mar 11, 2015 at 3:51 PM, Sean Owen wrote: > You shouldn't use /tmp, but it doesn't mean you should use

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Unfortunately /tmp mount is really small in our environment. I need to provide a per-user setting as the default value. I hacked bin/spark-class for the similar effect. And spark-defaults.conf can override it. :) Jianshi On Wed, Mar 11, 2015 at 3:28 PM, Patrick Wendell wrote: > We do

How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Hi, I need to set per-user spark.local.dir, how can I do that? I tried both /x/home/${user.name}/spark/tmp and /x/home/${USER}/spark/tmp And neither worked. Looks like it has to be a constant setting in spark-defaults.conf. Right? Any ideas how to do that? Thanks, -- Jianshi Huang

Re: Having lots of FetchFailedException in join

2015-03-05 Thread Jianshi Huang
Thanks. I was about to submit a ticket for this :) Also there's a ticket for sort-merge based groupbykey https://issues.apache.org/jira/browse/SPARK-3461 BTW, any idea why run with netty didn't output OOM error messages? It's very confusing in troubleshooting. Jianshi On Thu, M

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
There're some skew. 6461640SUCCESSPROCESS_LOCAL200 / 2015/03/04 23:45:471.1 min6 s198.6 MB21.1 GB240.8 MB5961590SUCCESSPROCESS_LOCAL30 / 2015/03/04 23:45:4744 s5 s200.7 MB4.8 GB154.0 MB But I expect this kind of skewness to be quite common. Jianshi On Thu, Mar 5, 2015 at 3:

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
I see. I'm using core's join. The data might have some skewness (checking). I understand shuffle can spill data to disk but when consuming it, say in cogroup or groupByKey, it still needs to read the whole group elements, right? I guess OOM happened there when reading very large groups

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
27;s still open status. Does that mean now groupByKey/cogroup/join(implemented as cogroup + flatmap) will still read the group as a whole during consuming? How can I deal with the key skewness in joins? Is there a skew-join implementation? Jianshi On Thu, Mar 5, 2015 at 2:44 PM, Shao, Saisai

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
One really interesting is that when I'm using the netty-based spark.shuffle.blockTransferService, there's no OOM error messages (java.lang.OutOfMemoryError: Java heap space). Any idea why it's not here? I'm using Spark 1.2.1. Jianshi On Thu, Mar 5, 2015 at 1:56 PM, Jiansh

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
Checkpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) Is join/cogroup still memory bound? Jianshi On Wed, Mar 4, 2015

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) And I checked executor on container host-, everything is good. Jianshi On Wed, Mar 4, 2015 at 12:28 PM, Aaron

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) at java.lang.Thread.run(Thread.java:724) Jianshi On Wed, Mar 4, 2015 at 3:25 AM, Aaron Davidson wrote: > "Failed to connect" implies that the executor at that host died, please > check

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) Jianshi On Wed, Mar 4, 2015 at 2:55 AM, Jianshi Huang wrote: > Hi, > > I got this error message: > &

[no subject]

2015-03-03 Thread Jianshi Huang
SNAPSHOT I built around Dec. 20. Is there any bug fixes related to shuffle block fetching or index files after that? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Loading tables using parquetFile vs. loading tables from Hive metastore with Parquet serde

2015-02-15 Thread Jianshi Huang
serde? Loading tables using parquetFile vs. loading tables from Hive metastore with Parquet serde Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Dynamic partition pattern support

2015-02-15 Thread Jianshi Huang
: https://issues.apache.org/jira/browse/SPARK-5828 Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-13 Thread Jianshi Huang
Get it. Thanks Reynold and Andrew! Jianshi On Thu, Feb 12, 2015 at 12:25 AM, Andrew Or wrote: > Hi Jianshi, > > For YARN, there may be an issue with how a recently patch changes the > accessibility of the shuffle files by the external shuffle service: > https://issues.apache.

Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-10 Thread Jianshi Huang
, 1.3.0) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Pig loader in Spark

2015-02-03 Thread Jianshi Huang
Hi, Anyone has implemented the default Pig Loader in Spark? (loading delimited text files with .pig_schema) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Hive UDAF percentile_approx says "This UDAF does not support the deprecated getEvaluator() method."

2015-01-13 Thread Jianshi Huang
Ah, thx Ted and Yin! I'll build a new version. :) Jianshi On Wed, Jan 14, 2015 at 7:24 AM, Yin Huai wrote: > Yeah, it's a bug. It has been fixed by > https://issues.apache.org/jira/browse/SPARK-3891 in master. > > On Tue, Jan 13, 2015 at 2:41 PM, Ted Yu wrote: > >

Hive UDAF percentile_approx says "This UDAF does not support the deprecated getEvaluator() method."

2015-01-13 Thread Jianshi Huang
org.apache.spark.sql.catalyst.plans.logical.Aggregate$$anonfun$output$6.apply(basicOperators.scala:143) I'm using latest branch-1.2 I found in PR that percentile and percentile_approx are supported. A bug? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-23 Thread Jianshi Huang
FYI, Latest hive 0.14/parquet will have column renaming support. Jianshi On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust wrote: > You might also try out the recently added support for views. > > On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang > wrote: > >> Ah... I see. T

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Jianshi Huang
Ah... I see. Thanks for pointing it out. Then it means we cannot mount external table using customized column names. hmm... Then the only option left is to use a subquery to add a bunch of column alias. I'll try it later. Thanks, Jianshi On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-08 Thread Jianshi Huang
Hi Huai, Exactly, I'll probably implement one using the new data source API when I have time... I've found the utility functions in JsonRDD. Jianshi On Tue, Dec 9, 2014 at 3:41 AM, Yin Huai wrote: > Hello Jianshi, > > You meant you want to convert a Map to a Struct, ri

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-07 Thread Jianshi Huang
I checked the source code for inferSchema. Looks like this is exactly what I want: val allKeys = rdd.map(allKeysWithValueTypes).reduce(_ ++ _) Then I can do createSchema(allKeys). Jianshi On Sun, Dec 7, 2014 at 2:50 PM, Jianshi Huang wrote: > Hmm.. > > I've created

Re: Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-06 Thread Jianshi Huang
Hmm.. I've created a JIRA: https://issues.apache.org/jira/browse/SPARK-4782 Jianshi On Sun, Dec 7, 2014 at 2:32 PM, Jianshi Huang wrote: > Hi, > > What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? > > I'm currently converting ea

Convert RDD[Map[String, Any]] to SchemaRDD

2014-12-06 Thread Jianshi Huang
Hi, What's the best way to convert RDD[Map[String, Any]] to a SchemaRDD? I'm currently converting each Map to a JSON String and do JsonRDD.inferSchema. How about adding inferSchema support to Map[String, Any] directly? It would be very useful. Thanks, -- Jianshi Huang LinkedI

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
a> sql("select cre_ts from pmt limit 1").collect res16: Array[org.apache.spark.sql.Row] = Array([null]) I created a JIRA for it: https://issues.apache.org/jira/browse/SPARK-4781 Jianshi On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang wrote: > Hmm... another issue I found

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Hmm... another issue I found doing this approach is that ANALYZE TABLE ... COMPUTE STATISTICS will fail to attach the metadata to the table, and later broadcast join and such will fail... Any idea how to fix this issue? Jianshi On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang wrote: > V

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-06 Thread Jianshi Huang
Very interesting, the line doing drop table will throws an exception. After removing it all works. Jianshi On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang wrote: > Here's the solution I got after talking with Liancheng: > > 1) using backquote `..` to wrap up all illegal characte

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
o drop and register table val t = table(name) val newSchema = StructType(t.schema.fields.map(s => s.copy(name = s.name.replaceAll(".*?::", "" sql(s"drop table $name") applySchema(t, newSchema).registerTempTable(name) I'm testing it for now. Tha

Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-05 Thread Jianshi Huang
e external table pmt ( sorted::id bigint ) stored as parquet location '...' Obviously it didn't work, I also tried removing the identifier sorted::, but the resulting rows contain only nulls. Any idea how to create a table in HiveContext from these Parquet files? Thanks,

Re: drop table if exists throws exception

2014-12-05 Thread Jianshi Huang
xception in the logs, but that exception does not propogate to user code. >> >> On Thu, Dec 4, 2014 at 11:31 PM, Jianshi Huang >> wrote: >> >> > Hi, >> > >> > I got exception saying Hive: NoSuchObjectException(message: table >> > not found)

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
With Liancheng's suggestion, I've tried setting spark.sql.hive.convertMetastoreParquet false but still analyze noscan return -1 in rawDataSize Jianshi On Fri, Dec 5, 2014 at 3:33 PM, Jianshi Huang wrote: > If I run ANALYZE without NOSCAN, then Hive can successfully

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
If I run ANALYZE without NOSCAN, then Hive can successfully get the size: parameters:{numFiles=0, EXTERNAL=TRUE, transient_lastDdlTime=1417764589, COLUMN_STATS_ACCURATE=true, totalSize=0, numRows=1156, rawDataSize=76296} Is Hive's PARQUET support broken? Jianshi On Fri, Dec 5, 2014 at 3:

drop table if exists throws exception

2014-12-04 Thread Jianshi Huang
Hi, I got exception saying Hive: NoSuchObjectException(message: table not found) when running "DROP TABLE IF EXISTS " Looks like a new regression in Hive module. Anyone can confirm this? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-12-04 Thread Jianshi Huang
.g. CREATE EXTERNAL TABLE table1 ( code int, desc string ) STORED AS PARQUET LOCATION '/user/jianshuang/data/dim_tables/table1.parquet' Anyone knows what went wrong? Thanks, Jianshi On Fri, Nov 28, 2014 at 1:24 PM, Cheng, Hao wrote: > Hi Jianshi, > > I couldn’t repr

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
I created a ticket for this: https://issues.apache.org/jira/browse/SPARK-4757 Jianshi On Fri, Dec 5, 2014 at 1:31 PM, Jianshi Huang wrote: > Correction: > > According to Liancheng, this hotfix might be the root cause: > > > https://github.com/a

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Correction: According to Liancheng, this hotfix might be the root cause: https://github.com/apache/spark/commit/38cb2c3a36a5c9ead4494cbc3dde008c2f0698ce Jianshi On Fri, Dec 5, 2014 at 12:45 PM, Jianshi Huang wrote: > Looks like the datanucleus*.jar shouldn't appear in the hdfs

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Looks like the datanucleus*.jar shouldn't appear in the hdfs path in Yarn-client mode. Maybe this patch broke yarn-client. https://github.com/apache/spark/commit/a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 Jianshi On Fri, Dec 5, 2014 at 12:02 PM, Jianshi Huang wrote: > Act

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
Actually my HADOOP_CLASSPATH has already been set to include /etc/hadoop/conf/* export HADOOP_CLASSPATH=/etc/hbase/conf/hbase-site.xml:/usr/lib/hbase/lib/hbase-protocol.jar:$(hbase classpath) Jianshi On Fri, Dec 5, 2014 at 11:54 AM, Jianshi Huang wrote: > Looks like somehow Spark failed

Re: Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
SPATH? Jianshi On Fri, Dec 5, 2014 at 11:37 AM, Jianshi Huang wrote: > I got the following error during Spark startup (Yarn-client mode): > > 14/12/04 19:33:58 INFO Client: Uploading resource > file:/x/home/jianshuang/spark/spark-latest/lib/datanucleus-api-jdo-3.2.6.jar > -&g

Exception adding resource files in latest Spark

2014-12-04 Thread Jianshi Huang
ter HEAD yesterday. Is this a bug? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Auto BroadcastJoin optimization failed in latest Spark

2014-11-27 Thread Jianshi Huang
Hi Hao, I'm using inner join as Broadcast join didn't work for left joins (thanks for the links for the latest improvements). And I'm using HiveConext and it worked in a previous build (10/12) when joining 15 dimension tables. Jianshi On Thu, Nov 27, 2014 at 8:35 AM, Cheng, Hao

Auto BroadcastJoin optimization failed in latest Spark

2014-11-26 Thread Jianshi Huang
se has met similar situation? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
/usr/lib/hive/lib doesn’t show any of the parquet jars, but ls /usr/lib/impala/lib shows the jar we’re looking for as parquet-hive-1.0.jar Is it removed from latest Spark? Jianshi On Wed, Nov 26, 2014 at 2:13 PM, Jianshi Huang wrote: > Hi, > > Looks like the latest SparkSQL with Hive 0

Re: How to do broadcast join in SparkSQL

2014-11-25 Thread Jianshi Huang
) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327) Using the same DDL and Analyze script above. Jianshi On Sat, Oct 11, 2014 at 2:18 PM, Jianshi Huang wrote: > It works fine, thanks for the help Michael. > > Liancheng also told m

Re: How to deal with BigInt in my case class for RDD => SchemaRDD convertion

2014-11-21 Thread Jianshi Huang
Ah yes. I found it too in the manual. Thanks for the help anyway! Since BigDecimal is just a wrapper around BigInt, let's also convert to BigInt to Decimal. I created a ticket. https://issues.apache.org/jira/browse/SPARK-4549 Jianshi On Fri, Nov 21, 2014 at 11:30 PM, Yin Huai wrote: &g

How to deal with BigInt in my case class for RDD => SchemaRDD convertion

2014-11-21 Thread Jianshi Huang
Hi, I got an error during rdd.registerTempTable(...) saying scala.MatchError: scala.BigInt Looks like BigInt cannot be used in SchemaRDD, is that correct? So what would you recommend to deal with it? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog:

NotSerializableException caused by Implicit sparkContext or sparkStreamingContext, why?

2014-11-18 Thread Jianshi Huang
ption saying that SparkContext is not serializable, which is totally irrelevant to txnSentTo I heard in Scala 2.11, there will be much better support in REPL to solve this issue. Is that true? Could anyone explain why we're having this problem? Thanks, -- Jianshi Huang LinkedIn: jians

Re: Is it safe to use Scala 2.11 for Spark build?

2014-11-18 Thread Jianshi Huang
Ok, I'll wait until -Pscala-2.11 is more stable and used by more people. Thanks for the help! Jianshi On Tue, Nov 18, 2014 at 3:49 PM, Ye Xianjin wrote: > Hi Prashant Sharma, > > It's not even ok to build with scala-2.11 profile on my machine. > >

Is it safe to use Scala 2.11 for Spark build?

2014-11-17 Thread Jianshi Huang
Any notable issues for using Scala 2.11? Is it stable now? Or can I use Scala 2.11 in my spark application and use Spark dist build with 2.10 ? I'm looking forward to migrate to 2.11 for some quasiquote features. Couldn't make it run in 2.10... Cheers, -- Jianshi Huang LinkedI

Re: Is there setup and cleanup function in spark?

2014-11-17 Thread Jianshi Huang
I see. Agree that lazy eval is not suitable for proper setup and teardown. We also abandoned it due to inherent incompatibility between implicit and lazy. It was fun to come up this trick though. Jianshi On Tue, Nov 18, 2014 at 10:28 AM, Tobias Pfeiffer wrote: > Hi, > > On Fri, Nov

Compiling Spark master HEAD failed.

2014-11-14 Thread Jianshi Huang
wrap: scala.r eflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. -> [Help 1] Anyone knows what's the problem? I'm building it on OSX. I didn't had this problem one month ago. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshu

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
Ok, then we need another trick. let's have an *implicit lazy var connection/context* around our code. And setup() will trigger the eval and initialization. The implicit lazy val/var trick is actually invented by Kevin. :) Jianshi On Fri, Nov 14, 2014 at 1:41 PM, Cheng Lian wrote:

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
So can I write it like this? rdd.mapPartition(i => setup(); i).map(...).mapPartition(i => cleanup(); i) So I don't need to mess up the logic and still can use map, filter and other transformations for RDD. Jianshi On Fri, Nov 14, 2014 at 12:20 PM, Cheng Lian wrote: > If you’

Re: Is there setup and cleanup function in spark?

2014-11-13 Thread Jianshi Huang
Where do you want the setup and cleanup functions to run? Driver or the worker nodes? Jianshi On Fri, Nov 14, 2014 at 10:44 AM, Dai, Kevin wrote: > HI, all > > > > Is there setup and cleanup function as in hadoop mapreduce in spark which > does some initializatio

Re: RDD to DStream

2014-11-12 Thread Jianshi Huang
needs to be collect to driver, is there a way to avoid doing this? Thanks Jianshi On Mon, Oct 27, 2014 at 4:57 PM, Jianshi Huang wrote: > Sure, let's still focus on the streaming simulation use case. It's a very > useful problem to solve. > > If we're going to use th

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-11-10 Thread Jianshi Huang
Hi Srinivas, Here's the versions I'm using. 1.2.0-SNAPSHOT 1.3.2 1.3.0 org.spark-project.akka 2.3.4-spark I'm using Spark built from master. so it's 1.2.0-SNAPSHOT. Jianshi On Tue, Nov 11, 2014 at 4:06 AM, Srinivas Chamart

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-30 Thread Jianshi Huang
Hi Preshant, Chester, Mohammed, I switched to Spark's Akka and now it works well. Thanks for the help! (Need to exclude Akka from Spray dependencies, or specify it as provided) Jianshi On Thu, Oct 30, 2014 at 3:17 AM, Mohammed Guller wrote: > I am not sure about that. > > &

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
I'm using Spark built from HEAD, I think it uses modified Akka 2.3.4, right? Jianshi On Wed, Oct 29, 2014 at 5:53 AM, Mohammed Guller wrote: > Try a version built with Akka 2.2.x > > > > Mohammed > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.co

Re: Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
org.spark-project.akka 2.3.4-spark it should solve problem. Makes sense? I'll give it a shot when I have time, now probably I'll just not using Spray client... Cheers, Jianshi On Tue, Oct 28, 2014 at 6:02 PM, Jianshi Huang wrote: > Hi, > > I got the following exc

Spray client reports Exception: akka.actor.ActorSystem.dispatcher()Lscala/concurrent/ExecutionContext

2014-10-28 Thread Jianshi Huang
e exception. Anyone has idea what went wrong? Need help! -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Ephemeral Hive metastore for HiveContext?

2014-10-27 Thread Jianshi Huang
Thanks Ted and Cheng for the in memory derby solution. I'll check it out. :) And to me, using in-mem by default makes sense, if user wants a shared metastore, it needs to be specified. An 'embedded' local metastore in the working directory barely has a use case. Jianshi On Mo

Re: Which is better? One spark app listening to 10 topics vs. 10 spark apps each listening to 1 topic

2014-10-27 Thread Jianshi Huang
Any suggestion? :) Jianshi On Thu, Oct 23, 2014 at 3:49 PM, Jianshi Huang wrote: > The Kafka stream has 10 topics and the data rate is quite high (~ 100K/s > per topic). > > Which configuration do you recommend? > - 1 Spark app consuming all Kafka topics > - 10 separ

  1   2   >