View all user's application logs in history server

2015-05-20 Thread Jianshi Huang
Hi, I'm using Spark 1.4.0-rc1 and I'm using default settings for history server. But I can only see my own logs. Is it possible to view all user's logs? The permission is fine for the user group. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
No one using History server? :) Am I the only one need to see all user's logs? Jianshi On Thu, May 21, 2015 at 1:29 PM, Jianshi Huang wrote: > Hi, > > I'm using Spark 1.4.0-rc1 and I'm using default settings for history > server. > > But I can only see my own

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
gt; > On Wed, May 27, 2015 at 5:33 AM, Jianshi Huang > wrote: > >> No one using History server? :) >> >> Am I the only one need to see all user's logs? >> >> Jianshi >> >> On Thu, May 21, 2015 at 1:29 PM, Jianshi Huang >> wrote: >&

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
BTW, is there an option to set file permission for spark event logs? Jianshi On Thu, May 28, 2015 at 11:25 AM, Jianshi Huang wrote: > Hmm...all files under the event log folder has permission 770 but > strangely my account cannot read other user's files. Permission denied. > >

Re: View all user's application logs in history server

2015-05-27 Thread Jianshi Huang
- Are all files readable by the user running the history server? > - Did all applications call sc.stop() correctly (i.e. files do not have > the ".inprogress" suffix)? > > Other than that, always look at the logs first, looking for any errors > that may be thrown. > > &

Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
27;, 'FAIR') > ,('spark.shuffle.service.enabled', 'true') > ,('spark.dynamicAllocation.enabled', 'true') > ]) > py_files = > ['hdfs://emr-header-1.cluster-68492:9000/lib/py4j-0.10.7-src.zip'] > sc = pyspark.SparkContext(appName="Jianshi", master="yarn-client", > conf=sparkConf, pyFiles=py_files) > > Thanks, -- Jianshi Huang

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
d from your gateway machine to YARN by > default. > > You probably have some configuration (in spark-defaults.conf) that > tells YARN to use a cached copy. Get rid of that configuration, and > you can use whatever version you like. > On Thu, Oct 4, 2018 at 2:19 AM Jianshi Huang > wro

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
d to be setting SPARK_HOME in the environment of >> your node managers. YARN shouldn't need to know about that. >> On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang >> wrote: >> > >> > >> https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4bb97a9d

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
so it does not get > expanded by the shell). > > But it's really weird to be setting SPARK_HOME in the environment of > your node managers. YARN shouldn't need to know about that. > On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang > wrote: > > > > > https://github.c

Re: Specifying different version of pyspark.zip and py4j files on worker nodes with Spark pre-installed

2018-10-04 Thread Jianshi Huang
> > But it's really weird to be setting SPARK_HOME in the environment of > your node managers. YARN shouldn't need to know about that. > On Thu, Oct 4, 2018 at 10:22 AM Jianshi Huang > wrote: > > > > > https://github.com/apache/spark/blob/88e7e87bd5c052e10f52d4

Pig loader in Spark

2015-02-03 Thread Jianshi Huang
Hi, Anyone has implemented the default Pig Loader in Spark? (loading delimited text files with .pig_schema) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-10 Thread Jianshi Huang
, 1.3.0) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Which version to use for shuffle service if I'm going to run multiple versions of Spark

2015-02-13 Thread Jianshi Huang
eynold Xin : > > I think we made the binary protocol compatible across all versions, so you >> should be fine with using any one of them. 1.2.1 is probably the best since >> it is the most recent stable release. >> >> On Tue, Feb 10, 2015 at 8:43 PM, Jianshi Huang &

Dynamic partition pattern support

2015-02-15 Thread Jianshi Huang
: https://issues.apache.org/jira/browse/SPARK-5828 Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Loading tables using parquetFile vs. loading tables from Hive metastore with Parquet serde

2015-02-15 Thread Jianshi Huang
serde? Loading tables using parquetFile vs. loading tables from Hive metastore with Parquet serde Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

[no subject]

2015-03-03 Thread Jianshi Huang
SNAPSHOT I built around Dec. 20. Is there any bug fixes related to shuffle block fetching or index files after that? Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) at org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$$anonfun$3.apply(BlockStoreShuffleFetcher.scala:83) Jianshi On Wed, Mar 4, 2015 at 2:55 AM, Jianshi Huang wrote: > Hi, > > I got this error message: > &

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
its logs as well. > > On Tue, Mar 3, 2015 at 11:03 AM, Jianshi Huang > wrote: > >> Sorry that I forgot the subject. >> >> And in the driver, I got many FetchFailedException. The error messages are >> >> 15/03/03 10:34:32 WARN TaskSetManager: Lost task 31.0 in

Re: Having lots of FetchFailedException in join

2015-03-03 Thread Jianshi Huang
Davidson wrote: > Drat! That doesn't help. Could you scan from the top to see if there were > any fatal errors preceding these? Sometimes a OOM will cause this type of > issue further down. > > On Tue, Mar 3, 2015 at 8:16 PM, Jianshi Huang > wrote: > >> The failed

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
at 2:11 PM, Jianshi Huang wrote: > Hmm... ok, previous errors are still block fetch errors. > > 15/03/03 10:22:40 ERROR RetryingBlockFetcher: Exception while beginning > fetch of 11 outstanding blocks > java.io.IOException: Failed to connect to host-xxx

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
One really interesting is that when I'm using the netty-based spark.shuffle.blockTransferService, there's no OOM error messages (java.lang.OutOfMemoryError: Java heap space). Any idea why it's not here? I'm using Spark 1.2.1. Jianshi On Thu, Mar 5, 2015 at 1:56 PM, Jiansh

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
e issues when join key is skewed or key number is > smaller, so you will meet OOM. > > > > Maybe you could monitor each stage or task’s shuffle and GC status also > system status to identify the problem. > > > > Thanks > > Jerry > > > > *From:* Jianshi

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
park core side, all the shuffle related operations can spill the > data into disk and no need to read the whole partition into memory. But if > you uses SparkSQL, it depends on how SparkSQL uses this operators. > > > > CC @hao if he has some thoughts on it. > > > > Than

Re: Having lots of FetchFailedException in join

2015-03-04 Thread Jianshi Huang
48 PM, Jianshi Huang wrote: > I see. I'm using core's join. The data might have some skewness > (checking). > > I understand shuffle can spill data to disk but when consuming it, say in > cogroup or groupByKey, it still needs to read the whole group elements, > right? I gues

Re: Having lots of FetchFailedException in join

2015-03-05 Thread Jianshi Huang
ar 5, 2015 at 4:01 PM, Shao, Saisai wrote: > I think there’s a lot of JIRA trying to solve this problem ( > https://issues.apache.org/jira/browse/SPARK-5763). Basically sort merge > join is a good choice. > > > > Thanks > > Jerry > > > > *From:* Jianshi Hua

How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
Hi, I need to set per-user spark.local.dir, how can I do that? I tried both /x/home/${user.name}/spark/tmp and /x/home/${USER}/spark/tmp And neither worked. Looks like it has to be a constant setting in spark-defaults.conf. Right? Any ideas how to do that? Thanks, -- Jianshi Huang

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
n't support expressions or wildcards in that configuration. For > each application, the local directories need to be constant. If you > have users submitting different Spark applications, those can each set > spark.local.dirs. > > - Patrick > > On Wed, Mar 11, 2015 at 12:14 AM, J

Re: How to set per-user spark.local.dir?

2015-03-11 Thread Jianshi Huang
user home > directories either. Typically, like in YARN, you would a number of > directories (on different disks) mounted and configured for local > storage for jobs. > > On Wed, Mar 11, 2015 at 7:42 AM, Jianshi Huang > wrote: > > Unfortunately /tmp mount is really small in ou

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
th boot classpath [.] not found >>> >>> >>> Here's more info on the versions I am using - >>> >>> 2.11 >>> 1.2.1 >>> 2.11.5 >>> >>> Please let me know how can I resolve this problem. >>> >>> Thanks >>> Ashish >>> >> >> > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
:23 AM, Jianshi Huang wrote: > Same issue here. But the classloader in my exception is somehow different. > > scala.ScalaReflectionException: class > org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with > java.net.URLClassLoader@53298398 of type class java.net.URLCla

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
@transient val sqlc = new org.apache.spark.sql.SQLContext(sc) [info] implicit def sqlContext = sqlc [info] import sqlc._ Jianshi On Fri, Mar 13, 2015 at 3:10 AM, Jianshi Huang wrote: > BTW, I was running tests from SBT when I get the errors. One test turn a > Seq of case class to Data

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
Hmm... look like the console command still starts a Spark 1.3.0 with Scala 2.11.6 even I changed them in build.sbt. So the test with 1.2.1 is not valid. Jianshi On Fri, Mar 13, 2015 at 2:34 PM, Jianshi Huang wrote: > I've confirmed it only failed in console started by SBT. > >

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-12 Thread Jianshi Huang
Forget about my last message. I was confused. Spark 1.2.1 + Scala 2.10.4 started by SBT console command also failed with this error. However running from a standard spark shell works. Jianshi On Fri, Mar 13, 2015 at 2:46 PM, Jianshi Huang wrote: > Hmm... look like the console command st

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
nction is throwing exception >>> >>> Exception in thread "main" scala.ScalaReflectionException: class >>> org.apache.spark.sql.catalyst.ScalaReflection in JavaMirror with primordial >>> classloader with boot classpath [.] not found >>> >>

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
I'm almost certain the problem is the ClassLoader. So adding fork := true solves problems for test and run. The problem is how can I fork a JVM for sbt console? fork in console := true seems not working... Jianshi On Fri, Mar 13, 2015 at 4:35 PM, Jianshi Huang wrote: > I gues

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-13 Thread Jianshi Huang
Liancheng also found out that the Spark jars are not included in the classpath of URLClassLoader. Hmm... we're very close to the truth now. Jianshi On Fri, Mar 13, 2015 at 6:03 PM, Jianshi Huang wrote: > I'm almost certain the problem is the ClassLoader. > > So adding

Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
he problematic datanode before retrying it. And maybe dynamically allocate another datanode if dynamic allocation is enabled. I think there needs to be a class of fatal errors that can't be recovered with retries. And it's best Spark can handle it nicely. Thanks, -- Jianshi Huang LinkedIn:

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
I created a JIRA: https://issues.apache.org/jira/browse/SPARK-6353 On Mon, Mar 16, 2015 at 5:36 PM, Jianshi Huang wrote: > Hi, > > We're facing "No space left on device" errors lately from time to time. > The job will fail after retries. Obvious in such case, retry w

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
of our cases are the second one, we set > "spark.scheduler.executorTaskBlacklistTime" to 3 to solve such "No > space left on device" errors. So if a task runs unsuccessfully in some > executor, it won't be scheduled to the same executor in 30 seconds. > > > Best Regards, > Shi

Re: Handling fatal errors of executors and decommission datanodes

2015-03-16 Thread Jianshi Huang
Oh, by default it's set to 0L. I'll try setting it to 3 immediately. Thanks for the help! Jianshi On Mon, Mar 16, 2015 at 11:32 PM, Jianshi Huang wrote: > Thanks Shixiong! > > Very strange that our tasks were retried on the same executor again and

Add partition support in saveAsParquet

2015-03-26 Thread Jianshi Huang
Hi, Anyone has similar request? https://issues.apache.org/jira/browse/SPARK-6561 When we save a DataFrame into Parquet files, we also want to have it partitioned. The proposed API looks like this: def saveAsParquet(path: String, partitionColumns: Seq[String]) -- Jianshi Huang LinkedIn

How to do dispatching in Streaming?

2015-04-12 Thread Jianshi Huang
ne DStream -> multiple DStreams) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: How to do dispatching in Streaming?

2015-04-17 Thread Jianshi Huang
m lime / the big picture – in some models, > friction can be a huge factor in the equations in some other it is just > part of the landscape > > > > *From:* Gerard Maas [mailto:gerard.m...@gmail.com] > *Sent:* Friday, April 17, 2015 10:12 AM > > *To:* Evo Eftimov > *Cc:* Tath

How to write Hive's map(key, value, ...) in Spark SQL DSL

2015-04-22 Thread Jianshi Huang
Hi, I want to write this in Spark SQL DSL: select map('c1', c1, 'c2', c2) as m from table Is there a way? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

String literal in dataframe.select(...)

2015-04-22 Thread Jianshi Huang
Hi, I want to do this in Spark SQL DSL: select '2015-04-22' as date from table How to do this? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: String literal in dataframe.select(...)

2015-04-22 Thread Jianshi Huang
Oh, I found it out. Need to import sql.functions._ Then I can do table.select(lit("2015-04-22").as("date")) Jianshi On Wed, Apr 22, 2015 at 7:27 PM, Jianshi Huang wrote: > Hi, > > I want to do this in Spark SQL DSL: > > select '2015-04-22&#x

Parquet error reading data that contains array of structs

2015-04-24 Thread Jianshi Huang
at parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:126) at parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Parquet error reading data that contains array of structs

2015-04-26 Thread Jianshi Huang
; Fix Version >> >> On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai wrote: >> >>> The exception looks like the one mentioned in >>> https://issues.apache.org/jira/browse/SPARK-4520. What is the version >>> of Spark?

Re: Parquet error reading data that contains array of structs

2015-04-27 Thread Jianshi Huang
t;> Fix Version of SPARK-4520 is not set. >> I assume it was fixed in 1.3.0 >> >> Cheers >> Fix Version >> >> On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai wrote: >> >>> The exception looks like the one mentioned in >>> https://is

FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread Jianshi Huang
I'm facing this error in Spark 1.3.1 https://issues.apache.org/jira/browse/SPARK-4105 Anyone knows what's the workaround? Change the compression codec for shuffle output? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-06 Thread Jianshi Huang
I'm using the default settings. Jianshi On Wed, May 6, 2015 at 7:05 PM, twinkle sachdeva wrote: > Hi, > > Can you please share your compression etc settings, which you are using. > > Thanks, > Twinkle > > On Wed, May 6, 2015 at 4:15 PM, Jianshi Huang > wrot

Why so slow

2015-05-12 Thread Jianshi Huang
s like https://issues.apache.org/jira/browse/SPARK-5446 is still open, when can we have it fixed? :) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Why so slow

2015-05-12 Thread Jianshi Huang
t;= 2014-04-30)) PhysicalRDD [meta#143,nvar#145,date#147], MapPartitionsRDD[6] at explain at :32 Jianshi On Tue, May 12, 2015 at 10:34 PM, Olivier Girardot wrote: > can you post the explain too ? > > Le mar. 12 mai 2015 à 12:11, Jianshi Huang a > écrit : > >> Hi,

Need help. Spark + Accumulo => Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Jianshi Huang
01005082020.jar:META-INF/ECLIPSEF.RSA [error] /Users/jianshuang/.ivy2/cache/org.eclipse.jetty.orbit/javax.activation/orbits/javax.activation-1.1.0.v201105071233.jar:META-INF/ECLIPSEF.RSA I googled it and looks like I need to exclude some JARs. Anyone has done that? Your help is really appreciated. Cheers,

Re: Need help. Spark + Accumulo => Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-16 Thread Jianshi Huang
Das wrote: > Hi > > Check in your driver programs Environment, (eg: > http://192.168.1.39:4040/environment/). If you don't see this > commons-codec-1.7.jar jar then that's the issue. > > Thanks > Best Regards > > > On Mon, Jun 16, 2014 at 5:07 PM, Jia

Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
1.jar gson.jar guava.jar joda-convert-1.2.jar joda-time-2.3.jar kryo-2.21.jar libthrift.jar quasiquotes_2.10-2.0.0-M8.jar scala-async_2.10-0.9.1.jar scala-library-2.10.4.jar scala-reflect-2.10.4.jar Anyone has hint what went wrong? Really confused. Cheers, -- Jianshi Huang Linke

Re: Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
l.com:7077... 14/06/17 04:15:32 ERROR Worker: Worker registration failed: Attempted to re-register worker at same address: akka.tcp:// sparkwor...@lvshdc5dn0321.lvs.paypal.com:41987 Is that a bug? Jianshi On Tue, Jun 17, 2014 at 5:41 PM, Jianshi Huang wrote: > Hi, > > I've

Re: Yarn-client mode and standalone-client mode hang during job start

2014-06-17 Thread Jianshi Huang
spark-submit from within the cluster, or > outside of it? If the latter, could you try running it from within the > cluster and see if it works? (Does your rtgraph.jar exist on the machine > from which you run spark-submit?) > > > 2014-06-17 2:41 GMT-07:00 Jianshi Huang : > > Hi

Wildcard support in input path

2014-06-17 Thread Jianshi Huang
It would be convenient if Spark's textFile, parquetFile, etc. can support path with wildcard, such as: hdfs://domain/user/jianshuang/data/parquet/table/month=2014* Or is there already a way to do it now? Jianshi -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & B

Re: Wildcard support in input path

2014-06-17 Thread Jianshi Huang
rogram and it worked.. >> My code was like this >> b = sc.textFile("hdfs:///path to file/data_file_2013SEP01*") >> >> Thanks & Regards, >> Meethu M >> >> >> On Wednesday, 18 June 2014 9:29 AM, Jianshi Huang < >> jianshi.hu...@gmail.com

Re: Wildcard support in input path

2014-06-17 Thread Jianshi Huang
Hi all, Thanks for the reply. I'm using parquetFile as input, is that a problem? In hadoop fs -ls, the path (hdfs://domain/user/jianshuang/data/parquet/table/month=2014*) will get list all the files. I'll test it again. Jianshi On Wed, Jun 18, 2014 at 2:23 PM, Jianshi Huang wr

Re: Wildcard support in input path

2014-06-18 Thread Jianshi Huang
string as part of their name? > ​ > > > On Wed, Jun 18, 2014 at 2:25 AM, Jianshi Huang > wrote: > >> Hi all, >> >> Thanks for the reply. I'm using parquetFile as input, is that a problem? >> In hadoop fs -ls, the path (hdfs://domain/user/ >>

Re: Need help. Spark + Accumulo => Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-23 Thread Jianshi Huang
he Spark User List mailing list archive at Nabble.com. > -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Use Spark with HBase' HFileOutputFormat

2014-07-16 Thread Jianshi Huang
eSortReducer or PutSortReducer) But in Spark, it seems I have to do the sorting and partition myself, right? Can anyone show me how to do it properly? Is there a better way to ingest data fast to HBase from Spark? Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-24 Thread Jianshi Huang
$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-25 Thread Jianshi Huang
27;re transformed to the a KeyValue to be insert in HBase, so I need to do a .reduce(_.union(_)) to combine them into one RDD[(key, value)]. I cannot see what's wrong in my code. Jianshi On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang wrote: > I can successfully run my code in local

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-27 Thread Jianshi Huang
transformed to the a KeyValue to be insert in HBase, so I need to > do a .reduce(_.union(_)) to combine them into one RDD[(key, value)]. > > I cannot see what's wrong in my code. > > Jianshi > > > > On Fri, Jul 25, 2014 at 12:24 PM, Jianshi Huang > wrote: >

Re: Need help, got java.lang.ExceptionInInitializerError in Yarn-Client/Cluster mode

2014-07-28 Thread Jianshi Huang
This would be helpful. I personally like Yarn-Client mode as all the running status can be checked directly from the console. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

spark.shuffle.consolidateFiles seems not working

2014-07-30 Thread Jianshi Huang
ems not working. What are the other possible reasons? How to fix it? Jianshi -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

RDD.coalesce got compilation error

2014-07-30 Thread Jianshi Huang
the doc says it always shuffles and recommends using coalesce for reducing partitions. Anyone can help me here? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: spark.shuffle.consolidateFiles seems not working

2014-07-30 Thread Jianshi Huang
your current settings > -n to set open files limit > (and other limits also) > > And I set -n to 10240. > > I see spark.shuffle.consolidateFiles helps by reusing open files. > (so I don't know to what extend does it help) > > Hope it helps. > > Larry > > &

Index calculation will cause integer overflow of numPartitions > 10362 in sortByKey

2014-07-30 Thread Jianshi Huang
I created this JIRA issue, somebody please pick it up? https://issues.apache.org/jira/browse/SPARK-2728 -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Index calculation will cause integer overflow of numPartitions > 10362 in sortByKey

2014-07-30 Thread Jianshi Huang
Looks like I cannot assign it. On Thu, Jul 31, 2014 at 11:56 AM, Larry Xiao wrote: > Hi > > Can you assign it to me? Thanks > > Larry > > > On 7/31/14, 10:47 AM, Jianshi Huang wrote: > >> I created this JIRA issue, somebody please pick it up? >> >>

Re: spark.shuffle.consolidateFiles seems not working

2014-07-31 Thread Jianshi Huang
ion can reduce the total shuffle file numbers, but the > concurrent opened file number is the same as basic hash-based shuffle. > > > > Thanks > > Jerry > > > > *From:* Jianshi Huang [mailto:jianshi.hu...@gmail.com] > *Sent:* Thursday, July 31, 2014 10:34 AM

Re: RDD.coalesce got compilation error

2014-07-31 Thread Jianshi Huang
​You could try something like the following: > ​ > val rdd: (WrapWithComparable[(Array[Byte], Array[Byte], Array[Byte])], > Externalizer[KeyValue]) = ... > val rdd_coalesced = rdd.coalesce(Math.min(1000, rdd.partitions.length), > false, null) > > > > > > Thanks > B

Re: spark.shuffle.consolidateFiles seems not working

2014-08-01 Thread Jianshi Huang
1.1. > > > On Thu, Jul 31, 2014 at 12:40 PM, Jianshi Huang > wrote: > >> I got the number from the Hadoop admin. It's 1M actually. I suspect the >> consolidation didn't work as expected? Any other reason? >> >> >> On Thu, Jul 31, 2014 at

Spark SQL (version 1.1.0-SNAPSHOT) should allow SELECT with duplicated columns

2014-08-06 Thread Jianshi Huang
ns in my select clause. I made the duplication on purpose for my code to parse correctly. I think we should allow users to specify duplicated columns as return value. -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: Out of memory on large RDDs

2014-08-27 Thread Jianshi Huang
ker(MapOutputTracker.scala:81) >>>>> ... 25 more >>>>> >>>>> >>>>> Before the error I can see this kind of logs: >>>>> >>>>> 14/03/11 14:29:40 INFO MapOutputTracker: Don't have map outputs for >&

.sparkrc for Spark shell?

2014-09-03 Thread Jianshi Huang
To make my shell experience merrier, I need to import several packages, and define implicit sparkContext and sqlContext. Is there a startup file (e.g. ~/.sparkrc) that Spark shell will load when it's started? Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & B

How to list all registered tables in a sql context?

2014-09-03 Thread Jianshi Huang
Hi, How can I list all registered tables in a sql context? -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: .sparkrc for Spark shell?

2014-09-04 Thread Jianshi Huang
I se. Thanks Prashant! Jianshi On Wed, Sep 3, 2014 at 7:05 PM, Prashant Sharma wrote: > Hey, > > You can use spark-shell -i sparkrc, to do this. > > Prashant Sharma > > > > > On Wed, Sep 3, 2014 at 2:17 PM, Jianshi Huang > wrote: > >> To make my

Re: How to list all registered tables in a sql context?

2014-09-05 Thread Jianshi Huang
Err... there's no such feature? Jianshi On Wed, Sep 3, 2014 at 7:03 PM, Jianshi Huang wrote: > Hi, > > How can I list all registered tables in a sql context? > > -- > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://hua

Re: How to list all registered tables in a sql context?

2014-09-07 Thread Jianshi Huang
Thanks Tobias, I also found this: https://issues.apache.org/jira/browse/SPARK-3299 Looks like it's been working on. Jianshi On Mon, Sep 8, 2014 at 9:28 AM, Tobias Pfeiffer wrote: > Hi, > > On Sat, Sep 6, 2014 at 1:40 AM, Jianshi Huang > wrote: > >> Err... there

[no subject]

2014-09-24 Thread Jianshi Huang
un(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724) -- Jianshi Huang LinkedIn: jianshi Twit

Executor/Worker stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup and never finishes.

2014-09-24 Thread Jianshi Huang
cheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.la

Re:

2014-09-24 Thread Jianshi Huang
tch$$ > anonfun$batchInsertEdges$3.apply(HbaseRDDBatch.scala:179) > > Can you reveal what HbaseRDDBatch.scala does ? > > Cheers > > On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang > wrote: > >> One of my big spark program always get stuck at 99% where a few tasks >&

Re:

2014-09-24 Thread Jianshi Huang
ark: have you checked region server (logs) to see if > region server had trouble keeping up ? > > Cheers > > On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang > wrote: > >> Hi Ted, >> >> It converts RDD[Edge] to HBase rowkey and columns and insert them to >>

Re:

2014-09-24 Thread Jianshi Huang
to be balancedyou might have some skewness in > row keys and one regionserver is under pressuretry finding that key and > replicate it using random salt > > On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang > wrote: > >> Hi Ted, >> >> It converts RDD[Edge]

Re:

2014-09-24 Thread Jianshi Huang
pressuretry finding that key >> and replicate it using random salt >> >> On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang >> wrote: >> >>> Hi Ted, >>> >>> It converts RDD[Edge] to HBase rowkey and columns and insert them to >>&g

Re:

2014-09-24 Thread Jianshi Huang
Looks like it's a HDFS issue, pretty new. https://issues.apache.org/jira/browse/HDFS-6999 Jianshi On Thu, Sep 25, 2014 at 12:10 AM, Jianshi Huang wrote: > Hi Ted, > > See my previous reply to Debasish, all region servers are idle. I don't > think it's caused by hot

Re:

2014-09-25 Thread Jianshi Huang
op 2.6.0 > > Any chance of deploying 2.6.0-SNAPSHOT to see if the problem goes away ? > > On Wed, Sep 24, 2014 at 10:54 PM, Jianshi Huang > wrote: > >> Looks like it's a HDFS issue, pretty new. >> >> https://issues.apache.org/jira/browse/HDFS-6999 >

How to do broadcast join in SparkSQL

2014-09-28 Thread Jianshi Huang
I cannot find it in the documentation. And I have a dozen dimension tables to (left) join... Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & Blog: http://huangjs.github.com/

Re: How to do broadcast join in SparkSQL

2014-09-28 Thread Jianshi Huang
wrote: > Have you looked at SPARK-1800 ? > > e.g. see sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala > Cheers > > On Sun, Sep 28, 2014 at 1:55 AM, Jianshi Huang > wrote: > >> I cannot find it in the documentation. And I have a dozen dimension >> tabl

Re: How to do broadcast join in SparkSQL

2014-10-07 Thread Jianshi Huang
ep 29, 2014 at 1:24 AM, Jianshi Huang wrote: > Yes, looks like it can only be controlled by the > parameter spark.sql.autoBroadcastJoinThreshold, which is a little bit weird > to me. > > How am I suppose to know the exact bytes of a table? Let me specify the > join algorit

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Jianshi Huang
at 2:18 PM, Jianshi Huang wrote: > Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged > into master? > > I cannot find spark.sql.hints.broadcastTables in latest master, but it's > in the following patch. > > > https://github.com/apache/spark/commit/7

Re: How to do broadcast join in SparkSQL

2014-10-10 Thread Jianshi Huang
MAT 'parquet.hive.DeprecatedParquetInputFormat' > |OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' > |LOCATION '$file'""".stripMargin > sql(ddl) > setConf("spark.sql.hive.convertMetastoreParquet", "true"

SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
dozen dim tables (using HiveContext) and then map it to my class object. It failed a couple of times and now I cached the intermediate table and currently it seems working fine... no idea why until I found SPARK-3106 Cheers, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github & B

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Hmm... it failed again, just lasted a little bit longer. Jianshi On Mon, Oct 13, 2014 at 4:15 PM, Jianshi Huang wrote: > https://issues.apache.org/jira/browse/SPARK-3106 > > I'm having the saming errors described in SPARK-3106 (no other types of > errors confirmed), running a

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
Turned out it was caused by this issue: https://issues.apache.org/jira/browse/SPARK-3923 Set spark.akka.heartbeat.interval to 100 solved it. Jianshi On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang wrote: > Hmm... it failed again, just lasted a little bit longer. > > Jianshi > >

Re: SPARK-3106 fixed?

2014-10-13 Thread Jianshi Huang
On Tue, Oct 14, 2014 at 4:36 AM, Jianshi Huang wrote: > Turned out it was caused by this issue: > https://issues.apache.org/jira/browse/SPARK-3923 > > Set spark.akka.heartbeat.interval to 100 solved it. > > Jianshi > > On Mon, Oct 13, 2014 at 4:24 PM, Jianshi Huang

  1   2   >