collect failed for unknow reason when deploy use standalone mode

2014-08-11 Thread wan...@testbird.com
Hi ,? ? I use spark 0.9 run a simple computation, but it failled when I use standalone mode code:? ???val sc = new SparkContext(args(0), "BayesAnalysis", System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass).toSeq)? ? ??val dataSet = sc.textFile(args(1)).map(_.split(",")

Re: collect failed for unknow reason when deploy use standalone mode

2014-08-11 Thread jeanlyn92
Hi wangyi: have more detail information? I guess it maybe caused by need a jars that havn't upload to the workers,such as your main class. ./bin/spark-class org.apache.spark.deploy.Client launch [client-options] \ \ [application-options] application-jar-url: Path to a bundled jar inc

Spark RuntimeException due to Unsupported datatype NullType

2014-08-11 Thread rafeeq s
Hi, *Spark RuntimeException due to Unsupported datatype NullType , *When saving null primitives *jsonRDD *with *.saveAsParquetFile()* *Code: I am trying to* store jsonRDD into Parquet file using *saveAsParquetFile with below code.* JavaRDD javaRDD = ssc.sparkContext().parallelize(jsonData); Java

Re: error with pyspark

2014-08-11 Thread Ron Gonzalez
If you're running on Ubuntu, do ulimit -n, which gives the max number of allowed open files. You will have to change the value in /etc/security/limits.conf to something like 1, logout and log back in. Thanks, Ron Sent from my iPad > On Aug 10, 2014, at 10:19 PM, Davies Liu wrote: > >> On

spark sql (can it call impala udf)

2014-08-11 Thread marspoc
I want to do the below query that I run in impala calling a c++ UDF in spark sql. In which pnl_flat_pp and pfh_flat are both impala table with partitioned. Can Spark Sql does that? select a.pnl_type_code,percentile_udf_cloudera(cast(90.0 as double),sum(pnl_vector1),sum(pnl_vector2),sum(pnl_vect

[spark-streaming] kafka source and flow control

2014-08-11 Thread gpasquiers
Hi, I’m new to this mailing list as well as spark-streaming. I’m using spark-streaming in a cloudera environment to consume a kafka source and store all data into hdfs. There is a great volume of data and our issue is that the kafka consumer is going too fast for HDFS, it fills up the storage (me

Re: CDH5, HiveContext, Parquet

2014-08-11 Thread chutium
hive-thriftserver does not work with parquet tables in hive metastore also, this PR will fix it too? do not need to change any pom.xml ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CDH5-HiveContext-Parquet-tp11853p11880.html Sent from the Apache Spark U

Re: Low Performance of Shark over Spark.

2014-08-11 Thread vinay . kashyap
Hi Yana, I notice there is GC happening in every executor which is around 400ms on an average. Do you think it is a major impact on the overall query time..?? And regarding the memory for a single worker, I have tried distributing the memory by increasing the number of workers per node and divid

Re: How to direct insert vaules into SparkSQL tables?

2014-08-11 Thread chutium
no, spark sql can not insert or update textfile yet, can only insert into parquet files but, people.union(new_people).registerAsTable("people") could be an idea. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-direct-insert-vaules-into-SparkSQL-tab

RE: [spark-streaming] kafka source and flow control

2014-08-11 Thread Gwenhael Pasquiers
Hi, We intend to apply other operations on the data later in the same spark context, but our first step is to archive it. Our goal is somth like this Step 1 : consume kafka Step 2 : archive to hdfs AND send to step 3 Step 3 : transform data Step 4 : save transformed data to HDFS as input for M/R

how to split RDD by key and save to different path

2014-08-11 Thread 诺铁
hi, I have googled and find similar question without good answer, http://stackoverflow.com/questions/24520225/writing-to-hadoop-distributed-file-system-multiple-times-with-spark in short, I would like to separate raw data and divide by some key, for example, create date, and put the in directory

ERROR UserGroupInformation: Can't find user in Subject:

2014-08-11 Thread Dan Foisy
Hi I've installed Spark on a Windows 7 machine. I can get the SparkShell up and running but when running through the simple example in Getting Started, I get the following error (tried running as administrator as well) - any ideas? scala> val textFile = sc.textFile("README.md") 14/08/11 08:55:52

RE: [spark-streaming] kafka source and flow control

2014-08-11 Thread Gwenhael Pasquiers
I didn’t reply to the last part of your message: My source is Kafka, kafka already acts as a buffer with a lot of space. So when I start my spark job, there is a lot of data to catch up (and it is critical not to lose any), but the kafka consumer goes as fast as it can (and it’s faster than my

looking for a definitive RDD.Pipe() example?

2014-08-11 Thread pjv0580
All, I have been searching the web for a few days looking for a definitive Spark/Spark Streaming RDD.Pipe() example and cannot find one. Would it be possible to share with the group an example of the both the Java/Scala side as well as the script (ex Python) side? Any help or response would be ver

Re: error with pyspark

2014-08-11 Thread Baoqiang Cao
Thanks Daves and Ron! It indeed was due to ulimit issue. Thanks a lot! Best, Baoqiang Cao Blog: http://baoqiang.org Email: bqcaom...@gmail.com On Aug 11, 2014, at 3:08 AM, Ron Gonzalez wrote: > If you're running on Ubuntu, do ulimit -n, which gives the max number of > allowed open files.

Spark app slowing down and I'm unable to kill it

2014-08-11 Thread Grzegorz Białek
Hi, I ran Spark application in local mode with command: $SPARK_HOME/bin/spark-submit --driver-memory 1g with set master=local. After around 10 minutes of computing it started to slow down significantly that next stage took around 50 minutes and next after 5 hours in 80% done and CPU usage decre

Re: Spark app slowing down and I'm unable to kill it

2014-08-11 Thread Grzegorz Białek
I'm using Spark 1.0.0 On Mon, Aug 11, 2014 at 4:14 PM, Grzegorz Białek < grzegorz.bia...@codilime.com> wrote: > Hi, > > I ran Spark application in local mode with command: > $SPARK_HOME/bin/spark-submit --driver-memory 1g > with set master=local. > > After around 10 minutes of computing it sta

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
I got the same exception after the streaming job runs for a while, The ERROR message was complaining about a temp file not being found in the output folder. 14/08/11 08:05:08 ERROR JobScheduler: Error running job streaming job 140774430 ms.0 java.io.FileNotFoundException: File hdfs://hadoopc/u

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
The exception was thrown out in application master(spark streaming driver) and the job shut down after this exception. On Mon, Aug 11, 2014 at 10:29 AM, Chen Song wrote: > I got the same exception after the streaming job runs for a while, The > ERROR message was complaining about a temp file no

share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-11 Thread chutium
sharing /reusing RDDs is always useful for many use cases, is this possible via persisting RDD on tachyon? such as off heap persist a named RDD into a given path (instead of /tmp_spark_tachyon/spark-xxx-xxx-xxx) or saveAsParquetFile on tachyon i tried to save a SchemaRDD on tachyon, val parquet

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
Bill Did you get this resolved somehow? Anyone has any insight into this problem? Chen On Mon, Aug 11, 2014 at 10:30 AM, Chen Song wrote: > The exception was thrown out in application master(spark streaming driver) > and the job shut down after this exception. > > > On Mon, Aug 11, 2014 at 10

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
I've also been seeing similar stacktraces on Spark core (not streaming) and have a theory it's related to spark.speculation being turned on. Do you have that enabled by chance? On Mon, Aug 11, 2014 at 8:10 AM, Chen Song wrote: > Bill > > Did you get this resolved somehow? Anyone has any insigh

Parallelizing a task makes it freeze

2014-08-11 Thread sparkuser2345
I have an array 'dataAll' of key-value pairs where each value is an array of arrays. I would like to parallelize a task over the elements of 'dataAll' to the workers. In the dummy example below, the number of elements in 'dataAll' is 3 but in real application it would be tens to hundreds. Without

Re: Can I share the RDD between multiprocess

2014-08-11 Thread coolfrood
Reviving this discussion again... I'm interested in using Spark as the engine for a web service. The SparkContext and its RDDs only exist in the JVM that started it. While RDDs are resilient, this means the context owner isn't resilient, so I may be able to serve requests out of a single "servic

Re: ClassNotFound for user class in uber-jar

2014-08-11 Thread lbustelo
I've see this same exact problem too and I've been ignoring, but I wonder if I'm loosing data. Can anyone at least comment on this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFound-for-user-class-in-uber-jar-tp10613p11902.html Sent from the Apac

ClassNotFound exception on class in uber.jar

2014-08-11 Thread lbustelo
Not sure if this problem reached the Spark guys because it shows in Nabble that "This post has NOT been accepted by the mailing list yet". http://apache-spark-user-list.1001560.n3.nabble.com/ClassNotFound-for-user-class-in-uber-jar-td10613.html#a11902 I'm resubmitting. Greetings, I'm currentl

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Chen Song
Andrew that is a good finding. Yes, I have speculative execution turned on, becauseI saw tasks stalled on HDFS client. If I turned off speculative execution, is there a way to circumvent the hanging task issue? On Mon, Aug 11, 2014 at 11:13 AM, Andrew Ash wrote: > I've also been seeing simil

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Cheng Lian
Since you were using hql(...), it’s probably not related to JDBC driver. But I failed to reproduce this issue locally with a single node pseudo distributed YARN cluster. Would you mind to elaborate more about steps to reproduce this bug? Thanks ​ On Sun, Aug 10, 2014 at 9:36 PM, Cheng Lian wrote

Re: Spark SQL JDBC

2014-08-11 Thread Cheng Lian
Hi John, the JDBC Thrift server resides in its own build profile and need to be enabled explicitly by ./sbt/sbt -Phive-thriftserver assembly. ​ On Tue, Aug 5, 2014 at 4:54 AM, John Omernik wrote: > I am using spark-1.1.0-SNAPSHOT right now and trying to get familiar with > the JDBC thrift serve

Re: increase parallelism of reading from hdfs

2014-08-11 Thread Paul Hamilton
Hi Chen, You need to set the max input split size so that the underlying hadoop libraries will calculate the splits appropriately. I have done the following successfully: val job = new Job() FileInputFormat.setMaxInputSplitSize(job, 12800L) And then use job.getConfiguration when creating a

Re: Running a task once on each executor

2014-08-11 Thread RodrigoB
Hi Christopher, I am also in the need of having a single function call on the node level. Your suggestion makes sense as a solution to the requirement, but still feels like a workaround, this check will get called on every row...Also having static members and methods created specially on a multi-t

Random Forest implementation in MLib

2014-08-11 Thread Sameer Tilak
Hi All,I read on the mailing list that random forest implementation was on the roadmap. I wanted to check about its status? We are currently using Weka and would like to move over to MLib for performance.

Re: Can I share the RDD between multiprocess

2014-08-11 Thread Ruchir Jha
Look at: https://github.com/ooyala/spark-jobserver On Mon, Aug 11, 2014 at 11:48 AM, coolfrood wrote: > Reviving this discussion again... > > I'm interested in using Spark as the engine for a web service. > > The SparkContext and its RDDs only exist in the JVM that started it. While > RDDs are

Re: [MLLib]:choosing the Loss function

2014-08-11 Thread SK
Hi, Thanks for the reference to the LBFGS optimizer. I tried to use the LBFGS optimizer, but I am not able to pass it as an input to the LogisticRegression model for binary classification. After studying the code in mllib/classification/LogisticRegression.scala, it appears that the only impleme

mllib style

2014-08-11 Thread Koert Kuipers
i was just looking at ALS (mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala) any need all the variables need to be vars and to have all these setters around? it just leads to so much clutter if you really want them to the vars it is safe in scala to make them public (scala

Failed jobs show up as succeeded in YARN?

2014-08-11 Thread Shay Rojansky
Spark 1.0.2, Python, Cloudera 5.1 (Hadoop 2.3.0) It seems that Python jobs I'm sending to YARN show up as succeeded even if they failed... Am I doing something wrong, is this a known issue? Thanks, Shay

spark.files.userClassPathFirst=true Not Working Correctly

2014-08-11 Thread DNoteboom
Currently my code uses commons-pool version 1.6 but Spark uses commons-pool version 1.54. This causes an error when I try to access a method that is visible in 1.6 but not in 1.54. I tried to fix this by setting the userClassPathFirst=true(and I verified that this was set correctly in http://:4040

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-11 Thread Marcelo Vanzin
Could you share what's the cluster manager you're using and exactly where the error shows up (driver or executor)? A quick look reveals that Standalone and Yarn use different options to control this, for example. (Maybe that already should be a bug.) On Mon, Aug 11, 2014 at 12:24 PM, DNoteboom w

Re: Compile spark code with idea succesful but run SparkPi error with "java.lang.SecurityException"

2014-08-11 Thread Ron's Yahoo!
Not sure what your environment is but this happened to me before because I had a couple of servlet-api jars in the path which did not match. I was building a system that programmatically submitted jobs so I had my own jars that conflicted with that of spark. The solution is do mvn dependency:tree

Re: increase parallelism of reading from hdfs

2014-08-11 Thread Chen Song
Thanks Paul. I will give a try. On Mon, Aug 11, 2014 at 1:11 PM, Paul Hamilton wrote: > Hi Chen, > > You need to set the max input split size so that the underlying hadoop > libraries will calculate the splits appropriately. I have done the > following successfully: > > val job = new Job() > F

RE: Spark on an HPC setup

2014-08-11 Thread Sidharth Kashyap
Hi Jeremy, Thanks for the reply. We got Spark on our setup after a similar script was brought up to work with LSF. Really appreciate your help. Will keep in touch on Twitter Thanks,@sidkashyap :) From: freeman.jer...@gmail.com Subject: Re: Spark on an HPC setup Date: Thu, 29 May 2014 00:37:54 -0

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-11 Thread DNoteboom
I'm currently running on my local machine on standalone. The error shows up in my code when I am closing resources using the TaskContext.addOnCompleteCallBack. However, the cause of this error is because of a faulty classLoader which must occur in the Executor in the function createClassLoader.

Gathering Information about Standalone Cluster

2014-08-11 Thread Wonha Ryu
Hey all, Is there any kind of API to access information about resources, executors, and applications in a standalone cluster displayed in the web UI? Currently I'm using 1.0.x, but interested in experimenting with bleeding edge. Thanks, Wonha

Re: Random Forest implementation in MLib

2014-08-11 Thread DB Tsai
We have an open-sourced Random Forest at Alpine Data Labs with the Apache license. We're also trying to have it merged into Spark MLlib now. https://github.com/AlpineNow/alpineml It's been tested a lot, and the accuracy and training time benchmark is great. There could be some bugs here and there

Re: [MLLib]:choosing the Loss function

2014-08-11 Thread Burak Yavuz
Hi, // Initialize the optimizer using logistic regression as the loss function with L2 regularization val lbfgs = new LBFGS(new LogisticGradient(), new SquaredL2Updater()) // Set the hyperparameters lbfgs.setMaxNumIterations(numIterations).setRegParam(regParam).setConvergenceTol(tol).setNumCorre

Re: Job ACL's on SPark

2014-08-11 Thread Manoj kumar
Hi Friends, Any response on this, I looked into documentation but could not get any information --Manoj On Fri, Aug 8, 2014 at 6:56 AM, Manoj kumar wrote: > Hi Team, > > > > Do we have Job ACL's for Spark which is similar to Hadoop Job ACL’s. > > > Where I can restrict who can submit the Job

Re: [MLLib]:choosing the Loss function

2014-08-11 Thread DB Tsai
Hi SK, I'm working on a PR of adding a logistic regression interface with LBFGS. It will be merged in Spark 1.1 release, I hope. https://github.com/apache/spark/pull/1862 Before merging, you can just copy the code into your application to use it. I'm also working on another PR which automatically

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Jenny Zhao
you can reproduce this issue with the following steps (assuming you have Yarn cluster + Hive 12): 1) using hive shell, create a database, e.g: create database ttt 2) write a simple spark sql program import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ import org.apache

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Yin Huai
Hi Jenny, How's your metastore configured for both Hive and Spark SQL? Which metastore mode are you using (based on https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin )? Thanks, Yin On Mon, Aug 11, 2014 at 6:15 PM, Jenny Zhao wrote: > > > you can reproduce this issue

Re: share/reuse off-heap persisted (tachyon) RDD in SparkContext or saveAsParquetFile on tachyon in SQLContext

2014-08-11 Thread Haoyuan Li
Is the speculative execution enabled? Best, Haoyuan On Mon, Aug 11, 2014 at 8:08 AM, chutium wrote: > sharing /reusing RDDs is always useful for many use cases, is this possible > via persisting RDD on tachyon? > > such as off heap persist a named RDD into a given path (instead of > /tmp_spar

Re: Spark sql failed in yarn-cluster mode when connecting to non-default hive database

2014-08-11 Thread Jenny Zhao
Thanks Yin! here is my hive-site.xml, which I copied from $HIVE_HOME/conf, didn't experience problem connecting to the metastore through hive. which uses DB2 as metastore database. hive.hwi.listen.port hive.querylog.location /var/ibm/biginsights/hive/query/${user.name}

Re: [spark-streaming] kafka source and flow control

2014-08-11 Thread Tobias Pfeiffer
Hi, On Mon, Aug 11, 2014 at 9:41 PM, Gwenhael Pasquiers < gwenhael.pasqui...@ericsson.com> wrote: > > We intend to apply other operations on the data later in the same spark > context, but our first step is to archive it. > > > > Our goal is somth like this > > Step 1 : consume kafka > Step 2 : ar

Spark streaming error - Task not serializable

2014-08-11 Thread Xuri Nagarin
Hi, I have some quick/dirty code here running in Spark 1.0.0 (CDH 5.1, Spark in Yarn cluster mode) import org.apache.spark.streaming.StreamingContext import org.apache.spark.streaming.StreamingContext._ import org.apache.spark.streaming.Seconds import org.apache.spark.streaming.kafka._ import kaf

Re: [spark-streaming] kafka source and flow control

2014-08-11 Thread Xuri Nagarin
In general, (and I am prototyping), I have a better idea :) - Consume kafka in Spark from topic-A - transform data in Spark (normalize, enrich etc etc) - Feed it back to Kafka (in a different topic-B) - Have flume->HDFS (for M/R, Impala, Spark batch) or Spark-streaming or any other compute framewor

Re: mllib style

2014-08-11 Thread Matei Zaharia
The public API of MLlib is meant to be Java-friendly, so that's why it has setters and getters with Java-like names. Internal APIs don't have to be. Matei On August 11, 2014 at 12:08:20 PM, Koert Kuipers (ko...@tresata.com) wrote: i was just looking at ALS (mllib/src/main/scala/org/apache/spar

Using very large files for KMeans training -- cluster centers size?

2014-08-11 Thread durin
I'm trying to apply KMeans training to some text data, which consists of lines that each contain something between 3 and 20 words. For that purpose, all unique words are saved in a dictionary. This dictionary can become very large as no hashing etc. is done, but it should spill to disk in case it d

Re: Initial job has not accepted any resources

2014-08-11 Thread ldmtwo
I see this error too. I have never found a fix and I've been working on this for a few months. For me, I have 4 nodes with 46GB and 8 cores each. If I change the executor to use 8GB, if fails. If I use 6GB, it works. I request 2 cores only. On another cluster, I have different limits. My workloa

Re: java.lang.StackOverflowError when calling count()

2014-08-11 Thread randylu
hi, TD. I also fall into the trap of long lineage, and your suggestions do work well. But i don't understand why the long lineage can cause stackover, and where it takes effect? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-StackOverflowError-whe

Benchmark on physical Spark cluster

2014-08-11 Thread Mozumder, Monir
I am trying to get some workloads or benchmarks for running on a physical spark cluster and find relative speedups on different physical clusters. The instructions at https://databricks.com/blog/2014/02/12/big-data-benchmark.html uses Amazon EC2. I was wondering if anyone got other benchmarks f

Re: Initial job has not accepted any resources

2014-08-11 Thread 诺铁
just as Marcelo Vanzin said there are two possible reasons for this problem. I solved reason2 several days ago. my process is, ssh to one of the worker node, read its log output , find a line that says "Remoting started" after that line their should be some line of "connecting to x" MAKE SURE

KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-11 Thread Ge, Yao (Y.)
I am trying to train a KMeans model with sparse vector with Spark 1.0.1. When I run the training I got the following exception: java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:221) at org.apache.spark.mllib.util.MLUtils$.

Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread ZHENG, Xu-dong
Hi all, We are trying to use Spark MLlib to train super large data (100M features and 5B rows). The input data in HDFS has ~26K partitions. By default, MLlib will create a task for every partition at each iteration. But because our dimensions are also very high, such large number of tasks will inc

Transform RDD[List]

2014-08-11 Thread Kevin Jung
Hi It may be simple question, but I can not figure out the most efficient way. There is a RDD containing list. RDD ( List(1,2,3,4,5) List(6,7,8,9,10) ) I want to transform this to RDD ( List(1,6) List(2,7) List(3,8) List(4,9) List(5,10) ) And I want to achieve this without using collect metho

Re: Transform RDD[List]

2014-08-11 Thread Soumya Simanta
Try something like this. scala> val a = sc.parallelize(List(1,2,3,4,5)) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at :12 scala> val b = sc.parallelize(List(6,7,8,9,10)) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at :12 scala>

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread Jiusheng Chen
How about increase HDFS file extent size? like current value is 128M, we make it 512M or bigger. On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong wrote: > Hi all, > > We are trying to use Spark MLlib to train super large data (100M features > and 5B rows). The input data in HDFS has ~26K partit

Support for ORC Table in Shark/Spark

2014-08-11 Thread vinay . kashyap
Hi all, Is it possible to use table with ORC format in Shark version 0.9.1 with Spark 0.9.2 and Hive version 0.12.0..?? I have tried creating the ORC table in Shark using the below query create table orc_table (x int, y string) stored as orc create table works, but when I try to insert values fr

How to save mllib model to hdfs and reload it

2014-08-11 Thread XiaoQinyu
hello: I want to know,if I use history data to training model and I want to use this model in other app.How should I do? Should I save this model in disk? And when I use this model then load it from disk.But I don't know how to save the mllib model,and reload it? I will be very pleasure,if anyon

Re: Mllib : Save SVM model to disk

2014-08-11 Thread XiaoQinyu
Have you solved this problem?? And could you share how to save model to hdfs and reload it? Thanks XiaoQinyu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mllib-Save-SVM-model-to-disk-tp74p11954.html Sent from the Apache Spark User List mailing list arc

Re: Is there any way to control the parallelism in LogisticRegression

2014-08-11 Thread ZHENG, Xu-dong
I think this has the same effect and issue with #1, right? On Tue, Aug 12, 2014 at 1:08 PM, Jiusheng Chen wrote: > How about increase HDFS file extent size? like current value is 128M, we > make it 512M or bigger. > > > On Tue, Aug 12, 2014 at 11:46 AM, ZHENG, Xu-dong > wrote: > >> Hi all, >>

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
Not sure which stalled HDFS client issue your'e referring to, but there was one fixed in Spark 1.0.2 that could help you out -- https://github.com/apache/spark/pull/1409. I've still seen one related to Configuration objects not being threadsafe though so you'd still need to keep speculation on to

Re: Transform RDD[List]

2014-08-11 Thread Kevin Jung
Hi ssimanta. The first line creates RDD[Int], not RDD[List[Int]]. In case of List , I can not zip all list elements in RDD like a.zip(b) and I can not use only tuple2 because realworld RDD has more List elements in source RDD. So I guess the expected result depends on the count of original Lists. T

Re: saveAsTextFiles file not found exception

2014-08-11 Thread Andrew Ash
Hi Chen, Please see the bug I filed at https://issues.apache.org/jira/browse/SPARK-2984 with the FileNotFoundException on _temporary directory issue. Andrew On Mon, Aug 11, 2014 at 10:50 PM, Andrew Ash wrote: > Not sure which stalled HDFS client issue your'e referring to, but there > was one

Re: ClassNotFound exception on class in uber.jar

2014-08-11 Thread Akhil Das
This is how i used to do it: *// Create a list of jars* > List jars = > Lists.newArrayList("/home/akhld/mobi/localcluster/x/spark-0.9.1-bin-hadoop2/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.2.0.jar","ADD-All-The-Jars-Here > "); >

Re: KMeans - java.lang.IllegalArgumentException: requirement failed

2014-08-11 Thread Sean Owen
It sounds like your data does not all have the same dimension? that's a decent guess. Have a look at the assertions in this method. On Tue, Aug 12, 2014 at 4:44 AM, Ge, Yao (Y.) wrote: > I am trying to train a KMeans model with sparse vector with Spark 1.0.1. > > When I run the training I got the

Re: set SPARK_LOCAL_DIRS issue

2014-08-11 Thread Andrew Ash
// assuming Spark 1.0 Hi Baoqiang, In my experience for the standalone cluster you need to set SPARK_WORKER_DIR not SPARK_LOCAL_DIRS to control where shuffle files are written. I think this is a documentation issue that could be improved, as http://spark.apache.org/docs/latest/spark-standalone.h

Serialization with com.twitter.chill.MeatLocker

2014-08-11 Thread jerryye
Hi, I've been trying to use com.twitter.chill.MeatLocker to serialize a third-party class. So far I'm having no luck and I'm still getting the dreaded Task not Serializable error for org.ahocorasick.trie.Trie. Am I doing something obviously wrong? Below is my test code that is failing: import co