Re:

2014-09-24 Thread Jianshi Huang
Looks like it's a HDFS issue, pretty new. https://issues.apache.org/jira/browse/HDFS-6999 Jianshi On Thu, Sep 25, 2014 at 12:10 AM, Jianshi Huang wrote: > Hi Ted, > > See my previous reply to Debasish, all region servers are idle. I don't > think it's caused by hotspotting. > > Besides, only 6

RE: java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread jishnu.prathap
Hi Sorry for the repeated mails .My post was not accepted by the mailing list due to some problem in postmas...@wipro.com I had to manually send it . Still it was not visible for half an hour.I retried. But later all the post was visible. I deleted it from the page but it was already delivered

Using one sql query's result inside another sql query

2014-09-24 Thread twinkle sachdeva
Hi, I am using Hive Context to fire the sql queries inside spark. I have created a schemaRDD( Let's call it cachedSchema ) inside my code. If i fire a sql query ( Query 1 ) on top of it, then it works. But if I refer to Query1's result inside another sql, that fails. Note that I have already regi

Re: MLUtils.loadLibSVMFile error

2014-09-24 Thread Liquan Pei
Hi Sameer, I think there are two things that you can do 1) What is your current driver-memory or executor-memory, you can try to Increate driver-memory or executor-memory to see if that solves your problem. 2) How many features in your data? Two many features may create a large number of temp obje

Re: java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread jishnu.prathap
No .. I am not passing any argument. I am getting this error while starting the Master The same spark binary i am able to run in another machine ( ubuntu ) installed. The information contained in this electronic message and any attachments to this message are intended for the excl

RE: MLUtils.loadLibSVMFile error

2014-09-24 Thread Sameer Tilak
Hi All,I was able to solve this formatting issue. However, I have another question. When I do the following, val examples: RDD[LabeledPoint] =MLUtils.loadLibSVMFile(sc,"structured/results/data.txt") I get java.lang.OutOfMemoryError: GC overhead limit exceeded error. Is it possible to specify th

YARN ResourceManager and Hadoop NameNode Web UI not visible in port 8088, port 50070

2014-09-24 Thread Raghuveer Chanda
Hi, Im running a spark job in YARN cluster ..but im not able to see the Web Interface of the YARN ResourceManager and Hadoop NameNode Web UI in port 8088, port 50070 and spark stages. Only the spark UI in port 18080 is visible. I got the URL's from cloudera but may be due to some default option

Re: Spark SQL use of alias in where clause

2014-09-24 Thread Yanbo Liang
Maybe it's the way SQL works. The select part is executed after the where filter is applied, so you cannot use alias declared in select part in where clause. Hive and Oracle behavior the same as Spark SQL. 2014-09-25 8:58 GMT+08:00 Du Li : > Hi, > > The following query does not work in Shark n

Re: What is a pre built package of Apache Spark

2014-09-24 Thread Denny Lee
This seems similar to a related Windows issue concerning python where pyspark could't find the python because the PYTHONSTARTUP environment wasn't set - by any chance could this be related? On Wed, Sep 24, 2014 at 7:51 PM, christy <760948...@qq.com> wrote: > Hi I have installed standalone on win7

Re: How to use FlumeInputDStream in spark cluster?

2014-09-24 Thread julyfire
I have test the example codes FlumeEventCount on standalone cluster, and this is still a problem in Spark 1.1.0, the latest version up to now. Do you have solved this issue in your way? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDSt

Re: What is a pre built package of Apache Spark

2014-09-24 Thread christy
Hi I have installed standalone on win7(src not pre-built version), after the sbt assembly, I can run spark-shell successfully but python shell does work. $ ./bin/pyspark ./bin/pyspark: line 111: exec: python: not found have you solved your problem? Thanks, Christy -- View this message in c

Re: sortByKey trouble

2014-09-24 Thread Zhan Zhang
Try this Import org.apache.spark.SparkContext._ Thanks. Zhan Zhang On Sep 24, 2014, at 6:13 AM, david wrote: > thank's > > i've already try this solution but it does not compile (in Eclipse) > > I'm surprise to see that in Spark-shell, sortByKey works fine on 2 > solutions : > > (String,

Re: rsync problem

2014-09-24 Thread Tobias Pfeiffer
Hi, I assume you unintentionally did not reply to the list, so I'm adding it back to CC. How do you submit your job to the cluster? Tobias On Thu, Sep 25, 2014 at 2:21 AM, rapelly kartheek wrote: > How do I find out whether a node in the cluster is a master or slave?? > Till now I was thinki

Spark SQL use of alias in where clause

2014-09-24 Thread Du Li
Hi, The following query does not work in Shark nor in the new Spark SQLContext or HiveContext. SELECT key, value, concat(key, value) as combined from src where combined like ’11%’; The following tweak of syntax works fine although a bit ugly. SELECT key, value, concat(key, value) as combined fr

Re: Spark SQL CLI

2014-09-24 Thread Gaurav Tiwari
Thanks , will give it a try, appreciate your help Regards, Gaurav On Sep 23, 2014 1:52 PM, "Michael Armbrust" wrote: > A workaround for now would be to save the JSON as parquet and the create a > metastore parquet table. Using parquet will be much faster for repeated > querying. This function m

experimental solution to nesting RDDs

2014-09-24 Thread ldmtwo
I want to share and brainstorm on an experiment before I try it all the way. I hope that Spark contributors can comment. To be clear, it is not my intent to use MLLib where I get partial control on the work being done and I'm not seeing it scale well enough yet. I have fundamental questions about S

Has anyone seen java.nio.ByteBuffer.wrap(ByteBuffer.java:392)

2014-09-24 Thread Steve Lewis
java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:392) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurren

Re: return probability \ confidence instead of actual class

2014-09-24 Thread Sunny Khatri
For multi-class you can use the same SVMWithSGD (for binary classification) with One-vs-All approach constructing respective training corpuses consisting one Class i as positive samples and Rest of the classes as negative one, and then use the same method provided by Aris as a measure of how far Cl

Processing multiple request in cluster

2014-09-24 Thread Subacini B
hi All, How to run concurrently multiple requests on same cluster. I have a program using *spark streaming context *which reads* streaming data* and writes it to HBase. It works fine, the problem is when multiple requests are submitted to cluster, only first request is processed as the entire clu

Re: return probability \ confidence instead of actual class

2014-09-24 Thread Aris
Χαίρε Αδαμάντιε Κοραήέαν είναι πράγματι το όνομα σου.. Just to follow up on Liquan, you might be interested in removing the thresholds, and then treating the predictions as a probability from 0..1 inclusive. SVM with the linear kernel is a straightforward linear classifier -- so you with the m

Re: Anybody built the branch for Adaptive Boosting, extension to MLlib by Manish Amde?

2014-09-24 Thread Aris
Hi Manish! Thanks for the reply and the explication on why the branches won't compile -- that makes perfect sense. You mentioned making the branch compatible with the latest master; could you share some more details? Which branch do you mean -- is it on your GitHub? And would I just be able to do

Re: How to sort rdd filled with existing data structures?

2014-09-24 Thread Liquan Pei
You only need to define an ordering of student, no need to modify the class definition of student. It's like a Comparator class in java. Currently, you have to map the rdd to sort by value. Liquan On Wed, Sep 24, 2014 at 9:52 AM, Sean Owen wrote: > See the scaladoc for how to define an implici

Re: MLUtils.loadLibSVMFile error

2014-09-24 Thread Liquan Pei
Hi Sameer, This seems to be a file format issue. Can you make sure that your data satisfies the format? Each line of libsvm format is as follows: : : ... Thanks, Liquan On Wed, Sep 24, 2014 at 3:02 PM, Sameer Tilak wrote: > Hi All, > > > When I try to load dataset using MLUtils.loadLibSV

Re: Question About Submit Application

2014-09-24 Thread Marcelo Vanzin
Sounds like "spark-01" is not resolving correctly on your machine (or is the wrong address). Can you ping "spark-01" and does that reach the VM where you set up the Spark Master? On Wed, Sep 24, 2014 at 1:12 PM, danilopds wrote: > Hello, > I'm learning about Spark Streaming and I'm really excited

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
If you launched the job in yarn-cluster mode, the tracking URL is printed on the output of the launched process. That will lead you to the Spark UI once the job is running. If you're using CM, you can reach the same link by clicking on the "Resource Manager UI" link on your Yarn service, then find

Bug in JettyUtils?

2014-09-24 Thread Jim Donahue
I've been seeing a failure in JettyUtils.createStaticHandler when running Spark 1.1 programs from Eclipse which I hadn't seen before. The problem seems to be line 129 Option(Utils.getSparkClassLoader.getResource(resourceBase)) match { ... getSparkClassLoader returns NULL, and it's all

MLUtils.loadLibSVMFile error

2014-09-24 Thread Sameer Tilak
Hi All, When I try to load dataset using MLUtils.loadLibSVMFile, I have the following problem. Any help will be greatly appreciated! Code snippet: import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.util.MLUtils import org.apache.spark.rdd.RDD import org.ap

Re: java.lang.OutOfMemoryError while running SVD MLLib example

2014-09-24 Thread Shailesh Birari
Note, the data is random numbers (double). Any suggestions/pointers will be highly appreciated. Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-while-running-SVD-MLLib-example-tp14972p15083.html Sent from the A

Re: mllib performance on mesos cluster

2014-09-24 Thread Sudha Krishna
Setting spark.mesos.coarse=true helped reduce the time on the mesos cluster from 17 min to around 6 min. The scheduler delay per task reduced from 40 ms to around 10 ms. thanks On Mon, Sep 22, 2014 at 12:36 PM, Xiangrui Meng wrote: > 1) MLlib 1.1 should be faster than 1.0 in general. What's th

Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Another update, actually it just hit me my problem is probably right here: https://gist.github.com/maddenpj/74a4c8ce372888ade92d#file-gistfile1-scala-L22 I'm creating a JDBC connection on every record, that's probably whats killing the performance. I assume the fix is just broadcast the connectio

Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Oh I should add I've tried a range of batch durations and reduce by window durations to no effect. I'm not too sure how to choose these? Currently today I've been testing with batch duration of 1 minute - 10 minute and reduce window duration of 10 minute or 20 minutes. -- View this message in

Re: Logging in Spark through YARN.

2014-09-24 Thread Vipul Pandey
Archit, Are you able to get it to work with 1.0.0? I tried the --files suggestion from Marcelo and it just changed logging for my client and the appmaster and executors were still the same. ~Vipul On Jul 30, 2014, at 9:59 PM, Archit Thakur wrote: > Hi Marcelo, > > Thanks for your quick comme

Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
I am attempting to use Spark Streaming to summarize event data streaming in and save it to a MySQL table. The input source data is stored in 4 topics on Kafka with each topic having 12 partitions. My approach to doing this worked both in local development and a simulated load testing environment bu

Re: Question About Submit Application

2014-09-24 Thread danilopds
One more information.. When I submit the application from my local PC to my VM, This VM was the master and worker and my local PC wasn't part of the cluster. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072p

Re: Memory/Network Intensive Workload

2014-09-24 Thread danilopds
Thank you for the suggestion! Bye. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-Network-Intensive-Workload-tp8501p15074.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Spark Streaming Twitter Example Error

2014-09-24 Thread danilopds
I solved this question using the SBT plugin "sbt-assembly". It's very good! Bye. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Twitter-Example-Error-tp12600p15073.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Question About Submit Application

2014-09-24 Thread danilopds
Hello, I'm learning about Spark Streaming and I'm really excited. Today I was testing to package some apps and send them in a Standalone cluster in my computer locally. It occurred ok. So, I created one Virtual Machine with network bridge and I tried to send again the app to this VM from my local

Re: New sbt plugin to deploy jobs to EC2

2014-09-24 Thread Shafaq
Hi, testing out the Spark Ec2 deployment plugin: I try to compile using $sbt sparkLaunchCluster --- [info] Resolving org.fusesource.jansi#jansi;1.4 ... [warn] :

Hive Avro table fail (pyspark)

2014-09-24 Thread ilyagluk
Dear Spark Users, I'd highly appreciate any comments on whether there is a way to define an Avro table and then run select * without errors. Thank you very much in advance! Ilya. Attached please find a sample data directory and the Avro schema. analytics_events2.zip

persist before or after checkpoint?

2014-09-24 Thread Shay Seng
Hey, I actually have 2 question (1) I want to generate unique IDs for each RDD element and I want to assign them in parallel so I do rdd.mapPartitionsWithIndex((index, s) => { var count = 0L s.zipWithIndex.map { case (t, i) => { count += 1 (index * GLOBAL

Re: Spark Code to read RCFiles

2014-09-24 Thread cem
I used the following code as an example to deserialize BytesRefArrayWritable. http://www.massapi.com/source/hive-0.5.0-dev/src/ql/src/test/org/apache/hadoop/hive/ql/io/TestRCFile.java.html Best Regards, Cem. On Wed, Sep 24, 2014 at 1:34 PM, Pramod Biligiri wrote: > I'm afraid SparkSQL isn't

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
Yeah I got the logs and its reporting about the memory. 14/09/25 00:08:26 WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Now I shifted to big cluster with more memory but here im not abl

Re: Null values in pyspark Row

2014-09-24 Thread Davies Liu
Could create a JIRA and add test cases for it? Thanks! Davies On Wed, Sep 24, 2014 at 11:56 AM, jamborta wrote: > Hi all, > > I have just updated to spark 1.1.0. The new row representation of the data > in spark SQL is very handy. > > I have noticed that it does not set None for NULL values comi

Null values in pyspark Row

2014-09-24 Thread jamborta
Hi all, I have just updated to spark 1.1.0. The new row representation of the data in spark SQL is very handy. I have noticed that it does not set None for NULL values coming from Hive if the column was string type - seems it works with other types. Is that something that will be implemented? T

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
You need to use the command line yarn application that I mentioned ("yarn logs"). You can't look at the logs through the UI after the app stops. On Wed, Sep 24, 2014 at 11:16 AM, Raghuveer Chanda wrote: > > Thanks for the reply .. This is the error in the logs obtained from UI at > http://dml3:80

Re: Spark Code to read RCFiles

2014-09-24 Thread Pramod Biligiri
I'm afraid SparkSQL isn't an option for my use case, so I need to use the Spark API itself. I turned off Kryo, and I'm getting a NullPointerException now: scala> val ref = file.take(1)(0)._2 ref: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable = org.apache.hadoop.hive.serde2.columnar.

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
Thanks for the reply .. This is the error in the logs obtained from UI at http://dml3:8042/node/containerlogs/container_1411578463780_0001_02_01/chanda So now how to set the Log Server url .. Failed while trying to construct the redirect url to the log server. Log Server url may not be confi

Re: RDD save as Seq File

2014-09-24 Thread Sean Owen
It's really Hadoop's support for S3. Hadoop FS semantics need directories, and S3 doesn't have a proper notion of directories. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html You should leave them AFAIK. On Wed, Sep 24, 2014 at 6:41 PM, Shay Seng

Re: parquetFile and wilcards

2014-09-24 Thread Michael Armbrust
We could certainly do this. The comma separated support is something I added. On Wed, Sep 24, 2014 at 10:20 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Does it make sense for us to open a JIRA to track enhancing the Parquet > input format to support wildcards? Or is this somethin

spark.sql.autoBroadcastJoinThreshold

2014-09-24 Thread sridhar1135
Does this work with spark-sql in 1.0.1 too ? I tried like this sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold=1;") But it still seems to trigger "shuffleMapTask()" and amount of shuffle is same with / without this parameter... Kindly request some help here Thanks -- Vi

Re: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2014-09-24 Thread Victor Tso-Guillen
Never mind: https://issues.apache.org/jira/browse/SPARK-1476 On Wed, Sep 24, 2014 at 11:10 AM, Victor Tso-Guillen wrote: > Really? What should we make of this? > > 24 Sep 2014 10:03:36,772 ERROR [Executor task launch worker-52] Executor - > Exception in task ID 40599 > > java.lang.IllegalArgumen

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2014-09-24 Thread Victor Tso-Guillen
Really? What should we make of this? 24 Sep 2014 10:03:36,772 ERROR [Executor task launch worker-52] Executor - Exception in task ID 40599 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:789) at org.apache

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
The screenshot executors.8080.png is of the executors tab itself and only driver is added without workers even if I kept the master as yarn-cluster. On Wed, Sep 24, 2014 at 11:18 PM, Matt Narrell wrote: > This just shows the driver. Click the Executors tab in the Spark UI > > mn > > On Sep 24,

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
Thanks for the reply, I have doubt as to which path to set for YARN_CONF_DIR My /etc/hadoop folder has the following sub folders conf conf.cloudera.hdfs conf.cloudera.mapreduce conf.cloudera.yarn and both conf and conf.cloudera.yarn folders have yarn-site.xml. As of now I set the variable as

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
You'll need to look at the driver output to have a better idea of what's going on. You can use "yarn logs --applicationId blah" after your app is finished (e.g. by killing it) to look at it. My guess is that your cluster doesn't have enough resources available to service the container request you'

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-24 Thread jatinpreet
Hi, I was able to get the training running in local mode with default settings, there was a problem with document labels which were quite large(not 20 as suggested earlier). I am currently training 175000 documents on a single node with 2GB of executor memory and 5GB of driver memory successfull

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Hmmm - I have only tested in local mode but I got an java.io.NotSerializableException: org.apache.hadoop.io.Text at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528) at java.io.ObjectOutputStream.w

Re: Spark with YARN

2014-09-24 Thread Greg Hill
Do you have YARN_CONF_DIR set in your environment to point Spark to where your yarn configs are? Greg From: Raghuveer Chanda mailto:raghuveer.cha...@gmail.com>> Date: Wednesday, September 24, 2014 12:25 PM To: "u...@spark.incubator.apache.org" mailto:u..

Re: Spark with YARN

2014-09-24 Thread Matt Narrell
This just shows the driver. Click the Executors tab in the Spark UI mn On Sep 24, 2014, at 11:25 AM, Raghuveer Chanda wrote: > Hi, > > I'm new to spark and facing problem with running a job in cluster using YARN. > > Initially i ran jobs using spark master as --master spark://dml2:7077 and

RDD save as Seq File

2014-09-24 Thread Shay Seng
Hi, Why does RDD.saveAsObjectFile() to S3 leave a bunch of *_$folder$ empty files around? Is it possible for the saveas to clean up? tks

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Russ Weeks
No, they do not implement Serializable. There are a couple of places where I've had to do a Text->String conversion but generally it hasn't been a problem. -Russ On Wed, Sep 24, 2014 at 10:27 AM, Steve Lewis wrote: > Do your custom Writable classes implement Serializable - I think that is > the

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Do your custom Writable classes implement Serializable - I think that is the only real issue - my code uses vanilla Text

Re: Multiple Kafka Receivers and Union

2014-09-24 Thread Tim Smith
Maybe differences between JavaPairDStream and JavaPairReceiverInputDStream? On Wed, Sep 24, 2014 at 7:46 AM, Matt Narrell wrote: > The part that works is the commented out, single receiver stream below the > loop. It seems that when I call KafkaUtils.createStream more than once, I > don’t rece

Re: RDD of Iterable[String]

2014-09-24 Thread Liquan Pei
Hi Deep, The Iterable trait in scala has methods like map and reduce that you can use to iterate elements of Iterable[String]. You can also create an Iterator from the Iterable. For example, suppose you have val rdd: RDD[Iterable[String]] you can do rdd.map { x => //x has type Iterable[String]

Re: parquetFile and wilcards

2014-09-24 Thread Nicholas Chammas
Does it make sense for us to open a JIRA to track enhancing the Parquet input format to support wildcards? Or is this something outside of Spark's control? Nick On Wed, Sep 24, 2014 at 1:01 PM, Michael Armbrust wrote: > This behavior is inherited from the parquet input format that we use. You

Re: parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Thank you, that works! On 24.09.2014, at 19:01, Michael Armbrust wrote: > This behavior is inherited from the parquet input format that we use. You > could list the files manually and pass them as a comma separated list. > > On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier wrote: > Hello, >

Re: parquetFile and wilcards

2014-09-24 Thread Michael Armbrust
This behavior is inherited from the parquet input format that we use. You could list the files manually and pass them as a comma separated list. On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier wrote: > Hello, > > sc.textFile and so on support wildcards in their path, but apparently > sqlc.parqu

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Russ Weeks
I use newAPIHadoopRDD with AccumuloInputFormat. It produces a PairRDD using Accumulo's Key and Value classes, both of which extend Writable. Works like a charm. I use the same InputFormat for all my MR jobs. -Russ On Wed, Sep 24, 2014 at 9:33 AM, Steve Lewis wrote: > I tried newAPIHadoopFile an

Re: How to sort rdd filled with existing data structures?

2014-09-24 Thread Sean Owen
See the scaladoc for how to define an implicit ordering to use with sortByKey: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions Off the top of my head, I think this is 90% correct to order by age for example: implicit val studentOrdering: Ordering

Re: Spark Hbase

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Ted, Thank you for quick response and details. I have verified HBaseTest.scala program it is returning hbaseRDD but i'm not able retrieve values from hbaseRDD. Could you help me to retrieve the values from hbaseRDD Regards, Rajesh On Wed, Sep 24, 2014 at 10:12 PM, Ted Yu wrote: > Take a l

Re: task getting stuck

2014-09-24 Thread Debasish Das
First test would be if you can write parquet files fine on HDFS from your Spark job fine...If that also get's stuck then there is something with the logic...If parquet files are dumped fine and u can load them on HBase then there is something going on with Spark-HBase interaction On Wed, Sep 24, 2

Re: Spark Hbase

2014-09-24 Thread Ted Yu
Take a look at the following under examples: examples/src//main/python/hbase_inputformat.py examples/src//main/python/hbase_outputformat.py examples/src//main/scala/org/apache/spark/examples/HBaseTest.scala examples/src//main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala

Re: task getting stuck

2014-09-24 Thread Debasish Das
spark SQL reads parquet file fine...did you follow one of these to read/write parquet from spark ? http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ On Wed, Sep 24, 2014 at 9:29 AM, Ted Yu wrote: > Adding a subject. > > bq. at parquet.hadoop.ParquetFileReader$ > ConsecutiveChunkL

Spark Hbase

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Team, Could you please point me the example program for Spark HBase to read columns and values Regards, Rajesh

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
I tried newAPIHadoopFile and it works except that my original InputFormat extends InputFormat and has a RecordReader This throws a not Serializable exception on Text - changing the type to InputFormat works with minor code changes. I do not, however, believe that Hadoop count use an InputFormat wi

Re: task getting stuck

2014-09-24 Thread Ted Yu
Adding a subject. bq. at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll( ParquetFileReader.java:599) Looks like there might be some issue reading the Parquet file. Cheers On Wed, Sep 24, 2014 at 9:10 AM, Jianshi Huang wrote: > Hi Ted, > > See my previous reply to Debasish

Re:

2014-09-24 Thread Jianshi Huang
Hi Ted, See my previous reply to Debasish, all region servers are idle. I don't think it's caused by hotspotting. Besides, only 6 out of 3000 tasks were stuck, and their inputs are about only 80MB each. Jianshi On Wed, Sep 24, 2014 at 11:58 PM, Ted Yu wrote: > I was thinking along the same li

WindowedDStreams and hierarchies

2014-09-24 Thread Pablo Medina
Hi everyone, I'm trying to understand the windowed operations functioning. What I want to achieve is the following: val ssc = new StreamingContext(sc, Seconds(1)) val lines = ssc.socketTextStream("localhost", ) val window5 = lines.window(Seconds(5),Seconds(5)).reduce((time1:

Re:

2014-09-24 Thread Jianshi Huang
Hi Debasish, Tables are presplitted and balanced, rows have skews but is not that serious. I checked HBase Master UI, all region servers are idle (0 request) and the table is not under any compaction. Jianshi On Wed, Sep 24, 2014 at 11:56 PM, Debasish Das wrote: > HBase regionserver needs to b

Re:

2014-09-24 Thread Jianshi Huang
The worker is not writing to HBase, it's just stuck. One task usually finishes around 20~40 mins, and I waited 6 hours for the buggy ones before killed it. There're only 6 out of 3000 tasks got stuck Jianshi On Wed, Sep 24, 2014 at 11:55 PM, Ted Yu wrote: > Just a shot in the dark: have you ch

Re:

2014-09-24 Thread Ted Yu
I was thinking along the same line. Jianshi: See http://hbase.apache.org/book.html#d0e6369 On Wed, Sep 24, 2014 at 8:56 AM, Debasish Das wrote: > HBase regionserver needs to be balancedyou might have some skewness in > row keys and one regionserver is under pressuretry finding that key

Re:

2014-09-24 Thread Debasish Das
HBase regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under pressuretry finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang wrote: > Hi Ted, > > It converts RDD[Edge] to HBase rowkey and colu

Re:

2014-09-24 Thread Ted Yu
Just a shot in the dark: have you checked region server (logs) to see if region server had trouble keeping up ? Cheers On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang wrote: > Hi Ted, > > It converts RDD[Edge] to HBase rowkey and columns and insert them to HBase > (in batch). > > BTW, I found ba

Re:

2014-09-24 Thread Jianshi Huang
Hi Ted, It converts RDD[Edge] to HBase rowkey and columns and insert them to HBase (in batch). BTW, I found batched Put actually faster than generating HFiles... Jianshi On Wed, Sep 24, 2014 at 11:49 PM, Ted Yu wrote: > bq. at com.paypal.risk.rds.dragon.storage.hbase.HbaseRDDBatch$$

Executor/Worker stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup and never finishes.

2014-09-24 Thread Jianshi Huang
(Forgot to add the subject, so again...) One of my big spark program always get stuck at 99% where a few tasks never finishes. I debugged it by printing out thread stacktraces, and found there're workers stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup. Anyone had similar problem? I'm

Re:

2014-09-24 Thread Ted Yu
bq. at com.paypal.risk.rds.dragon.storage.hbase.HbaseRDDBatch$$ anonfun$batchInsertEdges$3.apply(HbaseRDDBatch.scala:179) Can you reveal what HbaseRDDBatch.scala does ? Cheers On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang wrote: > One of my big spark program always get stuck at 99% wh

[no subject]

2014-09-24 Thread Jianshi Huang
One of my big spark program always get stuck at 99% where a few tasks never finishes. I debugged it by printing out thread stacktraces, and found there're workers stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup. Anyone had similar problem? I'm using Spark 1.1.0 built for HDP2.1. The pa

Spark : java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Team, I'm getting below exception. Could you please me to resolve this issue. Below is my piece of code val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) var s =rdd.map(x =

Re: Specifying Spark Executor Java options using Spark Submit

2014-09-24 Thread Larry Xiao
Hi Arun! I think you can find info at https://spark.apache.org/docs/latest/configuration.html quote: Spark provides three locations to configure the system: * Spark properties control most application parame

RDD of Iterable[String]

2014-09-24 Thread Deep Pradhan
Can we iterate over RDD of Iterable[String]? How do we do that? Because the entire Iterable[String] seems to be a single element in the RDD. Thank You

Specifying Spark Executor Java options using Spark Submit

2014-09-24 Thread Arun Ahuja
What is the proper way to specify java options for the Spark executors using spark-submit? We had done this previously using export SPARK_JAVA_OPTS='.." previously, for example to attach a debugger to each executor or add "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" On spark-submit

Re: Does Spark Driver works with HDFS in HA mode

2014-09-24 Thread Matt Narrell
Yes, this works. Make sure you have HADOOP_CONF_DIR set on your Spark machines mn On Sep 24, 2014, at 5:35 AM, Petr Novak wrote: > Hello, > if our Hadoop cluster is configured with HA and "fs.defaultFS" points to a > namespace instead of a namenode hostname - hdfs:/// - then > our Spark job

parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Hello, sc.textFile and so on support wildcards in their path, but apparently sqlc.parquetFile() does not. I always receive “File /file/to/path/*/input.parquet does not exist". Is this normal or a bug? Is there are a workaround? Thanks - Marius ---

Re: Multiple Kafka Receivers and Union

2014-09-24 Thread Matt Narrell
The part that works is the commented out, single receiver stream below the loop. It seems that when I call KafkaUtils.createStream more than once, I don’t receive any messages. I’ll dig through the logs, but at first glance yesterday I didn’t see anything suspect. I’ll have to look closer. m

Re: Spark Streaming

2014-09-24 Thread Akhil Das
See the inline response. On Wed, Sep 24, 2014 at 4:05 PM, Reddy Raja wrote: > Given this program.. I have the following queries.. > > val sparkConf = new SparkConf().setAppName("NetworkWordCount") > > sparkConf.set("spark.master", "local[2]") > > val ssc = new StreamingContext(sparkConf,

Re: Spark Code to read RCFiles

2014-09-24 Thread cem
I was able to read RC files with the following line: val file: RDD[(LongWritable, BytesRefArrayWritable)] = sc.hadoopFile("hdfs://day=2014-08-10/hour=00/", classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]], classOf[LongWritable], classOf[BytesRefArrayWritable],500) Try with dis

Re: java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread Victor Tso-Guillen
Do you have a newline or some other strange character in an argument you pass to spark that includes that 61608? On Wed, Sep 24, 2014 at 4:11 AM, wrote: > Hi , >*I am getting this weird error while starting Worker. * > > -bash-4.1$ spark-class org.apache.spark.deploy.worker.Worker > spa

Optimal Cluster Setup for Spark

2014-09-24 Thread pouryas
Hi there What is an optimal cluster setup for spark? Given X amount of resources, would you favour more worker nodes with less resources or less worker node with more resources. Is this application dependent? If so what are the things to consider, what are good practices? Cheers -- View this m

Re: Why recommend 2-3 tasks per CPU core ?

2014-09-24 Thread myasuka
Thank for your reply! Specifically, in our developing environment, I want to know is there any solution to let every tak do just once sub-matrices multiplication? After 'groupBy', every node with 16 cores should get 16 pair matrices, thus I set every node with 16 tasks to hope every core do once

Spark Cassandra Connector Issue and performance

2014-09-24 Thread pouryas
Hey all I tried spark connector with Cassandra and I ran into a problem that I was blocked on for couple of weeks. I managed to find a solution to the problem but I am not sure whether it was a bug of the connector/spark or not. I had three tables in Cassandra (Running Cassandra on 5 node cluste

  1   2   >