[Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2014-09-24 Thread Aniket Bhatnagar
Hi all Reading through Spark streaming's custom receiver documentation, it is recommended that onStart and onStop methods should not block indefinitely. However, looking at the source code of KinesisReceiver, the onStart method calls worker.run that blocks until worker is shutdown (via a call to o

Re: HdfsWordCount only counts some of the words

2014-09-24 Thread Sean Owen
If you look at the code for HdfsWordCount, you see it calls print(), which defaults to print 10 elements from each RDD. If you are just talking about the console output, then it is not expected to print all words to begin with. On Wed, Sep 24, 2014 at 2:29 AM, SK wrote: > > I execute it as follow

Re: Converting one RDD to another

2014-09-24 Thread Sean Owen
top returns the specified number of "largest" elements in your RDD. They are returned to the driver as an Array. If you want to make an RDD out of them again, call SparkContext.parallelize(...). Make sure this is what you mean though. On Wed, Sep 24, 2014 at 5:33 AM, Deep Pradhan wrote: > Hi, > I

sortByKey trouble

2014-09-24 Thread david
Hi, Does anybody know how to use sortbykey in scala on a RDD like : val rddToSave = file.map(l => l.split("\\|")).map(r => (r(34)+"-"+r(3), r(4), r(10), r(12))) besauce, i received ann error "sortByKey is not a member of ord.apache.spark.rdd.RDD[(String,String,String,String)]. What i t

RE: sortByKey trouble

2014-09-24 Thread Shao, Saisai
Hi, SortByKey is only for RDD[(K, V)], each tuple can only has two members, Spark will sort with first member, if you want to use sortByKey, you have to change your RDD[(String, String, String, String)] into RDD[(String, (String, String, String))]. Thanks Jerry -Original Message- From

Re: sortByKey trouble

2014-09-24 Thread Liquan Pei
Hi David, Can you try val rddToSave = file.map(l => l.split("\\|")).map(r => (r(34)+"-"+r(3), (r(4), r(10), r(12 ? That should work. Liquan On Wed, Sep 24, 2014 at 1:29 AM, david wrote: > Hi, > > Does anybody know how to use sortbykey in scala on a RDD like : > > val rddToSave = fi

All-time stream re-processing

2014-09-24 Thread Tobias Pfeiffer
Hi, I have a setup (in mind) where data is written to Kafka and this data is persisted in HDFS (e.g., using camus) so that I have an all-time archive of all stream data ever received. Now I want to process that all-time archive and when I am done with that, continue with the live stream, using Spa

Spark Streaming

2014-09-24 Thread Reddy Raja
Given this program.. I have the following queries.. val sparkConf = new SparkConf().setAppName("NetworkWordCount") sparkConf.set("spark.master", "local[2]") val ssc = new StreamingContext(sparkConf, Seconds(10)) val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.

find subgraph in Graphx

2014-09-24 Thread uuree
Hi, I want to extract a subgraph using two clauses such as (?person worksAt MIT && MIT locatesIn USA), I have this query. My data is constructed as triplets so I assume Graphx will suit it best. But I found out that subgraph method only supports a single edge checking and it seems to me that I cann

Does Spark Driver works with HDFS in HA mode

2014-09-24 Thread Petr Novak
Hello, if our Hadoop cluster is configured with HA and "fs.defaultFS" points to a namespace instead of a namenode hostname - hdfs:/// - then our Spark job fails with exception. Is there anything to configure or it is not implemented? Exception in thread "main" org.apache.spark.SparkException: Job

Re: java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread jishnu.prathap
Hi , I am getting this weird error while starting Worker. -bash-4.1$ spark-class org.apache.spark.deploy.worker.Worker spark://osebi-UServer:59468 Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/09/24 16:22:04 INFO worker.Worker: Registered signal handle

RE: java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread jishnu.prathap
Hi , I am getting this weird error while starting Worker. -bash-4.1$ spark-class org.apache.spark.deploy.worker.Worker spark://osebi-UServer:59468 Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/09/24 16:22:04 INFO worker.Worker: Registered signal handl

java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread jishnu.prathap
Hi , I am getting this weird error while starting Worker. -bash-4.1$ spark-class org.apache.spark.deploy.worker.Worker spark://osebi-UServer:59468 Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/09/24 16:22:04 INFO worker.Worker: Registered signal handle

java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread jishnu.prathap
Hi , I am getting this weird error while starting Worker. -bash-4.1$ spark-class org.apache.spark.deploy.worker.Worker spark://osebi-UServer:59468 Spark assembly has been built with Hive, including Datanucleus jars on classpath 14/09/24 16:22:04 INFO worker.Worker: Registered signal handle

Re: sortByKey trouble

2014-09-24 Thread david
thank's i've already try this solution but it does not compile (in Eclipse) I'm surprise to see that in Spark-shell, sortByKey works fine on 2 solutions : (String,String,String,String) (String,(String,String,String)) -- View this message in context: http://apache-spark-user-list.1

Re: All-time stream re-processing

2014-09-24 Thread Tobias Pfeiffer
Hi, On Wed, Sep 24, 2014 at 7:23 PM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > So you have a single Kafka topic which has very high retention period ( > that decides the storage capacity of a given Kafka topic) and you want to > process all historical data first using Camus

How to sort rdd filled with existing data structures?

2014-09-24 Thread Tao Xiao
Hi , I have the following rdd : val conf = new SparkConf() .setAppName("<< Testing Sorting >>") val sc = new SparkContext(conf) val L = List( (new Student("XiaoTao", 80, 29), "I'm Xiaotao"), (new Student("CCC", 100, 24), "I'm CCC"), (new Student("Jack", 90,

Spark Cassandra Connector Issue and performance

2014-09-24 Thread pouryas
Hey all I tried spark connector with Cassandra and I ran into a problem that I was blocked on for couple of weeks. I managed to find a solution to the problem but I am not sure whether it was a bug of the connector/spark or not. I had three tables in Cassandra (Running Cassandra on 5 node cluste

Re: Why recommend 2-3 tasks per CPU core ?

2014-09-24 Thread myasuka
Thank for your reply! Specifically, in our developing environment, I want to know is there any solution to let every tak do just once sub-matrices multiplication? After 'groupBy', every node with 16 cores should get 16 pair matrices, thus I set every node with 16 tasks to hope every core do once

Optimal Cluster Setup for Spark

2014-09-24 Thread pouryas
Hi there What is an optimal cluster setup for spark? Given X amount of resources, would you favour more worker nodes with less resources or less worker node with more resources. Is this application dependent? If so what are the things to consider, what are good practices? Cheers -- View this m

Re: java.lang.NumberFormatException while starting spark-worker

2014-09-24 Thread Victor Tso-Guillen
Do you have a newline or some other strange character in an argument you pass to spark that includes that 61608? On Wed, Sep 24, 2014 at 4:11 AM, wrote: > Hi , >*I am getting this weird error while starting Worker. * > > -bash-4.1$ spark-class org.apache.spark.deploy.worker.Worker > spa

Re: Spark Code to read RCFiles

2014-09-24 Thread cem
I was able to read RC files with the following line: val file: RDD[(LongWritable, BytesRefArrayWritable)] = sc.hadoopFile("hdfs://day=2014-08-10/hour=00/", classOf[RCFileInputFormat[LongWritable, BytesRefArrayWritable]], classOf[LongWritable], classOf[BytesRefArrayWritable],500) Try with dis

Re: Spark Streaming

2014-09-24 Thread Akhil Das
See the inline response. On Wed, Sep 24, 2014 at 4:05 PM, Reddy Raja wrote: > Given this program.. I have the following queries.. > > val sparkConf = new SparkConf().setAppName("NetworkWordCount") > > sparkConf.set("spark.master", "local[2]") > > val ssc = new StreamingContext(sparkConf,

Re: Multiple Kafka Receivers and Union

2014-09-24 Thread Matt Narrell
The part that works is the commented out, single receiver stream below the loop. It seems that when I call KafkaUtils.createStream more than once, I don’t receive any messages. I’ll dig through the logs, but at first glance yesterday I didn’t see anything suspect. I’ll have to look closer. m

parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Hello, sc.textFile and so on support wildcards in their path, but apparently sqlc.parquetFile() does not. I always receive “File /file/to/path/*/input.parquet does not exist". Is this normal or a bug? Is there are a workaround? Thanks - Marius ---

Re: Does Spark Driver works with HDFS in HA mode

2014-09-24 Thread Matt Narrell
Yes, this works. Make sure you have HADOOP_CONF_DIR set on your Spark machines mn On Sep 24, 2014, at 5:35 AM, Petr Novak wrote: > Hello, > if our Hadoop cluster is configured with HA and "fs.defaultFS" points to a > namespace instead of a namenode hostname - hdfs:/// - then > our Spark job

Specifying Spark Executor Java options using Spark Submit

2014-09-24 Thread Arun Ahuja
What is the proper way to specify java options for the Spark executors using spark-submit? We had done this previously using export SPARK_JAVA_OPTS='.." previously, for example to attach a debugger to each executor or add "-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" On spark-submit

RDD of Iterable[String]

2014-09-24 Thread Deep Pradhan
Can we iterate over RDD of Iterable[String]? How do we do that? Because the entire Iterable[String] seems to be a single element in the RDD. Thank You

Re: Specifying Spark Executor Java options using Spark Submit

2014-09-24 Thread Larry Xiao
Hi Arun! I think you can find info at https://spark.apache.org/docs/latest/configuration.html quote: Spark provides three locations to configure the system: * Spark properties control most application parame

Spark : java.io.NotSerializableException: org.apache.hadoop.hbase.client.Result

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Team, I'm getting below exception. Could you please me to resolve this issue. Below is my piece of code val rdd = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result]) var s =rdd.map(x =

[no subject]

2014-09-24 Thread Jianshi Huang
One of my big spark program always get stuck at 99% where a few tasks never finishes. I debugged it by printing out thread stacktraces, and found there're workers stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup. Anyone had similar problem? I'm using Spark 1.1.0 built for HDP2.1. The pa

Re:

2014-09-24 Thread Ted Yu
bq. at com.paypal.risk.rds.dragon.storage.hbase.HbaseRDDBatch$$ anonfun$batchInsertEdges$3.apply(HbaseRDDBatch.scala:179) Can you reveal what HbaseRDDBatch.scala does ? Cheers On Wed, Sep 24, 2014 at 8:46 AM, Jianshi Huang wrote: > One of my big spark program always get stuck at 99% wh

Executor/Worker stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup and never finishes.

2014-09-24 Thread Jianshi Huang
(Forgot to add the subject, so again...) One of my big spark program always get stuck at 99% where a few tasks never finishes. I debugged it by printing out thread stacktraces, and found there're workers stuck at parquet.hadoop.ParquetFileReader.readNextRowGroup. Anyone had similar problem? I'm

Re:

2014-09-24 Thread Jianshi Huang
Hi Ted, It converts RDD[Edge] to HBase rowkey and columns and insert them to HBase (in batch). BTW, I found batched Put actually faster than generating HFiles... Jianshi On Wed, Sep 24, 2014 at 11:49 PM, Ted Yu wrote: > bq. at com.paypal.risk.rds.dragon.storage.hbase.HbaseRDDBatch$$

Re:

2014-09-24 Thread Ted Yu
Just a shot in the dark: have you checked region server (logs) to see if region server had trouble keeping up ? Cheers On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang wrote: > Hi Ted, > > It converts RDD[Edge] to HBase rowkey and columns and insert them to HBase > (in batch). > > BTW, I found ba

Re:

2014-09-24 Thread Debasish Das
HBase regionserver needs to be balancedyou might have some skewness in row keys and one regionserver is under pressuretry finding that key and replicate it using random salt On Wed, Sep 24, 2014 at 8:51 AM, Jianshi Huang wrote: > Hi Ted, > > It converts RDD[Edge] to HBase rowkey and colu

Re:

2014-09-24 Thread Ted Yu
I was thinking along the same line. Jianshi: See http://hbase.apache.org/book.html#d0e6369 On Wed, Sep 24, 2014 at 8:56 AM, Debasish Das wrote: > HBase regionserver needs to be balancedyou might have some skewness in > row keys and one regionserver is under pressuretry finding that key

Re:

2014-09-24 Thread Jianshi Huang
The worker is not writing to HBase, it's just stuck. One task usually finishes around 20~40 mins, and I waited 6 hours for the buggy ones before killed it. There're only 6 out of 3000 tasks got stuck Jianshi On Wed, Sep 24, 2014 at 11:55 PM, Ted Yu wrote: > Just a shot in the dark: have you ch

Re:

2014-09-24 Thread Jianshi Huang
Hi Debasish, Tables are presplitted and balanced, rows have skews but is not that serious. I checked HBase Master UI, all region servers are idle (0 request) and the table is not under any compaction. Jianshi On Wed, Sep 24, 2014 at 11:56 PM, Debasish Das wrote: > HBase regionserver needs to b

WindowedDStreams and hierarchies

2014-09-24 Thread Pablo Medina
Hi everyone, I'm trying to understand the windowed operations functioning. What I want to achieve is the following: val ssc = new StreamingContext(sc, Seconds(1)) val lines = ssc.socketTextStream("localhost", ) val window5 = lines.window(Seconds(5),Seconds(5)).reduce((time1:

Re:

2014-09-24 Thread Jianshi Huang
Hi Ted, See my previous reply to Debasish, all region servers are idle. I don't think it's caused by hotspotting. Besides, only 6 out of 3000 tasks were stuck, and their inputs are about only 80MB each. Jianshi On Wed, Sep 24, 2014 at 11:58 PM, Ted Yu wrote: > I was thinking along the same li

Re: task getting stuck

2014-09-24 Thread Ted Yu
Adding a subject. bq. at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll( ParquetFileReader.java:599) Looks like there might be some issue reading the Parquet file. Cheers On Wed, Sep 24, 2014 at 9:10 AM, Jianshi Huang wrote: > Hi Ted, > > See my previous reply to Debasish

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
I tried newAPIHadoopFile and it works except that my original InputFormat extends InputFormat and has a RecordReader This throws a not Serializable exception on Text - changing the type to InputFormat works with minor code changes. I do not, however, believe that Hadoop count use an InputFormat wi

Spark Hbase

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Team, Could you please point me the example program for Spark HBase to read columns and values Regards, Rajesh

Re: task getting stuck

2014-09-24 Thread Debasish Das
spark SQL reads parquet file fine...did you follow one of these to read/write parquet from spark ? http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ On Wed, Sep 24, 2014 at 9:29 AM, Ted Yu wrote: > Adding a subject. > > bq. at parquet.hadoop.ParquetFileReader$ > ConsecutiveChunkL

Re: Spark Hbase

2014-09-24 Thread Ted Yu
Take a look at the following under examples: examples/src//main/python/hbase_inputformat.py examples/src//main/python/hbase_outputformat.py examples/src//main/scala/org/apache/spark/examples/HBaseTest.scala examples/src//main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala

Re: task getting stuck

2014-09-24 Thread Debasish Das
First test would be if you can write parquet files fine on HDFS from your Spark job fine...If that also get's stuck then there is something with the logic...If parquet files are dumped fine and u can load them on HBase then there is something going on with Spark-HBase interaction On Wed, Sep 24, 2

Re: Spark Hbase

2014-09-24 Thread Madabhattula Rajesh Kumar
Hi Ted, Thank you for quick response and details. I have verified HBaseTest.scala program it is returning hbaseRDD but i'm not able retrieve values from hbaseRDD. Could you help me to retrieve the values from hbaseRDD Regards, Rajesh On Wed, Sep 24, 2014 at 10:12 PM, Ted Yu wrote: > Take a l

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Russ Weeks
I use newAPIHadoopRDD with AccumuloInputFormat. It produces a PairRDD using Accumulo's Key and Value classes, both of which extend Writable. Works like a charm. I use the same InputFormat for all my MR jobs. -Russ On Wed, Sep 24, 2014 at 9:33 AM, Steve Lewis wrote: > I tried newAPIHadoopFile an

Re: How to sort rdd filled with existing data structures?

2014-09-24 Thread Sean Owen
See the scaladoc for how to define an implicit ordering to use with sortByKey: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.OrderedRDDFunctions Off the top of my head, I think this is 90% correct to order by age for example: implicit val studentOrdering: Ordering

Re: parquetFile and wilcards

2014-09-24 Thread Michael Armbrust
This behavior is inherited from the parquet input format that we use. You could list the files manually and pass them as a comma separated list. On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier wrote: > Hello, > > sc.textFile and so on support wildcards in their path, but apparently > sqlc.parqu

Re: parquetFile and wilcards

2014-09-24 Thread Marius Soutier
Thank you, that works! On 24.09.2014, at 19:01, Michael Armbrust wrote: > This behavior is inherited from the parquet input format that we use. You > could list the files manually and pass them as a comma separated list. > > On Wed, Sep 24, 2014 at 7:46 AM, Marius Soutier wrote: > Hello, >

Re: parquetFile and wilcards

2014-09-24 Thread Nicholas Chammas
Does it make sense for us to open a JIRA to track enhancing the Parquet input format to support wildcards? Or is this something outside of Spark's control? Nick On Wed, Sep 24, 2014 at 1:01 PM, Michael Armbrust wrote: > This behavior is inherited from the parquet input format that we use. You

Re: RDD of Iterable[String]

2014-09-24 Thread Liquan Pei
Hi Deep, The Iterable trait in scala has methods like map and reduce that you can use to iterate elements of Iterable[String]. You can also create an Iterator from the Iterable. For example, suppose you have val rdd: RDD[Iterable[String]] you can do rdd.map { x => //x has type Iterable[String]

Re: Multiple Kafka Receivers and Union

2014-09-24 Thread Tim Smith
Maybe differences between JavaPairDStream and JavaPairReceiverInputDStream? On Wed, Sep 24, 2014 at 7:46 AM, Matt Narrell wrote: > The part that works is the commented out, single receiver stream below the > loop. It seems that when I call KafkaUtils.createStream more than once, I > don’t rece

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Do your custom Writable classes implement Serializable - I think that is the only real issue - my code uses vanilla Text

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Russ Weeks
No, they do not implement Serializable. There are a couple of places where I've had to do a Text->String conversion but generally it hasn't been a problem. -Russ On Wed, Sep 24, 2014 at 10:27 AM, Steve Lewis wrote: > Do your custom Writable classes implement Serializable - I think that is > the

RDD save as Seq File

2014-09-24 Thread Shay Seng
Hi, Why does RDD.saveAsObjectFile() to S3 leave a bunch of *_$folder$ empty files around? Is it possible for the saveas to clean up? tks

Re: Spark with YARN

2014-09-24 Thread Greg Hill
Do you have YARN_CONF_DIR set in your environment to point Spark to where your yarn configs are? Greg From: Raghuveer Chanda mailto:raghuveer.cha...@gmail.com>> Date: Wednesday, September 24, 2014 12:25 PM To: "u...@spark.incubator.apache.org" mailto:u..

Re: Spark with YARN

2014-09-24 Thread Matt Narrell
This just shows the driver. Click the Executors tab in the Spark UI mn On Sep 24, 2014, at 11:25 AM, Raghuveer Chanda wrote: > Hi, > > I'm new to spark and facing problem with running a job in cluster using YARN. > > Initially i ran jobs using spark master as --master spark://dml2:7077 and

Re: Does anyone have experience with using Hadoop InputFormats?

2014-09-24 Thread Steve Lewis
Hmmm - I have only tested in local mode but I got an java.io.NotSerializableException: org.apache.hadoop.io.Text at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528) at java.io.ObjectOutputStream.w

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-24 Thread jatinpreet
Hi, I was able to get the training running in local mode with default settings, there was a problem with document labels which were quite large(not 20 as suggested earlier). I am currently training 175000 documents on a single node with 2GB of executor memory and 5GB of driver memory successfull

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
You'll need to look at the driver output to have a better idea of what's going on. You can use "yarn logs --applicationId blah" after your app is finished (e.g. by killing it) to look at it. My guess is that your cluster doesn't have enough resources available to service the container request you'

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
Thanks for the reply, I have doubt as to which path to set for YARN_CONF_DIR My /etc/hadoop folder has the following sub folders conf conf.cloudera.hdfs conf.cloudera.mapreduce conf.cloudera.yarn and both conf and conf.cloudera.yarn folders have yarn-site.xml. As of now I set the variable as

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
The screenshot executors.8080.png is of the executors tab itself and only driver is added without workers even if I kept the master as yarn-cluster. On Wed, Sep 24, 2014 at 11:18 PM, Matt Narrell wrote: > This just shows the driver. Click the Executors tab in the Spark UI > > mn > > On Sep 24,

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2014-09-24 Thread Victor Tso-Guillen
Really? What should we make of this? 24 Sep 2014 10:03:36,772 ERROR [Executor task launch worker-52] Executor - Exception in task ID 40599 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:789) at org.apache

Re: java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

2014-09-24 Thread Victor Tso-Guillen
Never mind: https://issues.apache.org/jira/browse/SPARK-1476 On Wed, Sep 24, 2014 at 11:10 AM, Victor Tso-Guillen wrote: > Really? What should we make of this? > > 24 Sep 2014 10:03:36,772 ERROR [Executor task launch worker-52] Executor - > Exception in task ID 40599 > > java.lang.IllegalArgumen

spark.sql.autoBroadcastJoinThreshold

2014-09-24 Thread sridhar1135
Does this work with spark-sql in 1.0.1 too ? I tried like this sqlContext.sql("SET spark.sql.autoBroadcastJoinThreshold=1;") But it still seems to trigger "shuffleMapTask()" and amount of shuffle is same with / without this parameter... Kindly request some help here Thanks -- Vi

Re: parquetFile and wilcards

2014-09-24 Thread Michael Armbrust
We could certainly do this. The comma separated support is something I added. On Wed, Sep 24, 2014 at 10:20 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Does it make sense for us to open a JIRA to track enhancing the Parquet > input format to support wildcards? Or is this somethin

Re: RDD save as Seq File

2014-09-24 Thread Sean Owen
It's really Hadoop's support for S3. Hadoop FS semantics need directories, and S3 doesn't have a proper notion of directories. http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html You should leave them AFAIK. On Wed, Sep 24, 2014 at 6:41 PM, Shay Seng

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
Thanks for the reply .. This is the error in the logs obtained from UI at http://dml3:8042/node/containerlogs/container_1411578463780_0001_02_01/chanda So now how to set the Log Server url .. Failed while trying to construct the redirect url to the log server. Log Server url may not be confi

Re: Spark Code to read RCFiles

2014-09-24 Thread Pramod Biligiri
I'm afraid SparkSQL isn't an option for my use case, so I need to use the Spark API itself. I turned off Kryo, and I'm getting a NullPointerException now: scala> val ref = file.take(1)(0)._2 ref: org.apache.hadoop.hive.serde2.columnar.BytesRefArrayWritable = org.apache.hadoop.hive.serde2.columnar.

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
You need to use the command line yarn application that I mentioned ("yarn logs"). You can't look at the logs through the UI after the app stops. On Wed, Sep 24, 2014 at 11:16 AM, Raghuveer Chanda wrote: > > Thanks for the reply .. This is the error in the logs obtained from UI at > http://dml3:80

Null values in pyspark Row

2014-09-24 Thread jamborta
Hi all, I have just updated to spark 1.1.0. The new row representation of the data in spark SQL is very handy. I have noticed that it does not set None for NULL values coming from Hive if the column was string type - seems it works with other types. Is that something that will be implemented? T

Re: Null values in pyspark Row

2014-09-24 Thread Davies Liu
Could create a JIRA and add test cases for it? Thanks! Davies On Wed, Sep 24, 2014 at 11:56 AM, jamborta wrote: > Hi all, > > I have just updated to spark 1.1.0. The new row representation of the data > in spark SQL is very handy. > > I have noticed that it does not set None for NULL values comi

Re: Spark with YARN

2014-09-24 Thread Raghuveer Chanda
Yeah I got the logs and its reporting about the memory. 14/09/25 00:08:26 WARN YarnClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Now I shifted to big cluster with more memory but here im not abl

Re: Spark Code to read RCFiles

2014-09-24 Thread cem
I used the following code as an example to deserialize BytesRefArrayWritable. http://www.massapi.com/source/hive-0.5.0-dev/src/ql/src/test/org/apache/hadoop/hive/ql/io/TestRCFile.java.html Best Regards, Cem. On Wed, Sep 24, 2014 at 1:34 PM, Pramod Biligiri wrote: > I'm afraid SparkSQL isn't

persist before or after checkpoint?

2014-09-24 Thread Shay Seng
Hey, I actually have 2 question (1) I want to generate unique IDs for each RDD element and I want to assign them in parallel so I do rdd.mapPartitionsWithIndex((index, s) => { var count = 0L s.zipWithIndex.map { case (t, i) => { count += 1 (index * GLOBAL

Hive Avro table fail (pyspark)

2014-09-24 Thread ilyagluk
Dear Spark Users, I'd highly appreciate any comments on whether there is a way to define an Avro table and then run select * without errors. Thank you very much in advance! Ilya. Attached please find a sample data directory and the Avro schema. analytics_events2.zip

Re: New sbt plugin to deploy jobs to EC2

2014-09-24 Thread Shafaq
Hi, testing out the Spark Ec2 deployment plugin: I try to compile using $sbt sparkLaunchCluster --- [info] Resolving org.fusesource.jansi#jansi;1.4 ... [warn] :

Question About Submit Application

2014-09-24 Thread danilopds
Hello, I'm learning about Spark Streaming and I'm really excited. Today I was testing to package some apps and send them in a Standalone cluster in my computer locally. It occurred ok. So, I created one Virtual Machine with network bridge and I tried to send again the app to this VM from my local

Re: Spark Streaming Twitter Example Error

2014-09-24 Thread danilopds
I solved this question using the SBT plugin "sbt-assembly". It's very good! Bye. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Twitter-Example-Error-tp12600p15073.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: Memory/Network Intensive Workload

2014-09-24 Thread danilopds
Thank you for the suggestion! Bye. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Memory-Network-Intensive-Workload-tp8501p15074.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Question About Submit Application

2014-09-24 Thread danilopds
One more information.. When I submit the application from my local PC to my VM, This VM was the master and worker and my local PC wasn't part of the cluster. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072p

Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
I am attempting to use Spark Streaming to summarize event data streaming in and save it to a MySQL table. The input source data is stored in 4 topics on Kafka with each topic having 12 partitions. My approach to doing this worked both in local development and a simulated load testing environment bu

Re: Logging in Spark through YARN.

2014-09-24 Thread Vipul Pandey
Archit, Are you able to get it to work with 1.0.0? I tried the --files suggestion from Marcelo and it just changed logging for my client and the appmaster and executors were still the same. ~Vipul On Jul 30, 2014, at 9:59 PM, Archit Thakur wrote: > Hi Marcelo, > > Thanks for your quick comme

Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Oh I should add I've tried a range of batch durations and reduce by window durations to no effect. I'm not too sure how to choose these? Currently today I've been testing with batch duration of 1 minute - 10 minute and reduce window duration of 10 minute or 20 minutes. -- View this message in

Re: Spark Streaming unable to handle production Kafka load

2014-09-24 Thread maddenpj
Another update, actually it just hit me my problem is probably right here: https://gist.github.com/maddenpj/74a4c8ce372888ade92d#file-gistfile1-scala-L22 I'm creating a JDBC connection on every record, that's probably whats killing the performance. I assume the fix is just broadcast the connectio

Re: mllib performance on mesos cluster

2014-09-24 Thread Sudha Krishna
Setting spark.mesos.coarse=true helped reduce the time on the mesos cluster from 17 min to around 6 min. The scheduler delay per task reduced from 40 ms to around 10 ms. thanks On Mon, Sep 22, 2014 at 12:36 PM, Xiangrui Meng wrote: > 1) MLlib 1.1 should be faster than 1.0 in general. What's th

Re: java.lang.OutOfMemoryError while running SVD MLLib example

2014-09-24 Thread Shailesh Birari
Note, the data is random numbers (double). Any suggestions/pointers will be highly appreciated. Thanks, Shailesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-while-running-SVD-MLLib-example-tp14972p15083.html Sent from the A

MLUtils.loadLibSVMFile error

2014-09-24 Thread Sameer Tilak
Hi All, When I try to load dataset using MLUtils.loadLibSVMFile, I have the following problem. Any help will be greatly appreciated! Code snippet: import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.util.MLUtils import org.apache.spark.rdd.RDD import org.ap

Bug in JettyUtils?

2014-09-24 Thread Jim Donahue
I've been seeing a failure in JettyUtils.createStaticHandler when running Spark 1.1 programs from Eclipse which I hadn't seen before. The problem seems to be line 129 Option(Utils.getSparkClassLoader.getResource(resourceBase)) match { ... getSparkClassLoader returns NULL, and it's all

Re: Spark with YARN

2014-09-24 Thread Marcelo Vanzin
If you launched the job in yarn-cluster mode, the tracking URL is printed on the output of the launched process. That will lead you to the Spark UI once the job is running. If you're using CM, you can reach the same link by clicking on the "Resource Manager UI" link on your Yarn service, then find

Re: Question About Submit Application

2014-09-24 Thread Marcelo Vanzin
Sounds like "spark-01" is not resolving correctly on your machine (or is the wrong address). Can you ping "spark-01" and does that reach the VM where you set up the Spark Master? On Wed, Sep 24, 2014 at 1:12 PM, danilopds wrote: > Hello, > I'm learning about Spark Streaming and I'm really excited

Re: MLUtils.loadLibSVMFile error

2014-09-24 Thread Liquan Pei
Hi Sameer, This seems to be a file format issue. Can you make sure that your data satisfies the format? Each line of libsvm format is as follows: : : ... Thanks, Liquan On Wed, Sep 24, 2014 at 3:02 PM, Sameer Tilak wrote: > Hi All, > > > When I try to load dataset using MLUtils.loadLibSV

Re: How to sort rdd filled with existing data structures?

2014-09-24 Thread Liquan Pei
You only need to define an ordering of student, no need to modify the class definition of student. It's like a Comparator class in java. Currently, you have to map the rdd to sort by value. Liquan On Wed, Sep 24, 2014 at 9:52 AM, Sean Owen wrote: > See the scaladoc for how to define an implici

Re: Anybody built the branch for Adaptive Boosting, extension to MLlib by Manish Amde?

2014-09-24 Thread Aris
Hi Manish! Thanks for the reply and the explication on why the branches won't compile -- that makes perfect sense. You mentioned making the branch compatible with the latest master; could you share some more details? Which branch do you mean -- is it on your GitHub? And would I just be able to do

Re: return probability \ confidence instead of actual class

2014-09-24 Thread Aris
Χαίρε Αδαμάντιε Κοραήέαν είναι πράγματι το όνομα σου.. Just to follow up on Liquan, you might be interested in removing the thresholds, and then treating the predictions as a probability from 0..1 inclusive. SVM with the linear kernel is a straightforward linear classifier -- so you with the m

Processing multiple request in cluster

2014-09-24 Thread Subacini B
hi All, How to run concurrently multiple requests on same cluster. I have a program using *spark streaming context *which reads* streaming data* and writes it to HBase. It works fine, the problem is when multiple requests are submitted to cluster, only first request is processed as the entire clu

Re: return probability \ confidence instead of actual class

2014-09-24 Thread Sunny Khatri
For multi-class you can use the same SVMWithSGD (for binary classification) with One-vs-All approach constructing respective training corpuses consisting one Class i as positive samples and Rest of the classes as negative one, and then use the same method provided by Aris as a measure of how far Cl

  1   2   >