RE: Bulk-load to HBase

2014-09-22 Thread innowireless TaeYun Kim
Thank you all. With your help, I managed to create HFiles for a small RDD. But for a large RDD, I gave up (at least for now). The reason is as follows: First, the flow of my code (it's Java) is as follows: { // Create a RDD which has the records JavaPairRDD rdd = // MyKey has a by

Re: Problem with pyspark command line invocation -- option truncation... (Spark v1.1.0) ...

2014-09-22 Thread Sandy Ryza
Agreed with Andrew - the Spark properties file is usually a much nicer way of specifying a bunch of properties. To be clear, --driver-memory still needs to go on the command line. -Sandy On Mon, Sep 22, 2014 at 1:49 AM, Andrew Or wrote: > Hi Didata, > > An alternative to what Sandy proposed is

Re: Possibly a dumb question: differences between saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset?

2014-09-22 Thread Matei Zaharia
File takes a filename to write to, while Dataset takes only a JobConf. This means that Dataset is more general (it can also save to storage systems that are not file systems, such as key-value stores), but is more annoying to use if you actually have a file. Matei On September 21, 2014 at 11:2

Spark - Apache Blur Connector : Index Kafka Messages into Blur using Spark Streaming

2014-09-22 Thread Dibyendu Bhattacharya
Hi, Last few days I am working on a Spark - Apache Blur Connector to index Kafka messages into Apache Blur using Spark Streaming. We have been working on to build a distributed search platform for our NRT use cases and we have been playing with Spark Streaming and Apache Blur for the same. We are

Spark SQL 1.1.0: NPE when join two cached table

2014-09-22 Thread Haopu Wang
I have two data sets and want to join them on each first field. Sample data are below: data set 1: id2,name1,2,300.0 data set 2: id1, The code is something like below: val sparkConf = new SparkConf().setAppName("JoinInScala") val sc = new SparkContext(spar

RE: Possibly a dumb question: differences between saveAsNewAPIHadoopFile and saveAsNewAPIHadoopDataset?

2014-09-22 Thread innowireless TaeYun Kim
Thank you. Now I’ve read some part of PairRDDFunctions.scala, and I’ve found that saveAsNewAPIHadoopFile is just a thin (convenient) wrapper to saveAsNewAPIHadoopDataset. From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Monday, September 22, 2014 5:12 PM To: user@spark.apache.o

Re: Bulk-load to HBase

2014-09-22 Thread Sean Owen
I see a number of potential issues: On Mon, Sep 22, 2014 at 8:42 AM, innowireless TaeYun Kim wrote: > JavaPairRDD rdd = > // MyKey has a byte[] member for rowkey Two byte[] with the same contents are not equals(), so won't work as you intend as a key. Is there more to it? I assume so

RE: Bulk-load to HBase

2014-09-22 Thread innowireless TaeYun Kim
Thank you for your detailed reply. First, the purpose of MyKey class is a wrapper to provide equals() and Comparable interface to byte[]. groupByKey() is for performance. I have to merge the byte[]s that have the same key. If merging is done with reduceByKey(), a lot of intermediate byte[] alloc

Re: Bulk-load to HBase

2014-09-22 Thread Sean Owen
On Mon, Sep 22, 2014 at 10:21 AM, innowireless TaeYun Kim wrote: > I have to merge the byte[]s that have the same key. > If merging is done with reduceByKey(), a lot of intermediate byte[] > allocation and System.arraycopy() is executed, and it is too slow. So I had > to resort to groupByKey(),

RE: Bulk-load to HBase

2014-09-22 Thread innowireless TaeYun Kim
Thank you. Actually I have numbers for the merged byte[] size for a group for my test input data. Max merged byte[] size per group is about 16KB, while the average is about 1KB. It’s not that big. Since the executor memory was set to 4GB and spark.storage.memoryFraction set to 0.3, I think it i

Getting RDD load progress

2014-09-22 Thread cyril.ponomaryov
I have an ImpalaRDD class: public class ImpalaRDD extends RDD { ... @Override public Iterator compute(Partition partition, TaskContext context) { ImpalaPartitionIterator iterator = new ImpalaPartitionIterator<>(mapRow, (JdbcPartition) partition, connectionProvider, partiti

ParquetRecordReader warnings: counter initialization

2014-09-22 Thread Andrew Ash
Hi All, I'm seeing the below WARNINGs in stdout using Spark SQL in Spark 1.1.0 -- is this warning a known issue? I don't see any open Jira tickets for it. Sep 22, 2014 10:13:54 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block Sep 22, 2014 10:13:54 AM INFO: parque

Re: Issues with partitionBy: FetchFailed

2014-09-22 Thread Andrew Ash
Hi David and Saisai, Are the exceptions you two are observing similar to the first one at https://issues.apache.org/jira/browse/SPARK-3633 ? Copied below: 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, c1706.h

Re: Issues with partitionBy: FetchFailed

2014-09-22 Thread David Rowe
Yep, this is what I was seeing. I'll experiment tomorrow with a version prior to the changeset in that ticket. On Mon, Sep 22, 2014 at 8:29 PM, Andrew Ash wrote: > Hi David and Saisai, > > Are the exceptions you two are observing similar to the first one at > https://issues.apache.org/jira/brows

RE: Issues with partitionBy: FetchFailed

2014-09-22 Thread Shao, Saisai
I didn’t meet this issue (Too many open files) as yours, because I set a relative large open file numbers in Linux, like 640K. What I was seeing is that one executor is pausing without doing anything, all the resources are not fully used, and one cpu core is running into 100%, so I assume this p

Spark flume java.lang.ArrayIndexOutOfBoundsException: 1

2014-09-22 Thread centerqi hu
Hi all This is a bug of spark? val Array(host, IntParam(port)) = args val batchInterval = Milliseconds(12) // Create the context and set the batch size val sparkConf = new SparkConf().setAppName("FlumeEventCount") val ssc = new StreamingContext(sparkConf, batchInter

Error while calculating the max temperature

2014-09-22 Thread Praveen Sripati
Hi, I am writing a Spark program in Python to find the maximum temperature for a year, given a weather dataset. The below program throws an error when I try to execute the Spark program. TypeError: 'NoneType' object is not iterable org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.sc

Re: Spark streaming twitter exception

2014-09-22 Thread Maisnam Ns
Yes, it works with these dependencies . Thank you. Now , I am able to run Spark twitter streaming . On Mon, Sep 22, 2014 at 12:09 PM, Akhil Das wrote: > Can you try adding these dependencies? > > libraryDependencies += "org.apache.spark" % "spark-streaming-twitter_2.10" > % "1.0.1" > libraryDepe

[no subject]

2014-09-22 Thread jishnu.prathap
The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or

Re: Spark streaming stops computing while the receiver keeps running without any errors reported

2014-09-22 Thread Aniket Bhatnagar
Hi all I was finally able to figure out why this streaming appeared stuck. The reason was that I was running out of workers in my standalone deployment of Spark. There was no feedback in the logs which is why it took a little time for me to figure out. However, now I am trying to run the same in

Unable to change the Ports

2014-09-22 Thread jishnu.prathap
Hi Everyone i am new to spark ... I am posting some basic doubts i met while trying to create a standalone cluster for a small poc ... 1)My Corporate firewall blocked the port 7077, which is the default port of Master URL , So i used start-master.sh --port 8080 (also tried with several other po

Out of memory exception in MLlib's naive baye's classification training

2014-09-22 Thread jatinpreet
Hi,I have been facing an unusual issue with Naive Baye's training. I run out of heap space with even with limited data during training phase. I am trying to run the same on a rudimentary cluster of two development machines in standalone mode.I am reading data from an HBase table, converting them in

Re: Spark streaming twitter exception

2014-09-22 Thread Akhil Das
Awesome! :] Thanks Best Regards On Mon, Sep 22, 2014 at 6:10 PM, Maisnam Ns wrote: > Yes, it works with these dependencies . Thank you. Now , I am able to run > Spark twitter streaming . > > On Mon, Sep 22, 2014 at 12:09 PM, Akhil Das > wrote: > >> Can you try adding these dependencies? >> >>

Re: Issues with partitionBy: FetchFailed

2014-09-22 Thread David Rowe
So, spinning up an identical cluster with spark 1.0.1, my job fails with the error: 14/09/22 12:30:28 WARN scheduler.TaskSetManager: Lost task 217.0 in stage 4.0 (TID 66613, ip-10-251-34-68.ap-southeast-2.compute.internal): org.apache.spark.SparkException: Error communicating with MapOutputTracker

Re: exception in spark 1.1.0

2014-09-22 Thread Chen Song
snappy native lib was installed as part of hadoop (cdh5.1.0) the location is in /usr/lib/hadoop/lib/ -rwxr-xr-x 1 hdfs hadoop 23904 Jul 12 14:11 libsnappy.so.1.1.3 lrwxrwxrwx 1 hdfs hadoop 18 Aug 11 20:56 libsnappy.so.1 -> libsnappy.so.1.1.3 lrwxrwxrwx 1 hdfs hadoop 18 Aug 11 20:56 libsn

Re: GraphX : AssertionError

2014-09-22 Thread Keith Massey
The triangle count also failed for me when I ran it on more than one node. There is this assertion in TriangleCount.scala that causes the failure: // double count should be even (divisible by two) assert((dblCount & 1) == 0) That did not hold true when I ran this on multiple nodes,

Re: Error while calculating the max temperature

2014-09-22 Thread Praveen Sripati
During the map based on some conditions if some of the rows are ignored (without any transformation) then then there is a record by None in the output RDD for the ignored records. And reduceByKey is not able to handle this type of None record and so the exception. I tried filter, but it is also not

Re: Error while calculating the max temperature

2014-09-22 Thread Sean Owen
If your map() sometimes does not emit an element, then you need to call flatMap() instead, and emit Some(value) (or any collection of values) if there is an element to return, or None otherwise. On Mon, Sep 22, 2014 at 4:50 PM, Praveen Sripati wrote: > During the map based on some conditions if s

Re: Spark SQL 1.1.0: NPE when join two cached table

2014-09-22 Thread Yin Huai
It is a bug. I have created https://issues.apache.org/jira/browse/SPARK-3641 to track it. Thanks for reporting it. Yin On Mon, Sep 22, 2014 at 4:34 AM, Haopu Wang wrote: > I have two data sets and want to join them on each first field. Sample > data are below: > > > > data set 1: > > id2,na

Setup an huge Unserializable Object in a mapper

2014-09-22 Thread matthes
Hello everybody! I’m newbe in spark and I hope my problem is solvable! I need to setup an instance which I want to use is a mapper function. The problem is it is not Serializable and the broadcast function is no option for me. The Instance can become very huge (e.g. 1GB-10GB). Is there a way to se

Is there any way (in Java) to make a JavaRDD from an iterable

2014-09-22 Thread Steve Lewis
The only way I find is to turn it into a list - in effect holding everything in memory (see code below). Surely Spark has a better way. Also what about unterminated iterables like a Fibonacci series - (useful only if limited in some other way ) /** * make an RDD from an iterable * @

spark time out

2014-09-22 Thread Chen Song
I am using Spark 1.1.0 and have seen a lot of Fetch Failures due to the following exception. java.io.IOException: sendMessageReliably failed because ack was not received within 60 sec at org.apache.spark.network.ConnectionManager$$anon$5$$anonfun$run$15.apply(ConnectionManager.scala:854)

Re: Error while calculating the max temperature

2014-09-22 Thread Praveen Sripati
Hi Sean, Thanks for the response. I changed from map to flatMap and in the function return a list as below if (temp != "+" and re.match("[01459]", q)): return [(year,temp)] else: return [] Thanks, Praveen On Mon, Sep 22, 2014 at 9:26 PM, Sean Owen wrote: > If your map() sometimes

Re: Is there any way (in Java) to make a JavaRDD from an iterable

2014-09-22 Thread Victor Tso-Guillen
You can write to disk and have Spark read it as a stream. This is how Hadoop files are iterated in Spark. On Mon, Sep 22, 2014 at 9:22 AM, Steve Lewis wrote: >The only way I find is to turn it into a list - in effect holding > everything in memory (see code below). Surely Spark has a better

SparkSQL: Key not valid while running TPC-H

2014-09-22 Thread Samay
Hi, I am trying to run TPC-H queries with SparkSQL 1.1.0 CLI with 1 r3.4xlarge master + 20 r3.4xlarge slave machines on EC2 (each machine has 16vCPUs, 122GB memory). The TPC-H scale factor I am using is 1000 (i.e. 1000GB of total data). When I try to run TPC-H query 3 i.e. select l_orderkey, sum(

Re: java.lang.NegativeArraySizeException in pyspark

2014-09-22 Thread Davies Liu
The traceback said that the serialized closure cannot be parsed (base64) correctly by py4j. The string in Java cannot be longer than 2G, so the serialized closure cannot longer than 1.5G (there are overhead in base64), is it possible that your data used in the map function is so big? If it's, you

Accumulator Immutability?

2014-09-22 Thread Vikram Kalabi
Consider this snippet from spark scaladoc , scala> val accum = sc.accumulator(0) accum: spark.Accumulator[Int] = 0 scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x) ...10/09/29 18:41:08 INFO S

RE: Accumulator Immutability?

2014-09-22 Thread abraham.jacob
I am also not familiar with Scala but I believe the concept is similar to the concept of String in Java. accum point to a “Accumulator”. You can change what it points to, but not that which it points to. From: Vikram Kalabi [mailto:vikram.apache@gmail.com] Sent: Monday, September 22, 2014

Re: Accumulator Immutability?

2014-09-22 Thread Soumya Simanta
I don't know the exact implementation of accumulator. You can look at the sources. But for Scala look at the following REPL session. scala> val al = new ArrayList[String]() al: java.util.ArrayList[String] = [] scala> al.add("a") res1: Boolean = true scala> al res2: java.util.ArrayList[Strin

Re: Accumulator Immutability?

2014-09-22 Thread Sean Owen
'accum' is a reference that can't point to another object because it's val. However the object it points to can certainly change state. 'val' has an effect mostly like 'final' in Java. Although the "accum += ..." syntax might lead you to believe it's executing "accum = accum + ...", as it would in

Re: Is there any way (in Java) to make a JavaRDD from an iterable

2014-09-22 Thread Steve Lewis
is there a way to write as a temporary file? Also what about a Stream - something like an RSS feed On Mon, Sep 22, 2014 at 10:21 AM, Victor Tso-Guillen wrote: > You can write to disk and have Spark read it as a stream. This is how > Hadoop files are iterated in Spark. > > On Mon, Sep 22, 2014 at

Re: Is there any way (in Java) to make a JavaRDD from an iterable

2014-09-22 Thread Sean Owen
I imagine it is because parallelize() inherently only makes sense for smallish data to begin with, since it will have to be broadcast from the driver. Large enough data should probably live in distributed storage to begin with. The Scala equivalent wants a Seq, so I assume there is some need or va

Re: secondary sort

2014-09-22 Thread Daniil Osipov
Adding an issue in JIRA would help keep track of the feature request: https://issues.apache.org/jira/browse/SPARK On Sat, Sep 20, 2014 at 7:39 AM, Koert Kuipers wrote: > now that spark has a sort based shuffle, can we expect a secondary sort > soon? there are some use cases where getting a sort

Re: ParquetRecordReader warnings: counter initialization

2014-09-22 Thread Michael Armbrust
These are coming from the parquet library and as far as I know can be safely ignored. On Mon, Sep 22, 2014 at 3:27 AM, Andrew Ash wrote: > Hi All, > > I'm seeing the below WARNINGs in stdout using Spark SQL in Spark 1.1.0 -- > is this warning a known issue? I don't see any open Jira tickets for

Running Spark in Local Mode vs. Single Node Cluster

2014-09-22 Thread kriskalish
I'm in a situation where I'm running Spark streaming on a single machine right now. The plan is to ultimately run it on a cluster, but for the next couple months it will probably stay on one machine. I tried to do some digging and I can't find any indication of whether it's better to run spark as

Re: Weird aggregation results when reusing objects inside reduceByKey

2014-09-22 Thread kriskalish
Thanks for the insight, I didn't realize there was internal object reuse going on. Is this a mechanism of Scala/Java or is this a mechanism of Spark? I actually just converted the code to use immutable case classes everywhere, so it will be a little tricky to test foldByKey(). I'll try to get to i

Re: SparkSQL Thriftserver in Mesos

2014-09-22 Thread John Omernik
Any thoughts on this? On Sat, Sep 20, 2014 at 12:16 PM, John Omernik wrote: > I am running the Thrift server in SparkSQL, and running it on the node I > compiled spark on. When I run it, tasks only work if they landed on that > node, other executors started on nodes I didn't compile spark on (a

Re: mllib performance on mesos cluster

2014-09-22 Thread Xiangrui Meng
1) MLlib 1.1 should be faster than 1.0 in general. What's the size of your dataset? Is the RDD evenly distributed across nodes? You can check the storage tab of the Spark WebUI. 2) I don't have much experience with mesos deployment. Someone else may be able to answer your question. -Xiangrui On

Re: SparkSQL Thriftserver in Mesos

2014-09-22 Thread Dean Wampler
The Mesos install guide says this: "To use Mesos from Spark, you need a Spark binary package available in a place accessible by Mesos, and a Spark driver program configured to connect to Mesos." For example, putting it in HDFS or copying it to each node in the same location should do the trick.

Re: Out of memory exception in MLlib's naive baye's classification training

2014-09-22 Thread Xiangrui Meng
Does feature size 43839 equal to the number of terms? Check the output dimension of your feature vectorizer and reduce number of partitions to match the number of physical cores. I saw you set spark.storage.memoryFaction to 0.0. Maybe it is better to keep the default. Also please confirm the driver

Spark SQL CLI

2014-09-22 Thread gtinside
Hi , I have been using spark shell to execute all SQLs. I am connecting to Cassandra , converting the data in JSON and then running queries on it, I am using HiveContext (and not SQLContext) because of "explode " functionality in it. I want to see how can I use Spark SQL CLI for directly runnin

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Greg Hill
I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers. It's complaining that I'm not allocating enough memory to the memoryOverhead values. I tracked it down to this code: https://github.com/apache/spark/blob/

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Nishkam Ravi
Greg, if you look carefully, the code is enforcing that the memoryOverhead be lower (and not higher) than spark.driver.memory. Thanks, Nishkam On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill wrote: > I thought I had this all figured out, but I'm getting some weird errors > now that I'm attempting t

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Greg Hill
Gah, ignore me again. I was reading the logic backwards. For some reason it isn't picking up my SPARK_DRIVER_MEMORY environment variable and is using the default of 512m. Probably an environmental issue. Greg From: Greg mailto:greg.h...@rackspace.com>> Date: Monday, September 22, 2014 3:26 P

The wikipedia Extraction (WEX) Dataset

2014-09-22 Thread daidong
I watched several presentations from the AMP Camp 2013. Many of the Spark examples are about extracting information from the tsv format Wikipedia extraction dataset (around 66 GB). It used to be provided as an open data set in Amazon EBS, but now it already disappeared. I really want to use these

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Greg Hill
Ah, I see. It turns out that my problem is that that comparison is ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m. Is that a bug that's since fixed? I'm on 1.0.1 and using 'yarn-cluster' as the master. 'yarn-client' seems to pick up the values and works fine. Greg From:

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Nishkam Ravi
Maybe try --driver-memory if you are using spark-submit? Thanks, Nishkam On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill wrote: > Ah, I see. It turns out that my problem is that that comparison is > ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m. Is that > a bug that's since fi

RDD data checkpoint cleaning

2014-09-22 Thread RodrigoB
Hi all, I've just started to take Spark Streaming recovery more seriously as things get more serious on the project roll-out. We need to ensure full recovery on all Spark levels - driver, receiver and worker. I've started to do some tests today and become concerned with the current findings. I

Re: Setup an huge Unserializable Object in a mapper

2014-09-22 Thread Sean Owen
I think this was covered in this thread last week: https://www.mail-archive.com/user@spark.apache.org/msg10493.html Try a singleton pattern to call this once per JVM. That only makes much sense if this object is immutable. On Mon, Sep 22, 2014 at 5:11 PM, matthes wrote: > Hello everybody! > > I’

Streaming: HdfsWordCount does not print any output

2014-09-22 Thread SK
Hi, I tried running the HdfsWordCount program in the streaming examples in Spark 1.1.0. I provided a directory in the distributed filesystem as input. This directory has one text file. However, the only thing that the program keeps printing is the time - but not the word count. I have not used the

Re: ParquetRecordReader warnings: counter initialization

2014-09-22 Thread Andrew Ash
Thanks for the info Michael. I see this a few other places in the Impala+Parquet context but a real quick scan didn't reveal any leads on this warning. I'll ignore for now. Andrew On Mon, Sep 22, 2014 at 12:16 PM, Michael Armbrust wrote: > These are coming from the parquet library and as far

Re: Spark SQL CLI

2014-09-22 Thread Yin Huai
Hi Gaurav, Can you put hive-site.xml in conf/ and try again? Thanks, Yin On Mon, Sep 22, 2014 at 4:02 PM, gtinside wrote: > Hi , > > I have been using spark shell to execute all SQLs. I am connecting to > Cassandra , converting the data in JSON and then running queries on it, I > am using Hi

Re: Streaming: HdfsWordCount does not print any output

2014-09-22 Thread SK
This issue is resolved. The file needs to be created after the program has started to execute. thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Streaming-HdfsWordCount-does-not-print-any-output-tp14849p14852.html Sent from the Apache Spark User List

Re: Spark SQL CLI

2014-09-22 Thread Gaurav Tiwari
Hi , I tried setting the metastore and metastore_db location in the *conf/hive-site.xml *to the directories created in spark bin folder (they were created when I ran spark shell and used LocalHiveContext), but still doesn't work Do I need to same my RDD as a table through hive context to make thi

Re: The wikipedia Extraction (WEX) Dataset

2014-09-22 Thread daidong
Really sorry to brother everybody. It is my mistake. The data set is still on the amazon and can be downloaded. The reason of my failure is that I start an instance not in U.S., so can not attach the EBS volume. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.c

Re: Spark SQL CLI

2014-09-22 Thread Yin Huai
Hi Gaurav, Seems metastore should be created by LocalHiveContext and metastore_db should be created by a regular HiveContext. Can you check if you are still using LocalHiveContext when you tried to access your tables? Also, if you created those tables when you launched your sql cli under bin/, you

Re: Solving Systems of Linear Equations Using Spark?

2014-09-22 Thread durin
Hey Deb, sorry for the late answer, I've been travelling and don't have much time yet until in a few days. To be precise, it's not me who has to solve the problem, but a person I know well and who I'd like to help with a possibly faster method. I'll try to state the facts as well as I know them,

Re: Issues with partitionBy: FetchFailed

2014-09-22 Thread Andrew Ash
Hi Jerry, For the one executor hung with one CPU core running at 100%, that sounds exactly like the symptoms I observed in https://issues.apache.org/jira/browse/SPARK-2546 around a deadlock with the JobConf. The next time that happens can you take a jstack and compare with the one on the ticket?

Cancelled Key exception

2014-09-22 Thread abraham.jacob
Hi Sparklers, I was wondering if some else has also encountered this... (Actually I am not even sure if this is an issue)... I have a spark job that reads data from Hbase does a bunch of transformation sparkContext.newAPIHadoopRDD -> flatMapToPair -> groupByKey -> mapValues After this I do a t

RE: Issues with partitionBy: FetchFailed

2014-09-22 Thread Shao, Saisai
Hi Andrew, I will try again using jstack. My question is that will this deadlock issue also lead to FetchFailed exception? Thanks Jerry From: Andrew Ash [mailto:and...@andrewash.com] Sent: Tuesday, September 23, 2014 8:29 AM To: Shao, Saisai Cc: David Rowe; Julien Carme; user@spark.apache.org Su

Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Andrew Or
Hi Greg, >From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not actually picked up in cluster mode. This is a bug and I have opened a PR to fix it: https://github.com/apache/spark/pull/2500. For now, please use --driver-memory instead, which should work for both client and cluster mo

Re: secondary sort

2014-09-22 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-3655 On Mon, Sep 22, 2014 at 3:11 PM, Daniil Osipov wrote: > Adding an issue in JIRA would help keep track of the feature request: > > https://issues.apache.org/jira/browse/SPARK > > On Sat, Sep 20, 2014 at 7:39 AM, Koert Kuipers wrote: > >> now that

Java Implementation of StreamingContext.fileStream

2014-09-22 Thread Michael Quinlan
I'm attempting to code a Java only implementation accessing the StreamingContext.fileStream method and am especially interested in setting the boolean "newFilesOnly" to false. Unfortunately my code throws exceptions: Exception in thread "main" java.lang.InstantiationException at sun.reflec

Re: java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2

2014-09-22 Thread Ankur Dave
I diagnosed this problem today and found that it's because the GraphX custom serializers make an assumption that is violated by sort-based shuffle. I filed SPARK-3649 explaining the problem and submitted a PR to fix it [2]. The fix removes the custom serializers, which has a 10% performance pena

RE: Exception with SparkSql and Avro

2014-09-22 Thread Zalzberg, Idan (Agoda)
Hello, I am trying to read a hive table that is stored in Avro DEFLATE files. something simple like "SELECT * FROM X LIMIT 10" I get 2 exceptions in the logs: 2014-09-23 09:27:50,157 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 10.0 in stage 1.0 (TID 10, cl.local): org.apache.avro.A

Change number of workers and memory

2014-09-22 Thread Dhimant
I am having a spark cluster having some high performance nodes and others are having commodity specs (lower configuration). When I configure worker memory and instances in spark-env.sh, it reflects to all the nodes. Can I change SPARK_WORKER_MEMORY and SPARK_WORKER_INSTANCES properties per node/ma

Re: Change number of workers and memory

2014-09-22 Thread Liquan Pei
Hi Dhimant, One thread related to your question is http://apache-spark-user-list.1001560.n3.nabble.com/heterogeneous-cluster-hardware-td11567.html One argument that you should set every machine the same SPARK_WORKER_MEMORY is that all tasks in a stage has to finish in order for the next stage to

Why recommend 2-3 tasks per CPU core ?

2014-09-22 Thread myasuka
We are now implementing a matrix multiplication algorithm on Spark, which was designed in the traditional MPI working way before. It assumes every core in the grid computes in parallel. Now in our develop environment, each executor node has 16 cores, and I assign 16 tasks to each executor node to

Re: "sbt/sbt run" command returns a JVM problem

2014-09-22 Thread christy
thanks very much, seems working... -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sbt-sbt-run-command-returns-a-JVM-problem-tp5157p14870.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Why recommend 2-3 tasks per CPU core ?

2014-09-22 Thread Nicholas Chammas
On Tue, Sep 23, 2014 at 1:58 AM, myasuka wrote: > Thus I want to know why recommend > 2-3 tasks per CPU core? > You want at least 1 task per core so that you fully utilize the cluster's parallelism. You want 2-3 tasks per core so that tasks are a bit smaller than they would otherwise be, makin

Re: Java Implementation of StreamingContext.fileStream

2014-09-22 Thread Akhil Das
Here's a working version that we have. > DStream> hadoopDStream = > streamingContext.fileStream("/akhld/lookhere/", new Function Object>(){ > @Override > public Object call(Path path) throws Exception { > // TODO Auto-generated method stub > return !path.getName().startsWith("."); > } }, true, Sp