Re: ALS failure with size > Integer.MAX_VALUE

2014-11-30 Thread Sean Owen
(It won't be that, since you see that the error occur when reading a block from disk. I think this is an instance of the 2GB block size limitation.) On Sun, Nov 30, 2014 at 4:36 AM, Ganelin, Ilya wrote: > Hi Bharath – I’m unsure if this is your problem but the > MatrixFactorizationModel in MLLIB

Loading JSON Dataset fails with com.fasterxml.jackson.databind.JsonMappingException

2014-11-30 Thread Peter Vandenabeele
Hi, On spark 1.1.0 in Standalone mode, I am following https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets to try to load a simple test JSON file (on my local filesystem, not in hdfs). The file is below and was validated with jsonlint.com: ➜ tmp cat test_4.json {"foo"

kafka pipeline exactly once semantics

2014-11-30 Thread Josh J
Hi, In the spark docs it mentions "However, output operations (like foreachRDD) have *at-least once* semantics, that is, the transformed data may get written to an external entity more than once in the

Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread Josh J
Is there a way to do this that preserves exactly once semantics for the write to Kafka? On Tue, Sep 2, 2014 at 12:30 PM, Tim Smith wrote: > I'd be interested in finding the answer too. Right now, I do: > > val kafkaOutMsgs = kafkInMessages.map(x=>myFunc(x._2,someParam)) > kafkaOutMsgs.foreachRDD

Re: Loading JSON Dataset fails with com.fasterxml.jackson.databind.JsonMappingException

2014-11-30 Thread Peter Vandenabeele
On Sun, Nov 30, 2014 at 1:10 PM, Peter Vandenabeele wrote: > On spark 1.1.0 in Standalone mode, I am following > > > https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets > > to try to load a simple test JSON file (on my local filesystem, not in > hdfs). > The file is below

Setting network variables in spark-shell

2014-11-30 Thread Brian Dolan
Howdy Folks, What is the correct syntax in 1.0.0 to set networking variables in spark shell? Specifically, I'd like to set the spark.akka.frameSize I'm attempting this: spark-shell -Dspark.akka.frameSize=1 --executor-memory 4g Only to get this within the session: System.getProperty("spark

Re: Setting network variables in spark-shell

2014-11-30 Thread Ritesh Kumar Singh
Spark configuration settings can be found here Hope it helps :) On Sun, Nov 30, 2014 at 9:55 PM, Brian Dolan wrote: > Howdy Folks, > > What is the correct syntax in 1.0.0 to set networking variables in spark > shell? Specifically, I'd li

Re: Setting network variables in spark-shell

2014-11-30 Thread Yanbo
Try to use spark-shell --conf spark.akka.frameSize=1 > 在 2014年12月1日,上午12:25,Brian Dolan 写道: > > Howdy Folks, > > What is the correct syntax in 1.0.0 to set networking variables in spark > shell? Specifically, I'd like to set the spark.akka.frameSize > > I'm attempting this: > spark-shel

Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread shahab
Hi, I just wonder if there is any implementation for Item-based Collaborative Filtering in Spark? best, /Shahab

Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread Jimmy
The latest version of MLlib has it built in no? J Sent from my iPhone > On Nov 30, 2014, at 9:36 AM, shahab wrote: > > Hi, > > I just wonder if there is any implementation for Item-based Collaborative > Filtering in Spark? > > best, > /Shahab

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread David Blewett
You might be interested in the new s3a filesystem in Hadoop 2.6.0 [1]. 1. https://issues.apache.org/jira/plugins/servlet/mobile#issue/HADOOP-10400 On Nov 26, 2014 12:24 PM, "Aaron Davidson" wrote: > Spark has a known problem where it will do a pass of metadata on a large > number of small files

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2014-11-30 Thread Aaron Davidson
Note that it does not appear that s3a solves the original problems in this thread, which are on the Spark side or due to the fact that metadata listing in S3 is slow simply due to going over the network. On Sun, Nov 30, 2014 at 10:07 AM, David Blewett wrote: > You might be interested in the new

Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread Sean Owen
There is an implementation of all-pairs similarity. Have a look at the DIMSUM implementation in RowMatrix. It is an element of what you would need for such a recommender, but not the whole thing. You can also do the model-building part of an ALS-based recommender with ALS in MLlib. So, no not dir

Re: Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread Pat Ferrel
Actually the spark-itemsimilarity job and related code in the Spark module of Mahout creates all-pairs similarity too. It’s designed to use with a search engine, which provides the query part of the recommender. Integrate the two and you have a near realtime scalable item-based/cooccurrence coll

Re: Publishing a transformed DStream to Kafka

2014-11-30 Thread francois . garillot
How about writing to a buffer ? Then you would flush the buffer to Kafka if and only if the output operation reports successful completion. In the event of a worker failure, that would not happen. — FG On Sun, Nov 30, 2014 at 2:28 PM, Josh J wrote: > Is there a way to do this that preserves

Re: Multiple SparkContexts in same Driver JVM

2014-11-30 Thread Harihar Nahak
try setting in SparkConf.set( 'spark.driver.allowMultipleContexts' , true) On 30 November 2014 at 17:37, lokeshkumar [via Apache Spark User List] < ml-node+s1001560n20037...@n3.nabble.com> wrote: > Hi Forum, > > Is it not possible to run multiple SparkContexts concurrently without > stopping the

Re: RDDs join problem: incorrect result

2014-11-30 Thread Harihar Nahak
what do you mean by incorrect? could you please share some examples from both the RDD and resultant RDD also If you get any exception paste that too. it helps to debug where is the issue On 27 November 2014 at 17:07, liuboya [via Apache Spark User List] < ml-node+s1001560n19928...@n3.nabble.com> w

Re: GraphX:java.lang.NoSuchMethodError:org.apache.spark.graphx.Graph$.apply

2014-11-30 Thread Harihar Nahak
Hi, If you haven't figure out so far; could you please share some details: how you running GraphX ? also before executing above commands from shell import required GraphX packages On 27 November 2014 at 20:49, liuboya [via Apache Spark User List] < ml-node+s1001560n19959...@n3.nabble.com> wrote:

How can a function access Executor ID, Function ID and other parameters

2014-11-30 Thread Steve Lewis
I am running on a 15 node cluster and am trying to set partitioning to balance the work across all nodes. I am using an Accumulator to track work by Mac Address but would prefer to use data known to the Spark environment - Executor ID, and Function ID show up in the Spark UI and Task ID and Attem

Re: reduceByKey and empty output files

2014-11-30 Thread Rishi Yadav
How big is your input dataset? On Thursday, November 27, 2014, Praveen Sripati wrote: > Hi, > > When I run the below program, I see two files in the HDFS because the > number of partitions in 2. But, one of the file is empty. Why is it so? Is > the work not distributed equally to all the tasks?

Re: Edge List File in GraphX

2014-11-30 Thread Harihar Nahak
Graphloade.edgeListFile(fileName) , where file name must be in 1\t2 form. about result NaN there might some issue with the data. I ran it for various combination of data set and it works perfectly fine. On 25 November 2014 at 19:23, pradhandeep [via Apache Spark User List] < ml-node+s1001560n1972

Re: kafka pipeline exactly once semantics

2014-11-30 Thread Tobias Pfeiffer
Josh, On Sun, Nov 30, 2014 at 10:17 PM, Josh J wrote: > > I would like to setup a Kafka pipeline whereby I write my data to a single > topic 1, then I continue to process using spark streaming and write the > transformed results to topic2, and finally I read the results from topic 2. > Not reall

RE: Unable to compile spark 1.1.0 on windows 8.1

2014-11-30 Thread Judy Nash
I have found the following to work for me on win 8.1: 1) run sbt assembly 2) Use Maven. You can find the maven commands for your build at : docs\building-spark.md -Original Message- From: Ishwardeep Singh [mailto:ishwardeep.si...@impetus.co.in] Sent: Thursday, November 27, 2014 11:31 PM

Re: spark.akka.frameSize setting problem

2014-11-30 Thread Ke Wang
I meet the same problem, did you solve it ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-akka-frameSize-setting-problem-tp3416p20063.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
4096MB is greater than Int.MaxValue and it will be overflow in Spark. Please set it less then 4096. Best Regards, Shixiong Zhu 2014-12-01 13:14 GMT+08:00 Ke Wang : > I meet the same problem, did you solve it ? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.

Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
Sorry. Should be not greater than 2048. 2047 is the greatest value. Best Regards, Shixiong Zhu 2014-12-01 13:20 GMT+08:00 Shixiong Zhu : > 4096MB is greater than Int.MaxValue and it will be overflow in Spark. > Please set it less then 4096. > > Best Regards, > Shixiong Zhu > > 2014-12-01 13:14 G

Re: spark.akka.frameSize setting problem

2014-11-30 Thread Shixiong Zhu
Created a JIRA to track it: https://issues.apache.org/jira/browse/SPARK-4664 Best Regards, Shixiong Zhu 2014-12-01 13:22 GMT+08:00 Shixiong Zhu : > Sorry. Should be not greater than 2048. 2047 is the greatest value. > > Best Regards, > Shixiong Zhu > > 2014-12-01 13:20 GMT+08:00 Shixiong Zhu : >

RE: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-30 Thread Judy Nash
Thanks Patrick and Cheng for the suggestions. The issue was Hadoop common jar was added to a classpath. After I removed Hadoop common jar from both master and slave, I was able to bypass the error. This was caused by a local change, so no impact on the 1.2 release. -Original Message- Fr

Re: latest Spark 1.2 thrift server fail with NoClassDefFoundError on Guava

2014-11-30 Thread Patrick Wendell
Thanks Judy. While this is not directly caused by a Spark issue, it is likely other users will run into this. This is an unfortunate consequence of the way that we've shaded Guava in this release, we rely on byte code shading of Hadoop itself as well. And if the user has their own Hadoop classes pr