Streaming linear regression example question

2015-03-14 Thread Margus Roo
Hi I try to understand example provided in https://spark.apache.org/docs/1.2.1/mllib-linear-methods.html - Streaming linear regression Code: import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.Labe

Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen
Hi, I read this document, http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and tried to build a TF-IDF model of my documents. I have a list of documents, each word is represented as a Int, and each document is listed in one line. doc_name, int1, int2... doc_name, int3, int4...

deploying Spark on standalone cluster

2015-03-14 Thread sara mustafa
Hi, I am trying to deploy spark on standalone cluster of two machines on for master node and one for worker node. i have defined the two machines in conf/slaves file and also i /etc/hosts, when i tried to run the cluster the worker node is running but the master node failed to run and throw this er

Re: Please help me understand TF-IDF Vector structure

2015-03-14 Thread Xi Shen
Hey, I work it out myself :) The "Vector" is actually a "SparesVector", so when it is written into a string, the format is (size, [coordinate], [value...]) Simple! On Sat, Mar 14, 2015 at 6:05 PM Xi Shen wrote: > Hi, > > I read this document, > http://spark.apache.org/docs/1.2.1/mllib-f

Re: spark there is no space on the disk

2015-03-14 Thread Sean Owen
It means pretty much what it says. You ran out of space on an executor (not driver), because the dir used for serialization temp files is full (not all volumes). Set spark.local.dirs to something more appropriate and larger. On Sat, Mar 14, 2015 at 2:10 AM, Peng Xia wrote: > Hi > > > I was runnin

Re: building all modules in spark by mvn

2015-03-14 Thread Sean Owen
I can't reproduce that. 'mvn package' builds everything. You're not showing additional output from Maven that would explain what it skipped and why. On Sat, Mar 14, 2015 at 12:57 AM, sequoiadb wrote: > guys, is there any easier way to build all modules by mvn ? > right now if I run “mvn package”

Re: deploying Spark on standalone cluster

2015-03-14 Thread fightf...@163.com
Hi, You may want to check your spark environment config in spark-env.sh, specifically for the SPARK_LOCAL_IP and check that whether you did modify that value, which may default be localhost. Thanks, Sun. fightf...@163.com From: sara mustafa Date: 2015-03-14 15:13 To: user Subject: deploying

Re: deploying Spark on standalone cluster

2015-03-14 Thread fightf...@163.com
Hi, You may want to check your spark environment config in spark-env.sh, specifically for the SPARK_LOCAL_IP and check that whether you did modify that value, which may default be localhost. Thanks, Sun. fightf...@163.com From: sara mustafa Date: 2015-03-14 15:13 To: user Subject: deploying

How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread James
Hello, I am got a cluster with spark on yarn. Currently some nodes of it are running a spark streamming program, thus their local space is not enough to support other application. Thus I wonder is that possible to use a blacklist to avoid using these nodes when running a new spark program? Alcaid

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread Ted Yu
Which release of hadoop are you using ? Can you utilize node labels feature ? See YARN-2492 and YARN-796 Cheers On Sat, Mar 14, 2015 at 1:49 AM, James wrote: > Hello, > > I am got a cluster with spark on yarn. Currently some nodes of it are > running a spark streamming program, thus their loca

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread James
My hadoop version is 2.2.0, and my spark version is 1.2.0 2015-03-14 17:22 GMT+08:00 Ted Yu : > Which release of hadoop are you using ? > > Can you utilize node labels feature ? > See YARN-2492 and YARN-796 > > Cheers > > On Sat, Mar 14, 2015 at 1:49 AM, James wrote: > >> Hello, >> >> I am got a

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread Simon Elliston Ball
You won’t be able to use YARN labels on 2.2.0. However, you only need the labels if you want to map containers on specific hardware. In your scenario, the capacity scheduler in YARN might be the best bet. You can setup separate queues for the streaming and other jobs to protect a percentage of c

Re: serialization stakeoverflow error during reduce on nested objects

2015-03-14 Thread alexis GILLAIN
I haven't register my class in kryo but I dont think it would have such an impact on the stack size. I'm thinking of using graphx and I'm wondering how it serializes the graph object as it can use kryo as serializer. 2015-03-14 6:22 GMT+01:00 Ted Yu : > Have you registered your class with kryo ?

Re: Using rdd methods with Dstream

2015-03-14 Thread Laeeq Ahmed
Thanks TD, this is what I was looking for. rdd.context.makeRDD worked. Laeeq On Friday, March 13, 2015 11:08 PM, Tathagata Das wrote: Is the number of top K elements you want to keep small? That is, is K small? In which case, you can1.  either do it in the driver on the array   DSt

Re: How to avoid using some nodes while running a spark program on yarn

2015-03-14 Thread Ted Yu
Out of curiosity, I searched for 'capacity scheduler deadlock' yielded the following: [YARN-3265] CapacityScheduler deadlock when computing absolute max avail capacity (fix for trunk/branch-2) [YARN-3251] Fix CapacityScheduler deadlock when computing absolute max avail capacity (short term fix fo

Re: How does Spark honor data locality when allocating computing resources for an application

2015-03-14 Thread eric wong
you seem like not to note the configuration varible "spreadOutApps" And it's comment: // As a temporary workaround before better ways of configuring memory, we allow users to set // a flag that will perform round-robin scheduling across the nodes (spreading out each app // among all the node

Spark Release 1.3.0 DataFrame API

2015-03-14 Thread David Mitchell
I am pleased with the release of the DataFrame API. However, I started playing with it, and neither of the two main examples in the documentation work: http://spark.apache.org/docs/1.3.0/sql-programming-guide.html Specfically: - Inferring the Schema Using Reflection - Programmatically Spec

Re: Spark Release 1.3.0 DataFrame API

2015-03-14 Thread Nick Pentreath
I've found people.toDF gives you a data frame (roughly equivalent to the previous Row RDD), And you can then call registerTempTable on that DataFrame. So people.toDF.registerTempTable("people") should work — Sent from Mailbox On Sat, Mar 14, 2015 at 5:33 PM, David Mitchell wrote

Re: Need Advice about reading lots of text files

2015-03-14 Thread Pat Ferrel
Any advice on dealing with a large number of separate input files? On Mar 13, 2015, at 4:06 PM, Pat Ferrel wrote: We have many text files that we need to read in parallel. We can create a comma delimited list of files to pass in to sparkContext.textFile(fileList). The list can get very large

Pausing/throttling spark/spark-streaming application

2015-03-14 Thread tulinski
Hi, I created a question on StackOverflow: http://stackoverflow.com/questions/29051579/pausing-throttling-spark-spark-streaming-application I would appreciate your help. Best, Tomek -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Pausing-throttling-spark-s

Re: Spark Release 1.3.0 DataFrame API

2015-03-14 Thread Sean Owen
Yes I think this was already just fixed by: https://github.com/apache/spark/pull/4977 a ".toDF()" is missing On Sat, Mar 14, 2015 at 4:16 PM, Nick Pentreath wrote: > I've found people.toDF gives you a data frame (roughly equivalent to the > previous Row RDD), > > And you can then call registerT

Re: Need Advice about reading lots of text files

2015-03-14 Thread Pat Ferrel
It’s a long story but there are many dirs with smallish part- files in them so we create a list of the individual files as input to sparkContext.textFile(fileList). I suppose we could move them and rename them to be contiguous part- files in one dir. Would that be better than passing in

Spark and HBase join issue

2015-03-14 Thread francexo83
Hi all, I have the following cluster configurations: - 5 nodes on a cloud environment. - Hadoop 2.5.0. - HBase 0.98.6. - Spark 1.2.0. - 8 cores and 16 GB of ram on each host. - 1 NFS disk with 300 IOPS mounted on host 1 and 2. - 1 NFS disk with 300 IOPS mounted on host

Re: Spark and HBase join issue

2015-03-14 Thread Ted Yu
The 4.1 GB table has 3 regions. This means that there would be at least 2 nodes which don't carry its region. Can you split this table into 12 (or more) regions ? BTW what's the value for spark.yarn.executor.memoryOverhead ? Cheers On Sat, Mar 14, 2015 at 10:52 AM, francexo83 wrote: > Hi all,

Re: Need Advice about reading lots of text files

2015-03-14 Thread Michael Armbrust
Here is how I have dealt with many small text files (on s3 though this should generalize) in the past: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201411.mbox/%3ccaaswr-58p66-es2haxh4i+bu__0rvxd2okewkly0mee8rue...@mail.gmail.com%3E > FromMichael Armbrust SubjectRe: > S3NativeF

Re: Spark SQL 1.3 max operation giving wrong results

2015-03-14 Thread Michael Armbrust
Do you have an example that reproduces the issue? On Fri, Mar 13, 2015 at 4:12 PM, gtinside wrote: > Hi , > > I am playing around with Spark SQL 1.3 and noticed that "max" function does > not give the correct result i.e doesn't give the maximum value. The same > query works fine in Spark SQL 1.2

Bug in "Spark SQL and Dataframes" : "Inferring the Schema Using Reflection"?

2015-03-14 Thread Dean Arnold
Running 1.3.0 from binary install. When executing the example under the subject section from within spark-shell, I get the following error: scala> people.registerTempTable("people") :35: error: value registerTempTable is not a member of org.apache.spark.rdd.RDD[Person] people.registe

Re: Bug in "Spark SQL and Dataframes" : "Inferring the Schema Using Reflection"?

2015-03-14 Thread Sean Owen
Yep, already fixed in master: https://github.com/apache/spark/pull/4977/files You need a '.toDF()' at the end. On Sat, Mar 14, 2015 at 6:55 PM, Dean Arnold wrote: > Running 1.3.0 from binary install. When executing the example under the > subject section from within spark-shell, I get the follo

How to create data frame from an avro file in Spark 1.3.0

2015-03-14 Thread Shing Hing Man
In spark-avro 0.1,  the method AvroContext.avroFile  returns a SchemaRDD, which is deprecated in Spark 1.3.0 package com.databricks.spark import org.apache.spark.sql.{SQLContext, SchemaRDD} package object avro {   /**    * Adds a method, `avroFile`, to SQLContext that allows reading data stor

Bug in Streaming files?

2015-03-14 Thread Justin Pihony
All, Looking into this StackOverflow question it appears that there is a bug when utilizing the newFilesOnly parameter in FileInputDStream. Before creating a ticket, I wanted to verify it here. The gist is that this

Re: Bug in Streaming files?

2015-03-14 Thread Sean Owen
No I don't think that much is a bug, since newFilesOnly=false removes a constraint that otherwise exists, and that's what you see. However read the closely related: https://issues.apache.org/jira/browse/SPARK-6061 @tdas open question for you there. On Sat, Mar 14, 2015 at 8:18 PM, Justin Pihony

Re: How to create data frame from an avro file in Spark 1.3.0

2015-03-14 Thread Michael Armbrust
We will be publishing a new version of the library early next week. Here's the PR for the upgraded version if you would like to build from source: https://github.com/databricks/spark-avro/pull/33 On Sat, Mar 14, 2015 at 1:17 PM, Shing Hing Man wrote: > In spark-avro 0.1, the method AvroContext

Re: spark there is no space on the disk

2015-03-14 Thread Peng Xia
Hi Sean, Thank very much for your reply. I tried to config it from below code: sf = SparkConf().setAppName("test").set("spark.executor.memory", "45g").set("spark.cores.max", 62),set("spark.local.dir", "C:\\tmp") But still get the error. Do you know how I can config this? Thanks, Best, Peng O

Re: spark there is no space on the disk

2015-03-14 Thread Peng Xia
And I have 2 TB free space on C driver. On Sat, Mar 14, 2015 at 8:29 PM, Peng Xia wrote: > Hi Sean, > > Thank very much for your reply. > I tried to config it from below code: > > sf = SparkConf().setAppName("test").set("spark.executor.memory", > "45g").set("spark.cores.max", 62),set("spark.loc

order preservation with RDDs

2015-03-14 Thread kian.ho
Hi, I was taking a look through the mllib examples in the official spark documentation and came across the following: http://spark.apache.org/docs/1.3.0/mllib-feature-extraction.html#tab_python_2 specifically the lines: label = data.map(lambda x: x.label) features = data.map(lambda x: x.features

Re: GraphX Snapshot Partitioning

2015-03-14 Thread Takeshi Yamamuro
Large edge partitions could cause java.lang.OutOfMemoryError, and then spark tasks fails. FWIW, each edge partition can have at most 2^32 edges because 64-bit vertex IDs are mapped into 32-bit ones in each partitions. If #edges is over the limit, graphx could throw ArrayIndexOutOfBoundsException,

Re: [GRAPHX] could not process graph with 230M edges

2015-03-14 Thread Takeshi Yamamuro
Hi, If you have heap problems in spark/graphx, it'd be better to split partitions into smaller ones so as to fit the partition on memory. On Sat, Mar 14, 2015 at 12:09 AM, Hlib Mykhailenko < hlib.mykhaile...@inria.fr> wrote: > Hello, > > I cannot process graph with 230M edges. > I cloned apache.

Re: Spark Release 1.3.0 DataFrame API

2015-03-14 Thread Rishi Yadav
programmatically specifying Schema needs import org.apache.spark.sql.type._ for StructType and StructField to resolve. On Sat, Mar 14, 2015 at 10:07 AM, Sean Owen wrote: > Yes I think this was already just fixed by: > > https://github.com/apache/spark/pull/4977 > > a ".toDF()" is missing > >

1.3 release

2015-03-14 Thread Eric Friedman
Is there a reason why the prebuilt releases don't include current CDH distros and YARN support? Eric Friedman - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apach