Re: MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Akhil Das
Can you see where exactly it is spending time? Like you said it goes to Stage 2, then you will be able to see how much time it spend on Stage 1. See if its a GC time, then try increasing the level of parallelism or repartition it like sc.getDefaultParallelism*3. Thanks Best Regards On Thu, Mar 19

Re: Spark + Kafka

2015-03-19 Thread James King
Thanks Khanderao. On Wed, Mar 18, 2015 at 7:18 PM, Khanderao Kand Gmail < khanderao.k...@gmail.com> wrote: > I have used various version of spark (1.0, 1.2.1) without any issues . > Though I have not significantly used kafka with 1.3.0 , a preliminary > testing revealed no issues . > > - khandera

Re: Database operations on executor nodes

2015-03-19 Thread Akhil Das
Totally depends on your database, if that's a NoSQL database like MongoDB/HBase etc then you can use the native .saveAsNewAPIHAdoopFile or .saveAsHadoopDataSet etc. For a SQL databases, i think people usually puts the overhead on driver like you did. Thanks Best Regards On Wed, Mar 18, 2015 at 1

Re: MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Su She
Hi Akhil, 1) How could I see how much time it is spending on stage 1? Or what if, like above, it doesn't get past stage 1? 2) How could I check if its a GC time? and where would I increase the parallelism for the model? I have a Spark Master and 2 Workers running on CDH 5.3...what would the defau

Re: Null pointer exception reading Parquet

2015-03-19 Thread Akhil Das
How are you running the application? Can you try running the same inside spark-shell? Thanks Best Regards On Wed, Mar 18, 2015 at 10:51 PM, sprookie wrote: > Hi All, > > I am using Saprk version 1.2 running locally. When I try to read a paquet > file I get below exception, what might be the iss

RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
Hi Reza, Behavior: · I tried running the job with different thresholds - 0.1, 0.5, 5, 20 & 100. Every time, the job got stuck at mapPartitionsWithIndex at RowMatrix.scala:522 with

Re: MLlib Spam example gets stuck in Stage X

2015-03-19 Thread Akhil Das
To get these metrics out, you need to open the driver ui running on port 4040. And in there you will see Stages information and for each stage you can see how much time it is spending on GC etc. In your case, the parallelism seems 4, the more # of parallelism the more # of tasks you will see. Than

how to specify multiple masters in sbin/start-slaves.sh script?

2015-03-19 Thread sequoiadb
Hey guys, Not sure if i’m the only one got this. We are building high-available standalone spark env. We are using ZK with 3 masters in the cluster. However, in sbin/start-slaves.sh, it calls start-slave.sh for each member in conf/slaves file, and specify master using $SPARK_MASTER_IP and $SPAR

OutOfMemoryError during reduce tasks

2015-03-19 Thread Balazs Meszaros
Hi, I am trying to evaluate performance aspects of Spark in respect to various memory settings. What makes it more difficult is that I'm new to Python, but the problem at hand doesn't seem to originate from that. I'm running a wordcount script [1] with different amounts of input data. There

R: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-19 Thread Paolo Platter
Yes, I would suggest spark-notebook too. It's very simple to setup and it's growing pretty fast. Paolo Inviata dal mio Windows Phone Da: Irfan Ahmad Inviato: ‎19/‎03/‎2015 04:05 A: davidh Cc: user@spar

Re: SparkSQL 1.3.0 JDBC data source issues

2015-03-19 Thread Pei-Lun Lee
JIRA and PR for first issue: https://issues.apache.org/jira/browse/SPARK-6408 https://github.com/apache/spark/pull/5087 On Thu, Mar 19, 2015 at 12:20 PM, Pei-Lun Lee wrote: > Hi, > > I am trying jdbc data source in spark sql 1.3.0 and found some issues. > > First, the syntax "where str_col='valu

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Ted Yu
There might be some delay: http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responses&subj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view > On Mar 18, 2015, at 4:47 PM, Dmitry Goldenberg > wrote: > > Thanks, Ted. Well, so far even there I'm only seeing my po

Re: LZO configuration can not affect

2015-03-19 Thread Ted Yu
How did you generate the Hadoop-lzo jar ? Thanks > On Mar 17, 2015, at 2:36 AM, 唯我者 <878223...@qq.com> wrote: > > hi,everybody: >I have configured the env about LZO like this: > <9da01...@a75e774d.bbf50755.jpg> > > <54346...@a75e774d.bbf50755.jpg> > > > But when I execute code w

calculating TF-IDF for large 100GB dataset problems

2015-03-19 Thread sergunok
Hi, I try to vectorize on yarn cluster corpus of texts (about 500K texts in 13 files - 100GB totally) located in HDFS . This process already token about 20 hours on 3 node cluster with 6 cores, 20GB RAM on each node. In my opinion it's to long :-) I started the task with the following command:

Re: Does newly-released LDA (Latent Dirichlet Allocation) algorithm supports ngrams?

2015-03-19 Thread Charles Earl
Heszak, I have only glanced at it but you should be able to incorporate tokens approximating n-gram yourself, say by using the lucene ShingleAnalyzerWrapper API http://lucene.apache.org/core/4_9_0/analyzers-common/org/apache/lucene/analysis/shingle/ShingleAnalyzerWrapper.html You might also take a

Saprk 1.2.0 | Spark job fails with MetadataFetchFailedException

2015-03-19 Thread Aniket Bhatnagar
I have a job that sorts data and runs a combineByKey operation and it sometimes fails with the following error. The job is running on spark 1.2.0 cluster with yarn-client deployment mode. Any clues on how to debug the error? org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output

Reading a text file into RDD[Char] instead of RDD[String]

2015-03-19 Thread Michael Lewis
Hi, I’m struggling to think of the best way to read a text file into an RDD[Char] rather than [String] I can do: sc.textFile(….) which gives me the Rdd[String], Can anyone suggest the most efficient way to create the RDD[Char] ? I’m sure I’ve missed something simple… Regards, Mike

Re: Reading a text file into RDD[Char] instead of RDD[String]

2015-03-19 Thread Sean Owen
val s = sc.parallelize(Array("foo", "bar", "baz")) val c = s.flatMap(_.toIterator) c.collect() res8: Array[Char] = Array(f, o, o, b, a, r, b, a, z) On Thu, Mar 19, 2015 at 8:46 AM, Michael Lewis wrote: > Hi, > > I’m struggling to think of the best way to read a text file into an RDD[Char] > ra

Re: Reading a text file into RDD[Char] instead of RDD[String]

2015-03-19 Thread Manoj Awasthi
sc.textFile().flatMap(_.toIterator) On Thu, Mar 19, 2015 at 6:31 PM, Sean Owen wrote: > val s = sc.parallelize(Array("foo", "bar", "baz")) > > val c = s.flatMap(_.toIterator) > > c.collect() > res8: Array[Char] = Array(f, o, o, b, a, r, b, a, z) > > On Thu, Mar 19, 2015 at 8:46 AM, Michael L

Re: LZO configuration can not affect

2015-03-19 Thread Ted Yu
If I read the screenshot correctly, Hadoop lzo jar is under /home/hadoop/mylib Cheers > On Mar 19, 2015, at 5:37 AM, jeanlyn92 wrote: > > You should conf as follow: > export > SPARK_LIBRARY_PATH="$HADOOP_HOME/lib/native:$HADOOP_HOME/share/hadoop/common/lib/hadoop-lzo-0.4.15.jar" > > >> On

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Nabble is a third-party site that tries its best to archive mail sent out over the list. Nothing guarantees it will be in sync with the real mailing list. To get the "truth" on what was sent over this, Apache-managed list, you unfortunately need to go the Apache archives: http://mail-archives.apac

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Ted Yu
I prefer using search-hadoop.com which provides better search capability. Cheers On Thu, Mar 19, 2015 at 6:48 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Nabble is a third-party site that tries its best to archive mail sent out > over the list. Nothing guarantees it will be in sy

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Sure, you can use Nabble or search-hadoop or whatever you prefer. My point is just that the source of truth are the Apache archives, and these other sites may or may not be in sync with that truth. On Thu, Mar 19, 2015 at 10:20 AM Ted Yu wrote: > I prefer using search-hadoop.com which provides

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Dmitry Goldenberg
It seems that those archives are not necessarily easy to find stuff in. Is there a search engine on top of them? so as to find e.g. your own posts easily? On Thu, Mar 19, 2015 at 10:34 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Sure, you can use Nabble or search-hadoop or whateve

Writing Spark Streaming Programs

2015-03-19 Thread James King
Hello All, I'm using Spark for streaming but I'm unclear one which implementation language to use Java, Scala or Python. I don't know anything about Python, familiar with Scala and have been doing Java for a long time. I think the above shouldn't influence my decision on which language to use be

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Nicholas Chammas
Yes, that is mostly why these third-party sites have sprung up around the official archives--to provide better search. Did you try the link Ted posted? On Thu, Mar 19, 2015 at 10:49 AM Dmitry Goldenberg wrote: > It seems that those archives are not necessarily easy to find stuff in. Is > there a

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Dmitry Goldenberg
Interesting points. Yes I just tried http://search-hadoop.com/m/JW1q5mjZUy/Spark+people%2527s+responses&subj=Apache+Spark+User+List+people+s+responses+not+showing+in+the+browser+view and I see responses there now. I believe Ted was right in that, there's a delay before they show up there (probably

Re: Apache Spark User List: people's responses not showing in the browser view

2015-03-19 Thread Ted Yu
Here is the reason on why results on search site may be delayed, especially for Apache JIRAs. If they crawl too often, Apache would flag the bot and blacklist it. Cheers On Thu, Mar 19, 2015 at 7:59 AM, Dmitry Goldenberg wrote: > Interesting points. Yes I just tried > http://search-hadoop.com/m

Re: Writing Spark Streaming Programs

2015-03-19 Thread Gerard Maas
Try writing this Spark Streaming idiom in Java and you'll choose Scala soon enough: dstream.foreachRDD{rdd => rdd.foreachPartition( partition => ) } When deciding between Java and Scala for Spark, IMHO Scala has the upperhand. If you're concerned with readability, have a look at the Scal

Re: Writing Spark Streaming Programs

2015-03-19 Thread James King
Many thanks Gerard, this is very helpful. Cheers! On Thu, Mar 19, 2015 at 4:02 PM, Gerard Maas wrote: > Try writing this Spark Streaming idiom in Java and you'll choose Scala > soon enough: > > dstream.foreachRDD{rdd => > rdd.foreachPartition( partition => ) > } > > When deciding betwee

Re: Writing Spark Streaming Programs

2015-03-19 Thread Charles Feduke
Scala is the language used to write Spark so there's never a situation in which features introduced in a newer version of Spark cannot be taken advantage of if you write your code in Scala. (This is mostly true of Java, but it may be a little more legwork if a Java-friendly adapter isn't available

Re: Writing Spark Streaming Programs

2015-03-19 Thread Emre Sevinc
Hello James, I've been working with Spark Streaming for the last 6 months, and I'm coding in Java 7. Even though I haven't encountered any blocking issues with that combination, I'd definitely pick Scala if the decision was up to me. I agree with Gerard and Charles on this one. If you can, go wit

Re: Writing Spark Streaming Programs

2015-03-19 Thread Jeffrey Jedele
I second what has been said already. We just built a streaming app in Java and I would definitely choose Scala this time. Regards, Jeff 2015-03-19 16:34 GMT+01:00 Emre Sevinc : > Hello James, > > I've been working with Spark Streaming for the last 6 months, and I'm > coding in Java 7. Even thou

JAVA_HOME problem with upgrade to 1.3.0

2015-03-19 Thread Williams, Ken
I’m trying to upgrade a Spark project, written in Scala, from Spark 1.2.1 to 1.3.0, so I changed my `build.sbt` like so: -libraryDependencies += "org.apache.spark" %% "spark-core" % "1.2.1" % “provided" +libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0" % "provided" the

saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Christian Perez
Hi all, DataFrame.saveAsTable creates a managed table in Hive (v0.13 on CDH5.3.2) in both spark-shell and pyspark, but creates the *wrong* schema _and_ storage format in the Hive metastore, so that the table cannot be read from inside Hive. Spark itself can read the table, but Hive throws a Serial

Re: [SQL] Elasticsearch-hadoop, exception creating temporary table

2015-03-19 Thread Todd Nist
Thanks for the assistance, I found the error it wan something I had donep; PEBCAK. I had placed a version of the elasticsearch-hadoop.2.1.0.BETA3 in the project/lib directory causing it to be managed dependency and being brought in first, even though the build.sbt had the correct version specified

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
Hi Christian, Your table is stored correctly in Parquet format. For saveAsTable, the table created is *not* a Hive table, but a Spark SQL data source table ( http://spark.apache.org/docs/1.3.0/sql-programming-guide.html#data-sources). We are only using Hive's metastore to store the metadata (to b

Re: JAVA_HOME problem with upgrade to 1.3.0

2015-03-19 Thread Ted Yu
JAVA_HOME, an environment variable, should be defined on the node where appattempt_1420225286501_4699_02 ran. Cheers On Thu, Mar 19, 2015 at 8:59 AM, Williams, Ken wrote: > I’m trying to upgrade a Spark project, written in Scala, from Spark > 1.2.1 to 1.3.0, so I changed my `build.sbt` lik

Re: Spark + Kafka

2015-03-19 Thread James King
Many thanks all for the good responses, appreciated. On Thu, Mar 19, 2015 at 8:36 AM, James King wrote: > Thanks Khanderao. > > On Wed, Mar 18, 2015 at 7:18 PM, Khanderao Kand Gmail < > khanderao.k...@gmail.com> wrote: > >> I have used various version of spark (1.0, 1.2.1) without any issues . >

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Yin Huai
I meant table properties and serde properties are used to store metadata of a Spark SQL data source table. We do not set other fields like SerDe lib. For a user, the output of DESCRIBE EXTENDED/FORMATTED on a data source table should not show unrelated stuff like Serde lib and InputFormat. I have c

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-19 Thread Yin Huai
Was the OOM thrown during the execution of first stage (map) or the second stage (reduce)? If it was the second stage, can you increase the value of spark.sql.shuffle.partitions and see if the OOM disappears? This setting controls the number of reduces Spark SQL will use and the default is 200. Ma

Re: saveAsTable broken in v1.3 DataFrames?

2015-03-19 Thread Christian Perez
Hi Yin, Thanks for the clarification. My first reaction is that if this is the intended behavior, it is a wasted opportunity. Why create a managed table in Hive that cannot be read from inside Hive? I think I understand now that you are essentially piggybacking on Hive's metastore to persist table

Re: JAVA_HOME problem with upgrade to 1.3.0

2015-03-19 Thread Williams, Ken
> From: Ted Yu mailto:yuzhih...@gmail.com>> > Date: Thursday, March 19, 2015 at 11:05 AM > > JAVA_HOME, an environment variable, should be defined on the node where > appattempt_1420225286501_4699_02 ran. Has this behavior changed in 1.3.0 since 1.2.1 though? Using 1.2.1 and making no othe

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-19 Thread David Holiday
hi all - thx for the alacritous replies! so regarding how to get things from notebook to spark and back, am I correct that spark-submit is the way to go? DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.com [cid:AE3

Re: DataFrame operation on parquet: GC overhead limit exceeded

2015-03-19 Thread Yiannis Gkoufas
Hi Yin, thanks a lot for that! Will give it a shot and let you know. On 19 March 2015 at 16:30, Yin Huai wrote: > Was the OOM thrown during the execution of first stage (map) or the second > stage (reduce)? If it was the second stage, can you increase the value > of spark.sql.shuffle.partitions

RE: Why I didn't see the benefits of using KryoSerializer

2015-03-19 Thread java8964
I read the Spark code a little bit, trying to understand my own question. It looks like the different is really between org.apache.spark.serializer.JavaSerializer and org.apache.spark.serializer.KryoSerializer, both having the method named writeObject. In my test case, for each line of my text f

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-19 Thread Irfan Ahmad
Once you setup spark-notebook, it'll handle the submits for interactive work. Non-interactive is not handled by it. For that spark-kernel could be used. Give it a shot ... it only takes 5 minutes to get it running in local-mode. *Irfan Ahmad* CTO | Co-Founder | *CloudPhysics*

Problems with spark.akka.frameSize

2015-03-19 Thread Vijayasarathy Kannan
Hi, I am encountering the following error with a Spark application. "Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 11257268 bytes, which exceeds max allowed: spark.akka.frameSize (10485760 bytes) - reserved (204800 bytes). Co

Load balancing

2015-03-19 Thread Mohit Anchlia
I am trying to understand how to load balance the incoming data to multiple spark streaming workers. Could somebody help me understand how I can distribute my incoming data from various sources such that incoming data is going to multiple spark streaming nodes? Is it done by spark client with help

Re: iPython Notebook + Spark + Accumulo -- best practice?

2015-03-19 Thread David Holiday
kk - I'll put something together and get back to you with more :-) DAVID HOLIDAY Software Engineer 760 607 3300 | Office 312 758 8385 | Mobile dav...@annaisystems.com [cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com] www.AnnaiSystems.com

Re: Column Similarity using DIMSUM

2015-03-19 Thread Reza Zadeh
Hi Manish, With 56431 columns, the output can be as large as 56431 x 56431 ~= 3bn. When a single row is dense, that can end up overwhelming a machine. You can push that up with more RAM, but note that DIMSUM is meant for tall and skinny matrices: so it scales linearly and across cluster with rows,

Re: calculating TF-IDF for large 100GB dataset problems

2015-03-19 Thread Davies Liu
On Thu, Mar 19, 2015 at 5:16 AM, sergunok wrote: > Hi, > > I try to vectorize on yarn cluster corpus of texts (about 500K texts in 13 > files - 100GB totally) located in HDFS . > > This process already token about 20 hours on 3 node cluster with 6 cores, > 20GB RAM on each node. > In my opinion i

Re: spark there is no space on the disk

2015-03-19 Thread Davies Liu
Is it possible that `spark.local.dir` is overriden by others? The docs say: NOTE: In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS (Standalone, Mesos) or LOCAL_DIRS (YARN) On Sat, Mar 14, 2015 at 5:29 PM, Peng Xia wrote: > Hi Sean, > > Thank very much for your reply. > I tri

Spark Streaming custom receiver for local data

2015-03-19 Thread MartijnD
We are building a wrapper that makes it possible to use reactive streams (i.e. Observable, see reactivex.io) as input to Spark Streaming. We therefore tried to create a custom receiver for Spark. However, the Observable lives at the driver program and is generally not serializable. Is it possible

Re: Spark 1.3 createDataframe error with pandas df

2015-03-19 Thread Davies Liu
On Mon, Mar 16, 2015 at 6:23 AM, kevindahl wrote: > kevindahl wrote >> I'm trying to create a spark data frame from a pandas data frame, but for >> even the most trivial of datasets I get an error along the lines of this: >> >> --

Re: Spark-submit and multiple files

2015-03-19 Thread Davies Liu
You could submit additional Python source via --py-files , for example: $ bin/spark-submit --py-files work.py main.py On Tue, Mar 17, 2015 at 3:29 AM, poiuytrez wrote: > Hello guys, > > I am having a hard time to understand how spark-submit behave with multiple > files. I have created two code s

Re: Error when using multiple python files spark-submit

2015-03-19 Thread Davies Liu
the options of spark-submit should come before main.py, or they will become the options of main.py, so it should be: ../hadoop/spark-install/bin/spark-submit --py-files /home/poiuytrez/naive.py,/home/poiuytrez/processing.py,/home/poiuytrez/settings.py --master spark://spark-m:7077 main.py

Spark SQL filter DataFrame by date?

2015-03-19 Thread kamatsuoka
I'm trying to filter a DataFrame by a date column, with no luck so far. Here's what I'm doing: When I run reqs_day.count() I get zero, apparently because my date parameter gets translated to 16509. Is this a bug, or am I doing it wrong? -- View this message in context: http://apache-spark-

Issues with SBT and Spark

2015-03-19 Thread Vijayasarathy Kannan
My current simple.sbt is name := "SparkEpiFast" version := "1.0" scalaVersion := "2.11.4" libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.2.1" % "provided" libraryDependencies += "org.apache.spark" % "spark-graphx_2.11" % "1.2.1" % "provided" While I do "sbt package", it co

Re: Issues with SBT and Spark

2015-03-19 Thread Masf
Hi Spark 1.2.1 uses Scala 2.10. Because of this, your program fails with scala 2.11 Regards On Thu, Mar 19, 2015 at 8:17 PM, Vijayasarathy Kannan wrote: > My current simple.sbt is > > name := "SparkEpiFast" > > version := "1.0" > > scalaVersion := "2.11.4" > > libraryDependencies += "org.apach

Re: Issues with SBT and Spark

2015-03-19 Thread Sean Owen
No, Spark is cross-built for 2.11 too, and those are the deps being pulled in here. This really does however sounds like a Scala 2.10 vs 2.11 mismatch. Check that, for example, your cluster is using the same build of Spark and that you did not package Spark with your app On Thu, Mar 19, 2015 at 3:

Re: Spark SQL filter DataFrame by date?

2015-03-19 Thread Yin Huai
Can you add your code snippet? Seems it's missing in the original email. Thanks, Yin On Thu, Mar 19, 2015 at 3:22 PM, kamatsuoka wrote: > I'm trying to filter a DataFrame by a date column, with no luck so far. > Here's what I'm doing: > > > > When I run reqs_day.count() I get zero, apparently

Re: spark there is no space on the disk

2015-03-19 Thread Marcelo Vanzin
IIRC you have to set that configuration on the Worker processes (for standalone). The app can't override it (only for a client-mode driver). YARN has a similar configuration, but I don't know the name (shouldn't be hard to find, though). On Thu, Mar 19, 2015 at 11:56 AM, Davies Liu wrote: > Is it

Re: spark there is no space on the disk

2015-03-19 Thread Ted Yu
For YARN, possibly this one ? yarn.nodemanager.local-dirs /hadoop/yarn/local Cheers On Thu, Mar 19, 2015 at 2:21 PM, Marcelo Vanzin wrote: > IIRC you have to set that configuration on the Worker processes (for > standalone). The app can't override it (only for a client-mo

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-19 Thread Doug Balog
I’m seeing the same problem. I’ve set logging to DEBUG, and I think some hints are in the “Yarn AM launch context” that is printed out before Yarn runs java. My next step is to talk to the admins and get them to set yarn.nodemanager.delete.debug-delay-sec in the config, as recommended in htt

FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded

2015-03-19 Thread roni
I get 2 types of error - -org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 and FetchFailedException: Adjusted frame length exceeds 2147483647: 12716268407 - discarded Spar keeps re-trying to submit the code and keeps getting this error. My file on wh

Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-19 Thread Eason Hu
Hi Akhil, Thank you for your help. I just found that the problem is related to my local spark application, since I ran it in IntelliJ and I didn't reload the project after I recompile the jar via maven. If I didn't reload, it will use some local cache data to run the application which leads to t

Cloudant as Spark SQL External Datastore on Spark 1.3.0

2015-03-19 Thread Yang Lei
Check this out : https://github.com/cloudant/spark-cloudant. It supports both the DataFrame and SQL approach for reading data from Cloudant and save it . Looking forward to your feedback on the project. Yang

Catching InvalidClassException in sc.objectFile

2015-03-19 Thread Justin Yip
Hello, I have persisted a RDD[T] to disk through "saveAsObjectFile". Then I changed the implementation of T. When I read the file with sc.objectFile using the new binary, I got the exception of java.io.InvalidClassException, which is expected. I try to catch this error via SparkException in the d

Timeout Issues from Spark 1.2.0+

2015-03-19 Thread EH
Hi all, I'm trying to run the sample Spark application in version v1.2.0 and above. However, I've encountered a weird issue like below. This issue only be seen in v1.2.0 and above, but v1.1.0 and v1.1.1 are fine. The sample code: val sc : SparkContext = new SparkContext(conf) val NUM_SAMPLES

Reliable method/tips to solve dependency issues?

2015-03-19 Thread Jim Kleckner
Do people have a reliable/repeatable method for solving dependency issues or tips? The current world of spark-hadoop-hbase-parquet-... is very challenging given the huge footprint of dependent packages and we may be pushing against the limits of how many packages can be combined into one environme

Spark SQL Self join with agreegate

2015-03-19 Thread Shailesh Birari
Hello, I want to use Spark sql to aggregate some columns of the data. e.g. I have huge data with some columns as: time, src, dst, val1, val2 I want to calculate sum(val1) and sum(val2) for all unique pairs of src and dst. I tried by forming SQL query SELECT a.time, a.src, a.dst, sum(

RE: Spark SQL Self join with agreegate

2015-03-19 Thread Cheng, Hao
Not so sure your intention, but something like "SELECT sum(val1), sum(val2) FROM table GROUP BY src, dest" ? -Original Message- From: Shailesh Birari [mailto:sbirar...@gmail.com] Sent: Friday, March 20, 2015 9:31 AM To: user@spark.apache.org Subject: Spark SQL Self join with agreegate

Re: LZO configuration can not affect

2015-03-19 Thread Ted Yu
jeanlyn92: I was not very clear in previous reply: I meant to refer to /home/hadoop/mylib/hadoop-lzo-SNAPSHOT.jar But looks like the distro includes hadoop-lzo-0.4.15.jar Cheers On Thu, Mar 19, 2015 at 6:26 PM, jeanlyn92 wrote: > That's not enough .The config must appoint specific jar instead

Re: Can LBFGS be used on streaming data?

2015-03-19 Thread Jeremy Freeman
Regarding the first question, can you say more about how you are loading your data? And what is the size of the data set? And is that the only error you see, and do you only see it in the streaming version? For the second question, there are a couple reasons the weights might slightly differ, i

Re: HDP 2.2 AM abort : Unable to find ExecutorLauncher class

2015-03-19 Thread Bharath Ravi Kumar
Hi Doug, I did try setting that config parameter to a larger number (several minutes), but still wasn't able to retrieve additional context logs. Let us know if you have any success with it. Thanks, Bharath On Fri, Mar 20, 2015 at 3:21 AM, Doug Balog wrote: > I’m seeing the same problem. > I’v

RE: Column Similarity using DIMSUM

2015-03-19 Thread Manish Gupta 8
Thanks Reza. It makes perfect sense. Regards, Manish From: Reza Zadeh [mailto:r...@databricks.com] Sent: Thursday, March 19, 2015 11:58 PM To: Manish Gupta 8 Cc: user@spark.apache.org Subject: Re: Column Similarity using DIMSUM Hi Manish, With 56431 columns, the output can be as large as 56431 x

Re: KMeans with large clusters Java Heap Space

2015-03-19 Thread mvsundaresan
Thanks Derrick, when I count the unique terms it is very small. So I added this... val tfidf_features = lines.flatMap(x => x._2.split(" ").filter(_.length > 2)).distinct().count().toInt val hashingTF = new HashingTF(tfidf_features) -- View this message in context: http://apache-spark-user-list

Spark MLLib KMeans Top Terms

2015-03-19 Thread mvsundaresan
I'm trying to cluster short text messages using KMeans, after trained the kmeans I want to get the top terms (5 - 10). How do I get that using clusterCenters? full code is here http://apache-spark-user-list.1001560.n3.nabble.com/KMeans-with-large-clusters-Java-Heap-Space-td21432.html -- View

Launching Spark Cluster Application through IDE

2015-03-19 Thread raggy
I am trying to debug a Spark Application on a cluster using a master and several worker nodes. I have been successful at setting up the master node and worker nodes using Spark standalone cluster manager. I downloaded the spark folder with binaries and use the following commands to setup worker and

Re: Software stack for Recommendation engine with spark mlib

2015-03-19 Thread Shashidhar Rao
Hi , Just 2 follow up questions, please suggest 1. Is there any commercial recommendation engine apart from the open source tools(Mahout,Spark) that are available that anybody can suggest ? 2. In this case only the purchase transaction is captured. There are no ratings and no feedback availabl

Measuer Bytes READ and Peak Memory Usage for Query

2015-03-19 Thread anu
Hi All I would like to measure Bytes Read and Peak Memory Usage for a Spark SQL Query. Please clarify if Bytes Read = aggregate size of all RDDs ?? All my RDDs are in memory and 0B spill to disk. And I am clueless how to measure Peak Memory Usage. -- View this message in context: http://apa

Re: Upgrade from Spark 1.1.0 to 1.1.1+ Issues

2015-03-19 Thread Akhil Das
Are you submitting your application from local to a remote host? If you want to run the spark application from a remote machine, then you have to at least set the following configurations properly. - *spark.driver.host* - points to the ip/host from where you are submitting the job (make sure you

Re: Measuer Bytes READ and Peak Memory Usage for Query

2015-03-19 Thread Akhil Das
You could do a cache and see the memory usage under Storage tab in the driver UI (runs on port 4040) Thanks Best Regards On Fri, Mar 20, 2015 at 12:02 PM, anu wrote: > Hi All > > I would like to measure Bytes Read and Peak Memory Usage for a Spark SQL > Query. > > Please clarify if Bytes Read =

Re: Launching Spark Cluster Application through IDE

2015-03-19 Thread Akhil Das
>From IntelliJ, you can use the remote debugging feature. For remote debugging, you need to pass the following: -Xdebug -Xrunjdwp:server=y,transport=dt_socket,address=4000,suspend=n jvm options and configure yo