Spark reads partitions in a wrong order

2014-04-25 Thread Mingyu Kim
If the underlying file system returns files in a non-alphabetical order to java.io.File.listFiles(), Spark reads the partitions out of order. Here¹s an example. var sc = new SparkContext(³local[3]², ³test²); var rdd1 = sc.parallelize([1,2,3,4,5]); rdd1.saveAsTextFile(³file://path/to/file²); var rd

Re: Spark reads partitions in a wrong order

2014-04-25 Thread Andrew Ash
Have you run the same test but with a URL in HDFS rather than the local filesystem? I think order may be preserved in that run, which makes the local filesystem losing order look more like a bug. Sent from my mobile phone On Apr 25, 2014 9:11 AM, "Mingyu Kim" wrote: > If the underlying file syst

Re: Deploying a python code on a spark EC2 cluster

2014-04-25 Thread Shubhabrata
Well, we used the script that comes with spark I think v0.9.1. But I am gonna try the newer version (1.0rvc2 script). I shall keep you posted about my findings. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Deploying-a-python-code-on-a-spark-EC2-cl

MultipleOutputs IdentityReducer

2014-04-25 Thread Andre Kuhnen

MultipleOutputs IdentityReducer

2014-04-25 Thread Andre Kuhnen
Hello, I am trying to write multiple files with Spark, but I can not find a way to do it. Here is the idea. val rddKeyValue : Rdd[(String, String)] = rddlines.map( line => createKeyValue(line)) now I would like to save this as and all the values inside the file I tried to use this after the

RE: JMX with Spark

2014-04-25 Thread Ravi Hemnani
Can you share your working metrics.properties.? I want remote jmx to be enabled so i need to use the JMXSink and monitor my spark master and workers. But what are the parameters that are to be defined like host and port ? So your config can help. -- View this message in context: http://ap

read file from hdfs

2014-04-25 Thread Joe L
I have just 2 two questions? sc.textFile("hdfs://host:port/user/matei/whatever.txt") Is host master node? What port we should use? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/read-file-from-hdfs-tp4824.html Sent from the Apache Spark User List mailing

Questions about productionizing spark

2014-04-25 Thread Han JU
Hi all, We are actively testing/benchmarking spark for our production use. Here's some questions about problems we've encountered so far: 1. By default 66% of the executor memory is used for RDD caching, so if there's no explicit caching in the code (eg. rdd.cache(), rdd.persiste(StorageLevel.M

Re: Pig on Spark

2014-04-25 Thread Mark Baker
I've only had a quick look at Pig, but it seems that a declarative layer on top of Spark couldn't be anything other than a big win, as it allows developers to declare *what* they want, permitting the compiler to determine how best poke at the RDD API to implement it. In my brief time with Spark, I

FW: reduceByKeyAndWindow - spark internals

2014-04-25 Thread Adrian Mocanu
Any suggestions where I can find this in the documentation or elsewhere? Thanks From: Adrian Mocanu [mailto:amoc...@verticalscope.com] Sent: April-24-14 11:26 AM To: u...@spark.incubator.apache.org Subject: reduceByKeyAndWindow - spark internals If I have this code: val stream1= doublesInputStre

Re: Deploying a python code on a spark EC2 cluster

2014-04-25 Thread Shubhabrata
This is the error from stderr: Spark Executor Command: "java" "-cp" ":/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar" "-Djava.library.path=/root/ephemeral-hdfs/lib/nati

Re: Pig on Spark

2014-04-25 Thread Eugen Cepoi
It depends, personally I have the opposite opinion. IMO expressing pipelines in a functional language feels natural, you just have to get used with the language (scala). Testing spark jobs is easy where testing a Pig script is much harder and not natural. If you want a more high level language t

strange error

2014-04-25 Thread Joe L
[error] 14/04/25 23:09:57 INFO slf4j.Slf4jLogger: Slf4jLogger started [error] 14/04/25 23:09:57 INFO Remoting: Starting remoting [error] 14/04/25 23:09:58 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@cm03:5] [error] 14/04/25 23:09:58 INFO Remoting: Remoting now lis

Re: read file from hdfs

2014-04-25 Thread Christophe Préaud
You should use the values defined in the 'fs.defaultFS' property (in hadoop core-site.xml file). Christophe. On 25/04/2014 14:38, Joe L wrote: I have just 2 two questions? sc.textFile("hdfs://host:port/user/matei/whatever.txt") Is host master node? What port we should use? -- View this me

Securing Spark's Network

2014-04-25 Thread Jacob Eisinger
Howdy, We tried running Spark 0.9.1 stand-alone inside docker containers distributed over multiple hosts. This is complicated due to Spark opening up ephemeral / dynamic ports for the workers and the CLI.  To ensure our docker solution doesn't break Spark in unexpected ways and maintains a sec

Re: Deploying a python code on a spark EC2 cluster

2014-04-25 Thread Shubhabrata
In order to check if there is any issue with python API I ran a scala application provided in the examples. Still the same error ./bin/run-example org.apache.spark.examples.SparkPi spark://[Master-URL]:7077 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/mn

Re: what is the best way to do cartesian

2014-04-25 Thread Alex Boisvert
You might want to try the built-in RDD.cartesian() method. On Thu, Apr 24, 2014 at 9:05 PM, Qin Wei wrote: > Hi All, > > I have a problem with the Item-Based Collaborative Filtering Recommendation > Algorithms in spark. > The basic flow is as below: >

Re: what is the best way to do cartesian

2014-04-25 Thread Eugen Cepoi
Depending on the size of the rdd you could also do a collect broadcast and then compute the product in a map function over the other rdd. If this is the same rdd you might also want to cache it. This pattern worked quite good for me Le 25 avr. 2014 18:33, "Alex Boisvert" a écrit : > You might wan

Re: Pig on Spark

2014-04-25 Thread Michael Armbrust
On Fri, Apr 25, 2014 at 6:30 AM, Mark Baker wrote: > I've only had a quick look at Pig, but it seems that a declarative > layer on top of Spark couldn't be anything other than a big win, as it > allows developers to declare *what* they want, permitting the compiler > to determine how best poke at

Spark & Shark 0.9.1 on ec2 with Hadoop 2 error

2014-04-25 Thread jesseerdmann
I've run into a problem trying to launch a cluster using the provided ec2 python script with --hadoop-major-version 2. The launch completes correctly with the exception of an Exception getting thrown for Tachyon 7 (I've included it at the end of the message, but that is not the focus and seems unr

Re: Securing Spark's Network

2014-04-25 Thread Akhil Das
Hi Jacob, This post might give you a brief idea about the ports being used https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA On Fri, Apr 25, 2014 at 8:53 PM, Jacob Eisinger wrote: > Howdy, > > We tried running Spark 0.9.1 stand-alone inside docker containers > distributed ove

Strange lookup behavior. Possible bug?

2014-04-25 Thread Yadid Ayzenberg
Hi All, Im running a lookup on a JavaPairRDD. When running on local machine - the lookup is successfull. However, when running a standalone cluster with the exact same dataset - one of the tasks never ends (constantly in RUNNING status). When viewing the worker log, it seems that the task has f

Re: Spark & Shark 0.9.1 on ec2 with Hadoop 2 error

2014-04-25 Thread Akhil Das
Hi I also have faced the same problem with shark 0.9.1 version and i have it fixed by sbt clean/packaging the shark with the rite hadoop version. You may execute the following commands to get it done. *cd shark;export SHARK_HADOOP_VERSION=$(/root/ephemeral-hdfs/bin/hadoop version | head -n1 | cut

help

2014-04-25 Thread Joe L
I need someone's help please I am getting the following error. [error] 14/04/26 03:09:47 INFO cluster.SparkDeploySchedulerBackend: Executor app-20140426030946-0004/8 removed: class java.io.IOException: Cannot run program "/home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh" (in directory

Re: JMX with Spark

2014-04-25 Thread Paul Schooss
Hello Folks, Sorry for the delay, these emails got missed due to the volume. Here is my metrics.conf root@jobs-ab-hdn4:~# cat /opt/klout/spark/conf/metrics.conf # syntax: [instance].sink|source.[name].[options]=[value] # This file configures Spark's internal metrics system. The metrics syste

Re: Pig on Spark

2014-04-25 Thread Bharath Mundlapudi
>> I've only had a quick look at Pig, but it seems that a declarative >> layer on top of Spark couldn't be anything other than a big win, as it >> allows developers to declare *what* they want, permitting the compiler >> to determine how best poke at the RDD API to implement it. The devil is in th

Re: help

2014-04-25 Thread Jey Kottalam
Try taking a look at the stderr logs of the executor "app-20140426030946-0004/8". This should be in the $SPARK_HOME/work directory of the corresponding machine. Hope that helps, -Jey On Fri, Apr 25, 2014 at 11:17 AM, Joe L wrote: > I need someone's help please I am getting the following error. >

Re: Spark and HBase

2014-04-25 Thread Josh Mahonin
Phoenix generally presents itself as an endpoint using JDBC, which in my testing seems to play nicely using JdbcRDD. However, a few days ago a patch was made against Phoenix to implement support via PIG using a custom Hadoop InputFormat, which means now it has Spark support too. Here's a code sni

Re: help

2014-04-25 Thread Joe L
hi thank you for your reply but I could not find it. it says that no such file or directory -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/help-tp4841p4848.html Sent from the A

Build times for Spark

2014-04-25 Thread Williams, Ken
I've cloned the github repo and I'm building Spark on a pretty beefy machine (24 CPUs, 78GB of RAM) and it takes a pretty long time. For instance, today I did a 'git pull' for the first time in a week or two, and then doing 'sbt/sbt assembly' took 43 minutes of wallclock time (88 minutes of CPU

Re: Pig on Spark

2014-04-25 Thread Mayur Rustagi
One core segment that frequently asks for systems like Pig & Hive are analyst who want to deal with data. The key place I see pig fitting in is getting non-developers deal with data at scale & free up developers to deal with code, udf rather than manage day to day dataflow changes & updates. A bypr

Re: Build times for Spark

2014-04-25 Thread DB Tsai
Are you using SSD? We found that the bottleneck is not computational, but disk IO. When assembly, sbt is moving lots of class files, jars, and packaging them into a single flat jar. I can do assembly in my macbook in 10mins while before upgrading to SSD, it took 30~40mins. Sincerely, DB Tsai ---

Re: Build times for Spark

2014-04-25 Thread Josh Rosen
Did you configure SBT to use the extra memory? On Fri, Apr 25, 2014 at 12:53 PM, Williams, Ken wrote: > I’ve cloned the github repo and I’m building Spark on a pretty beefy > machine (24 CPUs, 78GB of RAM) and it takes a pretty long time. > > > > For instance, today I did a ‘git pull’ for the

Scala Spark / Shark: How to access existing Hive tables in Hortonworks?

2014-04-25 Thread Darq Moth
I am trying to find some docs / description of the approach on the subject, please help. I have Hadoop 2.2.0 from Hortonworks installed with some existing Hive tables I need to query. Hive SQL works extremly and unreasonably slow on single node and cluster as well. I hope Shark will work faster. >

Re: Scala Spark / Shark: How to access existing Hive tables in Hortonworks?

2014-04-25 Thread Mayur Rustagi
You have to configure shark to access the Hortonworks hive metastore (hcatalog?) & you will start seeing the tables in shark shell & can run queries like normal & shark will leverage spark for processing your queries. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rusta

RE: Build times for Spark

2014-04-25 Thread Williams, Ken
No, I haven’t done any config for SBT. Is there somewhere you might be able to point me toward for how to do that? -Ken From: Josh Rosen [mailto:rosenvi...@gmail.com] Sent: Friday, April 25, 2014 3:27 PM To: user@spark.apache.org Subject: Re: Build times for Spark Did you configure SBT to use

Re: Build times for Spark

2014-04-25 Thread Akhil Das
You can always increase the sbt memory by setting export JAVA_OPTS="-Xmx10g" Thanks Best Regards On Sat, Apr 26, 2014 at 2:17 AM, Williams, Ken wrote: > No, I haven't done any config for SBT. Is there somewhere you might be > able to point me toward for how to do that? > > > > -Ken > > >

Re: Scala Spark / Shark: How to access existing Hive tables in Hortonworks?

2014-04-25 Thread Darq Moth
Thanks! For now I use JDBC from Scala to get data from Hive. In Hive I have a simple table with 20 rows in the following format: user_id, movie_title, rating, date I do 3 nested select requests: 1) select distinct user_id 2) for each user_id: select distinct movie_title //select a

Re: Securing Spark's Network

2014-04-25 Thread Jacob Eisinger
Howdy Akhil, Thanks - that did help! And, it made me think about how the EC2 scripts work [1] to set up security. From my understanding of EC2 security groups [2], this just sets up external access, right? (This has no effect on internal communication between the instances, right?) I am still

Re: Build times for Spark

2014-04-25 Thread Shivaram Venkataraman
Are you by any chance building this on NFS ? As far as I know the build is severely bottlenecked by filesystem calls during assembly (each class file in each dependency gets a fstat call or something like that). That is partly why building from say a local ext4 filesystem or a SSD is much faster i

RE: Build times for Spark

2014-04-25 Thread Williams, Ken
I am indeed, but it's a pretty fast NFS. I don't have any SSD I can use, but I could try to use local disk to see what happens. For me, a large portion of the time seems to be spent on lines like "Resolving org.fusesource.jansi#jansi;1.4 ..." or similar . Is this going out to find Maven resou

Re: Build times for Spark

2014-04-25 Thread Shivaram Venkataraman
AFAIK the resolver does pick up things form your local ~/.m2 -- Note that as ~/.m2 is on NFS that adds to the amount of filesystem traffic. Shivaram On Fri, Apr 25, 2014 at 2:57 PM, Williams, Ken wrote: > I am indeed, but it's a pretty fast NFS. I don't have any SSD I can > use, but I could t

Re: Spark and HBase

2014-04-25 Thread Nicholas Chammas
Josh, is there a specific use pattern you think is served well by Phoenix + Spark? Just curious. On Fri, Apr 25, 2014 at 3:17 PM, Josh Mahonin wrote: > Phoenix generally presents itself as an endpoint using JDBC, which in my > testing seems to play nicely using JdbcRDD. > > However, a few days

Re: Strange lookup behavior. Possible bug?

2014-04-25 Thread Yadid Ayzenberg
Some additional information - maybe this rings a bell with someone: I suspect this happens when the lookup returns more than one value. For 0 and 1 values, the function behaves as you would expect. Anyone ? On 4/25/14, 1:55 PM, Yadid Ayzenberg wrote: Hi All, Im running a lookup on a JavaPai

Re: help

2014-04-25 Thread Jey Kottalam
Sorry, but I don't know where Cloudera puts the executor log files. Maybe their docs give the correct path? On Fri, Apr 25, 2014 at 12:32 PM, Joe L wrote: > hi thank you for your reply but I could not find it. it says that no such > file or directory > > >

Running out of memory Naive Bayes

2014-04-25 Thread John King
I've been trying to use the Naive Bayes classifier. Each example in the dataset is about 2 million features, only about 20-50 of which are non-zero, so the vectors are very sparse. I keep running out of memory though, even for about 1000 examples on 30gb RAM while the entire dataset is 4 million ex

Question about Transforming huge files from Local to HDFS

2014-04-25 Thread PengWeiPRC
Hi there, I am sorry to bother you, but I encountered a problem about transforming large files (hundreds of giga per file) in local file system to HDFS as Parquet file format using Spark. The problem can be described as follows. 1) When I tried to read a huge file from local and used Avro + Parqu

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
I've tried to set larger buffer, but reduceByKey seems to be failed. need help:) 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Shutting down all executors 14/04/26 12:31:12 INFO cluster.CoarseGrainedSchedulerBackend: Asking each executor to shut down 14/04/26 12:31:12 INFO schedule

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
This error come just because I killed my App:( Is there something wrong? the reduceByKey operation is extremely slow(than default Serializer). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4869.html Sen

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
reduceByKey(_+_).countByKey instead of countByKey seems to be fast. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4870.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: parallelize for a large Seq is extreamly slow.

2014-04-25 Thread Earthson
parallelize is still so slow. package com.semi.nlp import org.apache.spark._ import SparkContext._ import scala.io.Source import com.esotericsoftware.kryo.Kryo import org.apache.spark.serializer.KryoRegistrator class MyRegistrator extends KryoRegistrator { override def registerCla