Re: ClassNotDefException when using spark-submit with multiple jars and files located on HDFS

2015-06-10 Thread Akhil Das
Or you can do sc.addJar(/path/to/the/jar), i haven't tested with HDFS path though it works fine with local path. Thanks Best Regards On Wed, Jun 10, 2015 at 10:17 AM, Jörn Franke wrote: > I am not sure they work with HDFS pathes. You may want to look at the > source code. Alternatively you can

Re: How to use Apache spark mllib Model output in C++ component

2015-06-10 Thread Akhil Das
Hope Swig and JNA might help for accessing c++ libraries from Java. Thanks Best Regards On Wed, Jun 10, 2015 at 11:50 AM, mahesht wrote: > > There is C++ component which uses some model which we want to replace it by > spark model

Re: Spark's Scala shell killing itself

2015-06-10 Thread Akhil Das
May be you should update your spark version to the latest one. Thanks Best Regards On Wed, Jun 10, 2015 at 11:04 AM, Chandrashekhar Kotekar < shekhar.kote...@gmail.com> wrote: > Hi, > > I have configured Spark to run on YARN. Whenever I start spark shell using > 'spark-shell' command, it automat

Re: Join between DStream and Periodically-Changing-RDD

2015-06-10 Thread Akhil Das
RDD's are immutable, why not join two DStreams? Not sure, but you can try something like this also: kvDstream.foreachRDD(rdd => { val file = ssc.sparkContext.textFile("/sigmoid/") val kvFile = file.map(x => (x.split(",")(0), x)) rdd.join(kvFile) }) Thanks Best Regards

Re: Determining number of executors within RDD

2015-06-10 Thread Himanshu Mehra
Hi Akshat, I assume what you want is to make sure the number of partitions in your RDD, which is easily achievable by passing numSlices and minSplits argument at the time of RDD creation. example : val someRDD = sc.parallelize(someCollection, numSlices) / val someRDD = sc.textFile(pathToFile, minS

Re: Running SparkSql against Hive tables

2015-06-10 Thread Cheng Lian
On 6/10/15 1:55 AM, James Pirz wrote: I am trying to use Spark 1.3 (Standalone) against Hive 1.2 running on Hadoop 2.6. I looked the ThriftServer2 logs, and I realized that the server was not starting properly, because of failure in creating a server socket. In fact, I had passed the URI to m

how to maintain huge dataset while using spark streaming

2015-06-10 Thread homar
Hi, I'm currently working on a following use case: I have lots of events, each of them have userId, createTime, visitStartDate(initially empty) and many different fields. I would like to use spark streaming to tag those events with visit start date. Two events form a visit if: 1. they have the sam

Re: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-10 Thread Cheng Lian
Would you mind to provide executor output so that we can check the reason why executors died? And you may run EXPLAIN EXTENDED to find out the physical plan of your query, something like: |0: jdbc:hive2://localhost:1> explain extended select * from foo; +--

Re: Monitoring Spark Jobs

2015-06-10 Thread Himanshu Mehra
Hi Sam, You might want to have a look at spark UI which runs by default at localhost://8080. You can also configure Apache Ganglia to monitor over your cluster resources. Thank you Regards Himanshu Mehra -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Mon

Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
Hi Xiaohan, Would you please try to set "spark.sql.thriftServer.incrementalCollect" to "true" and increasing driver memory size? In this way, HiveThriftServer2 uses RDD.toLocalIterator rather than RDD.collect().iterator to return the result set. The key difference is that RDD.toLocalIterator

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-10 Thread Jeroen Vlek
Hi Josh, Thank you for your effort. Looking at your code, I feel that mine is semantically the same, except written in Java. The dependencies in the pom.xml all have the scope provided. The job is submitted as follows: $ rm spark.log && MASTER=spark://maprdemo:7077 /opt/mapr/spark/spark-1.3.1/

Re: spark-submit does not use hive-site.xml

2015-06-10 Thread Cheng Lian
Hm, this is a common confusion... Although the variable name is `sqlContext` in Spark shell, it's actually a `HiveContext`, which extends `SQLContext` and has the ability to communicate with Hive metastore. So your program need to instantiate a `org.apache.spark.sql.hive.HiveContext` instead.

Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
Would you please also provide executor stdout and stderr output? Thanks. Cheng On 6/10/15 4:23 PM, 姜超才 wrote: Hi Lian, Thanks for your quick response. I forgot mention that I have tuned driver memory from 2G to 4G, seems got minor improvement, The dead way when fetching 1,400,000 rows chang

Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
Also, if the data isn't confidential, would you mind to send me a compressed copy (don't cc user@spark.apache.org)? Cheng On 6/10/15 4:23 PM, 姜超才 wrote: Hi Lian, Thanks for your quick response. I forgot mention that I have tuned driver memory from 2G to 4G, seems got minor improvement, The

回复:Re: Re: Re: How to decrease the time of storing block in memory

2015-06-10 Thread luohui20001
thanks Ak, thanks for your idea. I had tried using spark to do what the shell did. However it is not fast enough as I expected and not very easy. Thanks&Best regards! San.Luo - 原始邮件 - 发件人:Akhil Das 收件人:罗辉 抄送人:user 主题:Re: Re: Re: How to d

Re: append file on hdfs

2015-06-10 Thread Pa Rö
hi, i have an idea to solve my problem, i want write one file for each spark partion, but i not know to get the actuel partion suffix/ID in my call function? points.foreachPartition( new VoidFunction>>() { private static final long serialVersionUID = -72108975

DataFrame.save with SaveMode.Overwrite produces 3x higher data size

2015-06-10 Thread bkapukaranov
Hi, Kudos on Spark 1.3.x, it's a great release - loving data frames! One thing I noticed after upgrading is that if I use the generic save DataFrame function with Overwrite mode and a "parquet" source it produces much larger output parquet file. Source json data: ~500GB Originally saved parquet:

Fwd: Re: How to keep a SQLContext instance alive in a spark streaming application's life cycle?

2015-06-10 Thread Sergio Jiménez Barrio
Note: CCing user@spark.apache.org First, you must check if the RDD is empty: messages.foreachRDD { rdd => if (!rdd.isEmpty) { }} Now, you can obtain the instance of a SQLContext: val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext)

cannot access port 4040

2015-06-10 Thread mrm
Hi, I am using Spark 1.3.1 standalone and I have a problem where my cluster is working fine, I can see the port 8080 and check that my ec2 instances are fine, but I cannot access port 4040. I have tried sbin/stop-all.sh, sbin/stop-master.sh, exiting the spark context and restarting it to no avail

Re: cannot access port 4040

2015-06-10 Thread Himanshu Mehra
Hi Maria, Have you tried the 8080 as well ? Thanks Himanshu -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cannot-access-port-4040-tp23248p23249.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: DataFrame.save with SaveMode.Overwrite produces 3x higher data size

2015-06-10 Thread bkapukaranov
Additionally, if I delete the parquet and recreate it using the same generic save function with 1000 partitions and overwrite the size is again correct. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-save-with-SaveMode-Overwrite-produces-3x-higher

Re: cannot access port 4040

2015-06-10 Thread Akhil Das
4040 is your driver port, you need to run some application. Login to your cluster start a spark-shell and try accessing 4040. Thanks Best Regards On Wed, Jun 10, 2015 at 3:51 PM, mrm wrote: > Hi, > > I am using Spark 1.3.1 standalone and I have a problem where my cluster is > working fine, I ca

Re: cannot access port 4040

2015-06-10 Thread mrm
Hi Akhil, (Your reply does not appear in the mailing list but I received an email so I will reply here). I have an application running already in the shell using pyspark. I can see the application running on port 8080, but I cannot log into it through port 4040. It says "connection timed out" aft

Re: cannot access port 4040

2015-06-10 Thread mrm
Hi Akhil, Thanks for your reply! I still cannot see port 4040 in my machine when I type "master-ip-address:4040" in my browser. I have tried this command: netstat -nat | grep 4040 and it returns this: tcp0 0 :::4040 :::* LISTEN Logging int

Re: 回复: Re: 回复: Re: Met OOM when fetching more than 1,000,000 rows.

2015-06-10 Thread Cheng Lian
Hm, I tried the following with 0.13.1 and 0.13.0 on my laptop (don't have access to a cluster for now) but couldn't reproduce this issue. Your program just executed smoothly... :-/ Command line used to start the Thrift server: ./sbin/start-thriftserver.sh --driver-memory 4g --master local

Re: cannot access port 4040

2015-06-10 Thread Akhil Das
Opening your 4040 manually or ssh tunneling (ssh -L 4040:127.0.0.1:4040 master-ip, and then open localhost:4040 in browser.) will work for you then . Thanks Best Regards On Wed, Jun 10, 2015 at 5:10 PM, mrm wrote: > Hi Akhil, > > Thanks for your reply! I still cannot see port 4040 in my machine

spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
Both the driver (ApplicationMaster running on hadoop) and container (CoarseGrainedExecutorBackend) end up exceeding my 25GB allocation. my code is something like sc.binaryFiles(... 1mil xml files).flatMap( ... extract some domain classes, not many though as each xml usually have zero results).red

[Spark 1.3.1 on YARN on EMR] Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient

2015-06-10 Thread Roberto Coluccio
Hi! I'm struggling with an issue with Spark 1.3.1 running on YARN, running on an AWS EMR cluster. Such cluster is based on AMI 3.7.0 (hence Amazon Linux 2015.03, Hive 0.13 already installed and configured on the cluster, Hadoop 2.4, etc...). I make use of the AWS emr-bootstrap-action "*install-spa

learning rpc about spark core source code

2015-06-10 Thread huangzheng
Hi all Recently I have learned about 1.3 spark core source code , can't understand rpc, How to communicate between client driver, worker and master? There are some scala files such as RpcCallContextRpcEndPointRef RpcEndpoint RpcEnv. On spark core rpc module Have any blogs ?

Re: BigDecimal problem in parquet file

2015-06-10 Thread Bipin Nag
Hi Cheng, I am using Spark 1.3.1 binary available for Hadoop 2.6. I am loading an existing parquet file, then repartitioning and saving it. Doing this gives the error. The code for this doesn't look like causing problem. I have a feeling the source - the existing parquet is the culprit. I create

Split RDD based on criteria

2015-06-10 Thread dgoldenberg
Hi, I'm gathering that the typical approach for splitting an RDD is to apply several filters to it. rdd1 = rdd.filter(func1); rdd2 = rdd.filter(func2); ... Is there/should there be a way to create 'buckets' like these in one go? List rddList = rdd.filter(func1, func2, ..., funcN) Another angle

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-10 Thread Josh Mahonin
Hi Jeroen, Rather than bundle the Phoenix client JAR with your app, are you able to include it in a static location either in the SPARK_CLASSPATH, or set the conf values below (I use SPARK_CLASSPATH myself, though it's deprecated): spark.driver.extraClassPath spark.executor.extraClassPath Jo

Re: learning rpc about spark core source code

2015-06-10 Thread Shixiong Zhu
The new RPC interface is an internal module and added in 1.4. It should not exist in 1.3. Where did you find it? For the communication between driver, worker and master, it still uses Akka. There are a pending PR to update them: https://github.com/apache/spark/pull/5392 Do you mean the communicati

Spark standalone mode and kerberized cluster

2015-06-10 Thread kazeborja
Hello all. I've been reading some old mails and notice that the use of kerberos in a standalone cluster was not supported. Is this stillt he case? Thanks. Borja. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-standalone-mode-and-kerberized-cluster-t

Re: PostgreSQL JDBC Classpath Issue

2015-06-10 Thread shahab
Hi George, I have same issue, did you manage to find a solution? best, /Shahab On Wed, May 13, 2015 at 9:21 PM, George Adams wrote: > Hey all, I seem to be having an issue with PostgreSQL JDBC jar on my > classpath. I’ve outlined the issue on Stack Overflow ( > http://stackoverflow.com/questi

Re: PostgreSQL JDBC Classpath Issue

2015-06-10 Thread Cheng Lian
Michael had answered this question in the SO thread http://stackoverflow.com/a/30226336 Cheng On 6/10/15 9:24 PM, shahab wrote: Hi George, I have same issue, did you manage to find a solution? best, /Shahab On Wed, May 13, 2015 at 9:21 PM, George Adams > wrote

Re: append file on hdfs

2015-06-10 Thread Richard Marscher
Hi, if you now want to write 1 file per partition, that's actually built into Spark as *saveAsTextFile*(*path*)Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toSt

Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Is it possible to configure Spark to do all of its shuffling FULLY in memory (given that I have enough memory to store all the data)?

Re: Determining number of executors within RDD

2015-06-10 Thread maxdml
Note that this property is only available for YARN -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Determining-number-of-executors-within-RDD-tp15554p23256.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Fully in-memory shuffles

2015-06-10 Thread Josh Rosen
There's a discussion of this at https://github.com/apache/spark/pull/5403 On Wed, Jun 10, 2015 at 7:08 AM, Corey Nolet wrote: > Is it possible to configure Spark to do all of its shuffling FULLY in > memory (given that I have enough memory to store all the data)? > > > >

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
I am profiling the driver. It currently has 564MB of strings which might be the 1mil file names. But also it has 2.34 GB of long[] ! That's so far, it is still running. What are those long[] used for? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-use

Re: Cassandra Submit

2015-06-10 Thread Yana Kadiyska
Do you build via maven or sbt? How do you submit your application -- do you use local, standalone or mesos/yarn? Your jars as you originally listed them seem right to me. Try this, from your ${SPARK_HOME}: SPARK_CLASSPATH=spark-cassandra-connector_2.10-1.3.0-M1.jar:guava-jdk5-14.0.1.jar:cassandra-

Re: which database for gene alignment data ?

2015-06-10 Thread Frank Austin Nothaft
Hi Roni, These are exposed as public APIs. If you want, you can run them inside of the adam-shell (which is just a wrapper for the spark shell, but with the ADAM libraries on the class path). > Also , I need to save all my intermediate data. Seems like ADAM stores data > in Parquet on HDFS. >

Re: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-10 Thread Sourav Mazumder
Here is the physical plan. Also attaching the executor log from one of the executors. You can see that memory consumption is slowly rising and then it is reaching around 10.5 GB. There it is staying for around 5 minutes 06-50-36 to 06-55-00. Then this executor is getting killed. ExecutorMemory co

Re: Spark SQL with Thrift Server is very very slow and finally failing

2015-06-10 Thread Cheng Lian
Seems that Spark SQL can't retrieve table size statistics and doesn't enable broadcast join in your case. Would you please try `ANALYZE TABLE ` for both tables to generated table statistics information? Cheng On 6/10/15 10:26 PM, Sourav Mazumder wrote: Here is the physical plan. Also attach

Re: Linear Regression with SGD

2015-06-10 Thread Debasish Das
It's always better to use a quasi newton solver if the runtime and problem scale permits as there are guarantees on opti mization...owlqn and bfgs are both quasi newton Most single node code bases will run quasi newton solvesif you are using sgd better is to use adadelta/adagrad or similar tri

Re: Saving compressed textFiles from a DStream in Scala

2015-06-10 Thread Bob Corsaro
Thanks Akhil. For posterity, I ended up with: https://gist.github.com/dokipen/aa07f351a970fe54fcff I couldn't get rddToFilename() to work, but it's impl was pretty simple. I'm a poet but I don't know it. On Tue, Jun 9, 2015 at 3:10 AM Akhil Das wrote: > like this? > > myDStream.foreachRD

Spark not working on windows 7 64 bit

2015-06-10 Thread Eran Medan
I'm on a road block trying to understand why Spark doesn't work for a colleague of mine on his Windows 7 laptop. I have pretty much the same setup and everything works fine. I googled the error message and didn't get anything that resovled it. Here is the exception message (after running spark 1

PYTHONPATH on worker nodes

2015-06-10 Thread Bob Corsaro
I'm setting PYTHONPATH before calling pyspark, but the worker nodes aren't inheriting it. I've tried looking through the code and it appears that it should, I can't find the bug. Here's an example, what am I doing wrong? https://gist.github.com/dokipen/84c4e4a89fddf702fdf1

Re: Issue running Spark 1.4 on Yarn

2015-06-10 Thread matvey14
Hi nsalian, For some reason the rest of this thread isn't showing up here. The NodeManager isn't busy. I'll copy/paste, the details are in there. I've tried running a Hadoop app pointing to the same queue. Same

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Kostas Kougios
After some time the driver accumulated 6.67GB of long[] . The executor mem usage so far is low. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-uses-too-much-memory-maybe-binaryFiles-with-more-than-1-million-files-in-HDFS-groupBy-or-reduc-tp23253p23259

RE: Join between DStream and Periodically-Changing-RDD

2015-06-10 Thread Evo Eftimov
It depends on how big the Batch RDD requiring reloading is Reloading it for EVERY single DStream RDD would slow down the stream processing inline with the total time required to reload the Batch RDD ….. But if the Batch RDD is not that big then that might not be an issues especially in t

spark streaming - checkpointing - looking at old application directory and failure to start streaming context

2015-06-10 Thread Ashish Nigam
Hi, If checkpoint data is already present in HDFS, driver fails to load as it is performing lookup on previous application directory. As that folder already exists, it fails to start context. Failed job's application id was application_1432284018452_0635 and job was performing lookup on application

Re: spark uses too much memory maybe (binaryFiles() with more than 1 million files in HDFS), groupBy or reduceByKey()

2015-06-10 Thread Marcelo Vanzin
So, I don't have an explicit solution to your problem, but... On Wed, Jun 10, 2015 at 7:13 AM, Kostas Kougios < kostas.koug...@googlemail.com> wrote: > I am profiling the driver. It currently has 564MB of strings which might be > the 1mil file names. But also it has 2.34 GB of long[] ! That's so

Re: spark streaming - checkpointing - looking at old application directory and failure to start streaming context

2015-06-10 Thread Akhil Das
Delete the checkpoint directory, you might have modified your driver program. Thanks Best Regards On Wed, Jun 10, 2015 at 9:44 PM, Ashish Nigam wrote: > Hi, > If checkpoint data is already present in HDFS, driver fails to load as it > is performing lookup on previous application directory. As t

How to build spark with Hive 1.x ?

2015-06-10 Thread Neal Yin
I am trying to build spark 1.3 branch with Hive 1.1.0. mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Phive-0.13.1 -Dhive.version=1.1.0 -Dhive.version.short=1.1.0 -DskipTests clean package I got following error Failed to execute goal on project spark-hive_2.10: Coul

Re: Spark not working on windows 7 64 bit

2015-06-10 Thread Jörn Franke
You may compare the c:\windows\system32\drivers\etc\hosts if they are configured similarly Le mer. 10 juin 2015 à 17:16, Eran Medan a écrit : > I'm on a road block trying to understand why Spark doesn't work for a > colleague of mine on his Windows 7 laptop. > I have pretty much the same setup a

Re: How to build spark with Hive 1.x ?

2015-06-10 Thread Ted Yu
Hive version 1.x is currently not supported. Cheers On Wed, Jun 10, 2015 at 9:16 AM, Neal Yin wrote: > I am trying to build spark 1.3 branch with Hive 1.1.0. > > mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive > -Phive-thriftserver -Phive-0.13.1 -Dhive.version=1.1.0 > –Dhive.version.sho

Re: spark streaming - checkpointing - looking at old application directory and failure to start streaming context

2015-06-10 Thread Ashish Nigam
I did not change driver program. I just shutdown the context and again started. BTW, I see this ticket already open in unassigned state - SPARK-6892 that talks about this issue. Is this a known issue? Also, any workarounds? On Wed, Jun 10,

Re: PYTHONPATH on worker nodes

2015-06-10 Thread Marcelo Vanzin
I don't think it's propagated automatically. Try this: spark-submit --conf "spark.executorEnv.PYTHONPATH=..." ... On Wed, Jun 10, 2015 at 8:15 AM, Bob Corsaro wrote: > I'm setting PYTHONPATH before calling pyspark, but the worker nodes aren't > inheriting it. I've tried looking through the cod

Re: Split RDD based on criteria

2015-06-10 Thread Chad Urso McDaniel
While it does feel like a filter is what you want to do, a common way to handle this is to map to different keys. Using your rddList example it becomes like this (scala style): --- val rddSplit: RDD[(Int, Any)] = rdd.map(x => (*createKey*(x), x)) val rddBuckets: RDD[(Int, Iterable[Any])] = rddSpl

Re: Spark Maven Test error

2015-06-10 Thread Rick Moritz
Dear List, I'm trying to reference a lonely message to this list from March 25th,( http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Maven-Test-error-td22216.html ), but I'm unsure this will thread properly. Sorry, if didn't work out. Anyway, using Spark 1.4.0-RC4 I run into the same issu

Re: Split RDD based on criteria

2015-06-10 Thread Sean Owen
No, but you can write a couple lines of code that do this. It's not optimized of course. This is actually a long and interesting side discussion, but I'm not sure how much it could be given that the computation is "pull" rather than "push"; there is no concept of one pass over the data resulting in

Re: spark-submit does not use hive-site.xml

2015-06-10 Thread James Pirz
Thanks for your help ! Switching to HiveContext fixed the issue. Just one side comment: In the documentation regarding Hive Tables and HiveContext , we see: // sc is an existing JavaSparkContext.HiveContext sqlContext =

Re: spark-submit does not use hive-site.xml

2015-06-10 Thread Cheng Lian
Thanks for pointing out the documentation error :) Opened https://github.com/apache/spark/pull/6749 to fix this. On 6/11/15 1:18 AM, James Pirz wrote: Thanks for your help ! Switching to HiveContext fixed the issue. Just one side comment: In the documentation regarding Hive Tables and HiveCont

Re: Issue running Spark 1.4 on Yarn

2015-06-10 Thread nsalian
Hi, Thanks for the added information. Helps add more context. Is that specific queue different from the others? FairScheduler.xml should have the information needed.Or if you have a separate allocations.xml. Something of this format: 1 mb,0vcores 9 mb,0vcores 50 0.1

RE: [SPARK-6330] 1.4.0/1.5.0 Bug to access S3 -- AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyI

2015-06-10 Thread Shuai Zheng
I have tried both cases(s3 and s3n, set all possible parameters), and trust me, the same code works with 1.3.1, but not for 1.3.0 and 1.4.0, 1.5.0. I even use a plain project to test this, and use maven to include all referenced library, but it give me error. I think everyone can easily

Re: Determining number of executors within RDD

2015-06-10 Thread maxdml
Actually this is somehow confusing for two reasons: - First, the option 'spark.executor.instances', which seems to be only dealt with in the case of YARN in the source code of SparkSubmit.scala, is also present in the conf/spark-env.sh file under the standalone section, which would indicate that i

Re: Determining number of executors within RDD

2015-06-10 Thread Evo Eftimov
Yes  i think it is ONE worker ONE executor as executor is nothing but jvm instance spawned by the worker  To run more executors ie jvm instances on the same physical cluster node you need to run more than one worker on that node and then allocate only part of the sys resourced to that worker/ex

Re: Determining number of executors within RDD

2015-06-10 Thread Sandy Ryza
On YARN, there is no concept of a Spark Worker. Multiple executors will be run per node without any effort required by the user, as long as all the executors fit within each node's resource limits. -Sandy On Wed, Jun 10, 2015 at 3:24 PM, Evo Eftimov wrote: > Yes i think it is ONE worker ONE e

Re: Determining number of executors within RDD

2015-06-10 Thread Evo Eftimov
We/i were discussing STANDALONE mode, besides maxdml had already summarized what is available and possible under yarn So let me recap - for standalone mode if you need more than 1 executor per physical host e.g. to partition its sys resources more finley (especialy RAM per jvm instance) you nee

Problem with pyspark on Docker talking to YARN cluster

2015-06-10 Thread Ashwin Shankar
All, I was wondering if any of you have solved this problem : I have pyspark(ipython mode) running on docker talking to a yarn cluster(AM/executors are NOT running on docker). When I start pyspark in the docker container, it binds to port *49460.* Once the app is submitted to YARN, the app(AM) o

Re: Efficient way to get top K values per key in (key, value) RDD?

2015-06-10 Thread erisa
Hi, I am a Spark newbie, and trying to solve the same problem, and have implemented the same exact solution that sowen is suggesting. I am using priorityqueues to keep trak of the top 25 sub_categories, per each category, and using the combineByKey function to do that. However I run into the fol

Hive Custom Transform Scripts (read from stdin and print to stdout) in Spark

2015-06-10 Thread nishanthps
What is the best way to reuse hive custom transform scripts written in python or awk or c++ which process data from stdin and print to stdout in spark. These scripts are typically using the Transform Syntax in Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform -- V

Re: Determining number of executors within RDD

2015-06-10 Thread Nishkam Ravi
This PR adds support for multiple executors per worker: https://github.com/apache/spark/pull/731 and should be available in 1.4. Thanks, Nishkam On Wed, Jun 10, 2015 at 1:35 PM, Evo Eftimov wrote: > We/i were discussing STANDALONE mode, besides maxdml had already > summarized what is available

Re: How to set KryoRegistrator class in spark-shell

2015-06-10 Thread bhomass
you need to register using spark-default.xml as explained here https://books.google.com/books?id=WE_GBwAAQBAJ&pg=PA239&lpg=PA239&dq=spark+shell+register+kryo+serialization&source=bl&ots=vCxgEfz1-2&sig=dHU8FY81zVoBqYIJbCFuRwyFjAw&hl=en&sa=X&ved=0CEwQ6AEwB2oVChMIn_iujpCGxgIVDZmICh3kYADW#v=onepage&q=

Can't access Ganglia on EC2 Spark cluster

2015-06-10 Thread barmaley
Launching using spark-ec2 script results in: Setting up ganglia RSYNC'ing /etc/ganglia to slaves... <...> Shutting down GANGLIA gmond: [FAILED] Starting GANGLIA gmond:[ OK ] Shutting down GANGLIA gmond:

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
So with this... to help my understanding of Spark under the hood- Is this statement correct "When data needs to pass between multiple JVMs, a shuffle will *always* hit disk"? On Wed, Jun 10, 2015 at 10:11 AM, Josh Rosen wrote: > There's a discussion of this at https://github.com/apache/spark/pu

Re: RDD of RDDs

2015-06-10 Thread ping yan
Thanks much for the detailed explanations. I suspected architectural support of the notion of rdd of rdds, but my understanding of Spark or distributed computing in general is not as deep as allowing me to understand better. so this really helps! I ended up going with List[RDD]. The collection of

NullPointerException with functions.rand()

2015-06-10 Thread Justin Yip
Hello, I am using 1.4.0 and found the following weird behavior. This case works fine: scala> sc.parallelize(Seq((1,2), (3, 100))).toDF.withColumn("index", rand(30)).show() +--+---+---+ |_1| _2| index| +--+---+---+ | 1| 2| 0.6662967911724369| | 3|100|

Re: Fully in-memory shuffles

2015-06-10 Thread Patrick Wendell
In many cases the shuffle will actually hit the OS buffer cache and not ever touch spinning disk if it is a size that is less than memory on the machine. - Patrick On Wed, Jun 10, 2015 at 5:06 PM, Corey Nolet wrote: > So with this... to help my understanding of Spark under the hood- > > Is this

Re: NullPointerException with functions.rand()

2015-06-10 Thread Ted Yu
Looks like the NPE came from this line: @transient protected lazy val rng = new XORShiftRandom(seed + TaskContext.get().partitionId()) Could TaskContext.get() be null ? On Wed, Jun 10, 2015 at 6:15 PM, Justin Yip wrote: > Hello, > > I am using 1.4.0 and found the following weird behavior. > >

Re: Fully in-memory shuffles

2015-06-10 Thread Corey Nolet
Ok so it is the case that small shuffles can be done without hitting any disk. Is this the same case for the aux shuffle service in yarn? Can that be done without hitting disk? On Wed, Jun 10, 2015 at 9:17 PM, Patrick Wendell wrote: > In many cases the shuffle will actually hit the OS buffer cac

Re: Fully in-memory shuffles

2015-06-10 Thread Davies Liu
If you have enough memory, you can put the temporary work directory in tempfs (in memory file system). On Wed, Jun 10, 2015 at 8:43 PM, Corey Nolet wrote: > Ok so it is the case that small shuffles can be done without hitting any > disk. Is this the same case for the aux shuffle service in yarn?

Re: Problem with pyspark on Docker talking to YARN cluster

2015-06-10 Thread Ashwin Shankar
Hi Eron, Thanks for your reply, but none of these options works for us. > > >1. use 'spark.driver.host' and 'spark.driver.port' setting to >stabilize the driver-side endpoint. (ref >) > > This unfortunately won't help

Re: Can't access Ganglia on EC2 Spark cluster

2015-06-10 Thread Akhil Das
Looks like libphp version is 5.6 now, which version of spark are you using? Thanks Best Regards On Thu, Jun 11, 2015 at 3:46 AM, barmaley wrote: > Launching using spark-ec2 script results in: > > Setting up ganglia > RSYNC'ing /etc/ganglia to slaves... > <...> > Shutting down GANGLIA gmond:

Re: Spark standalone mode and kerberized cluster

2015-06-10 Thread Akhil Das
This might help http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.2.4/Apache_Spark_Quickstart_v224/content/ch_installing-kerb-spark-quickstart.html Thanks Best Regards On Wed, Jun 10, 2015 at 6:49 PM, kazeborja wrote: > Hello all. > > I've been reading some old mails and notice that the use o

Re: Apache Phoenix (4.3.1 and 4.4.0-HBase-0.98) on Spark 1.3.1 ClassNotFoundException

2015-06-10 Thread Jeroen Vlek
Hi Josh, That worked! Thank you so much! (I can't believe it was something so obvious ;) ) If you care about such a thing you could answer my question here for bounty: http://stackoverflow.com/questions/30639659/apache-phoenix-4-3-1-and-4-4-0-hbase-0-98-on-spark-1-3-1-classnotfoundexceptio Hav

how to deal with continued records

2015-06-10 Thread Zhang Jiaqiang
Hello, I have a large CSV file in which the continued records(with same RecordID) may have the context meaning. I should see these continued records as ONE complete record. Also the recordID will be reset to 1 at some time when the csv dumper system think it's necessary. I'd like to get some sugg