Re: Spark executor memory information

2015-07-14 Thread Akhil Das
1. Yes open up the webui running on 8080 to see the memory/cores allocated to your workers, and open up the ui running on 4040 and click on the Executor tab to see the memory allocated for the executor. 2. mllib codes can be found over here and s

Re: hive-site.xml spark1.3

2015-07-14 Thread Akhil Das
Try adding it in your SPARK_CLASSPATH inside conf/spark-env.sh file. Thanks Best Regards On Tue, Jul 14, 2015 at 7:05 AM, Jerrick Hoang wrote: > Hi all, > > I'm having conf/hive-site.xml pointing to my Hive metastore but sparksql > CLI doesn't pick it up. (copying the same conf/ files to spark1

RE: Share RDD from SparkR and another application

2015-07-14 Thread Sun, Rui
Hi, hari, I don't think job-server can work with SparkR (also pySpark). It seems it would be technically possible but needs support from job-server and SparkR(also pySpark), which doesn't exist yet. But there may be some in-direct ways of sharing RDDs between SparkR and an application. For exa

Re: Does Spark Streaming support streaming from a database table?

2015-07-14 Thread Akhil Das
Why not add a trigger to your database table and whenever its updated push the changes to kafka etc and use normal sparkstreaming? You can also write a receiver based architecture for this, but that will be a bit time consuming.

Re: Spark Intro

2015-07-14 Thread Akhil Das
This is where you can get started https://spark.apache.org/docs/latest/sql-programming-guide.html Thanks Best Regards On Mon, Jul 13, 2015 at 3:54 PM, vinod kumar wrote: > > Hi Everyone, > > I am developing application which handles bulk of data around > millions(This may vary as per user's req

Re: Standalone mode connection failure from worker node to master

2015-07-14 Thread sivarani
I am also facing the same issue, anyone figured it? Please help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Standalone-mode-connection-failure-from-worker-node-to-master-tp23101p23816.html Sent from the Apache Spark User List mailing list archive at Nabb

Re: Standalone mode connection failure from worker node to master

2015-07-14 Thread Akhil Das
Can you paste your conf/spark-env.sh file? Put SPARK_MASTER_IP as the master machine's host name in spark-env.sh file. Also add your slaves hostnames into conf/slaves file and do a sbin/start-all.sh Thanks Best Regards On Tue, Jul 14, 2015 at 1:26 PM, sivarani wrote: > I am also facing the same

spark submit configuration on yarn

2015-07-14 Thread Pa Rö
hello community, i want run my spark app on a cluster (cloudera 5.4.4) with 3 nodes (one pc has i7 8core with 16GB RAM). now i want submit my spark job on yarn (20GB RAM). my script to submit the job is to time the following: export HADOOP_CONF_DIR=/etc/hadoop/conf/ ./spark-1.3.0-bin-hadoop2.4/b

Re: Does Spark Streaming support streaming from a database table?

2015-07-14 Thread ayan guha
Hi At this moment we have the same requirement. Unfortunately, database owners will not be able to push to a msg queue but they have enabled Oracle CDC which synchronously update a replica of production DB. Our task will be query the replica and create msg streams to Kinesis. There is already an

Re: Benchmark results between Flink and Spark

2015-07-14 Thread Jerry Lam
FYI, another benchmark: http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html quote: "I have observed a lot of fetch failures while running Spark, which results in many restarted tasks and, therefore, takes the longest time. I suspect that executors are incapable of s

Udf's in spark

2015-07-14 Thread Ravisankar Mani
Hi Everyone, As mentioned in Spark sQL programming guide, Spark SQL support Hive UDFs. I have built the UDF's in hive meta store. It working perfectly in hive connection. But it is not working in spark ("java.lang.RuntimeException: Couldn't find function DATE_FORMAT"). Could you please help how t

Re: Including additional scala libraries in sparkR

2015-07-14 Thread Michal Haris
Ok thanks. It seems that --jars is not behaving as expected - getting class not found for even the most simple object from my lib. But anyways, I have to do at least a filter transformation before collecting the HBaseRDD into R so will have to go the route of using scala spark shell to transform an

Re: Spark Intro

2015-07-14 Thread vinod kumar
Hi Akhil Is my choice to switch to spark is good? because I don't have enough information regards limitation and working environment of spark. I tried spark SQL but it seems it returns data slower than compared to MsSQL.( I have tested with data which has 4 records) On Tue, Jul 14, 2015 at

Re: Spark Intro

2015-07-14 Thread Akhil Das
It might take some time to understand the echo system. I'm not sure about what kind of environment you are having (like #cores, Memory etc.), To start with, you can basically use a jdbc connector or dump your data as csv and load it into Spark and query it. You get the advantage of caching if you h

java.lang.IllegalStateException: unread block data

2015-07-14 Thread Arthur Chan
Hi, I use Spark 1.4. When saving the model to HDFS, I got error? Please help! Regards my scala command: sc.makeRDD(model.clusterCenters,10).saveAsObjectFile("/tmp/tweets/model") The error log: 15/07/14 18:27:40 INFO SequenceFileRDDFunctions: Saving as sequence file of type (NullWritable,Byt

Re: Problems after upgrading to spark 1.4.0

2015-07-14 Thread Luis Ángel Vicente Sánchez
I have just restarted the job and it doesn't seem that the shutdown hook is executed. I have attached to this email the log from the driver. It seems that the slave are not accepting the tasks... but we haven't change anything on our mesos cluster, we have only upgrade one job to spark 1.4; is ther

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Akhil Das
Look in the worker logs and see whats going on. Thanks Best Regards On Tue, Jul 14, 2015 at 4:02 PM, Arthur Chan wrote: > Hi, > > I use Spark 1.4. When saving the model to HDFS, I got error? > > Please help! > Regards > > > > my scala command: > sc.makeRDD(model.clusterCenters,10).saveAsObject

RE: Including additional scala libraries in sparkR

2015-07-14 Thread Sun, Rui
Could you give more details about the mis-behavior of --jars for SparkR? maybe it's a bug. From: Michal Haris [michal.ha...@visualdna.com] Sent: Tuesday, July 14, 2015 5:31 PM To: Sun, Rui Cc: Michal Haris; user@spark.apache.org Subject: Re: Including additional sc

Re: Research ideas using spark

2015-07-14 Thread Daniel Darabos
Hi Shahid, To be honest I think this question is better suited for Stack Overflow than for a PhD thesis. On Tue, Jul 14, 2015 at 7:42 AM, shahid ashraf wrote: > hi > > I have a 10 node cluster i loaded the data onto hdfs, so the no. of > partitions i get is 9. I am running a spark application ,

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-14 Thread Hafsa Asif
I m still looking forward for the answer. I want to know how to properly close everything about spark in java standalone app. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-solve-ThreadException-in-Apache-Spark-standalone-Java-Application-tp23675p23

Re: Basic Spark SQL question

2015-07-14 Thread Ron Gonzalez
Cool thanks. Will take a look... Sent from my iPhone > On Jul 13, 2015, at 6:40 PM, Michael Armbrust wrote: > > I'd look at the JDBC server (a long running yarn job you can submit queries > too) > > https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-se

About extra memory on yarn mode

2015-07-14 Thread Sea
Hi all: I have a question about why spark on yarn will need extra memory I apply for 10 executors, executor memory 6g, I find that it will allocate 1g more for 1 executor, totally 7g for 1 executor. I try to set spark.yarn.executor.memoryOverhead, but it did not help. 1g for 1 executor is too muc

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Arthur Chan
Hi, Below is the log form the worker. 15/07/14 17:18:56 ERROR FileAppender: Error writing stream to file /spark/app-20150714171703-0004/5/stderr java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) at java.io.BufferedInputStream.read1(Buf

No. of Task vs No. of Executors

2015-07-14 Thread shahid
hi I have a 10 node cluster i loaded the data onto hdfs, so the no. of partitions i get is 9. I am running a spark application , it gets stuck on one of tasks, looking at the UI it seems application is not using all nodes to do calculations. attached is the screen shot of tasks, it seems tasks a

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Akhil Das
Someone else also reported this error with spark 1.4.0 Thanks Best Regards On Tue, Jul 14, 2015 at 6:57 PM, Arthur Chan wrote: > Hi, Below is the log form the worker. > > > 15/07/14 17:18:56 ERROR FileAppender: Error writing stream to file > /spark/app-20150714171703-0004/5/stderr > > java.io.I

?????? Does Spark Streaming support streaming from a database table?

2015-07-14 Thread focus
Hi In our case, we have some data stored in a Oracle database table, and new records will be added into this table. We need to analyse new records to calculate some values continuesly, then we write a program to monitor the table every minute. Because every record has a increased unique ID num

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-14 Thread Yana Kadiyska
Have you seen this SO thread: http://stackoverflow.com/questions/13471519/running-daemon-with-exec-maven-plugin This seems to be more related to the plugin than Spark, looking at the stack trace On Tue, Jul 14, 2015 at 8:11 AM, Hafsa Asif wrote: > I m still looking forward for the answer. I

Re: Share RDD from SparkR and another application

2015-07-14 Thread harirajaram
I appreciate your reply. Yes,you are right by putting in a parquet etc and reading from another app,I would rather use spark-jobserver or IBM kernel to achieve the same if it is not SparkR as it gives more flexibility/scalabilty. Anyway,I have found a way to run R for my poc from my existing app us

Re: Share RDD from SparkR and another application

2015-07-14 Thread harirajaram
A small correction when I typed it is not RDDBackend it is RBackend,sorry. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Share-RDD-from-SparkR-and-another-application-tp23795p23828.html Sent from the Apache Spark User List mailing list archive at Nabble.co

Re: No. of Task vs No. of Executors

2015-07-14 Thread ayan guha
Hi As you can see, Spark has taken data locality into consideration and thus scheduled all tasks as node local. It is because spark could run task on a node where data is present, so spark went ahead and scheduled the tasks. It is actually good for reading. If you really want to fan out processing

Re: Few basic spark questions

2015-07-14 Thread Debasish Das
What do you need in sparkR that mllib / ml don't havemost of the basic analysis that you need on stream can be done through mllib components... On Jul 13, 2015 2:35 PM, "Feynman Liang" wrote: > Sorry; I think I may have used poor wording. SparkR will let you use R to > analyze the data, but

correct Scala Imports for creating DFs from RDDs?

2015-07-14 Thread ashwang168
Hello! I am currently using Spark 1.4.0, scala 2.10.4, and sbt 0.13.8 to try and create a jar file from a scala file (attached above) and run it using spark-submit. I am also using Hive, Hadoop 2.6.0-cdh5.4.0 which has the files that I'm trying to read in. Currently I am very confused about how t

Re: Spark Intro

2015-07-14 Thread Hafsa Asif
Hi, I was also in the same situation as we were using MySQL. Let me give some clearfications: 1. Spark provides a great methodology for big data analysis. So, if you want to make your system more analytical and want deep prepared analytical methods to analyze your data, then its a very good option.

Re: About extra memory on yarn mode

2015-07-14 Thread Jong Wook Kim
executor.memory only sets the maximum heap size of executor and the JVM needs non-heap memory to store class metadata, interned strings and other native overheads coming from networking libraries, off-heap storage levels, etc. These are (of course) legitimate usage of resources and you'll have t

Re: Including additional scala libraries in sparkR

2015-07-14 Thread Shivaram Venkataraman
There was a fix for `--jars` that went into 1.4.1 https://github.com/apache/spark/commit/2579948bf5d89ac2d822ace605a6a4afce5258d6 Shivaram On Tue, Jul 14, 2015 at 4:18 AM, Sun, Rui wrote: > Could you give more details about the mis-behavior of --jars for SparkR? > maybe it's a bug. > __

Re: Create RDD from output of unix command

2015-07-14 Thread Hafsa Asif
Your question is very interesting. What I suggest is, that copy your output in some text file. Read text file in your code and apply RDD. Just consider wordcount example by Spark. I love this example with Java client. Well, Spark is an analytical engine and it has a slogan to analyze big big data s

Re: Spark application with a RESTful API

2015-07-14 Thread Hafsa Asif
I have almost the same case. I will tell you what I am actually doing, if it is according to your requirement, then I will love to help you. 1. my database is aerospike. I get data from it. 2. written standalone spark app (it does not run in standalone mode, but with simple java command or maven c

Re: Few basic spark questions

2015-07-14 Thread Oded Maimon
Hi, Thanks for all the help. I'm still missing something very basic. If I wont use sparkR, which doesn't support streaming (will use mlib instead as Debasish suggested), and I have my scala receiver working, how the receiver should save the data in memory? I do see the store method, so if i use it

Re: Create RDD from output of unix command

2015-07-14 Thread Igor Berman
haven't you thought about spark streaming? there is thread that could help https://www.mail-archive.com/user%40spark.apache.org/msg30105.html On 14 July 2015 at 18:20, Hafsa Asif wrote: > Your question is very interesting. What I suggest is, that copy your output > in some text file. Read text f

How to maintain multiple JavaRDD created within another method like javaStreamRDD.forEachRDD

2015-07-14 Thread unk1102
I use Spark Streaming where messages read from Kafka topics are stored into JavaDStream this rdd contains actual data. Now after going through documentation and other help I have found we traverse JavaDStream using foreachRDD javaDStreamRdd.foreachRDD(new Function,Void>() { public void call(Ja

Re: correct Scala Imports for creating DFs from RDDs?

2015-07-14 Thread DW @ Gmail
You are mixing the 1.0.0 Spark SQL jar with Spark 1.4.0 jars in your build file Sent from my rotary phone. > On Jul 14, 2015, at 7:57 AM, ashwang168 wrote: > > Hello! > > I am currently using Spark 1.4.0, scala 2.10.4, and sbt 0.13.8 to try and > create a jar file from a scala file (attached

ProcessBuilder in SparkLauncher is memory inefficient for launching new process

2015-07-14 Thread Elkhan Dadashov
Hi all, If you want to launch Spark job from Java in programmatic way, then you need to Use SparkLauncher. SparkLauncher uses ProcessBuilder for creating new process - Java seems handle process creation in an inefficient way. " When you execute a process, you must first fork() and then exec(). F

Re: Spark application with a RESTful API

2015-07-14 Thread Debasish Das
How do you manage the spark context elastically when your load grows from 1000 users to 1 users ? On Tue, Jul 14, 2015 at 8:31 AM, Hafsa Asif wrote: > I have almost the same case. I will tell you what I am actually doing, if > it > is according to your requirement, then I will love to help y

Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Elkhan Dadashov
More particular example: I run pi.py Spark Python example in *yarn-cluster* mode (--master) through SparkLauncher in Java. While the program is running, these are the stats of how much memory each process takes: SparkSubmit process : 11.266 *gigabyte* Virtual Memory ApplicationMaster process: 2

spark on yarn

2015-07-14 Thread Shushant Arora
I am running spark application on yarn managed cluster. When I specify --executor-cores > 4 it fails to start the application. I am starting the app as spark-submit --class classname --num-executors 10 --executor-cores 5 --master masteradd jarname Exception in thread "main" org.apache.spark.Spar

Re: Finding moving average using Spark and Scala

2015-07-14 Thread Feynman Liang
If your rows may have NAs in them, I would process each column individually by first projecting the column ( map(x => x.nameOfColumn) ), filtering out the NAs, then running a summarizer over each column. Even if you have many rows, after summarizing you will only have a vector of length #columns.

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 9:57 AM, Shushant Arora wrote: > When I specify --executor-cores > 4 it fails to start the application. > When I give --executor-cores as 4 , it works fine. > Do you have any NM that advertises more than 4 available cores? Also, it's always worth it to check if there's a

Re: Few basic spark questions

2015-07-14 Thread Feynman Liang
You could implement the receiver as a Spark Streaming Receiver ; the data received would be available for any streaming applications which operate on DStreams (e.g. Streaming KMeans

Re: spark on yarn

2015-07-14 Thread Shushant Arora
got the below exception in logs: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException): Invalid resource request, requested virtual cores < 0, or requested virtual cores > max configured, requestedVirtualCores=5, maxVirtualCores=4 at org.apache.ha

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 10:40 AM, Shushant Arora wrote: > My understanding was --executor-cores(5 here) are maximum concurrent > tasks possible in an executor and --num-executors (10 here)are no of > executors or containers demanded by Application master/Spark driver program > to yarn RM. > --e

Spark on EMR with S3 example (Python)

2015-07-14 Thread Pagliari, Roberto
Is there an example about how to load data from a public S3 bucket in Python? I haven't found any. Thank you,

Re: spark on yarn

2015-07-14 Thread Shushant Arora
Is yarn.scheduler.maximum-allocation-vcores the setting for max vcores per container? Whats the setting for max limit of --num-executors ? On Tue, Jul 14, 2015 at 11:18 PM, Marcelo Vanzin wrote: > On Tue, Jul 14, 2015 at 10:40 AM, Shushant Arora < > shushantaror...@gmail.com> wrote: > >> My und

Re: spark on yarn

2015-07-14 Thread Ted Yu
Shushant : Please also see 'Debugging your Application' section of https://spark.apache.org/docs/latest/running-on-yarn.html On Tue, Jul 14, 2015 at 10:48 AM, Marcelo Vanzin wrote: > On Tue, Jul 14, 2015 at 10:40 AM, Shushant Arora < > shushantaror...@gmail.com> wrote: > >> My understanding wa

Re: Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 9:53 AM, Elkhan Dadashov wrote: > While the program is running, these are the stats of how much memory each > process takes: > > SparkSubmit process : 11.266 *gigabyte* Virtual Memory > > ApplicationMaster process: 2303480 *byte *Virtual Memory > That SparkSubmit number l

Re: spark on yarn

2015-07-14 Thread Ted Yu
Please see YARN-193 where 'yarn.scheduler.maximum-allocation-vcores' was introduced. See also YARN-3823 which changed default value. Cheers On Tue, Jul 14, 2015 at 10:55 AM, Shushant Arora wrote: > Is yarn.scheduler.maximum-allocation-vcores the setting for max vcores per > container? > > What

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 10:55 AM, Shushant Arora wrote: > Is yarn.scheduler.maximum-allocation-vcores the setting for max vcores per > container? > I don't remember YARN config names by heart, but that sounds promising. I'd look at the YARN documentation for details. > Whats the setting for ma

Re: spark on yarn

2015-07-14 Thread Shushant Arora
Ok thanks a lot! few more doubts : What happens in a streaming application say with spark-submit --class classname --num-executors 10 --executor-cores 4 --master masteradd jarname Will it allocate 10 containers throughout the life of streaming application on same nodes until any node failure hap

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 11:13 AM, Shushant Arora wrote: > spark-submit --class classname --num-executors 10 --executor-cores 4 > --master masteradd jarname > > Will it allocate 10 containers throughout the life of streaming > application on same nodes until any node failure happens and > It will

master compile broken for scala 2.11

2015-07-14 Thread Koert Kuipers
it works for scala 2.10, but for 2.11 i get: [ERROR] /home/koert/src/spark/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java:135: error: is not abstract and does not override abstract method minBy(Function1,Ordering) in TraversableOnce [ERROR] return new

RE: Master vs. Slave Nodes Clarification

2015-07-14 Thread Mohammed Guller
The master node does not have to be similar to the worker nodes. It can be a smaller machine. In case of C*, again you don't need to have C* on the master node. You need C* and Spark workers co-located. Master can be on one of the C* node or a non-C* node. Mohammed -Original Message-

Re: spark on yarn

2015-07-14 Thread Jong Wook Kim
it's probably because your YARN cluster has only 40 vCores available. Go to your resource manager and check if "VCores Total" and "Memory Total" exceeds what you have set. (40 cores and 5120 MB) If that looks fine, go to "Scheduler" page and find the queue on which your jobs run, and check the

Re: HDFS performances + unexpected death of executors.

2015-07-14 Thread Max Demoulin
I will try a fresh setup very soon. Actually, I tried to compile spark by myself, against hadoop 2.5.2, but I had the issue that I mentioned in this thread: http://apache-spark-user-list.1001560.n3.nabble.com/Master-doesn-t-start-no-logs-td23651.html I was wondering if maybe serialization/deseria

Re: ProcessBuilder in SparkLauncher is memory inefficient for launching new process

2015-07-14 Thread Jong Wook Kim
The article you've linked, is specific to an embedded system. the JVM built for that architecture (which the author didn't mention) might not be as stable and well-supported as HotSpot. ProcessBuilder is a stable Java API and despite somewhat limited functionality it is the standard method to l

Re: How to maintain multiple JavaRDD created within another method like javaStreamRDD.forEachRDD

2015-07-14 Thread Jong Wook Kim
Your question is not very clear, but from what I understand, you want to deal with a stream of MyTable that has parsed records from your Kafka topics. What you need is JavaDStream, and you can use transform()

Re: spark on yarn

2015-07-14 Thread Shushant Arora
Can a container have multiple JVMs running in YARN? I am comparing Hadoop Mapreduce running on yarn vs spark running on yarn here : 1.Is the difference is in Hadoop Mapreduce job - say I specify 20 reducers and my job uses 10 map tasks then, it need total 30 containers or 30 vcores ? I guess 30 v

Re: Spark on EMR with S3 example (Python)

2015-07-14 Thread Sujit Pal
Hi Roberto, I have written PySpark code that reads from private S3 buckets, it should be similar for public S3 buckets as well. You need to set the AWS access and secret keys into the SparkContext, then you can access the S3 folders and files with their s3n:// paths. Something like this: sc = Spa

To access elements of a org.apache.spark.mllib.linalg.Vector

2015-07-14 Thread Dan Dong
Hi, I'm wondering how to access elements of a linalg.Vector, e.g: sparseVector: Seq[org.apache.spark.mllib.linalg.Vector] = List((3,[1,2],[1.0,2.0]), (3,[0,1,2],[3.0,4.0,5.0])) scala> sparseVector(1) res16: org.apache.spark.mllib.linalg.Vector = (3,[0,1,2],[3.0,4.0,5.0]) How to get the indices

Re: To access elements of a org.apache.spark.mllib.linalg.Vector

2015-07-14 Thread Burak Yavuz
Hi Dan, You could zip the indices with the values if you like. ``` val sVec = sparseVector(1).asInstanceOf[ org.apache.spark.mllib.linalg.SparseVector] val map = sVec.indices.zip(sVec.values).toMap ``` Best, Burak On Tue, Jul 14, 2015 at 12:23 PM, Dan Dong wrote: > Hi, > I'm wondering how t

Re: spark on yarn

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 12:03 PM, Shushant Arora wrote: > Can a container have multiple JVMs running in YARN? > Yes and no. A container runs a single command, but that process can start other processes, and those also count towards the resource usage of the container (mostly memory). For example

Re: To access elements of a org.apache.spark.mllib.linalg.Vector

2015-07-14 Thread Dan Dong
Yes, it works! Thanks a lot Burak! Cheers, Dan 2015-07-14 14:34 GMT-05:00 Burak Yavuz : > Hi Dan, > > You could zip the indices with the values if you like. > > ``` > val sVec = sparseVector(1).asInstanceOf[ > org.apache.spark.mllib.linalg.SparseVector] > val map = sVec.indices.zip(sVec.values)

Java 8 vs Scala

2015-07-14 Thread spark user
Hi All  To Start new project in Spark , which technology is good .Java8 OR  Scala . I am Java developer , Can i start with Java 8  or I Need to learn Scala . which one is better technology  for quick start any POC project  Thanks  - su 

DataFrame.withColumn() recomputes columns even after cache()

2015-07-14 Thread pnpritchard
Hi! I am seeing some unexpected behavior with regards to cache() in DataFrames. Here goes: In my Scala application, I have created a DataFrame that I run multiple operations on. It is expensive to recompute the DataFrame, so I have called cache() after it gets created. I notice that the cache()

Re: Java 8 vs Scala

2015-07-14 Thread Ted Yu
See previous thread: http://search-hadoop.com/m/q3RTtaXamv1nFTGR On Tue, Jul 14, 2015 at 1:30 PM, spark user wrote: > Hi All > > To Start new project in Spark , which technology is good .Java8 OR Scala . > > I am Java developer , Can i start with Java 8 or I Need to learn Scala . > > which one

Re: Java 8 vs Scala

2015-07-14 Thread Vineel Yalamarthy
Good question. Like you , many are in the same boat(coming from Java background). Looking forward to response from the community. Regards Vineel On Tue, Jul 14, 2015 at 2:30 PM, spark user wrote: > Hi All > > To Start new project in Spark , which technology is good .Java8 OR Scala . > > I

Re: How to speed up Spark process

2015-07-14 Thread ๏̯͡๏
Any solutions to solve this exception ? org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1 at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:389) at org.a

RE: Spark on EMR with S3 example (Python)

2015-07-14 Thread Pagliari, Roberto
Hi Sujit, I just wanted to access public datasets on Amazon. Do I still need to provide the keys? Thank you, From: Sujit Pal [mailto:sujitatgt...@gmail.com] Sent: Tuesday, July 14, 2015 3:14 PM To: Pagliari, Roberto Cc: user@spark.apache.org Subject: Re: Spark on EMR with S3 example (Python) H

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread Tathagata Das
Just to be clear, you mean the Spark Standalone cluster manager's "master" and not the applications "driver", right. In that case, the earlier responses are correct. TD On Tue, Jul 14, 2015 at 11:26 AM, Mohammed Guller wrote: > The master node does not have to be similar to the worker nodes. It

Misaligned Rows with UDF

2015-07-14 Thread pedro
Hi, I am working at finding the root cause of a bug where rows in dataframes seem to have misaligned data. My dataframes have two types of columns: columns from data and columns from UDFs. I seem to be having trouble where for a given row, the row data doesn't match the data used to compute the UD

Re: master compile broken for scala 2.11

2015-07-14 Thread Josh Rosen
I've opened a PR to fix this; please take a look: https://github.com/apache/spark/pull/7405 On Tue, Jul 14, 2015 at 11:22 AM, Koert Kuipers wrote: > it works for scala 2.10, but for 2.11 i get: > > [ERROR] > /home/koert/src/spark/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeEx

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread algermissen1971
On 14 Jul 2015, at 23:26, Tathagata Das wrote: > Just to be clear, you mean the Spark Standalone cluster manager's "master" > and not the applications "driver", right. Sorry, by now I have understood that I would not necessarily put the driver app on the master node and that not making that

Re: Master vs. Slave Nodes Clarification

2015-07-14 Thread Tathagata Das
Yep :) On Tue, Jul 14, 2015 at 2:44 PM, algermissen1971 wrote: > > On 14 Jul 2015, at 23:26, Tathagata Das wrote: > > > Just to be clear, you mean the Spark Standalone cluster manager's > "master" and not the applications "driver", right. > > Sorry, by now I have understood that I would not nec

Data Frame for nested json

2015-07-14 Thread spark user
is DataFrame  support nested json to dump directely to data base  For simple json it working fine  {"id":2,"name":"Gerald","email":"gbarn...@zimbio.com","city":"Štoky","country":"Czech Republic","ip":"92.158.154.75”},  But for nested json it failed to load  root |-- rows: array (nullable = true)

Re: spark streaming with kafka reset offset

2015-07-14 Thread Chen Song
A follow up question. When using createDirectStream approach, the offsets are checkpointed to HDFS and it is understandable by Spark Streaming job. Is there a way to expose the offsets via a REST api to end users. Or alternatively, is there a way to have offsets committed to Kafka Offset Manager s

Re: spark streaming with kafka reset offset

2015-07-14 Thread Cody Koeninger
You have access to the offset ranges for a given rdd in the stream by typecasting to HasOffsetRanges. You can then store the offsets wherever you need to. On Tue, Jul 14, 2015 at 5:00 PM, Chen Song wrote: > A follow up question. > > When using createDirectStream approach, the offsets are checkp

Re: spark streaming with kafka reset offset

2015-07-14 Thread Tathagata Das
Relevant documentation - https://spark.apache.org/docs/latest/streaming-kafka-integration.html, towards the end. directKafkaStream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] // offsetRanges.length = # of Kafka partitions being consumed ... } On Tue,

Re: Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Elkhan Dadashov
Thanks, Marcelo. That article confused me, thanks for correcting it & helpful tips. I looked into Virtual memory usage (jmap+jvisualvm) does not show that 11.5 g Virtual Memory usage - it is much less. I get 11.5 g Virtual memory usage using top -p pid command for SparkSubmit process. The virtua

Re: Why does SparkSubmit process takes so much virtual memory in yarn-cluster mode ?

2015-07-14 Thread Marcelo Vanzin
On Tue, Jul 14, 2015 at 3:42 PM, Elkhan Dadashov wrote: > I looked into Virtual memory usage (jmap+jvisualvm) does not show that > 11.5 g Virtual Memory usage - it is much less. I get 11.5 g Virtual memory > usage using top -p pid command for SparkSubmit process. > If you're looking at top you w

Sessionization using updateStateByKey

2015-07-14 Thread swetha
Hi, I have a question regarding sessionization using updateStateByKey. If near real time state needs to be maintained in a Streaming application, what happens when the number of RDDs to maintain the state becomes very large? Does it automatically get saved to HDFS and reload when needed or do I h

Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-14 Thread Kelly, Jonathan
I've set up my cluster with a pre-calcualted value for spark.executor.instances in spark-defaults.conf such that I can run a job and have it maximize the utilization of the cluster resources by default. However, if I want to run a job with dynamicAllocation (by passing -c spark.dynamicAllocation

Re: Sessionization using updateStateByKey

2015-07-14 Thread Tathagata Das
[Apologies for repost, for those who have seen this response already in the dev mailing list] 1. When you set ssc.checkpoint(checkpointDir), the spark streaming periodically saves the state RDD (which is a snapshot of all the state data) to HDFS using RDD checkpointing. In fact, a streaming app wi

Re: java.lang.IllegalStateException: unread block data

2015-07-14 Thread Arthur Chan
I found the reason, it is about sc. Thanks On Tue, Jul 14, 2015 at 9:45 PM, Akhil Das wrote: > Someone else also reported this error with spark 1.4.0 > > Thanks > Best Regards > > On Tue, Jul 14, 2015 at 6:57 PM, Arthur Chan > wrote: > >> Hi, Below is the log form the worker. >> >> >> 15/07/14

Getting not implemented by the TFS FileSystem implementation

2015-07-14 Thread Jerrick Hoang
Hi all, I'm upgrading from spark1.3 to spark1.4 and when trying to run spark-sql CLI. It gave an ```ava.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation``` exception. I did not get this error with 1.3 and I don't use any TFS FileSystem. Full stack trace is

Re: DataFrame.withColumn() recomputes columns even after cache()

2015-07-14 Thread pnpritchard
I was able to workaround this by converting the DataFrame to an RDD and then back to DataFrame. This seems very weird to me, so any insight would be much appreciated! Thanks, Nick P.S. Here's the updated code with the workaround: ``` // Examples udf's that println when called val twice =

SparkSQL 1.4 can't accept registration of UDF?

2015-07-14 Thread ogoh
Hello, I am using SparkSQL along with ThriftServer so that we can access using Hive queries. With Spark 1.3.1, I can register UDF function. But, Spark 1.4.0 doesn't work for that. The jar of the udf is same. Below is logs: I appreciate any advice. == With Spark 1.4 Beeline version 1.4.0 by Apache

Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread swetha
Hi, Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark Streaming to do lookups/updates/deletes in RDDs using keys by storing them as key/value pairs. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-IndexedRDD-avail

Sorted Multiple Outputs

2015-07-14 Thread Yiannis Gkoufas
Hi there, I have been using the approach described here: http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job In addition to that, I was wondering if there is a way to set the customize the order of those values contained in each file. Thanks a lot!

Re: creating a distributed index

2015-07-14 Thread swetha
Hi Ankur, Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark Streaming to do lookups/updates/deletes in RDDs using keys by storing them as key/value pairs. Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cr

How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Brandon White
Hello there, I have a JBDC connection setup to my Spark cluster but I cannot see the tables that I cache in memory. The only tables I can see are those that are in my Hive instance. I use a HiveContext to register a table and cache it in memory. How can I enable my JBDC connection to query this in

RE: How do you access a cached Spark SQL Table from a JBDC connection?

2015-07-14 Thread Cheng, Hao
Can you describe how did you cache the tables? In another HiveContext? AFAIK, cached table only be visible within the same HiveContext, you probably need to execute the sql query like “cache table mytable as SELECT xxx” in the JDBC connection also. Cheng Hao From: Brandon White [mailto:bwwinthe

Re: Is IndexedRDD available in Spark 1.4.0?

2015-07-14 Thread Ted Yu
Please take a look at SPARK-2365 which is in progress. On Tue, Jul 14, 2015 at 5:18 PM, swetha wrote: > Hi, > > Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark > Streaming to do lookups/updates/deletes in RDDs using keys by storing them > as key/value pairs. > > Thanks

  1   2   >