Regarding benefits of using more than one cpu for a task in spark

2015-04-07 Thread twinkle sachdeva
Hi, In spark, there are two settings regarding number of cores, one is at task level :spark.task.cpus and there is another one, which drives number of cores per executors: spark.executor.cores Apart from using more than one core for a task which has to call some other external API etc, is there

RE: Spark 1.2.0 with Play/Activator

2015-04-07 Thread Manish Gupta 8
If I try to build spark-notebook with "spark.version"="1.2.0-cdh5.3.0", sbt throw these warnings before failing to compile: :: org.apache.spark#spark-yarn_2.10;1.2.0-cdh5.3.0: not found :: org.apache.spark#spark-repl_2.10;1.2.0-cdh5.3.0: not found Any suggestions? Thanks From: Manish Gupta 8 [

Re: Spark 1.2.0 with Play/Activator

2015-04-07 Thread andy petrella
Mmmh, you want it running a spark 1.2 with hadoop 2.5.0-cdh5.3.2 right? If I'm not wrong you might have to launch it like so: ``` sbt -Dspark.version=1.2.0 -Dhadoop.version=2.5.0-cdh5.3.2 ``` Or you can download it from http://spark-notebook.io if you want. HTH andy On Tue, Apr 7, 2015 at 9:0

Re: Spark Application Stages and DAG

2015-04-07 Thread Vijay Innamuri
My Spark streaming application processes the data received in each interval. In Spark Stages UI, all the stages are pointed to single line of code* windowDStream.foreachRDD* only (not the actions inside the DStream) - Following is the information from Spark Stages UI page: Stage IdDescr

Re: task not serialize

2015-04-07 Thread Jeetendra Gangele
Lets say I follow below approach and I got RddPair with huge size .. which can not fit into one machine ... what to run foreach on this RDD? On 7 April 2015 at 04:25, Jeetendra Gangele wrote: > > > On 7 April 2015 at 04:03, Dean Wampler wrote: > >> >> On Mon, Apr 6, 2015 at 6:20 PM, Jeetendra G

[DAGSchedule][OutputCommitCoordinator] OutputCommitCoordinator.authorizedCommittersByStage Map Out Of Memory

2015-04-07 Thread Tao Li
Hi all: I am using spark streaming(1.3.1) as a long time running service and out of memory after running for 7 days. I found that the field *authorizedCommittersByStage* in *OutputCommitCoordinator* class cause the OOM. authorizedCommittersByStage is a map, key is StageId, value is Map[Partition

Re: Spark streaming with Kafka- couldnt find KafkaUtils

2015-04-07 Thread Felix C
Or you could build an uber jar ( you could google that ) https://eradiating.wordpress.com/2015/02/15/getting-spark-streaming-on-kafka-to-work/ --- Original Message --- From: "Akhil Das" Sent: April 4, 2015 11:52 PM To: "Priya Ch" Cc: user@spark.apache.org, "dev" Subject: Re: Spark streaming w

Re: Processing Large Images in Spark?

2015-04-07 Thread Steve Loughran
On 6 Apr 2015, at 23:05, Patrick Young mailto:patrick.mckendree.yo...@gmail.com>> wrote: does anyone have any thoughts on storing a really large raster in HDFS? Seems like if I just dump the image into HDFS as it, it'll get stored in blocks all across the system and when I go to read it, the

Can not get executor's Log from Spark's History Server

2015-04-07 Thread donhoff_h
Hi, Experts I run my Spark Cluster on Yarn. I used to get executors' Logs from Spark's History Server. But after I started my Hadoop jobhistory server and made configuration to aggregate logs of hadoop jobs to a HDFS directory, I found that I could not get spark's executors' Logs any more. Is

[GraphX] aggregateMessages with active set

2015-04-07 Thread James
Hello, The old api of GraphX "mapReduceTriplets" has an optional parameter "activeSetOpt: Option[(VertexRDD[_]" that limit the input of sendMessage. However, to the new api "aggregateMessages" I could not find this option, why it does not offer any more? Alcaid

Re: Processing Large Images in Spark?

2015-04-07 Thread andy petrella
Heya, You might be interesting at looking at GeoTrellis They use RDDs of Tiles to process big images like Landsat ones can be (specially 8). However, I see you have only 1G per file, so I guess you only care of a single band? Or is it a reboxed pic? Note: I think the GeoTrellis image format is s

'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread 李铖
In my dev-test env .I have 3 virtual machines ,every machine have 12G memory,8 cpu core. Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not right. I run this command :*spark-submit --master yarn-client --driver-memory 7g --executor-memory 6g /home/hadoop/spark/main.py* exceptio

scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Yamini
Using spark(1.2) streaming to read avro schema based topics flowing in kafka and then using spark sql context to register data as temp table. Avro maven plugin(1.7.7 version) generates the java bean class for the avro file but includes a field named SCHEMA$ of type org.apache.avro.Schema which is n

Issue with pyspark 1.3.0, sql package and rows

2015-04-07 Thread Stefano Parmesan
Hi all, I've already opened a bug on Jira some days ago [1] but I'm starting thinking this is not the correct way to go since I haven't got any news about it yet. Let me try to explain it briefly: with pyspark, trying to cogroup two input files with different schemas lead (nondeterministically) t

Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-07 Thread Puneet Kumar Ojha
Hi , Is there any difference between Difference between textFile Vs hadoopFile (textInoutFormat) when data is present in HDFS? Will there be any performance gain that can be observed? Puneet Kumar Ojha Data Architect | PubMatic

Pipelines for controlling workflow

2015-04-07 Thread Staffan
Hi, I am building a pipeline and I've read most that I can find on the topic (spark.ml library and the AMPcamp version of pipelines: http://ampcamp.berkeley.edu/5/exercises/image-classification-with-pipelines.html). I do not have structured data as in the case of the new Spark.ml library which uses

Re: Difference between textFile Vs hadoopFile (textInoutFormat) on HDFS data

2015-04-07 Thread Nick Pentreath
There is no difference - textFile calls hadoopFile with a TextInputFormat, and maps each value to a String.  — Sent from Mailbox On Tue, Apr 7, 2015 at 1:46 PM, Puneet Kumar Ojha wrote: > Hi , > Is there any difference between Difference between textFile Vs hadoopFile > (textInoutFormat) wh

Re: task not serialize

2015-04-07 Thread Dean Wampler
Foreach() runs in parallel across the cluster, like map, flatMap, etc. You'll only run into problems if you call collect(), which brings the entire RDD into memory in the driver program. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (

Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread James Aley
Hello, First of all, thank you to everyone working on Spark. I've only been using it for a few weeks now but so far I'm really enjoying it. You saved me from a big, scary elephant! :-) I was wondering if anyone might be able to offer some advice about working with the Thrift JDBC server? I'm tryi

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread ARose
I am having the same issue with my java application. String url = "jdbc:sqlserver://" + host + ":1433;DatabaseName=" + database + ";integratedSecurity=true"; String driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"; SparkConf conf = new SparkConf().setAppName(appName

ML consumption time based on data volume - same cluster

2015-04-07 Thread Vasyl Harasymiv
Hi Spark Community, Imagine you have a stable computing cluster (e.g. 5 nodes) with Hadoop that does not run anything that your Spark jobs. Now imagine you run simple machine learning on the data (e.g. 100MB): 1. K-means - 5 min 2. Logistic regression - 5 min Now imagine that the volume

Issue with pyspark 1.3.0, sql package and rows

2015-04-07 Thread Stefano Parmesan
Hi all, I've already opened a bug on Jira some days ago [1] but I'm starting thinking this is not the correct way to go since I haven't got any news about it yet. Let me try to explain it briefly: with pyspark, trying to cogroup two input files with different schemas lead (nondeterministically) t

RE: [BUG]Broadcast value return empty after turn to org.apache.spark.serializer.KryoSerializer

2015-04-07 Thread Shuai Zheng
I have found the issue, but I think it is bug. If I change my class to: public class ModelSessionBuilder implements Serializable { /** * */ . private Properties[] propertiesList; private

Does spark utilize the sorted order of hbase keys, when using hbase as data source

2015-04-07 Thread Юра
Hello, guys! I am a newbie to Spark and would appreciate any advice or help. Here is the detailed question: http://stackoverflow.com/questions/29493472/does-spark-utilize-the-sorted-order-of-hbase-keys-when-using-hbase-as-data-sour Regards, Yury

Re: Does spark utilize the sorted order of hbase keys, when using hbase as data source

2015-04-07 Thread Ted Yu
How many distinct users are stored in HBase ? TableInputFormat produces splits where number of splits matches the number of regions in a table. You can write your own InputFormat which splits according to user id. FYI On Tue, Apr 7, 2015 at 7:36 AM, Юра wrote: > Hello, guys! > > I am a newbie

RE: --driver-memory parameter doesn't work for spark-submmit on yarn?

2015-04-07 Thread Shuai Zheng
Sorry for reply late. I bypass this by set _JAVA_OPTIONS. And the ps aux | grep spark hadoop 14442 0.6 0.2 34334552 128560 pts/0 Sl+ 14:37 0:01 /usr/java/latest/bin/java org.apache.spark.deploy.SparkSubmitDriverBootstrapper --driver-memory=5G --executor-memory=10G --master yarn-client -

Re: Microsoft SQL jdbc support from spark sql

2015-04-07 Thread Denny Lee
That's correct, at this time MS SQL Server is not supported through the JDBC data source at this time. In my environment, we've been using Hadoop streaming to extract out data from multiple SQL Servers, pushing the data into HDFS, creating the Hive tables and/or converting them into Parquet, and t

Re: Does spark utilize the sorted order of hbase keys, when using hbase as data source

2015-04-07 Thread Юра
There are 500 millions distinct users... 2015-04-07 17:45 GMT+03:00 Ted Yu : > How many distinct users are stored in HBase ? > > TableInputFormat produces splits where number of splits matches the number > of regions in a table. You can write your own InputFormat which splits > according to user

The differentce between SparkSql/DataFram join and Rdd join

2015-04-07 Thread Hao Ren
Hi, We have 2 hive tables and want to join one with the other. Initially, we ran a sql request on HiveContext. But it did not work. It was blocked on 30/600 tasks. Then we tried to load tables into two DataFrames, we have encountered the same problem. Finally, it works with RDD.join. What we have

Re: Does spark utilize the sorted order of hbase keys, when using hbase as data source

2015-04-07 Thread Ted Yu
Then splitting according to user id's is out of the question :-) On Tue, Apr 7, 2015 at 8:12 AM, Юра wrote: > There are 500 millions distinct users... > > 2015-04-07 17:45 GMT+03:00 Ted Yu : > >> How many distinct users are stored in HBase ? >> >> TableInputFormat produces splits where number of

Re: task not serialize

2015-04-07 Thread Jeetendra Gangele
I thinking to follow the below approach(in my class hbase also return the same object which i will get in RDD) .1 First run the flatMapPairf JavaPairRDD> pairvendorData =matchRdd.flatMapToPair( new PairFlatMapFunction(){ @Override public Iterable> call( VendorRecord t) throws Exception { List> p

Re: RDD collect hangs on large input data

2015-04-07 Thread Jon Chase
Zsolt - what version of Java are you running? On Mon, Mar 30, 2015 at 7:12 AM, Zsolt Tóth wrote: > Thanks for your answer! > I don't call .collect because I want to trigger the execution. I call it > because I need the rdd on the driver. This is not a huge RDD and it's not > larger than the one

FlatMapPair run for longer time

2015-04-07 Thread Jeetendra Gangele
Hi All I am running the below code and its running for very long time where input to flatMapTopair is record of 50K. and I am calling Hbase for 50K times just a range scan query to should not take time. can anybody guide me what is wrong here? JavaPairRDD> pairvendorData =matchRdd.flatMapToPair( n

Incremently load big RDD file into Memory

2015-04-07 Thread mas
val locations = filelines.map(line => line.split("\t")).map(t => (t(5).toLong, (t(2).toDouble, t(3).toDouble))).distinct().collect() val cartesienProduct=locations.cartesian(locations).map(t=> Edge(t._1._1,t._2._1,distanceAmongPoints(t._1._2._1,t._1._2._2,t._2._2._1,t._2._2._2))) Code executes p

Re: A problem with Spark 1.3 artifacts

2015-04-07 Thread Marcelo Vanzin
BTW, just out of curiosity, I checked both the 1.3.0 release assembly and the spark-core_2.10 artifact downloaded from http://mvnrepository.com/, and neither contain any references to anything under "org.eclipse" (all referenced jetty classes are the shaded ones under org.spark-project.jetty). On

Re: FlatMapPair run for longer time

2015-04-07 Thread Dean Wampler
It's hard for us to diagnose your performance problems, because we don't have your environment and fixing one will simply reveal the next one to be fixed. So, I suggest you use the following strategy to figure out what takes the most time and hence what you might try to optimize. Try replacing expr

RE: Incremently load big RDD file into Memory

2015-04-07 Thread java8964
cartesian is an expensive operation. If you have 'M' records in location, then locations. cartesian(locations) will generate MxM result.If locations is a big RDD, it is hard to do the locations. cartesian(locations) efficiently.Yong > Date: Tue, 7 Apr 2015 10:04:12 -0700 > From: mas.ha...@gmail.c

Re: Using DIMSUM with ids

2015-04-07 Thread Debasish Das
I have a version that works well for Netflix data but now I am validating on internal datasets..this code will work on matrix factors and sparse matrices that has rows = 100* columnsif columns are much smaller than rows then col based flow works well...basically we need both flows... I did not

Re: [GraphX] aggregateMessages with active set

2015-04-07 Thread Ankur Dave
We thought it would be better to simplify the interface, since the active set is a performance optimization but the result is identical to calling subgraph before aggregateMessages. The active set option is still there in the package-private method aggregateMessagesWithActiveSet. You can actually

Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Michael Armbrust
Have you looked at spark-avro? https://github.com/databricks/spark-avro On Tue, Apr 7, 2015 at 3:57 AM, Yamini wrote: > Using spark(1.2) streaming to read avro schema based topics flowing in > kafka > and then using spark sql context to register data as temp table. Avro maven > plugin(1.7.7 ver

Re: The differentce between SparkSql/DataFram join and Rdd join

2015-04-07 Thread Michael Armbrust
The joins here are totally different implementations, but it is worrisome that you are seeing the SQL join hanging. Can you provide more information about the hang? jstack of the driver and a worker that is processing a task would be very useful. On Tue, Apr 7, 2015 at 8:33 AM, Hao Ren wrote:

Re: Can not get executor's Log from Spark's History Server

2015-04-07 Thread Marcelo Vanzin
The Spark history server does not have the ability to serve executor logs currently. You need to use the "yarn logs" command for that. On Tue, Apr 7, 2015 at 2:51 AM, donhoff_h <165612...@qq.com> wrote: > Hi, Experts > > I run my Spark Cluster on Yarn. I used to get executors' Logs from Spark's >

Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Yamini Maddirala
Hi Michael, Yes, I did try spark-avro 0.2.0 databricks project. I am using CHD5.3 which is based on spark 1.2. Hence I'm bound to use spark-avro 0.2.0 instead of the latest. I'm not sure how spark-avro project can help me in this scenario. 1. I have JavaDStream of type avro generic record :JavaD

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread Michael Armbrust
> > 1) What exactly is the relationship between the thrift server and Hive? > I'm guessing Spark is just making use of the Hive metastore to access table > definitions, and maybe some other things, is that the case? > Underneath the covers, the Spark SQL thrift server is executing queries using a

Array[T].distinct doesn't work inside RDD

2015-04-07 Thread anny9699
Hi, I have a question about Array[T].distinct on customized class T. My data is a like RDD[(String, Array[T])] in which T is a class written by my class. There are some duplicates in each Array[T] so I want to remove them. I override the equals() method in T and use val dataNoDuplicates = dataDu

How to generate Java bean class for avro files using spark avro project

2015-04-07 Thread Yamini
Is there a way to generate Java bean for a given avro schema file in spark 1.2 using spark-avro project 0.2.0 for following use case? 1. Topics from kafka read and stored in the form of avro generic records :JavaDStream 2. Using spark avro project able to get the schema in the following way Jav

Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Yamini Maddirala
For more details on my question http://apache-spark-user-list.1001560.n3.nabble.com/How-to-generate-Java-bean-class-for-avro-files-using-spark-avro-project-tp22413.html Thanks, Yamini On Tue, Apr 7, 2015 at 2:23 PM, Yamini Maddirala wrote: > Hi Michael, > > Yes, I did try spark-avro 0.2.0 datab

Re: From DataFrame to LabeledPoint

2015-04-07 Thread Sergio Jiménez Barrio
Solved! Thanks for ur help. I had converted null values to Double value (0.0) El 06/04/2015 19:25, "Joseph Bradley" escribió: > I'd make sure you're selecting the correct columns. If not that, then > your input data might be corrupt. > > CCing user to keep it on the user list. > > On Mon, Apr 6,

How to get SparkSql results on a webpage on real time

2015-04-07 Thread Mukund Ranjan (muranjan)
Hi, I have written a scala object which can do query on the messages which I am receiving from Kafka. Now I have to show it on some webpage or dashboard which can auto refresh with new results.. Any pointer how can I do that.. Thanks, Mukund

set spark.storage.memoryFraction to 0 when no cached RDD and memory area for broadcast value?

2015-04-07 Thread Shuai Zheng
Hi All, I am a bit confused on spark.storage.memoryFraction, this is used to set the area for RDD usage, will this RDD means only for cached and persisted RDD? So if my program has no cached RDD at all (means that I have no .cache() or .persist() call on any RDD), then I can set this spark.stor

Re: A problem with Spark 1.3 artifacts

2015-04-07 Thread Jacek Lewandowski
So weird, as I said - I created a new empty project where Spark core was the only dependency... [image: datastax_logo.png] JACEK LEWANDOWSKI DSE Software Engineer | +48.609.810.774 | jacek.lewandow...@datastax.com [image: linkedin.png]

Re: A problem with Spark 1.3 artifacts

2015-04-07 Thread Marcelo Vanzin
Maybe you have some sbt-built 1.3 version in your ~/.ivy2/ directory that's masking the maven one? That's the only explanation I can come up with... On Tue, Apr 7, 2015 at 12:22 PM, Jacek Lewandowski < jacek.lewandow...@datastax.com> wrote: > So weird, as I said - I created a new empty project wh

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread James Aley
Hi Michael, Thanks so much for the reply - that really cleared a lot of things up for me! Let me just check that I've interpreted one of your suggestions for (4) correctly... Would it make sense for me to write a small wrapper app that pulls in hive-thriftserver as a dependency, iterates my Parqu

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread Michael Armbrust
That should totally work. The other option would be to run a persistent metastore that multiple contexts can talk to and periodically run a job that creates missing tables. The trade-off here would be more complexity, but less downtime due to the server restarting. On Tue, Apr 7, 2015 at 12:34 P

Re: Advice using Spark SQL and Thrift JDBC Server

2015-04-07 Thread James Aley
Excellent, thanks for your help, I appreciate your advice! On 7 Apr 2015 20:43, "Michael Armbrust" wrote: > That should totally work. The other option would be to run a persistent > metastore that multiple contexts can talk to and periodically run a job > that creates missing tables. The trade-

broken link on Spark Programming Guide

2015-04-07 Thread jonathangreenleaf
in the current Programming Guide: https://spark.apache.org/docs/1.3.0/programming-guide.html#actions under Actions, the Python link goes to: https://spark.apache.org/docs/1.3.0/api/python/pyspark.rdd.RDD-class.html which is 404 which I think should be: https://spark.apache.org/docs/1.3.0/api/pyth

Re: broken link on Spark Programming Guide

2015-04-07 Thread Ted Yu
For the last link, you might have meant: https://spark.apache.org/docs/1.3.0/api/python/pyspark.html#pyspark.RDD Cheers On Tue, Apr 7, 2015 at 1:32 PM, jonathangreenleaf < jonathangreenl...@gmail.com> wrote: > in the current Programming Guide: > https://spark.apache.org/docs/1.3.0/programming-gu

Re: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread 李铖
Any help?please. Help me do a right configure. 李铖 于2015年4月7日星期二写道: > In my dev-test env .I have 3 virtual machines ,every machine have 12G > memory,8 cpu core. > > Here is spark-defaults.conf,and spark-env.sh.Maybe some config is not > right. > > I run this command :*spark-submit --master yarn-

parquet partition discovery

2015-04-07 Thread Christopher Petro
I was unable to get this feature to work in 1.3.0. I tried building off master and it still wasn't working for me. So I dug into the code, and I'm not sure how the parsePartition() was ever working. The while loop which walks up the parent directories in the path always terminates after a single

RE: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread java8964
It is hard to guess why OOM happens without knowing your application's logic and the data size. Without knowing that, I can only guess based on some common experiences: 1) increase "spark.default.parallelism"2) Increase your executor-memory, maybe 6g is not just enough 3) Your environment is kind

Re: Array[T].distinct doesn't work inside RDD

2015-04-07 Thread Anny Chen
Hi Sean, I didn't override hasCode. But the problem is that Array[T].toSet could work but Array[T].distinct couldn't. If it is because I didn't override hasCode, then toSet shouldn't work either right? I also tried using this Array[T].distinct outside RDD, and it is working alright also, returning

Re: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread Ted Yu
李铖: w.r.t. #5, you can use --executor-cores when invoking spark-submit Cheers On Tue, Apr 7, 2015 at 2:35 PM, java8964 wrote: > It is hard to guess why OOM happens without knowing your application's > logic and the data size. > > Without knowing that, I can only guess based on some common exper

How to use Joda Time with Spark SQL?

2015-04-07 Thread adamgerst
I've been using Joda Time in all my spark jobs (by using the nscala-time package) and have not run into any issues until I started trying to use spark sql. When I try to convert a case class that has a com.github.nscala_time.time.Imports.DateTime object in it, an exception is thrown for with a Mat

Unable to run spark examples on cloudera

2015-04-07 Thread Georgi Knox
Hi There, We’ve just started to trial out Spark at Bitly. We are running Spark 1.2.1 on Cloudera(CDH-5.3.0) with Hadoop 2.5.0 and am running into issues even just trying to run the python examples. Its just being run in standalone mode i believe. $ ./bin/spark-submit —driver-memory 2g examples/s

Re: ML consumption time based on data volume - same cluster

2015-04-07 Thread Xiangrui Meng
This could be empirically verified in spark-perf: https://github.com/databricks/spark-perf. Theoretically, it would be < 2x for k-means and logistic regression, because computation is doubled but communication cost remains the same. -Xiangrui On Tue, Apr 7, 2015 at 7:15 AM, Vasyl Harasymiv wrote:

Job submission API

2015-04-07 Thread Prashant Kommireddi
Hello folks, Newbie here! Just had a quick question - is there a job submission API such as the one with hadoop https://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/Job.html#submit() to submit Spark jobs to a Yarn cluster? I see in example that bin/spark-submit is what's out there

Re: Job submission API

2015-04-07 Thread michal.klo...@gmail.com
A SparkContext can submit jobs remotely. The spark-submit options in general can be populated into a SparkConf and passed in when you create a SparkContext. We personally have not had too much success with yarn-client remote submission, but standalone cluster mode was easy to get going. M >

Re: Job submission API

2015-04-07 Thread Veena Basavaraj
The following might be helpful. http://community.cloudera.com/t5/Advanced-Analytics-Apache-Spark/What-dependencies-to-submit-Spark-jobs-programmatically-not-via/td-p/24721 http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/ On 7 April 2015 at 16:32, michal.klo...@gmail.com wrote:

Re: Job submission API

2015-04-07 Thread HARIPRIYA AYYALASOMAYAJULA
Hello, If you are looking for the command to submit the following command works: spark-submit --class "SampleTest" --master yarn-cluster --num-executors 4 --executor-cores 2 /home/priya/Spark/Func1/target/scala-2.10/simple-project_2.10-1.0.jar On Tue, Apr 7, 2015 at 6:36 PM, Veena Basavaraj wro

Re: DataFrame groupBy MapType

2015-04-07 Thread Justin Yip
Thanks Michael. Will submit a ticket. Justin On Mon, Apr 6, 2015 at 1:53 PM, Michael Armbrust wrote: > I'll add that I don't think there is a convenient way to do this in the > Column API ATM, but would welcome a JIRA for adding it :) > > On Mon, Apr 6, 2015 at 1:45 PM, Michael Armbrust > wrot

Error when running Spark on Windows 8.1

2015-04-07 Thread Arun Lists
Hi, We are trying to run a Spark application using spark-submit on Windows 8.1. The application runs successfully to completion on MacOS 10.10 and on Ubuntu Linux. On Windows, we get the following error messages (see below). It appears that Spark is trying to delete some temporary directory that i

Drools in Spark

2015-04-07 Thread Sathish Kumaran Vairavelu
Hello, Just want to check if anyone has tried drools with Spark? Please let me know. Are there any alternate rule engine that works well with Spark? Thanks Sathish

Expected behavior for DataFrame.unionAll

2015-04-07 Thread Justin Yip
Hello, I am experimenting with DataFrame. I tried to construct two DataFrames with: 1. case class A(a: Int, b: String) scala> adf.printSchema() root |-- a: integer (nullable = false) |-- b: string (nullable = true) 2. case class B(a: String, c: Int) scala> bdf.printSchema() root |-- a: string

Specifying Spark property from command line?

2015-04-07 Thread Arun Lists
Hi, Is it possible to specify a Spark property like spark.local.dir from the command line when running an application using spark-submit? Thanks, arun

Re: ML consumption time based on data volume - same cluster

2015-04-07 Thread Vasyl Harasymiv
Thank you Xiangrui, Indeed, however, if the computation involves taking matrix, even locally, like random forest, if data increases 2x, even local computation time should increase >2x. But I will test it with the Spark Perf and let you know! On Tue, Apr 7, 2015 at 4:50 PM, Xiangrui Meng wrote:

Re: Specifying Spark property from command line?

2015-04-07 Thread Arun Lists
I just figured this out from the documentation: --conf spark.local.dir="C:\Temp" On Tue, Apr 7, 2015 at 5:00 PM, Arun Lists wrote: > Hi, > > Is it possible to specify a Spark property like spark.local.dir from the > command line when running an application using spark-submit? > > Thanks, > aru

Re: Array[T].distinct doesn't work inside RDD

2015-04-07 Thread Sean Owen
I suppose it depends a lot on the implementations. In general, distinct and toSet work when hashCode and equals are defined correctly. When that isn't the case, the result isn't defined; it might happen to work in some cases. This could well explain why you see different results. Why not implement

Re: broken link on Spark Programming Guide

2015-04-07 Thread Sean Owen
I fixed this a while ago in master. It should go out with the next release and next push of the site. On Tue, Apr 7, 2015 at 4:32 PM, jonathangreenleaf wrote: > in the current Programming Guide: > https://spark.apache.org/docs/1.3.0/programming-guide.html#actions > > under Actions, the Python lin

HiveThriftServer2

2015-04-07 Thread Mohammed Guller
Hi - I want to create an instance of HiveThriftServer2 in my Scala application, so I imported the following line: import org.apache.spark.sql.hive.thriftserver._ However, when I compile the code, I get the following error: object thriftserver is not a member of package org.apache.spark.sql.hi

Timeout errors from Akka in Spark 1.2.1

2015-04-07 Thread Nikunj Bansal
I have a standalone and local Spark streaming process where we are reading inputs using FlumeUtils. Our longest window size is 6 hours. After about a day and a half of running without any issues, we start seeing Timeout errors while cleaning up input blocks. This seems to cause reading from Flume t

Re: broken link on Spark Programming Guide

2015-04-07 Thread Jonathan Greenleaf
Awesome. thank you! On Apr 7, 2015 8:55 PM, "Sean Owen" wrote: > I fixed this a while ago in master. It should go out with the next > release and next push of the site. > > On Tue, Apr 7, 2015 at 4:32 PM, jonathangreenleaf > wrote: > > in the current Programming Guide: > > https://spark.apache.

value reduceByKeyAndWindow is not a member of org.apache.spark.streaming.dstream.DStream[(String, Int)]

2015-04-07 Thread Su She
Hello Everyone, I am trying to implement this example (Spark Streaming with Twitter). https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/TwitterPopularTags.scala I am able to do: hashTags.print() to get a live stream of filtered hashtags, but

Re: Spark + Kinesis

2015-04-07 Thread Vadim Bichutskiy
Hey y'all, While I haven't been able to get Spark + Kinesis integration working, I pivoted to plan B: I now push data to S3 where I set up a DStream to monitor an S3 bucket with textFileStream, and that works great. I <3 Spark! Best, Vadim ᐧ On Mon, Apr 6, 2015 at 12:23 PM, Vadim Bichutskiy <

Re: Spark TeraSort source request

2015-04-07 Thread Pramod Biligiri
+1. I would love to have the code for this as well. Pramod On Fri, Apr 3, 2015 at 12:47 PM, Tom wrote: > Hi all, > > As we all know, Spark has set the record for sorting data, as published on: > https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html. > > Here at our group, we would lov

DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
Hello, I have a parquet file of around 55M rows (~ 1G on disk). Performing simple grouping operation is pretty efficient (I get results within 10 seconds). However, after called DataFrame.cache, I observe a significant performance degrade, the same operation now takes 3+ minutes. My hunch is that

Cannot change the memory of workers

2015-04-07 Thread Jia Yu
Hi guys, Currently I am running Spark program on Amazon EC2. Each worker has around (less than but near to )2 gb memory. By default, I can see each worker is allocated 976 mb memory as the table shows below on Spark WEB UI. I know this value is from (Total memory minus 1 GB). But I want more than

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Yin Huai
Hi Justin, Does the schema of your data have any decimal, array, map, or struct type? Thanks, Yin On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip wrote: > Hello, > > I have a parquet file of around 55M rows (~ 1G on disk). Performing simple > grouping operation is pretty efficient (I get results w

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
The schema has a StructType. Justin On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai wrote: > Hi Justin, > > Does the schema of your data have any decimal, array, map, or struct type? > > Thanks, > > Yin > > On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip > wrote: > >> Hello, >> >> I have a parquet file of

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Yin Huai
I think the slowness is caused by the way that we serialize/deserialize the value of a complex type. I have opened https://issues.apache.org/jira/browse/SPARK-6759 to track the improvement. On Tue, Apr 7, 2015 at 6:59 PM, Justin Yip wrote: > The schema has a StructType. > > Justin > > On Tue, Ap

Re: 'Java heap space' error occured when query 4G data file from HDFS

2015-04-07 Thread 李铖
Thanks. using this command and parameters,it works. *spark-submit --master yarn-client --executor-memory 8g --executor-cores 4 /home/hadoop/spark/main.py* 2015-04-08 5:45 GMT+08:00 Ted Yu : > 李铖: > w.r.t. #5, you can use --executor-cores when invoking spark-submit > > Cheers > > On Tue, Apr 7, 2

Re: streamSQL - is it available or is it in POC ?

2015-04-07 Thread haopu
I'm also interested in this project. Do you have any update on it? Is it still active? Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streamSQL-is-it-available-or-is-it-in-POC-tp20993p22416.html Sent from the Apache Spark User List mailing list

EC2 spark-submit --executor-memory

2015-04-07 Thread spark_user_2015
Dear Spark team, I'm using the EC2 script to startup a Spark cluster. If I login and use the executor-memory parameter in the submit script, the UI tells me that no cores are assigned to the job and nothing happens. Without executor-memory everything works fine... Until I get "dag-scheduler-event-

Caching and Actions

2015-04-07 Thread spark_user_2015
I understand that RDDs are not created until an action is called. Is it a correct conclusion that it doesn't matter if ".cache" is used anywhere in the program if I only have one action that is called only once? Related to this question, consider this situation: val d1 = data.map((x,y,z) => (x,y)

Unable to specify multiple directories as input

2015-04-07 Thread ๏̯͡๏
Hello, I have two HDFS directories each containing multiple avro files. I want to specify these two directories as input. In Hadoop world, one can specify list of comma separated directories. In Spark that does not work. Logs 15/04/07 21:10:11 INFO storage.BlockManagerMaster: Updated info o

Re: Unable to specify multiple directories as input

2015-04-07 Thread ๏̯͡๏
Spark Version 1.3 Command: ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-company/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
Thanks for the explanation Yin. Justin On Tue, Apr 7, 2015 at 7:36 PM, Yin Huai wrote: > I think the slowness is caused by the way that we serialize/deserialize > the value of a complex type. I have opened > https://issues.apache.org/jira/browse/SPARK-6759 to track the improvement. > > On Tue,

need info on Spark submit on yarn-cluster mode

2015-04-07 Thread sachin Singh
Hi , I observed that we have installed only one cluster, and submiting job as yarn-cluster then getting below error, so is this cause that installation is only one cluster? Please correct me, if this is not cause then why I am not able to run in cluster mode, spark submit command is - spark-submit