Get application id when using SparkSubmit.main from java

2018-04-20 Thread Ron Gonzalez
Hi,  I am trying to get the application id after I use SparkSubmit.main for a yarn submission.  I am able to make it asynchronous using spark.yarn.watForCompletion=false configuration option, but I can't seem to figure out how I can get the application id for this job. I read both SparkSubmit.s

Re: Get full RDD lineage for a spark job

2017-07-23 Thread Ron Gonzalez
wrote: Hi Ron, You can try using the toDebugString method on the RDD, this will print the RDD lineage.  Regards,Keith. http://keith-chapman.com On Fri, Jul 21, 2017 at 11:24 AM, Ron Gonzalez wrote: Hi,  Can someone point me to a test case or share sample code that is able to extract the RDD

Get full RDD lineage for a spark job

2017-07-21 Thread Ron Gonzalez
Hi,  Can someone point me to a test case or share sample code that is able to extract the RDD graph from a Spark job anywhere during its lifecycle? I understand that Spark has UI that can show the graph of the execution so I'm hoping that is using some API somewhere that I could use.  I know RDD

Losing files in hdfs after creating spark sql table

2015-07-30 Thread Ron Gonzalez
Hi, After I create a table in spark sql and load infile an hdfs file to it, the file is no longer queryable if I do hadoop fs -ls. Is this expected? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org F

Re: Classifier for Big Data Mining

2015-07-21 Thread Ron Gonzalez
I'd use Random Forest. It will give you better generalizability. There are also a number of things you can do with RF that allows to train on samples of the massive data set and then just average over the resulting models... Thanks, Ron On 07/21/2015 02:17 PM, Olivier Girardot wrote: depends

Question on Spark SQL for a directory

2015-07-21 Thread Ron Gonzalez
Hi, Question on using spark sql. Can someone give an example for creating table from a directory containing parquet files in HDFS instead of an actual parquet file? Thanks, Ron On 07/21/2015 01:59 PM, Brandon White wrote: A few questions about caching a table in Spark SQL. 1) Is there an

Re: Basic Spark SQL question

2015-07-14 Thread Ron Gonzalez
ing-the-thrift-jdbcodbc-server > >> On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang >> wrote: >> Well for adhoc queries you can use the CLI >> >>> On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez >>> wrote: >>> Hi, >>> I have a q

Basic Spark SQL question

2015-07-13 Thread Ron Gonzalez
Hi, I have a question for Spark SQL. Is there a way to be able to use Spark SQL on YARN without having to submit a job? Bottom line here is I want to be able to reduce the latency of running queries as a job. I know that the spark sql default submission is like a job, but was wondering if i

Re: error with pyspark

2014-08-11 Thread Ron Gonzalez
If you're running on Ubuntu, do ulimit -n, which gives the max number of allowed open files. You will have to change the value in /etc/security/limits.conf to something like 1, logout and log back in. Thanks, Ron Sent from my iPad > On Aug 10, 2014, at 10:19 PM, Davies Liu wrote: > >> On

Re: Save an RDD to a SQL Database

2014-08-06 Thread Ron Gonzalez
Hi Vida, It's possible to save an RDD as a hadoop file using hadoop output formats. It might be worthwhile to investigate using DBOutputFormat and see if this will work for you. I haven't personally written to a db, but I'd imagine this would be one way to do it. Thanks, Ron Sent from my i

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Ron Gonzalez
One key thing I forgot to mention is that I changed the avro version to 1.7.7 to get AVRO-1476. I took a closer look at the jars, and what I noticed is that the assembly jars that work do not have the org.apache.avro.mapreduce package packaged into the assembly. For spark-1.0.1, org.apache.avro

Re: Computing mean and standard deviation by key

2014-08-04 Thread Ron Gonzalez
Cool thanks!  On Monday, August 4, 2014 8:58 AM, kriskalish wrote: Hey Ron, It was pretty much exactly as Sean had depicted. I just needed to provide count an anonymous function to tell it which elements to count. Since I wanted to count them all, the function is simply "true".         va

Re: Computing mean and standard deviation by key

2014-08-01 Thread Ron Gonzalez
Can you share the mapValues approach you did? Thanks, Ron Sent from my iPhone > On Aug 1, 2014, at 3:00 PM, kriskalish wrote: > > Thanks for the help everyone. I got the mapValues approach working. I will > experiment with the reduceByKey approach later. > > <3 > > -Kris > > > > > -- >

Re: Is there a way to write spark RDD to Avro files

2014-08-01 Thread Ron Gonzalez
You have to import org.apache.spark.rdd._, which will automatically make available this method. Thanks, Ron Sent from my iPhone > On Aug 1, 2014, at 3:26 PM, touchdown wrote: > > Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small > avro files into one avro file. I re

NotSerializableException

2014-07-30 Thread Ron Gonzalez
Hi, I took avro 1.7.7 and recompiled my distribution to be able to fix the issue when dealing with avro GenericRecord. The issue I got was resolved. I'm referring to AVRO-1476. I also enabled kryo registration in SparkConf. That said, I am still seeing a NotSerializableException for Schema

Re: Issue submitting spark job to yarn

2014-07-25 Thread Ron Gonzalez
le is overwritten in hdfs after it's been registered as a local resource. Node manager logs are your friend! Just sharing in case other folks run into the same problem. Thanks, Ron Sent from my iPhone > On Jul 25, 2014, at 9:36 AM, Ron Gonzalez wrote: > > Folks, > I've

Issue submitting spark job to yarn

2014-07-25 Thread Ron Gonzalez
Folks,   I've been able to submit simple jobs to yarn thus far. However, when I did something more complicated that added 194 dependency jars using --addJars, the job fails in YARN with no logs. What ends up happening is that no container logs get created (app master or executor). If I add just

Re: cache changes precision

2014-07-24 Thread Ron Gonzalez
s just by chance that this ends up changing your > average to be rounded. > > Can you try with cloning the records in the map call? Also look at the > contents and see if they're actually changed, or if the resulting RDD after a > cache is just the last record "smeared&qu

cache changes precision

2014-07-24 Thread Ron Gonzalez
Hi,   I'm doing the following:   def main(args: Array[String]) = {     val sparkConf = new SparkConf().setAppName("AvroTest").setMaster("local[2]")     val sc = new SparkContext(sparkConf)     val conf = new Configuration()     val job = new Job(conf)     val path = new Path("/tmp/a.avro");     va

Possible bug in ClientBase.scala?

2014-07-13 Thread Ron Gonzalez
Hi, I was doing programmatic submission of Spark yarn jobs and I saw code in ClientBase.getDefaultYarnApplicationClasspath(): val field = classOf[MRJobConfig].getField("DEFAULT_YARN_APPLICATION_CLASSPATH) MRJobConfig doesn't have this field so the created launch env is incomplete. Workaround i

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
I am able to use Client.scala or LauncherExecutor.scala as my programmatic entry point for Yarn. Thanks, Ron Sent from my iPad > On Jul 9, 2014, at 7:14 AM, Jerry Lam wrote: > > +1 as well for being able to submit jobs programmatically without using shell > script. > > we also experience is

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
Koert, Yeah I had the same problems trying to do programmatic submission of spark jobs to my Yarn cluster. I was ultimately able to resolve it by reviewing the classpath and debugging through all the different things that the Spark Yarn client (Client.scala) did for submitting to Yarn (like env

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Ron Gonzalez
The idea behind YARN is that you can run different application types like MapReduce, Storm and Spark. I would recommend that you build your spark jobs in the main method without specifying how you deploy it. Then you can use spark-submit to tell Spark how you would want to deploy to it using ya

Re: Setting queue for spark job on yarn

2014-05-20 Thread Ron Gonzalez
Btw, I'm on 0.9.1. Will setting a queue programmatically be available in 1.0? Thanks, Ron Sent from my iPad > On May 20, 2014, at 6:27 PM, Ron Gonzalez wrote: > > Hi Sandy, > Is there a programmatic way? We're building a platform as a service and > need to assi

Re: Setting queue for spark job on yarn

2014-05-20 Thread Ron Gonzalez
What version are you using? For 0.9, you need to set it outside your code > with the SPARK_YARN_QUEUE environment variable. > > -Sandy > > >> On Mon, May 19, 2014 at 9:29 PM, Ron Gonzalez wrote: >> Hi, >> How does one submit a spark job to yarn and specify a q

Setting queue for spark job on yarn

2014-05-19 Thread Ron Gonzalez
Hi,   How does one submit a spark job to yarn and specify a queue?   The code that successfully submits to yarn is:    val conf = new SparkConf()    val sc = new SparkContext("yarn-client", "Simple App", conf)    Where do I need to specify the queue?   Thanks in advance for any help on this...

Re: Avro serialization

2014-04-04 Thread Ron Gonzalez
> doing an avro one for this you probably want one of : >> https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/*ProtoBuf* >> >> or just whatever your using at the moment to open them in a MR job probably >> co

Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-04 Thread Ron Gonzalez
Hi, Can you explain a little more what's going on? Which one submits a job to the yarn cluster that creates an application master and spawns containers for the local jobs? I tried yarn-client and submitted to our yarn cluster and it seems to work that way. Shouldn't Client.scala be running wi

Re: Submitting to yarn cluster

2014-04-03 Thread Ron Gonzalez
to make sure it propogates everywhere. There are also places it calls  SparkHadoopUtil.get.newConfiguration() so not sure those would handle it properly. You can always file a jira to add support for it and see what people think. Tom On Thursday, April 3, 2014 8:46 AM, Ron Gonzalez wrote: Rig

Avro serialization

2014-04-03 Thread Ron Gonzalez
Hi,   I know that sources need to either be java serializable or use kryo serialization.   Does anyone have sample code that reads, transforms and writes avro files in spark? Thanks, Ron

Re: Submitting to yarn cluster

2014-04-03 Thread Ron Gonzalez
_DIR is getting put into your classpath.   I would also make sure  HADOOP_PREFIX is being set. Tom On Wednesday, April 2, 2014 10:10 PM, Ron Gonzalez wrote: Hi,   I have a small program but I cannot seem to make it connect to the right properties of the cluster.   I have the SPARK_YARN_APP_JAR

Submitting to yarn cluster

2014-04-02 Thread Ron Gonzalez
Hi,   I have a small program but I cannot seem to make it connect to the right properties of the cluster.   I have the SPARK_YARN_APP_JAR, SPARK_JAR and SPARK_HOME set properly.   If I run this scala file, I am seeing that this is never using the yarn.resourcemanager.address property that I set o