How to get the radius of clusters in spark K means

2015-08-19 Thread ashensw
We can get cluster centers in K means clustering. Like wise is there any method in spark to get the cluster radius? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-radius-of-clusters-in-spark-K-means-tp24353.html Sent from the Apache Spark Use

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
Can you make some more profiling? I am wondering if the driver is busy with scanning the HDFS / S3. Like jstack And also, it’s will be great if you can paste the physical plan for the simple query. From: Jerrick Hoang [mailto:jerrickho...@gmail.com] Sent: Thursday, August 20, 2015 1:46 PM To:

creating data warehouse with Spark and running query with Hive

2015-08-19 Thread Jeetendra Gangele
HI All, I have a data in HDFS partition with Year/month/data/event_type. And I am creating a hive tables with this data, this data is in JSON so I am using json serve and creating hive tables. below is the code val jsonFile = hiveContext.read.json("hdfs://localhost:9000/housing/events_real/cate

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
I cloned from TOT after 1.5.0 cut off. I noticed there were a couple of CLs trying to speed up spark sql with tables with a huge number of partitions, I've made sure that those CLs are included but it's still very slow On Wed, Aug 19, 2015 at 10:43 PM, Cheng, Hao wrote: > Yes, you can try set th

RE: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Cheng, Hao
Yes, you can try set the spark.sql.sources.partitionDiscovery.enabled to false. BTW, which version are you using? Hao From: Jerrick Hoang [mailto:jerrickho...@gmail.com] Sent: Thursday, August 20, 2015 12:16 PM To: Philip Weaver Cc: user Subject: Re: Spark Sql behaves strangely with tables with

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
I guess the question is why does spark have to do partition discovery with all partitions when the query only needs to look at one partition? Is there a conf flag to turn this off? On Wed, Aug 19, 2015 at 9:02 PM, Philip Weaver wrote: > I've had the same problem. It turns out that Spark (specifi

Re: Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Philip Weaver
I've had the same problem. It turns out that Spark (specifically parquet) is very slow at partition discovery. It got better in 1.5 (not yet released), but was still unacceptably slow. Sadly, we ended up reading parquet files manually in Python (via C++) and had to abandon Spark SQL because of this

Spark Sql behaves strangely with tables with a lot of partitions

2015-08-19 Thread Jerrick Hoang
Hi all, I did a simple experiment with Spark SQL. I created a partitioned parquet table with only one partition (date=20140701). A simple `select count(*) from table where date=20140701` would run very fast (0.1 seconds). However, as I added more partitions the query takes longer and longer. When

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread canan chen
Thanks Andrew, make sense. On Thu, Aug 20, 2015 at 4:40 AM, Andrew Or wrote: > Hi Canan, > > The event log dir is a per-application setting whereas the history server > is an independent service that serves history UIs from many applications. > If you use history server locally then the `spark.h

Re: issue Running Spark Job on Yarn Cluster

2015-08-19 Thread stark_summer
Please look at more about hadoop logs, such as yarn logs -applicationId xxx attach more logs to this topic -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/issue-Running-Spark-Job-on-Yarn-Cluster-tp21779p24350.html Sent from the Apache Spark User List mai

Re: SQLContext Create Table Problem

2015-08-19 Thread Yusuf Can Gürkan
Hey, This is my spark-env: # Add Hadoop libraries to Spark classpath SPARK_CLASSPATH="${SPARK_CLASSPATH}:${HADOOP_HOME}/*:${HADOOP_HOME}/../hadoop-hdfs/*:${HADOOP_HOME}/../hadoop-mapreduce/*:${HADOOP_HOME}/../hadoop-yarn/*" LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${HADOOP_HOME}/lib/native" # Add Hiv

How to set the number of executors and tasks in a Spark Streaming job in Mesos

2015-08-19 Thread swetha
Hi, How to set the number of executors and tasks in a Spark Streaming job in Mesos? I have the following settings but my job still shows me 11 active tasks and 11 executors. Any idea as to why this is happening ? sparkConf.set("spark.mesos.coarse", "true") sparkConf.set("spark.cores.max",

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread Andrew Or
Hi Canan, The event log dir is a per-application setting whereas the history server is an independent service that serves history UIs from many applications. If you use history server locally then the `spark.history.fs.logDirectory` will happen to point to `spark.eventLog.dir`, but the use case it

Re: Spark return key value pair

2015-08-19 Thread Robin East
Dawid is right, if you did words.count it would be twice the number of input lines. You can use map like this: words = lines.map(mapper2) for i in words.take(10): msg = i[0] + ":” + i[1] + "\n” --- Robin East

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Andrew Or
Hi Axel, what spark version are you using? Also, what do your configurations look like for the following? - spark.cores.max (also --total-executor-cores) - spark.executor.cores (also --executor-cores) 2015-08-19 9:27 GMT-07:00 Axel Dahl : > hmm maybe I spoke too soon. > > I have an apache zeppe

Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Andrew Or
Yes, in other words, a "bucket" is a single file in hash-based shuffle (no consolidation), but a segment of partitioned file in sort-based shuffle. 2015-08-19 5:52 GMT-07:00 Muhammad Haseeb Javed <11besemja...@seecs.edu.pk>: > Thanks Andrew for a detailed response, > > So the reason why key value

Re: How to minimize shuffling on Spark dataframe Join?

2015-08-19 Thread Romi Kuntsman
If you create a PairRDD from the DataFrame, using dataFrame.toRDD().mapToPair(), then you can call partitionBy(someCustomPartitioner) which will partition the RDD by the key (of the pair). Then the operations on it (like joining with another RDD) will consider this partitioning. I'm not sure that D

Re: Issues with S3 paths that contain colons

2015-08-19 Thread Romi Kuntsman
I had the exact same issue, and overcame it by overriding NativeS3FileSystem with my own class, where I replaced the implementation of globStatus. It's a hack but it works. Then I set the hadoop config fs.myschema.impl to my class name, and accessed the files through myschema:// instead of s3n://

PySpark on Mesos - Scaling

2015-08-19 Thread scox
I'm running a pyspark program on a Mesos cluster and seeing behavior where: * The first three stages run with multiple (10-15) tasks. * The fourth stage runs with only one task. * It is using 10 cpus, which is 5 machines in this configuration * It is very slow I would like it to use more resources

spark.sql.shuffle.partitions=1 seems to be working fine but creates timeout for large skewed data

2015-08-19 Thread unk1102
Hi I have a Spark job which deals with large skewed dataset. I have around 1000 Hive partitions to process in four different tables every day. So if I go with 200 spark.sql.shuffle.partitions default partitions created by Spark I end up with 4 * 1000 * 200 = 8 small small files in HDFS which wo

Re: Does spark sql support column indexing

2015-08-19 Thread Michael Armbrust
Maintaining traditional B-Tree or Hash indexes in a system like Spark SQL would be very difficult, mostly since a very common use case is to process data that we don't own (i.e. a folder in HDFS or a bucket on S3). As such, we don't have the opportunity to update the index, since users can add dat

How to avoid executor time out on yarn spark while dealing with large shuffle skewed data?

2015-08-19 Thread unk1102
I have one Spark job which seems to run fine but after one hour or so executor start getting lost because of time out something like the following error cluster.yarnScheduler : Removing an executor 14 65 timeout exceeds 60 seconds and because of above error couple of chained errors starts

Re: Issues with S3 paths that contain colons

2015-08-19 Thread Steve Loughran
you might want to think about filing a JIRA on issues.apache.org against HADOOP here, component being fs/s3. That doesn't mean it is fixable, only known. Every FS has its own set of forbidden characters & filenames; unix doesn't files named "."; windows doesn't allow files called COM1, ..., so h

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-19 Thread Shushant Arora
will try scala. Only Reason of not using scala is - never worked on that. On Wed, Aug 19, 2015 at 7:34 PM, Cody Koeninger wrote: > Is there a reason not to just use scala? It's not a lot of code... and > it'll be even less code in scala ;) > > On Wed, Aug 19, 2015 at 4:14 AM, Shushant Arora >

SQLContext load. Filtering files

2015-08-19 Thread Masf
Hi. I'd like to read Avro files using this library https://github.com/databricks/spark-avro I need to load several files from a folder, not all files. Is there some functionality to filter the files to load? And... Is is possible to know the name of the files loaded from a folder? My problem is

Re: Spark Streaming failing on YARN Cluster

2015-08-19 Thread Ramkumar V
I'm getting some spark exception. Please look this log trace ( *http://pastebin.com/xL9jaRUa * ). *Thanks*, On Wed, Aug 19, 2015 at 10:20 PM, Hari Shreedharan < hshreedha...@cloudera.com> wrote: > It looks like you are havi

Re: broadcast variable of Kafka producer throws ConcurrentModificationException

2015-08-19 Thread Marcin Kuthan
I'm glad that I could help :) 19 sie 2015 8:52 AM "Shenghua(Daniel) Wan" napisał(a): > +1 > > I wish I have read this blog earlier. I am using Java and have just > implemented a singleton producer per executor/JVM during the day. > Yes, I did see that NonSerializableException when I was debugging

Re: Spark Streaming failing on YARN Cluster

2015-08-19 Thread Hari Shreedharan
It looks like you are having issues with the files getting distributed to the cluster. What is the exception you are getting now? On Wednesday, August 19, 2015, Ramkumar V wrote: > Thanks a lot for your suggestion. I had modified HADOOP_CONF_DIR in > spark-env.sh so that core-site.xml is under H

Re: Too many files/dirs in hdfs

2015-08-19 Thread Mohit Anchlia
My question was how to do this in Hadoop? Could somebody point me to some examples? On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY wrote: > Of course, Java or Scala can do that: > 1) Create a FileWriter with append or roll over option > 2) For each RDD create a StringBuilder after applying yo

Re: blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Igor Berman
you don't need to register, search in youtube for this video... On 19 August 2015 at 18:34, Gourav Sengupta wrote: > Excellent resource: http://www.oreilly.com/pub/e/3330 > > And more amazing is the fact that the presenter actually responds to your > questions. > > Regards, > Gourav Sengupta > >

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Axel Dahl
hmm maybe I spoke too soon. I have an apache zeppelin instance running and have configured it to use 48 cores (each node only has 16 cores), so I figured by setting it to 48, would mean that spark would grab 3 nodes. what happens instead though is that spark, reports that 48 cores are being used,

Re: SQLContext Create Table Problem

2015-08-19 Thread Yusuf Can Gürkan
Hey Yin, Thanks for answer. I thought that this could be problem but i can not create HiveContext because i can not import org.apache.spark.sql.hive.HiveContext. It does not see this package. I read that i should build spark with -PHive but i’m running on Amazon EMR 1.4.1 and on spark-shell i

Re: SQLContext Create Table Problem

2015-08-19 Thread Yin Huai
Can you try to use HiveContext instead of SQLContext? Your query is trying to create a table and persist the metadata of the table in metastore, which is only supported by HiveContext. On Wed, Aug 19, 2015 at 8:44 AM, Yusuf Can Gürkan wrote: > Hello, > > I’m trying to create a table with sqlCont

SQLContext Create Table Problem

2015-08-19 Thread Yusuf Can Gürkan
Hello, I’m trying to create a table with sqlContext.sql method as below: val sc = new SparkContext() val sqlContext = new SQLContext(sc) import sqlContext.implicits._ sqlContext.sql(s""" create table if not exists landing ( date string, referrer string ) partitioned by (partnerid string,dt stri

Re: how do I execute a job on a single worker node in standalone mode

2015-08-19 Thread Axel Dahl
That worked great, thanks Andrew. On Tue, Aug 18, 2015 at 1:39 PM, Andrew Or wrote: > Hi Axel, > > You can try setting `spark.deploy.spreadOut` to false (through your > conf/spark-defaults.conf file). What this does is essentially try to > schedule as many cores on one worker as possible before

Re: blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Gourav Sengupta
Excellent resource: http://www.oreilly.com/pub/e/3330 And more amazing is the fact that the presenter actually responds to your questions. Regards, Gourav Sengupta On Wed, Aug 19, 2015 at 4:12 PM, Todd wrote: > Hi, > I would ask if there are some blogs/articles/videos on how to analyse > spark

blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Todd
Hi, I would ask if there are some blogs/articles/videos on how to analyse spark performance during runtime,eg, tools that can be used or something related.

ValueError: Can only zip with RDD which has the same number of partitions error on one machine but not on another

2015-08-19 Thread Abhinav Mishra
Hi, I have this piece of code which works fine on one machine but when I run this on another machine I get error as - "ValueError: Can only zip with RDD which has the same number of partitions". My code is: rdd2 = sc.parallelize(list1) rdd3 = rdd1.zip(rdd2).map(lambda ((x1,x2,x3,x4), y): (y,x2, x

Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

2015-08-19 Thread jlg
Some background on what we're trying to do: We have four Kinesis receivers with varying amounts of data coming through them. Ultimately we work on a unioned stream that is getting about 11 MB/second of data. We use a batch size of 5 seconds. We create four distinct DStreams from this data that h

How to overwrite partition when writing Parquet?

2015-08-19 Thread Romi Kuntsman
Hello, I have a DataFrame, with a date column which I want to use as a partition. Each day I want to write the data for the same date in Parquet, and then read a dataframe for a date range. I'm using: myDataframe.write().partitionBy("date").mode(SaveMode.Overwrite).parquet(parquetDir); If I use

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-19 Thread Cody Koeninger
Is there a reason not to just use scala? It's not a lot of code... and it'll be even less code in scala ;) On Wed, Aug 19, 2015 at 4:14 AM, Shushant Arora wrote: > To correct myself - KafkaRDD[K, V, U, T, R] is subclass of RDD[R] but > Option is not subclass of Option; > > In scala C[T’] is a

Re: Scala: How to match a java object????

2015-08-19 Thread Ted Yu
Saif: In your example below, the error was due to there is no automatic conversion from Int to BigDecimal. Cheers > On Aug 19, 2015, at 6:40 AM, > wrote: > > Hi, thank you all for the asssistance. > > It is odd, it works when creating a new java.mathBigDecimal object, but not > if I wo

RE: Scala: How to match a java object????

2015-08-19 Thread Saif.A.Ellafi
It is okay. this methodology works very well for mapping objects of my Seq[Any]. It is indeed very cool :-) Saif From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Wednesday, August 19, 2015 10:47 AM To: Ellafi, Saif A. Cc: ; ; Subject: Re: Scala: How to match a java object Saif: In your exam

RE: Scala: How to match a java object????

2015-08-19 Thread Saif.A.Ellafi
Hi, thank you all for the asssistance. It is odd, it works when creating a new java.mathBigDecimal object, but not if I work directly with scala> 5 match { case x: java.math.BigDecimal => 2 } :23: error: scrutinee is incompatible with pattern type; found : java.math.BigDecimal required: Int

Re: What is the reason for ExecutorLostFailure?

2015-08-19 Thread VIJAYAKUMAR JAWAHARLAL
Hints are good. Thanks Corey. I will try to find out more from the logs. > On Aug 18, 2015, at 7:23 PM, Corey Nolet wrote: > > Usually more information as to the cause of this will be found down in your > logs. I generally see this happen when an out of memory exception has > occurred for one

RE: Failed to fetch block error

2015-08-19 Thread java8964
>From the log, it looks like the OS user who is running spark cannot open any >more file. Check your ulimit setting for that user: ulimit -aopen files (-n) 65536 > Date: Tue, 18 Aug 2015 22:06:04 -0700 > From: swethakasire...@gmail.com > To: user@spark.apache.org > Subject: F

Re: Difference between Sort based and Hash based shuffle

2015-08-19 Thread Muhammad Haseeb Javed
Thanks Andrew for a detailed response, So the reason why key value pairs with same keys are always found in a single buckets in Hash based shuffle but not in Sort is because in sort-shuffle each mapper writes a single partitioned file, and it is up to the reducer to fetch correct partitions from t

Fwd: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
oops, forgot to reply-all on this thread. -- Forwarded message -- From: Rick Moritz Date: Wed, Aug 19, 2015 at 2:46 PM Subject: Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell To: Igor Berman Those values are not explicitely set, and attempting to read

Re: What's the best practice for developing new features for spark ?

2015-08-19 Thread Zoltán Zvara
I personally build with SBT and run Spark on YARN with IntelliJ. You need to connect to remote JVMs with a remote debugger. You also need to do similar, if you use Python, because it will launch a JVM on the driver aswell. On Wed, Aug 19, 2015 at 2:10 PM canan chen wrote: > Thanks Ted. I notice

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread canan chen
Anyone know about this ? Or do I miss something here ? On Fri, Aug 7, 2015 at 4:20 PM, canan chen wrote: > Is there any reason that historyserver use another property for the event > log dir ? Thanks >

Re: What's the best practice for developing new features for spark ?

2015-08-19 Thread canan chen
Thanks Ted. I notice another thread about running spark programmatically (client mode for standalone and yarn). Would it be much easier to debug spark if is is possible ? Hasn't anyone thought about it ? On Wed, Aug 19, 2015 at 5:50 PM, Ted Yu wrote: > See this thread: > > > http://search-hadoo

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Igor Berman
i would compare spark ui metrics for both cases and see any differences(number of partitions, number of spills etc) why can't you make repl to be consistent with zepellin spark version? might be rc has issues... On 19 August 2015 at 14:42, Rick Moritz wrote: > No, the setup is one driver wit

Re: How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread William Briggs
When submitting to YARN, you can specify two different operation modes for the driver with the "--master" parameter: yarn-client or yarn-cluster. For more information on submitting to YARN, see this page in the Spark docs: http://spark.apache.org/docs/latest/running-on-yarn.html yarn-cluster mode

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
No, the setup is one driver with 32g of memory, and three executors each with 8g of memory in both cases. No core-number has been specified, thus it should default to single-core (though I've seen the yarn-owned jvms wrapping the executors take up to 3 cores in top). That is, unless, as I suggested

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Igor Berman
any differences in number of cores, memory settings for executors? On 19 August 2015 at 09:49, Rick Moritz wrote: > Dear list, > > I am observing a very strange difference in behaviour between a Spark > 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin > interpreter (comp

Re: Spark return key value pair

2015-08-19 Thread Dawid Wysakowicz
I am not 100% sure but probably flatMap unwinds the tuples. Try with map instead. 2015-08-19 13:10 GMT+02:00 Jerry OELoo : > Hi. > I want to parse a file and return a key-value pair with pySpark, but > result is strange to me. > the test.sql is a big fie and each line is usename and password, wit

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Rick Moritz
Creating a franken-jar and replacing the differing .class in my spark-assembly with the one compiled with java 1.6 appears to make no significant difference with regards to the generated shuffle-volume. I will try using FAIR-scheduling from the shell after the sark-submit test, to see if that has a

Spark return key value pair

2015-08-19 Thread Jerry OELoo
Hi. I want to parse a file and return a key-value pair with pySpark, but result is strange to me. the test.sql is a big fie and each line is usename and password, with # between them, I use below mapper2 to map data, and in my understanding, i in words.take(10) should be a tuple, but the result is

Spark UI returning error 500 in yarn-client mode

2015-08-19 Thread mosheeshel
Running Spark on YARN in yarn-client mode, everything appears to be working just fine except I can't access Spark UI via the Application Master link in YARN UI or directly at http://driverip:4040/jobs I get error 500, and the driver log shows the error pasted below: When running the same job using

Re:Re: How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread Todd
I think Yarn ResourceManager has the mechanism to relaunch the driver on failure. But I am uncertain. Could someone help on this? Thanks. At 2015-08-19 16:37:32, "Spark Enthusiast" wrote: Thanks for the reply. Are Standalone or Mesos the only options? Is there a way to auto relaunch if

Spark UI returning error 500 in yarn-client mode

2015-08-19 Thread Moshe Eshel
Running Spark on YARN in yarn-client mode, everything appears to be working just fine except I can't access Spark UI via the Application Master link in YARN UI or directly at http://driverip:4040/jobs I get error 500, and the driver log shows the error pasted below: When running the same job using

Re: What's the best practice for developing new features for spark ?

2015-08-19 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtdZv0d1btRHl/Spark+build+module&subj=Building+Spark+Building+just+one+module+ > On Aug 19, 2015, at 1:44 AM, canan chen wrote: > > I want to work on one jira, but it is not easy to do unit test, because it > involves different components especi

Re: Spark Streaming failing on YARN Cluster

2015-08-19 Thread Ramkumar V
Thanks a lot for your suggestion. I had modified HADOOP_CONF_DIR in spark-env.sh so that core-site.xml is under HADOOP_CONF_DIR. i can able to see the logs like that you had shown above. Now i can able to run for 3 minutes and store results between every minutes. After sometimes, there is an except

Re: spark streaming 1.3 doubts(force it to not consume anything)

2015-08-19 Thread Shushant Arora
To correct myself - KafkaRDD[K, V, U, T, R] is subclass of RDD[R] but Option is not subclass of Option; In scala C[T’] is a subclass of C[T] as per https://twitter.github.io/scala_school/type-basics.html but this is not allowed in java. So is there any workaround to achieve this in java for over

tasks of stage if run same woker

2015-08-19 Thread kale
tasks of stage if run same woker

What's the best practice for developing new features for spark ?

2015-08-19 Thread canan chen
I want to work on one jira, but it is not easy to do unit test, because it involves different components especially UI. spark building is pretty slow, I don't want to build it each time to test my code change. I am wondering how other people do ? Is there any experience can share ? Thanks

Re: How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread Spark Enthusiast
Thanks for the reply. Are Standalone or Mesos the only options? Is there a way to auto relaunch if driver runs as a Hadoop Yarn Application? On Wednesday, 19 August 2015 12:49 PM, Todd wrote: There is an option for the spark-submit (Spark standalone or Mesos with cluster deploy mod

Re: Spark Streaming failing on YARN Cluster

2015-08-19 Thread Ramkumar V
We are using Cloudera-5.3.1. since it is one of the earlier version of CDH, it doesnt supports the latest version of spark. So i installed spark-1.4.1 separately in my machine. I couldnt able to do spark-submit in cluster mode. How to core-site.xml under classpath ? it will be very helpful if you c

回复:Does spark sql support column indexing

2015-08-19 Thread prosp4300
The answer is simply NO, But I hope someone could give more deep insight or any meaningful reference 在2015年08月19日 15:21,Todd 写道: I don't find related talk on whether spark sql supports column indexing. If it does, is there guide how to do it? Thanks.

Re: Programmatically create SparkContext on YARN

2015-08-19 Thread Andreas Fritzler
Hi Andrew, Thanks a lot for your response. I am aware of the '--master' flag in the spark-submit command. However I would like to create the SparkContext inside my coding. Maybe I should elaborate a little bit further: I would like to reuse e.g. the result of any Spark computation inside my code.

Does spark sql support column indexing

2015-08-19 Thread Todd
I don't find related talk on whether spark sql supports column indexing. If it does, is there guide how to do it? Thanks.

Re:How to automatically relaunch a Driver program after crashes?

2015-08-19 Thread Todd
There is an option for the spark-submit (Spark standalone or Mesos with cluster deploy mode only) --supervise If given, restarts the driver on failure. At 2015-08-19 14:55:39, "Spark Enthusiast" wrote: Folks, As I see, the Driver program is a single point of failure. N