FW: Pyspark: set Orc Stripe.size on dataframe writer issue

2018-10-17 Thread Somasundara, Ashwin
Hello Group I am having issues setting the stripe size, index stride and index on an orc file using PySpark. I am getting approx 2000 stripes for the 1.2GB file when I am expecting only 5 stripes for the 256MB setting. Tried the below options 1. Set the .options on data frame writer. The comp

Spark dataset to byte array over grpc

2018-04-23 Thread Ashwin Sai Shankar
Also is there a better way to send this output to client? Thanks, Ashwin

Re: Why python cluster mode is not supported in standalone cluster?

2018-02-14 Thread Ashwin Sai Shankar
+dev mailing list(since i didn't get a response from user DL) On Tue, Feb 13, 2018 at 12:20 PM, Ashwin Sai Shankar wrote: > Hi Spark users! > I noticed that spark doesn't allow python apps to run in cluster mode in > spark standalone cluster. Does anyone know the reason?

Why python cluster mode is not supported in standalone cluster?

2018-02-13 Thread Ashwin Sai Shankar
Hi Spark users! I noticed that spark doesn't allow python apps to run in cluster mode in spark standalone cluster. Does anyone know the reason? I checked jira but couldn't find anything relevant. Thanks, Ashwin

Recompute Spark outputs intelligently

2017-12-15 Thread Ashwin Raju
out which columns need to be recomputed and which can be left as is. Is there a best practice in the Spark ecosystem for this problem? Perhaps some metadata system/data lineage system we can use? I'm curious if this is a common problem that has already been addressed. Thanks, Ashwin

Re: Spark 2.2 streaming with append mode: empty output

2017-08-15 Thread Ashwin Raju
x27;: {u'description': u'org.apache.spark.sql.execution.streaming.ConsoleSink@7e4050cd'}} On Mon, Aug 14, 2017 at 4:55 PM, Tathagata Das wrote: > In append mode, the aggregation outputs a row only when the watermark has > been crossed and the corresponding aggregate is

Spark 2.2 streaming with append mode: empty output

2017-08-14 Thread Ashwin Raju
the same query with outputMode("append") however, the output only has the column names, no rows. I was originally trying to output to parquet, which only supports append mode. I was seeing no data in my parquet files, so I switched to console output to debug, then noticed this issue. Am I misunderstanding something about how append mode works? Thanks, Ashwin

Reusing dataframes for streaming (spark 1.6)

2017-08-08 Thread Ashwin Raju
taframe what i would like to do instead: def process(time, rdd): # create dataframe from RDD - input_df # output_df = dataframe_pipeline_fn(input_df) -ashwin

Re: Spark shuffle files

2017-03-27 Thread Ashwin Sai Shankar
rg/apache/spark/ContextCleaner.scala > > On Mon, Mar 27, 2017 at 12:38 PM, Ashwin Sai Shankar < > ashan...@netflix.com.invalid> wrote: > >> Hi! >> >> In spark on yarn, when are shuffle files on local disk removed? (Is it >> when the app completes or >> o

Spark shuffle files

2017-03-27 Thread Ashwin Sai Shankar
Hi! In spark on yarn, when are shuffle files on local disk removed? (Is it when the app completes or once all the shuffle files are fetched or end of the stage?) Thanks, Ashwin

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Thanks. I'll try that. Hopefully that should work. On Mon, Jul 4, 2016 at 9:12 PM, Mathieu Longtin wrote: > I started with a download of 1.6.0. These days, we use a self compiled > 1.6.2. > > On Mon, Jul 4, 2016 at 11:39 AM Ashwin Raaghav > wrote: > >> I am thinki

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Longtin wrote: > 1.6.1. > > I have no idea. SPARK_WORKER_CORES should do the same. > > On Mon, Jul 4, 2016 at 11:24 AM Ashwin Raaghav > wrote: > >> Which version of Spark are you using? 1.6.1? >> >> Any ideas as to why it is not working in ours? >> >&

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
Which version of Spark are you using? 1.6.1? Any ideas as to why it is not working in ours? On Mon, Jul 4, 2016 at 8:51 PM, Mathieu Longtin wrote: > 16. > > On Mon, Jul 4, 2016 at 11:16 AM Ashwin Raaghav > wrote: > >> Hi, >> >> I tried what you suggeste

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
e per server. However, it seems it will > start as many pyspark as there are cores, but maybe not use them. > > On Mon, Jul 4, 2016 at 10:44 AM Ashwin Raaghav > wrote: > >> Hi Mathieu, >> >> Isn't that the same as setting "spark.executor.cores" to 1? An

Re: Limiting Pyspark.daemons

2016-07-04 Thread Ashwin Raaghav
aemons process is still not coming down. It looks like initially >> there is one Pyspark.daemons process and this in turn spawns as many >> pyspark.daemons processes as the number of cores in the machine. >> >> Any help is appreciated :) >> >> Thanks, >> Ashwin Raagha

Re: Adding h5 files in a zip to use with PySpark

2016-06-15 Thread Ashwin Raaghav
> > - > > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > > For additional commands, e-mail: user-h...@spark.apache.org > > > > > -- Regards, Ashwin Raaghav

Re: Question about MEOMORY_AND_DISK persistence

2016-02-28 Thread Ashwin Giridharan
Hi Vishnu, A partition will either be in memory or in disk. -Ashwin On Feb 28, 2016 15:09, "Vishnu Viswanath" wrote: > Hi All, > > I have a question regarding Persistence (MEMORY_AND_DISK) > > Suppose I am trying to persist an RDD which has 2 partitions and only 1

Spark streaming: Consistency of multiple streams in Spark

2015-12-17 Thread Ashwin
could synchronize these multiple streams. What am I missing? Thanks, Ashwin [1] http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-12.pdf - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e

Re: Hive error after update from 1.4.1 to 1.5.2

2015-12-16 Thread Ashwin Sai Shankar
Hi Bryan, I see the same issue with 1.5.2, can you pls let me know what was the resolution? Thanks, Ashwin On Fri, Nov 20, 2015 at 12:07 PM, Bryan Jeffrey wrote: > Nevermind. I had a library dependency that still had the old Spark version. > > On Fri, Nov 20, 2015 at 2:14 PM, Brya

Re: Spark on YARN multitenancy

2015-12-15 Thread Ashwin Sai Shankar
We run large multi-tenant clusters with spark/hadoop workloads, and we use 'yarn's preemption'/'spark's dynamic allocation' to achieve multitenancy. See following link on how to enable/configure preemption using fair scheduler : http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/Fai

Re: How to display column names in spark-sql output

2015-12-11 Thread Ashwin Sai Shankar
Never mind, its *set hive.cli.print.header=true* Thanks ! On Fri, Dec 11, 2015 at 5:16 PM, Ashwin Shankar wrote: > Hi, > When we run spark-sql, is there a way to get column names/headers with the > result? > > -- > Thanks, > Ashwin > > >

How to display column names in spark-sql output

2015-12-11 Thread Ashwin Shankar
Hi, When we run spark-sql, is there a way to get column names/headers with the result? -- Thanks, Ashwin

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-31 Thread Ashwin Giridharan
creating 500 Dstreams based off 500 textfile > directories, do we need at least 500 executors / nodes to be receivers for > each one of the streams? > > On Tue, Jul 28, 2015 at 6:09 PM, Tathagata Das > wrote: > >> @Ashwin: You could append the topic in the data. >>

Re: What happens when you create more DStreams then nodes in the cluster?

2015-07-31 Thread Ashwin Giridharan
t; Thanks, Ashwin On Fri, Jul 31, 2015 at 4:52 PM, Brandon White wrote: > Since one input dstream creates one receiver and one receiver uses one > executor / node. > > What happens if you create more Dstreams than nodes in the cluster? > > Say I have 30 Dstreams on a 15 node clust

Re: How to control Spark Executors from getting Lost when using YARN client mode?

2015-07-30 Thread Ashwin Giridharan
an optimal configuration would be, --num-executors 8 --executor-cores 2 --executor-memory 2G Thanks, Ashwin On Thu, Jul 30, 2015 at 12:08 PM, unk1102 wrote: > Hi I have one Spark job which runs fine locally with less data but when I > schedule it on YARN to execute I keep on getti

Re: Has anybody ever tried running Spark Streaming on 500 text streams?

2015-07-28 Thread Ashwin Giridharan
D { rdd => >> //do something >> } >> } >> >> ssc.start() >> >> Would something like this scale? What would be the limiting factor to >> performance? What is the best way to parallelize this? Any other ideas on >> design? >> > > -- Thanks & Regards, Ashwin Giridharan

Re: Long running streaming application - worker death

2015-07-26 Thread Ashwin Giridharan
owse/SPARK-1340"; corresponding to this bug is yet to be resolved. Also have a look at http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-and-the-spark-shell-td3347.html Thanks, Ashwin On Sun, Jul 26, 2015 at 9:29 AM, aviemzur wrote: > Hi all, > > I have a question

Re: Problem with pyspark on Docker talking to YARN cluster

2015-06-10 Thread Ashwin Shankar
3. use yarn-cluster mode Pyspark interactive shell(ipython) doesn't have cluster mode. SPARK-5162 <https://issues.apache.org/jira/browse/SPARK-5162> is for spark-submit python in cluster mode. Thanks, Ashwin On Wed, Jun 10, 2015 at 3:55 PM, Eron Wright wrote: > Options i

Problem with pyspark on Docker talking to YARN cluster

2015-06-10 Thread Ashwin Shankar
rt to hostmachine's ip/port. So the AM can then talk hostmachine's ip/port, which would be mapped to the container. Thoughts ? -- Thanks, Ashwin

How to pass system properties in spark ?

2015-06-03 Thread Ashwin Shankar
appening ? *When I enable log4j debug I see that following :* log4j: Setting property [file] to []. log4j: setFile called: , true log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: (No such file or directory) at java.io.FileOutputStream.open(Native Method) -- Thanks, Ashwin

Spark on Yarn : Map outputs lifetime ?

2015-05-12 Thread Ashwin Shankar
Hi, In spark on yarn and when running spark_shuffle as auxiliary service on node manager, does map spills of a stage gets cleaned up once the next stage completes OR is it preserved till the app completes(ie waits for all the stages to complete) ? -- Thanks, Ashwin

Re: Building spark targz

2014-11-12 Thread Ashwin Shankar
e but are you looking for the tar in assembly/target dir ? > > On Wed, Nov 12, 2014 at 3:14 PM, Ashwin Shankar > wrote: > >> Hi, >> I just cloned spark from the github and I'm trying to build to generate a >> tar ball. >> I'm doing : mvn -Pyarn -Pha

Building spark targz

2014-11-12 Thread Ashwin Shankar
d ? -- Thanks, Ashwin

Re: Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
x27;s executors got preempted say while doing reduceByKey, will the application progress with the remaining resources/fair share ? I'm new to spark, sry if I'm asking something very obvious :). Thanks, Ashwin On Wed, Oct 22, 2014 at 12:07 PM, Marcelo Vanzin wrote: > Hi Ashwin, > > L

Multitenancy in Spark - within/across spark context

2014-10-22 Thread Ashwin Shankar
e about user/job isolation ? I know I'm asking a lot of questions. Thanks in advance :) ! -- Thanks, Ashwin Netflix