Creating RDD from Iterable from groupByKey results

2015-06-15 Thread Nirav Patel
I am trying to create new RDD based on given PairRDD. I have a PairRDD with few keys but each keys have large (about 100k) values. I want to somehow repartition, make each `Iterable` into RDD[v] so that I can further apply map, reduce, sortBy etc effectively on those values. I am sensing flatMapVal

Re: Spark executor killed without apparent reason

2016-03-03 Thread Nirav Patel
aised as I manually killed application at some point after too many executors were getting killed. " ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM" Thanks On Wed, Mar 2, 2016 at 8:22 AM, Nirav Patel wrote: > I think that was due to manually killing application. Ex

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Nirav Patel
so why does 'saveAsHadoopDataset' incurs so much memory pressure? Should I try to reduce hbase caching value ? On Wed, Mar 2, 2016 at 7:51 AM, Nirav Patel wrote: > Hi, > > I have a spark jobs that runs on yarn and keeps failing at line where i do : > > > val hConf

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Nirav Patel
; > FYI > > On Thu, Mar 3, 2016 at 6:08 AM, Ted Yu wrote: > >> From the log snippet you posted, it was not clear why connection got >> lost. You can lower the value for caching and see if GC activity gets >> lower. >> >> How wide are the rows in hbase

Re: Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-03 Thread Nirav Patel
> You can use binary search to get to a reasonable value for caching. > > Thanks > > On Thu, Mar 3, 2016 at 7:52 AM, Nirav Patel wrote: > >> Hi Ted, >> >> I'd say about 70th percentile keys have 2 columns each having a string of >> 20k comma separated v

Compress individual RDD

2016-03-15 Thread Nirav Patel
Hi, I see that there's following spark config to compress an RDD. My guess is it will compress all RDDs of a given SparkContext, right? If so, is there a way to instruct spark context to only compress some rdd and leave others uncompressed ? Thanks spark.rdd.compress false Whether to compress

Re: Compress individual RDD

2016-03-15 Thread Nirav Patel
nly rdds with serialization enabled in the persistence > mode. So you could skip _SER modes for your other rdds. Not perfect but > something. > On 15-Mar-2016 4:33 pm, "Nirav Patel" wrote: > >> Hi, >> >> I see that there's following spark config to compress an

spark 1.5.2 - value filterByRange is not a member of org.apache.spark.rdd.RDD[(myKey, myData)]

2016-03-30 Thread Nirav Patel
Hi, I am trying to use filterByRange feature of spark OrderedRDDFunctions in a hope that it will speed up filtering by scanning only required partitions. I have created Paired RDD with a RangePartitioner in one scala class and in another class I am trying to access this RDD and do following: In fi

Re: How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-04-02 Thread Nirav Patel
>> operation. Filter is an O(m) operation where m is the size of your >> partition, not an O(N) operation. >> >> -Ilya Ganelin >> >> On Sat, Jan 23, 2016 at 5:48 AM, Nirav Patel >> wrote: >> >>> Problem is I have RDD of about 10M rows and it kee

Multiple lookups; consolidate result and run further aggregations

2016-04-02 Thread Nirav Patel
I will start by question: Is spark lookup function on pair rdd is a driver action. ie result is returned to driver? I have list of Keys on driver side and I want to perform multiple parallel lookups on pair rdd which returns Seq[V]; consolidate results; and perform further aggregation/transformati

Re: spark 1.5.2 - value filterByRange is not a member of org.apache.spark.rdd.RDD[(myKey, myData)]

2016-04-02 Thread Nirav Patel
sortByKey() > > See core/src/main/scala/org/apache/spark/api/java/JavaPairRDD.scala > > On Wed, Mar 30, 2016 at 5:20 AM, Nirav Patel > wrote: > >> Hi, I am trying to use filterByRange feature of spark OrderedRDDFunctions >> in a hope that it will speed up filtering

SparkDriver throwing java.lang.OutOfMemoryError: Java heap space

2016-04-04 Thread Nirav Patel
Hi, We are using spark 1.5.2 and recently hitting this issue after our dataset grew from 140GB to 160GB. Error is thrown during shuffle fetch on reduce side which all should happen on executors and executor should report them! However its gets reported only on driver. SparkContext gets shutdown fr

GroubByKey Iterable[T] - Memory usage

2016-04-25 Thread Nirav Patel
Hi, Is the Iterable from out of GroupByKey is loaded fully into memory of reducer task or can it also be on disk? Also, is there a way to evacuate from memory once reducer is done iterating it and want to use memory for something else. Thanks -- [image: What's New with Xactly]

aggregateByKey - external combine function

2016-04-28 Thread Nirav Patel
Hi, I tried to convert a groupByKey operation to aggregateByKey in a hope to avoid memory and high gc issue when dealing with 200GB of data. I needed to create a Collection of resulting key-value pairs which represent all combinations of given key. My merge fun definition is as follows: private

Re: aggregateByKey - external combine function

2016-04-29 Thread Nirav Patel
Any thoughts? I can explain more on problem but basically shuffle data doesn't seem to fit in reducer memory (32GB) and I am looking ways to process them on disk+memory. Thanks On Thu, Apr 28, 2016 at 10:07 AM, Nirav Patel wrote: > Hi, > > I tried to convert a groupByKe

Spark 1.5.2 Shuffle Blocks - running out of memory

2016-05-03 Thread Nirav Patel
Hi, My spark application getting killed abruptly during a groupBy operation where shuffle happens. All shuffle happens with PROCESS_LOCAL locality. I see following in driver logs. Should not this logs be in executors? Anyhow looks like ByteBuffer is running out of memory. What will be workaround f

Re: Spark 1.5.2 Shuffle Blocks - running out of memory

2016-05-06 Thread Nirav Patel
Is this a limit of spark shuffle block currently? On Tue, May 3, 2016 at 11:18 AM, Nirav Patel wrote: > Hi, > > My spark application getting killed abruptly during a groupBy operation > where shuffle happens. All shuffle happens with PROCESS_LOCAL locality. I > see following

How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
Hi, I thought I was using kryo serializer for shuffle. I could verify it from spark UI - Environment tab that spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrator com.myapp.spark.jobs.conf.SparkSerializerRegistrator But when I see following error in Driver logs it

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
hould happen in executor JVM NOT in driver JVM. Thanks On Sat, May 7, 2016 at 11:58 AM, Ted Yu wrote: > bq. at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129) > > It was Akka which uses JavaSerializer > > Cheers > > On Sat, May 7, 2016 at 11:13 AM, Nir

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
of executors , stages and tasks in your app. > Do you know your driver heap size and application structure ( num of stages > and tasks ) > > Ashish > > On Saturday, May 7, 2016, Nirav Patel wrote: > >> Right but this logs from spark driver and spark driver seem

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Nirav Patel
ocs/latest/running-on-yarn.html > > In cluster mode, spark.yarn.am.memory is not effective. > > For Spark 2.0, akka is moved out of the picture. > FYI > >> On Sat, May 7, 2016 at 8:24 PM, Nirav Patel wrote: >> I have 20 executors, 6 cores each. Total 5 stages. It f

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Nirav Patel
park.driver.memory > > Cheers > > On Sun, May 8, 2016 at 9:14 AM, Nirav Patel wrote: > >> Yes, I am using yarn client mode hence I specified am settings too. >> What you mean akka is moved out of picture? I am using spark 2.5.1 >> >> Sent from my iPhone >&

How to take executor memory dump

2016-05-11 Thread Nirav Patel
Hi, I am hitting OutOfMemoryError issues with spark executors. It happens mainly during shuffle. Executors gets killed with OutOfMemoryError. I have try setting up spark.executor.extraJavaOptions to take memory dump but its not happening. spark.executor.extraJavaOptions = "-XX:+UseCompressedOops

API to study key cardinality and distribution and other important statistics about data at certain stage

2016-05-13 Thread Nirav Patel
Hi, Problem is every time job fails or perform poorly at certain stages you need to study your data distribution just before THAT stage. Overall look at input data set doesn't help very much if you have so many transformation going on in DAG. I alway end up writing complicated typed code to run an

Spark UI metrics - Task execution time and number of records processed

2016-05-18 Thread Nirav Patel
Hi, I have been noticing that for shuffled tasks(groupBy, Join) reducer tasks are not evenly loaded. Most of them (90%) finished super fast but there are some outliers that takes much longer as you can see from "Max" value in following metric. Metric is from Join operation done on two RDDs. I trie

Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
It's great that spark scheduler does optimized DAG processing and only does lazy eval when some action is performed or shuffle dependency is encountered. Sometime it goes further after shuffle dep before executing anything. e.g. if there are map steps after shuffle then it doesn't stop at shuffle t

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
UI will tell you which Stage in a Job failed, and > the accompanying error log message will generally also give you some idea > of which Tasks failed and why. Tracing the error back further and at a > different level of abstraction to lay blame on a particular transformation > wouldn't be part

Re: Spark UI doesn't give visibility on which stage job actually failed (due to lazy eval nature)

2016-05-25 Thread Nirav Patel
tages, TaskSets and Tasks -- and when you start talking about Datasets and > Spark SQL, you then needing to start talking about tracking and mapping > concepts like Plans, Schemas and Queries. It would introduce significant > new complexity. > > On Wed, May 25, 2016 at 6:59 PM, Nirav

Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Nirav Patel
Hi, I am getting following Kryo deserialization error when trying to buklload Cached RDD into Hbase. It works if I don't cache the RDD. I cache it with MEMORY_ONLY_SER. here's the code snippet: hbaseRdd.values.foreachPartition{ itr => val hConf = HBaseConfiguration.create() hCon

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Nirav Patel
> Thanks > > On Sun, May 29, 2016 at 4:26 PM, Nirav Patel > wrote: > >> Hi, >> >> I am getting following Kryo deserialization error when trying to buklload >> Cached RDD into Hbase. It works if I don't cache the RDD. I cache it >> with

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-29 Thread Nirav Patel
Sure let me can try that. But from looks of it it seems kryo kryo. util.MapReferenceResolver.getReadObject trying to access incorrect index (100) On Sun, May 29, 2016 at 5:06 PM, Ted Yu wrote: > Can you register Put with Kryo ? > > Thanks > > On May 29, 2016, at 4:58 PM, Nir

Re: Bulk loading Serialized RDD into Hbase throws KryoException - IndexOutOfBoundsException

2016-05-30 Thread Nirav Patel
exception. On Sun, May 29, 2016 at 11:26 PM, sjk wrote: > org.apache.hadoop.hbase.client.{Mutation, Put} > org.apache.hadoop.hbase.io.ImmutableBytesWritable > > if u used mutation, register the above class too > > On May 30, 2016, at 08:11, Nirav Patel wrote: > > Sure let m

Spark dynamic allocation - efficiently request new resource

2016-06-07 Thread Nirav Patel
Hi, Do current or future(2.0) spark dynamic allocation have capability to request a container with varying resource requirements based on various factor? Few factors I can think of is based on stage and data its processing it can either ask for more CPUs or more Memory. i.e. new executor can have

Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-21 Thread Nirav Patel
I have an RDD[String, MyObj] which is a result of Join + Map operation. It has no partitioner info. I run reduceByKey without passing any Partitioner or partition counts. I observed that output aggregation result for given key is incorrect sometime. like 1 out of 5 times. It looks like reduce oper

Re: spark job automatically killed without rhyme or reason

2016-06-21 Thread Nirav Patel
spark is memory hogger and suicidal if you have a job processing bigger dataset. however databricks claims that spark > 1.6 have optimization related to memory footprint as well as processing. It will only be available if you use dataframe or dataset. if you are using rdd you have to do lot of te

Re: FullOuterJoin on Spark

2016-06-21 Thread Nirav Patel
Can your domain list fit in memory of one executor. if so you can use broadcast join. You can always narrow down to inner join and derive rest from original set if memory is issue there. If you are just concerned about shuffle memory then to reduce amount of shuffle you can do following: 1) partit

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Nirav Patel
did you observe any error (on > workers) ? > > Cheers > > On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel > wrote: > >> I have an RDD[String, MyObj] which is a result of Join + Map operation. >> It has no partitioner info. I run reduceByKey without passing any >> Pa

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Nirav Patel
Wed, Jun 22, 2016 at 11:52 AM, Nirav Patel wrote: > Hi, > > I do not see any indication of errors or executor getting killed in spark > UI - jobs, stages, event timelines. No task failures. I also don't see any > errors in executor logs. > > Thanks > > On Wed, Jun 22

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Nirav Patel
Yes driver keeps fair amount of meta data to manage scheduling across all your executors. I assume with 64 nodes you have more executors as well. Simple way to test is to increase driver memory. On Wed, Jun 22, 2016 at 10:10 AM, Raghava Mutharaju < m.vijayaragh...@gmail.com> wrote: > It is an ite

Re: OOM on the driver after increasing partitions

2016-06-22 Thread Nirav Patel
e > maximum available limit. So the other options are > > 1) Separate the driver from master, i.e., run them on two separate nodes > 2) Increase the RAM capacity on the driver/master node. > > Regards, > Raghava. > > > On Wed, Jun 22, 2016 at 7:05 PM, Nirav Patel >

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-23 Thread Nirav Patel
SO it was indeed my merge function. I created new result object for every merge and its working now. Thanks On Wed, Jun 22, 2016 at 3:46 PM, Nirav Patel wrote: > PS. In my reduceByKey operation I have two mutable object. What I do is > merge mutable2 into mutable1 and return mutable1.

How to efficiently Scan (not filter nor lookup) part of Paird RDD or Ordered RDD

2016-01-23 Thread Nirav Patel
Problem is I have RDD of about 10M rows and it keeps growing. Everytime when we want to perform query and compute on subset of data we have to use filter and then some aggregation. Here I know filter goes through each partitions and every rows of RDD which may not be efficient at all. Spark having

Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Nirav Patel
Hi, Perhaps I should write a blog about this that why spark is focusing more on writing easier spark jobs and hiding underlaying performance optimization details from a seasoned spark users. It's one thing to provide such abstract framework that does optimization for you so you don't have to worry

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-01-25 Thread Nirav Patel
; to write your own optimizations based on your own knowledge of the data > types and semantics that are hiding in your raw RDDs, there's no reason > that you can't do that. > > On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel > wrote: > >> Hi, >> >> P

Re: Regarding Off-heap memory

2016-01-26 Thread Nirav Patel
>From my experience with spark 1.3.1 you can also set spark.executor.memoryOverhead to about 7-10% of your spark.executor.memory. Total of which will be requested for a Yarn container. On Tue, Jan 26, 2016 at 4:20 AM, Xiaoyu Ma wrote: > Hi all, > I saw spark 1.6 has new off heap settings: spark.

Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Nirav Patel
Hi, we were using spark 1.3.1 and launching our spark jobs on yarn-client mode programmatically via creating a sparkConf and sparkContext object manually. It was inspired from spark self-contained application example here: https://spark.apache.org/docs/1.5.2/quick-start.html#self-contained-applica

Spark 1.5.2 - Programmatically launching spark on yarn-client mode

2016-01-28 Thread Nirav Patel
Hi, we were using spark 1.3.1 and launching our spark jobs on yarn-client mode programmatically via creating a sparkConf and sparkContext object manually. It was inspired from spark self-contained application example here: https://spark.apache.org/docs/1.5.2/quick-start.html#self-contained-applica

Re: Programmatically launching spark on yarn-client mode no longer works in spark 1.5.2

2016-01-28 Thread Nirav Patel
ght be due to some race > conditions in exit period. The way you mentioned is still valid, this > problem only occurs when stopping the application. > > Thanks > Saisai > > On Fri, Jan 29, 2016 at 10:22 AM, Nirav Patel > wrote: > >> Hi, we were using spark 1

Re: Spark 1.5.2 - Programmatically launching spark on yarn-client mode

2016-01-30 Thread Nirav Patel
endency leaked into your > app ? > > Cheers > > On Thu, Jan 28, 2016 at 7:36 PM, Nirav Patel > wrote: > >> Hi, we were using spark 1.3.1 and launching our spark jobs on yarn-client >> mode programmatically via creating a sparkConf and sparkContext object >> manua

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
has types too! On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel wrote: > I haven't gone through much details of spark catalyst optimizer and > tungston project but we have been advised by databricks support to use > DataFrame to resolve issues with OOM error that we are getting dur

Spark 1.5.2 - are new Project Tungsten optimizations available on RDD as well?

2016-02-02 Thread Nirav Patel
Hi, I read about release notes and few slideshares on latest optimizations done on spark 1.4 and 1.5 releases. Part of which are optimizations from project Tungsten. Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array before shuffle for optimized GC and memory. My

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
mizations in catalyst. RDD is > simply no longer the focus. > On Feb 2, 2016 7:17 PM, "Nirav Patel" wrote: > >> so latest optimizations done on spark 1.4 and 1.5 releases are mostly >> from project Tungsten. Docs says it usues sun.misc.unsafe to convert >> physical

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
I dont understand why one thinks RDD of case object doesn't have types(schema) ? If spark can convert RDD to DataFrame which means it understood the schema. SO then from that point why one has to use SQL features to do further processing? If all spark need for optimizations is schema then what this

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel
ot comment on the beauty of it because "beauty is in the eye of the > beholder" LOL > Regarding the comment on error prone, can you say why you think it is the > case? Relative to what other ways? > > Best Regards, > > Jerry > > > On Tue, Feb 2, 201

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
Hi Stefan, Welcome to the OOM - heap space club. I have been struggling with similar errors (OOM and yarn executor being killed) and failing job or sending it in retry loops. I bet the same job will run perfectly fine with less resource on Hadoop MapReduce program. I have tested it for my program

Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
Hi, I have a spark job running on yarn-client mode. At some point during Join stage, executor(container) runs out of memory and yarn kills it. Due to this Entire job restarts! and it keeps doing it on every failure? What is the best way to checkpoint? I see there's checkpoint api and other option

Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
rove things is to install the > Spark shuffle service on the YARN nodes, so that even if an executor > crashes, its shuffle output is still available to other executors. > > On Wed, Feb 3, 2016 at 11:46 AM, Nirav Patel > wrote: > >> Hi, >> >> I have a spark job ru

Re: Spark 1.5.2 Yarn Application Master - resiliencey

2016-02-03 Thread Nirav Patel
o Vanzin wrote: > > Yes, but you don't necessarily need to use dynamic allocation (just enable > the external shuffle service). > > On Wed, Feb 3, 2016 at 11:53 AM, Nirav Patel > wrote: > >> Do you mean this setup? >> >> https://spark.apache.org/docs/1.5.2/job

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-03 Thread Nirav Patel
like to use joins where one side is >> streaming (and the other cached). this seems to be available for DataFrame >> but not for RDD. >> >> On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel >> wrote: >> >>> Hi Jerry, >>> >>> Yes I read that

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
tp://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> > > > > *From:* Nirav Patel [mailto:npa...@xactlycorp.com] > *Sent:* Wednesday, February 3, 2016 11:31 AM > *To:* Stefan Panayotov > *Cc:* Jim Green; Ted Yu; Jakob Odersky; user@spark.apac

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
About OP. How many cores you assign per executor? May be reducing that number will give more portion of executor memory to each task being executed on that executor. Others please comment if that make sense. On Wed, Feb 3, 2016 at 1:52 PM, Nirav Patel wrote: > I know it;s a strong word

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
> max memory that your executor might have. But the memory that you get is > less then that. I don’t clearly remember but i think its either memory/2 or > memory/4. But I may be wrong as I have been out of spark for months. > > On Feb 3, 2016, at 2:58 PM, Nirav Patel wrote: > > A

Re: Spark 1.5.2 memory error

2016-02-03 Thread Nirav Patel
te: > There is also (deprecated) spark.storage.unrollFraction to consider > > On Wed, Feb 3, 2016 at 2:21 PM, Nirav Patel wrote: > >> What I meant is executor.cores and task.cpus can dictate how many >> parallel tasks will run on given executor. >> >> Let's

Re: Too many open files, why changing ulimit not effecting?

2016-02-05 Thread Nirav Patel
For centos there's also /etc/security/limits.d/90-nproc.conf that may need modifications. Services that you expect to use new limits needs to be restarted. Simple thing to do is to reboot the machine. On Fri, Feb 5, 2016 at 3:59 AM, Ted Yu wrote: > bq. and *"session required pam_limits.so"*. >

Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-10 Thread Nirav Patel
In Yarn we have following settings enabled so that job can use virtual memory to have a capacity beyond physical memory off course. yarn.nodemanager.vmem-check-enabled false yarn.nodemanager.pmem-check-enabled false vmem to pmem ration is 2:1. However spark do

Spark execuotr Memory profiling

2016-02-10 Thread Nirav Patel
We have been trying to solve memory issue with a spark job that processes 150GB of data (on disk). It does a groupBy operation; some of the executor will receive somehwere around (2-4M scala case objects) to work with. We are using following spark config: "executorInstances": "15", "executor

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-16 Thread Nirav Patel
t increase the executor memory. Also considering increasing the > parallelism ie the number of partitions. > > Regards > Sab > >> On 11-Feb-2016 5:46 am, "Nirav Patel" wrote: >> In Yarn we have following settings enabled so that job can use virt

Re: Spark execuotr Memory profiling

2016-02-20 Thread Nirav Patel
or.memoryOverhead","4000") > > conf = conf.set("spark.executor.cores","4").set("spark.executor.memory", > "15G").set("spark.executor.instances","6") > > Is it also possible to use reduceBy in place of group

Re: Spark execuotr Memory profiling

2016-02-20 Thread Nirav Patel
.run(DFSOutputStream.java:745) > > Kindly help me understand the conf. > > > Thanks in advance. > > Regards > Arun. > > -- > *From:* Kuchekar [kuchekar.nil...@gmail.com] > *Sent:* 11 February 2016 09:42 > *To:* Nirav Patel >

Re: Spark Application Master on Yarn client mode - Virtual memory limit

2016-02-24 Thread Nirav Patel
> > On 17 Feb 2016, at 01:29, Nirav Patel wrote: > > I think you are not getting my question . I know how to tune executor > memory settings and parallelism . That's not an issue. It's a specific > question about what happens when physical memory limit of given exec

Spark executor killed without apparent reason

2016-03-01 Thread Nirav Patel
Hi, We are using spark 1.5.2 or yarn. We have a spark application utilizing about 15GB executor memory and 1500 overhead. However, at certain stage we notice higher GC time (almost same as task time) spent. These executors are bound to get killed at some point. However, nodemanager or resource man

Re: Spark executor killed without apparent reason

2016-03-01 Thread Nirav Patel
ufOfMemoryError: Direct Buffer Memory" or something else. > > On Tue, Mar 1, 2016 at 6:23 AM, Nirav Patel wrote: > >> Hi, >> >> We are using spark 1.5.2 or yarn. We have a spark application utilizing >> about 15GB executor memory and 1500 overhead. However, at ce

Job fails at saveAsHadoopDataset stage due to Lost Executor due to reason unknown so far

2016-03-01 Thread Nirav Patel
Hi, I have a spark jobs that runs on yarn and keeps failing at line where i do : val hConf = HBaseConfiguration.create hConf.setInt("hbase.client.scanner.caching", 1) hConf.setBoolean("hbase.cluster.distributed", true) new PairRDDFunctions(hbaseRdd).saveAsHadoopDataset(jobConfig)

Re: Spark execuotr Memory profiling

2016-03-01 Thread Nirav Patel
ing spark <http://techsuppdiva.github.io/spark1.6.html> > . > > Hope this helps, keep the community posted what resolved your issue if it > does. > > Thanks. > > Kuchekar, Nilesh > > On Sat, Feb 20, 2016 at 11:29 AM, Nirav Patel > wrote: > >> Thanks

Spark executor jvm classloader not able to load nested jars

2015-11-02 Thread Nirav Patel
Hi, I have maven based mixed scala/java application that can submit spar jobs. My application jar "myapp.jar" has some nested jars inside lib folder. It's a fat jar created using spring-boot-maven plugin which nest other jars inside lib folder of parent jar. I prefer not to create shaded flat jar

Spark 1.3.1 - Does SparkConext in multi-threaded env requires SparkEnv.set(env) anymore

2015-12-10 Thread Nirav Patel
As subject says, do we still need to use static env in every thread that access sparkContext? I read some ref here. http://qnalist.com/questions/4956211/is-spark-context-in-local-mode-thread-safe -- [image: What's New with Xactly]

apache-spark 1.3.0 and yarn integration and spring-boot as a container

2015-07-30 Thread Nirav Patel
Hi, I was running spark application as a query service (much like spark-shell but within my servlet container provided by spring-boot) with spark 1.0.2 and standalone mode. Now After upgrading to spark 1.3.1 and trying to use Yarn instead of standalone cluster things going south for me. I created u

How PolynomialExpansion works

2016-09-16 Thread Nirav Patel
Doc says: Take a 2-variable feature vector as an example: (x, y), if we want to expand it with degree 2, then we get (x, x * x, y, x * y, y * y). I know polynomial expansion of (x+y)^2 = x^2 + 2xy + y^2 but can't relate it to above. Thanks -- [image: What's New with Xactly]

Tutorial error - zeppelin 0.6.2 built with spark 2.0 and mapr

2016-09-26 Thread Nirav Patel
Hi, I built zeppeling 0.6 branch using spark 2.0 using following mvn : mvn clean package -Pbuild-distr -Pmapr41 -Pyarn -Pspark-2.0 -Pscala-2.11 -DskipTests Built went successful. I only have following set in zeppelin-conf.sh export HADOOP_HOME=/opt/mapr/hadoop/hadoop-2.5.1/ export HADOOP_CONF_

Re: Tutorial error - zeppelin 0.6.2 built with spark 2.0 and mapr

2016-09-26 Thread Nirav Patel
FYI, it works when I use MapR configured Spark 2.0. ie export SPARK_HOME=/opt/mapr/spark/spark-2.0.0-bin-without-hadoop Thanks Nirav On Mon, Sep 26, 2016 at 3:45 PM, Nirav Patel wrote: > Hi, > > I built zeppeling 0.6 branch using spark 2.0 using following mvn : > > mvn clean

MulticlassClassificationEvaluator how weighted precision and weighted recall calculated

2016-10-03 Thread Nirav Patel
For example 3 class would it be? weightedPrecision = ( TP1 * w1 + TP2 * w2 + TP3 * w3) / ( TP1 * w1 + TP2 * w2 + TP3 * w3) + ( FP1 * w1 + FP2 * w2 + FP3 * w3) where TP1..2 are TP for each class. w1, w2.. are wight for each class based on their distribution in sample data? and similar for recall

ML - MulticlassClassificationEvaluator How to get metrics for each class

2016-10-03 Thread Nirav Patel
I see that in scikit library if you specify 'Non' or nothing for 'average' parameter it returns metrics for each classes. How to get this in ML library? http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html Current weighted metrics does help to see overall picture b

Re: ML - MulticlassClassificationEvaluator How to get metrics for each class

2016-10-03 Thread Nirav Patel
I think its via using MulticlassMetrics class. Just found it. Thanks On Mon, Oct 3, 2016 at 3:31 PM, Nirav Patel wrote: > I see that in scikit library if you specify 'Non' or nothing for 'average' > parameter it returns metrics for each classes. How to get this in ML

Re: Executor Lost error

2016-10-04 Thread Nirav Patel
Few pointer from in addition: 1) Executor can also get lost if they hung up on GC and can't respond to driver for timeout ms. That should be in executor logs though. 2) --conf "spark.shuffle.memoryFraction=0.8" that's very high shuffle fraction. You should check UI for Event Timeline and exec logs

Is IDF model reusable

2016-11-01 Thread Nirav Patel
I am using IDF estimator/model (TF-IDF) to convert text features into vectors. Currently, I fit IDF model on all sample data and then transform them. I read somewhere that I should split my data into training and test before fitting IDF model; Fit IDF only on training data and then use same transfo

Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
FYI, I do reuse IDF model while making prediction against new unlabeled data but not between training and test data while training a model. On Tue, Nov 1, 2016 at 3:10 AM, Nirav Patel wrote: > I am using IDF estimator/model (TF-IDF) to convert text features into > vectors. Currently, I f

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
rfit your model. > > --- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > > On 1 Nov 2016, at 10:15, Nirav Patel wrote: > >

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
el before passing it through the next step. > > > --- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > >

Spark ML - CrossValidation - How to get Evaluation metrics of best model

2016-11-01 Thread Nirav Patel
I am running classification model. with normal training-test split I can check model accuracy and F1 score using MulticlassClassificationEvaluator. How can I do this with CrossValidation approach? Afaik, you Fit entire sample data in CrossValidator as you don't want to leave out any observation fro

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
you > may want to automate model evaluations, but that's a different story. > > Not sure if I could explain properly, please feel free to comment. > On 1 Nov 2016 22:54, "Nirav Patel" wrote: > >> Yes, I do apply NaiveBayes after IDF . >> >> " you can re-

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
uation" work flow > typically in lower frequency than Re-Training process. > > On Wed, Nov 2, 2016 at 5:48 AM, Nirav Patel wrote: > >> Hi Ayan, >> After deployment, we might re-train it every month. That is whole >> different problem I have explored yet. classific

Re: Spark ML - Is IDF model reusable

2016-11-01 Thread Nirav Patel
s a good thing :) > > On Wed, Nov 2, 2016 at 10:04 AM, Nirav Patel > wrote: > >> Hi Ayan, >> >> "classification algorithm will for sure need to Fit against new dataset >> to produce new model" I said this in context of re-training the model. Is >&

Re: Spark ML - CrossValidation - How to get Evaluation metrics of best model

2016-11-02 Thread Nirav Patel
ossValidator, in order to get an unbiased estimate of the best model's > performance. > > On Tue, Nov 1, 2016 at 12:10 PM Nirav Patel wrote: > >> I am running classification model. with normal training-test split I can >> check model accuracy and F1 score using Multiclas

Spark ML - Is it rule of thumb that all Estimators should only be Fit on Training data

2016-11-02 Thread Nirav Patel
It is very clear that for ML algorithms (classification, regression) that Estimator only fits on training data but it's not very clear of other estimators like IDF for example. IDF is a feature transformation model but having IDF estimator and transformer makes it little confusing that what exactly

Spark ML - Naive Bayes - how to select Threshold values

2016-11-07 Thread Nirav Patel
Few questions about `thresholds` parameter: This is what doc says "Param for Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values >= 0. The class with largest value p/t is predicted, where p i

spark ml - ngram - how to preserve single word (1-gram)

2016-11-08 Thread Nirav Patel
Is it possible to preserve single token while using n-gram feature transformer? e.g. Array("Hi", "I", "heard", "about", "Spark") Becomes Array("Hi", "i", "heard", "about", "Spark", "Hi i", "I heard", "heard about", "about Spark") Currently if I want to do it I will have to manually transform c

Spark SQL UDF - passing map as a UDF parameter

2016-11-14 Thread Nirav Patel
I am trying to use following API from Functions to convert a map into column so I can pass it to UDF. map(cols: Column *): Column "Crea

Re: Spark SQL UDF - passing map as a UDF parameter

2016-11-15 Thread Nirav Patel
to 3).map(i => (i, 0)) > map(rdd.collect.flatMap(x => x._1 :: x._2 :: Nil).map(lit _): _*) > > // maropu > > On Tue, Nov 15, 2016 at 9:33 AM, Nirav Patel > wrote: > >> I am trying to use following API from Functions to convert a map into >> column so I can pass it to

Dynamic scheduling not respecting spark.executor.cores

2017-01-03 Thread Nirav Patel
When enabling dynamic scheduling I see that all executors are using only 1 core even if I specify "spark.executor.cores" to 6. If dynamic scheduling is disable then each executors will have 6 cores. I have tested this against spark 1.5 . I wonder if this is the same behavior with 2.x as well. Than

Re: Dynamic Allocation not respecting spark.executor.cores

2017-01-04 Thread Nirav Patel
If this is not an expected behavior then its should be logged as an issue. On Tue, Jan 3, 2017 at 2:51 PM, Nirav Patel wrote: > When enabling dynamic scheduling I see that all executors are using only 1 > core even if I specify "spark.executor.cores" to 6. If dynamic schedul

  1   2   >