Re: Problem getting program to run on 15TB input

2015-06-09 Thread Arun Luthra
usage of spark. > > > > @Arun, can you kindly confirm if Daniel’s suggestion helped your usecase? > > > > Thanks, > > > > Kapil Malik | kma...@adobe.com | 33430 / 8800836581 > > > > *From:* Daniel Mahler [mailto:dmah...@gmail.com] > *Sent:* 13 April 2

Missing values support in Mllib yet?

2015-06-19 Thread Arun Luthra
Hi, Is there any support for handling missing values in mllib yet, especially for decision trees where this is a natural feature? Arun

Spark launching without all of the requested YARN resources

2015-06-23 Thread Arun Luthra
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via spark-submit) will begin its processing even though it apparently did not get all of the requested resources; it is running very slowly. Is there a way to force Spark/YARN to only begin when it has the full set of resource

Re: Spark launching without all of the requested YARN resources

2015-07-02 Thread Arun Luthra
> start as early as possible should make it complete earlier and increase the >> utilization of resources >> >> On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra >> wrote: >> >>> Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark >>&

How to change hive database?

2015-07-07 Thread Arun Luthra
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext I'm getting org.apache.spark.sql.catalyst.analysis.NoSuchTableException from: val dataframe = hiveContext.table("other_db.mytable") Do I have to change current database to access it? Is it possible to

Re: How to change hive database?

2015-07-08 Thread Arun Luthra
Thanks, it works. On Tue, Jul 7, 2015 at 11:15 AM, Ted Yu wrote: > See this thread http://search-hadoop.com/m/q3RTt0NFls1XATV02 > > Cheers > > On Tue, Jul 7, 2015 at 11:07 AM, Arun Luthra > wrote: > >> >> https://spark.apache.org

How to ignore features in mllib

2015-07-09 Thread Arun Luthra
Is it possible to ignore features in mllib? In other words, I would like to have some 'pass-through' data, Strings for example, attached to training examples and test data. A related stackoverflow question: http://stackoverflow.com/questions/30739283/spark-mllib-how-to-ignore-features-when-trainin

groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Arun Luthra
Here is the offending line: val some_rdd = pair_rdd.groupByKey().flatMap { case (mk: MyKey, md_iter: Iterable[MyData]) => { ... [error] .scala:249: overloaded method value groupByKey with alternatives: [error] [K](func: org.apache.spark.api.java.function.MapFunction[(aaa.MyKey, aaa.My

Re: groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Arun Luthra
r dataset or an unexpected implicit conversion. > Just add rdd() before the groupByKey call to push it into an RDD. That > being said - groupByKey generally is an anti-pattern so please be careful > with it. > > On Wed, Aug 10, 2016 at 8:07 PM, Arun Luthra > wrote: > >> H

Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-15 Thread Arun Luthra
I got this OOM error in Spark local mode. The error seems to have been at the start of a stage (all of the stages on the UI showed as complete, there were more stages to do but had not showed up on the UI yet). There appears to be ~100G of free memory at the time of the error. Spark 2.0.0 200G dr

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-18 Thread Arun Luthra
ainst me? What if I manually split them up into numerous Map variables? On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra wrote: > I got this OOM error in Spark local mode. The error seems to have been at > the start of a stage (all of the stages on the UI showed as complete, there > were

groupByKey does not work?

2016-01-04 Thread Arun Luthra
I tried groupByKey and noticed that it did not group all values into the same group. In my test dataset (a Pair rdd) I have 16 records, where there are only 4 distinct keys, so I expected there to be 4 records in the groupByKey object, but instead there were 8. Each of the 4 distinct keys appear 2

Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
see that each key is repeated 2 times but each key should only appear once. Arun On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu wrote: > Can you give a bit more information ? > > Release of Spark you're using > Minimal dataset that shows the problem > > Cheers > > On M

Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
ues in object > equality. > > On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra wrote: > >> Spark 1.5.0 >> >> data: >> >> p1,lo1,8,0,4,0,5,20150901|5,1,1.0 >> p1,lo2,8,0,4,0,5,20150901|5,1,1.0 >> p1,lo3,8,0,4,0,5,20150901|5,

TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
Example warning: 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1, partition: 2168, attempt: 4436 Is there a solution for this? Increase driver memory? I'm using just 1G driver memory but ideally I w

MemoryStore: Not enough space to cache broadcast_N in memory

2016-01-21 Thread Arun Luthra
WARN MemoryStore: Not enough space to cache broadcast_4 in memory! (computed 60.2 MB so far) WARN MemoryStore: Persisting block broadcast_4 to disk instead. Can I increase the memory allocation for broadcast variables? I have a few broadcast variables that I create with sc.broadcast() . Are thes

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
you are performing? > > On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra > wrote: > >> Example warning: >> >> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID >> 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1,

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
this exception because the coordination does not get triggered in > non save/write operations. > > On Thu, Jan 21, 2016 at 2:46 PM Holden Karau wrote: > >> Before we dig too far into this, the thing which most quickly jumps out >> to me is groupByKey which could be causing some p

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
hat the TaskCommitDenied is perhaps a red hearing and the > problem is groupByKey - but I've also just seen a lot of people be bitten > by it so that might not be issue. If you just do a count at the point of > the groupByKey does the pipeline succeed? > > On Thu, Jan 21, 2016

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
ront in my mind. On Thu, Jan 21, 2016 at 5:35 PM, Arun Luthra wrote: > Looking into the yarn logs for a similar job where an executor was > associated with the same error, I find: > > ... > 16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive > connection to (SERVE

Re: TaskCommitDenied (Driver denied task commit)

2016-01-22 Thread Arun Luthra
, Jan 21, 2016 at 6:19 PM, Arun Luthra wrote: > Two changes I made that appear to be keeping various errors at bay: > > 1) bumped up spark.yarn.executor.memoryOverhead to 2000 in the spirit of > https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3ccacbyxkld8qasymj2ghk_

Want 1-1 map between input files and output files in map-only job

2015-11-19 Thread Arun Luthra
Hello, Is there some technique for guaranteeing that there is a 1-1 correspondence between the input files and the output files? For example if my input directory has files called input001.txt, input002.txt, ... etc. I would like Spark to generate output files named something like part-1, part

types allowed for saveasobjectfile?

2015-08-27 Thread Arun Luthra
What types of RDD can saveAsObjectFile(path) handle? I tried a naive test with an RDD[Array[String]], but when I tried to read back the result with sc.objectFile(path).take(5).foreach(println), I got a non-promising output looking like: [Ljava.lang.String;@46123a [Ljava.lang.String;@76123b [Ljava.

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Arun Luthra
Ah, yes, that did the trick. So more generally, can this handle any serializable object? On Thu, Aug 27, 2015 at 2:11 PM, Jonathan Coveney wrote: > array[String] doesn't pretty print by default. Use .mkString(",") for > example > > > El jueves, 27 de agosto de

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-23 Thread Arun Luthra
Splitting up the Maps to separate objects did not help. However, I was able to work around the problem by reimplementing it with RDD joins. On Aug 18, 2016 5:16 PM, "Arun Luthra" wrote: > This might be caused by a few large Map objects that Spark is trying to > serializ

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-24 Thread Arun Luthra
Also for the record, turning on kryo was not able to help. On Tue, Aug 23, 2016 at 12:58 PM, Arun Luthra wrote: > Splitting up the Maps to separate objects did not help. > > However, I was able to work around the problem by reimplementing it with > RDD joins. > > On Aug 18, 2

GC problem doing fuzzy join

2019-06-18 Thread Arun Luthra
I'm trying to do a brute force fuzzy join where I compare N records against N other records, for N^2 total comparisons. The table is medium size and fits in memory, so I collect it and put it into a broadcast variable. The other copy of the table is in an RDD. I am basically calling the RDD map o

Spark job ends abruptly during setup without error message

2015-02-05 Thread Arun Luthra
While a spark-submit job is setting up, the yarnAppState goes into Running mode, then I get a flurry of typical looking INFO-level messages such as INFO BlockManagerMasterActor: ... INFO YarnClientSchedulerBackend: Registered executor: ... Then, spark-submit quits without any error message and I

Re: Spark job ends abruptly during setup without error message

2015-02-05 Thread Arun Luthra
the job from your local machine or on the driver > machine.? > > Have you set YARN_CONF_DIR. > > On Thu, Feb 5, 2015 at 10:43 PM, Arun Luthra > wrote: > >> While a spark-submit job is setting up, the yarnAppState goes into >> Running mode, then I get a flurry of ty

Open file limit settings for Spark on Yarn job

2015-02-10 Thread Arun Luthra
Hi, I'm running Spark on Yarn from an edge node, and the tasks on the run Data Nodes. My job fails with the "Too many open files" error once it gets to groupByKey(). Alternatively I can make it fail immediately if I repartition the data when I create the RDD. Where do I need to make sure that uli

Re: Open file limit settings for Spark on Yarn job

2015-02-11 Thread Arun Luthra
5 at 11:41 PM, Felix C wrote: > Alternatively, is there another way to do it? > groupByKey has been called out as expensive and should be avoid (it causes > shuffling of data). > > I've generally found it possible to use reduceByKey instead > > --- Original Message

Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-02-26 Thread Arun Luthra
Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949 I tried this as a workaround: import org.apache.spark.scheduler._ import org.roaringbitmap._ ... kryo.register(classOf[org.roaringbitmap.RoaringBitmap]) kryo.register(classOf[org.roaringbitmap.RoaringArray]) kryo.r

Problem getting program to run on 15TB input

2015-02-27 Thread Arun Luthra
My program in pseudocode looks like this: val conf = new SparkConf().setAppName("Test") .set("spark.storage.memoryFraction","0.2") // default 0.6 .set("spark.shuffle.memoryFraction","0.12") // default 0.2 .set("spark.shuffle.manager","SORT") // preferred setting for optimized

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
EAP` (and use Tachyon). > > Burak > > On Fri, Feb 27, 2015 at 11:39 AM, Arun Luthra > wrote: > >> My program in pseudocode looks like this: >> >> val conf = new SparkConf().setAppName("Test") >> .set("spark.storage.memoryFrac

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
elped. > > Pawel Szulc > http://rabbitonweb.com > > sob., 28 lut 2015, 6:22 PM Arun Luthra użytkownik > napisał: > > So, actually I am removing the persist for now, because there is >> significant filtering that happens after calling textFile()... but I will >> keep that o

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
A correction to my first post: There is also a repartition right before groupByKey to help avoid too-many-open-files error: rdd2.union(rdd1).map(...).filter(...).repartition(15000).groupByKey().map(...).flatMap(...).saveAsTextFile() On Sat, Feb 28, 2015 at 11:10 AM, Arun Luthra wrote: >

Re: Problem getting program to run on 15TB input

2015-02-28 Thread Arun Luthra
y phase? Are you using a profiler or do yoi base that assumption >> only on logs? >> >> sob., 28 lut 2015, 8:12 PM Arun Luthra użytkownik >> napisał: >> >> A correction to my first post: >>> >>> There is also a repartition right before groupByKey to he

Re: Problem getting program to run on 15TB input

2015-03-01 Thread Arun Luthra
-memory 15g \ --executor-memory 110g \ --executor-cores 32 \ But then, reduceByKey() fails with: java.lang.OutOfMemoryError: Java heap space On Sat, Feb 28, 2015 at 12:09 PM, Arun Luthra wrote: > The Spark UI names the line number and name of the operation (repartition > in this case) t

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-02 Thread Arun Luthra
I think this is a Java vs scala syntax issue. Will check. On Thu, Feb 26, 2015 at 8:17 PM, Arun Luthra wrote: > Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949 > > I tried this as a workaround: > > import org.apache.spark.scheduler._ > import

Re: Problem getting program to run on 15TB input

2015-03-02 Thread Arun Luthra
Everything works smoothly if I do the 99%-removal filter in Hive first. So, all the baggage from garbage collection was breaking it. Is there a way to filter() out 99% of the data without having to garbage collect 99% of the RDD? On Sun, Mar 1, 2015 at 9:56 AM, Arun Luthra wrote: > I trie

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-10 Thread Arun Luthra
[T], not T[], so you want to use > something: > > kryo.register(classOf[Array[org.roaringbitmap.RoaringArray$Element]]) > kryo.register(classOf[Array[Short]]) > > nonetheless, the spark should take care of this itself. I'll look into it > later today. > > > On Mon,

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-12 Thread Arun Luthra
#x27;d need a lot more info to help > > On Tue, Mar 10, 2015 at 6:54 PM, Arun Luthra > wrote: > >> Does anyone know how to get the HighlyCompressedMapStatus to compile? >> >> I will try turning off kryo in 1.2.0 and hope things don't break. I want >> to ben

Re: Workaround for spark 1.2.X roaringbitmap kryo problem?

2015-03-12 Thread Arun Luthra
sedMapStatus") >> res1: Class[_] = class >> org.apache.spark.scheduler.HighlyCompressedMapStatus > > > > hope this helps, > Imran > > > On Thu, Mar 12, 2015 at 12:47 PM, Arun Luthra > wrote: > >> I'm using

Efficient saveAsTextFile by key, directory for each key?

2015-04-21 Thread Arun Luthra
Is there an efficient way to save an RDD with saveAsTextFile in such a way that the data gets shuffled into separated directories according to a key? (My end goal is to wrap the result in a multi-partitioned Hive table) Suppose you have: case class MyData(val0: Int, val1: string, directory_name:

Re: Efficient saveAsTextFile by key, directory for each key?

2015-04-22 Thread Arun Luthra
PARK-3007 On Tue, Apr 21, 2015 at 5:45 PM, Arun Luthra wrote: > Is there an efficient way to save an RDD with saveAsTextFile in such a way > that the data gets shuffled into separated directories according to a key? > (My end goal is to wrap the result in a multi-partitioned Hive table

PrintWriter error in foreach

2014-09-10 Thread Arun Luthra
I have a spark program that worked in local mode, but throws an error in yarn-client mode on a cluster. On the edge node in my home directory, I have an output directory (called transout) which is ready to receive files. The spark job I'm running is supposed to write a few hundred files into that d

Re: PrintWriter error in foreach

2014-09-10 Thread Arun Luthra
th to the file you want to write, and make sure the > directory exists and is writable by the Spark process. > > On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra > wrote: > >> I have a spark program that worked in local mode, but throws an error in >> yarn-client mode on a clus

How to time Spark SQL statement?

2014-09-17 Thread Arun Luthra
I'm doing a spark SQL benchmark similar to the code in https://spark.apache.org/docs/latest/sql-programming-guide.html (section: Inferring the Schema Using Reflection**). What's the simplest way to time the SQL statement itself, so that I'm not timing the .map(_.split(",")).map(p => Person(p(0), p(

How to configure build.sbt for Spark 1.2.0

2014-10-08 Thread Arun Luthra
I built Spark 1.2.0 succesfully, but was unable to build my Spark program under 1.2.0 with sbt assembly & my build.sbt file. It contains: I tried: "org.apache.spark" %% "spark-sql" % "1.2.0", "org.apache.spark" %% "spark-core" % "1.2.0", and "org.apache.spark" %% "spark-sql" % "1.2.0

Re: How to configure build.sbt for Spark 1.2.0

2014-10-08 Thread Arun Luthra
park (1.2.0-SNAPSHOT), > you'll need to first build spark and "publish-local" so your application > build can find those SNAPSHOTs in your local repo. > > Just append "publish-local" to your sbt command where you build Spark. > > -Pat > > > > On

rack-topology.sh no such file or directory

2014-11-19 Thread Arun Luthra
null -name "rack-topology"). Any possibly solution? Arun Luthra

Re: rack-topology.sh no such file or directory

2014-11-25 Thread Arun Luthra
te.xml > and remove the setting for a rack topology script there (or it might be in > core-site.xml). > > Matei > > On Nov 19, 2014, at 12:13 PM, Arun Luthra wrote: > > I'm trying to run Spark on Yarn on a hortonworks 2.1.5 cluster. I'm > get

SQL query in scala API

2014-12-03 Thread Arun Luthra
I'm wondering how to do this kind of SQL query with PairRDDFunctions. SELECT zip, COUNT(user), COUNT(DISTINCT user) FROM users GROUP BY zip In the Spark scala API, I can make an RDD (called "users") of key-value pairs where the keys are zip (as in ZIP code) and the values are user id's. Then I ca

Re: SQL query in scala API

2014-12-04 Thread Arun Luthra
Is that Spark SQL? I'm wondering if it's possible without spark SQL. On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian wrote: > You may do this: > > table("users").groupBy('zip)('zip, count('user), countDistinct('user)) > > On 12/4/14 8:47 AM

Re: SQL query in scala API

2014-12-06 Thread Arun Luthra
(count + 1, seen + user) > }, { case ((count0, seen0), (count1, seen1)) => > (count0 + count1, seen0 ++ seen1) > }).mapValues { case (count, seen) => > (count, seen.size) > } > > On 12/5/14 3:47 AM, Arun Luthra wrote: > > Is that Spark SQL? I'm wonderi