usage of spark.
>
>
>
> @Arun, can you kindly confirm if Daniel’s suggestion helped your usecase?
>
>
>
> Thanks,
>
>
>
> Kapil Malik | kma...@adobe.com | 33430 / 8800836581
>
>
>
> *From:* Daniel Mahler [mailto:dmah...@gmail.com]
> *Sent:* 13 April 2
Hi,
Is there any support for handling missing values in mllib yet, especially
for decision trees where this is a natural feature?
Arun
Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark (via
spark-submit) will begin its processing even though it apparently did not
get all of the requested resources; it is running very slowly.
Is there a way to force Spark/YARN to only begin when it has the full set
of resource
> start as early as possible should make it complete earlier and increase the
>> utilization of resources
>>
>> On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra
>> wrote:
>>
>>> Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark
>>&
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
I'm getting org.apache.spark.sql.catalyst.analysis.NoSuchTableException
from:
val dataframe = hiveContext.table("other_db.mytable")
Do I have to change current database to access it? Is it possible to
Thanks, it works.
On Tue, Jul 7, 2015 at 11:15 AM, Ted Yu wrote:
> See this thread http://search-hadoop.com/m/q3RTt0NFls1XATV02
>
> Cheers
>
> On Tue, Jul 7, 2015 at 11:07 AM, Arun Luthra
> wrote:
>
>>
>> https://spark.apache.org
Is it possible to ignore features in mllib? In other words, I would like to
have some 'pass-through' data, Strings for example, attached to training
examples and test data.
A related stackoverflow question:
http://stackoverflow.com/questions/30739283/spark-mllib-how-to-ignore-features-when-trainin
Here is the offending line:
val some_rdd = pair_rdd.groupByKey().flatMap { case (mk: MyKey, md_iter:
Iterable[MyData]) => {
...
[error] .scala:249: overloaded method value groupByKey with
alternatives:
[error] [K](func:
org.apache.spark.api.java.function.MapFunction[(aaa.MyKey, aaa.My
r dataset or an unexpected implicit conversion.
> Just add rdd() before the groupByKey call to push it into an RDD. That
> being said - groupByKey generally is an anti-pattern so please be careful
> with it.
>
> On Wed, Aug 10, 2016 at 8:07 PM, Arun Luthra
> wrote:
>
>> H
I got this OOM error in Spark local mode. The error seems to have been at
the start of a stage (all of the stages on the UI showed as complete, there
were more stages to do but had not showed up on the UI yet).
There appears to be ~100G of free memory at the time of the error.
Spark 2.0.0
200G dr
ainst me?
What if I manually split them up into numerous Map variables?
On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra wrote:
> I got this OOM error in Spark local mode. The error seems to have been at
> the start of a stage (all of the stages on the UI showed as complete, there
> were
I tried groupByKey and noticed that it did not group all values into the
same group.
In my test dataset (a Pair rdd) I have 16 records, where there are only 4
distinct keys, so I expected there to be 4 records in the groupByKey
object, but instead there were 8. Each of the 4 distinct keys appear 2
see that each key is repeated 2 times but each key should only
appear once.
Arun
On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu wrote:
> Can you give a bit more information ?
>
> Release of Spark you're using
> Minimal dataset that shows the problem
>
> Cheers
>
> On M
ues in object
> equality.
>
> On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra wrote:
>
>> Spark 1.5.0
>>
>> data:
>>
>> p1,lo1,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo2,8,0,4,0,5,20150901|5,1,1.0
>> p1,lo3,8,0,4,0,5,20150901|5,
Example warning:
16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1,
partition: 2168, attempt: 4436
Is there a solution for this? Increase driver memory? I'm using just 1G
driver memory but ideally I w
WARN MemoryStore: Not enough space to cache broadcast_4 in memory!
(computed 60.2 MB so far)
WARN MemoryStore: Persisting block broadcast_4 to disk instead.
Can I increase the memory allocation for broadcast variables?
I have a few broadcast variables that I create with sc.broadcast() . Are
thes
you are performing?
>
> On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra
> wrote:
>
>> Example warning:
>>
>> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID
>> 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1,
this exception because the coordination does not get triggered in
> non save/write operations.
>
> On Thu, Jan 21, 2016 at 2:46 PM Holden Karau wrote:
>
>> Before we dig too far into this, the thing which most quickly jumps out
>> to me is groupByKey which could be causing some p
hat the TaskCommitDenied is perhaps a red hearing and the
> problem is groupByKey - but I've also just seen a lot of people be bitten
> by it so that might not be issue. If you just do a count at the point of
> the groupByKey does the pipeline succeed?
>
> On Thu, Jan 21, 2016
ront in my mind.
On Thu, Jan 21, 2016 at 5:35 PM, Arun Luthra wrote:
> Looking into the yarn logs for a similar job where an executor was
> associated with the same error, I find:
>
> ...
> 16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive
> connection to (SERVE
, Jan 21, 2016 at 6:19 PM, Arun Luthra wrote:
> Two changes I made that appear to be keeping various errors at bay:
>
> 1) bumped up spark.yarn.executor.memoryOverhead to 2000 in the spirit of
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3ccacbyxkld8qasymj2ghk_
Hello,
Is there some technique for guaranteeing that there is a 1-1 correspondence
between the input files and the output files? For example if my input
directory has files called input001.txt, input002.txt, ... etc. I would
like Spark to generate output files named something like part-1,
part
What types of RDD can saveAsObjectFile(path) handle? I tried a naive test
with an RDD[Array[String]], but when I tried to read back the result with
sc.objectFile(path).take(5).foreach(println), I got a non-promising output
looking like:
[Ljava.lang.String;@46123a
[Ljava.lang.String;@76123b
[Ljava.
Ah, yes, that did the trick.
So more generally, can this handle any serializable object?
On Thu, Aug 27, 2015 at 2:11 PM, Jonathan Coveney
wrote:
> array[String] doesn't pretty print by default. Use .mkString(",") for
> example
>
>
> El jueves, 27 de agosto de
Splitting up the Maps to separate objects did not help.
However, I was able to work around the problem by reimplementing it with
RDD joins.
On Aug 18, 2016 5:16 PM, "Arun Luthra" wrote:
> This might be caused by a few large Map objects that Spark is trying to
> serializ
Also for the record, turning on kryo was not able to help.
On Tue, Aug 23, 2016 at 12:58 PM, Arun Luthra wrote:
> Splitting up the Maps to separate objects did not help.
>
> However, I was able to work around the problem by reimplementing it with
> RDD joins.
>
> On Aug 18, 2
I'm trying to do a brute force fuzzy join where I compare N records against
N other records, for N^2 total comparisons.
The table is medium size and fits in memory, so I collect it and put it
into a broadcast variable.
The other copy of the table is in an RDD. I am basically calling the RDD
map o
While a spark-submit job is setting up, the yarnAppState goes into Running
mode, then I get a flurry of typical looking INFO-level messages such as
INFO BlockManagerMasterActor: ...
INFO YarnClientSchedulerBackend: Registered executor: ...
Then, spark-submit quits without any error message and I
the job from your local machine or on the driver
> machine.?
>
> Have you set YARN_CONF_DIR.
>
> On Thu, Feb 5, 2015 at 10:43 PM, Arun Luthra
> wrote:
>
>> While a spark-submit job is setting up, the yarnAppState goes into
>> Running mode, then I get a flurry of ty
Hi,
I'm running Spark on Yarn from an edge node, and the tasks on the run Data
Nodes. My job fails with the "Too many open files" error once it gets to
groupByKey(). Alternatively I can make it fail immediately if I repartition
the data when I create the RDD.
Where do I need to make sure that uli
5 at 11:41 PM, Felix C wrote:
> Alternatively, is there another way to do it?
> groupByKey has been called out as expensive and should be avoid (it causes
> shuffling of data).
>
> I've generally found it possible to use reduceByKey instead
>
> --- Original Message
Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949
I tried this as a workaround:
import org.apache.spark.scheduler._
import org.roaringbitmap._
...
kryo.register(classOf[org.roaringbitmap.RoaringBitmap])
kryo.register(classOf[org.roaringbitmap.RoaringArray])
kryo.r
My program in pseudocode looks like this:
val conf = new SparkConf().setAppName("Test")
.set("spark.storage.memoryFraction","0.2") // default 0.6
.set("spark.shuffle.memoryFraction","0.12") // default 0.2
.set("spark.shuffle.manager","SORT") // preferred setting for
optimized
EAP` (and use Tachyon).
>
> Burak
>
> On Fri, Feb 27, 2015 at 11:39 AM, Arun Luthra
> wrote:
>
>> My program in pseudocode looks like this:
>>
>> val conf = new SparkConf().setAppName("Test")
>> .set("spark.storage.memoryFrac
elped.
>
> Pawel Szulc
> http://rabbitonweb.com
>
> sob., 28 lut 2015, 6:22 PM Arun Luthra użytkownik
> napisał:
>
> So, actually I am removing the persist for now, because there is
>> significant filtering that happens after calling textFile()... but I will
>> keep that o
A correction to my first post:
There is also a repartition right before groupByKey to help avoid
too-many-open-files error:
rdd2.union(rdd1).map(...).filter(...).repartition(15000).groupByKey().map(...).flatMap(...).saveAsTextFile()
On Sat, Feb 28, 2015 at 11:10 AM, Arun Luthra wrote:
>
y phase? Are you using a profiler or do yoi base that assumption
>> only on logs?
>>
>> sob., 28 lut 2015, 8:12 PM Arun Luthra użytkownik
>> napisał:
>>
>> A correction to my first post:
>>>
>>> There is also a repartition right before groupByKey to he
-memory 15g \
--executor-memory 110g \
--executor-cores 32 \
But then, reduceByKey() fails with:
java.lang.OutOfMemoryError: Java heap space
On Sat, Feb 28, 2015 at 12:09 PM, Arun Luthra wrote:
> The Spark UI names the line number and name of the operation (repartition
> in this case) t
I think this is a Java vs scala syntax issue. Will check.
On Thu, Feb 26, 2015 at 8:17 PM, Arun Luthra wrote:
> Problem is noted here: https://issues.apache.org/jira/browse/SPARK-5949
>
> I tried this as a workaround:
>
> import org.apache.spark.scheduler._
> import
Everything works smoothly if I do the 99%-removal filter in Hive first. So,
all the baggage from garbage collection was breaking it.
Is there a way to filter() out 99% of the data without having to garbage
collect 99% of the RDD?
On Sun, Mar 1, 2015 at 9:56 AM, Arun Luthra wrote:
> I trie
[T], not T[], so you want to use
> something:
>
> kryo.register(classOf[Array[org.roaringbitmap.RoaringArray$Element]])
> kryo.register(classOf[Array[Short]])
>
> nonetheless, the spark should take care of this itself. I'll look into it
> later today.
>
>
> On Mon,
#x27;d need a lot more info to help
>
> On Tue, Mar 10, 2015 at 6:54 PM, Arun Luthra
> wrote:
>
>> Does anyone know how to get the HighlyCompressedMapStatus to compile?
>>
>> I will try turning off kryo in 1.2.0 and hope things don't break. I want
>> to ben
sedMapStatus")
>> res1: Class[_] = class
>> org.apache.spark.scheduler.HighlyCompressedMapStatus
>
>
>
> hope this helps,
> Imran
>
>
> On Thu, Mar 12, 2015 at 12:47 PM, Arun Luthra
> wrote:
>
>> I'm using
Is there an efficient way to save an RDD with saveAsTextFile in such a way
that the data gets shuffled into separated directories according to a key?
(My end goal is to wrap the result in a multi-partitioned Hive table)
Suppose you have:
case class MyData(val0: Int, val1: string, directory_name:
PARK-3007
On Tue, Apr 21, 2015 at 5:45 PM, Arun Luthra wrote:
> Is there an efficient way to save an RDD with saveAsTextFile in such a way
> that the data gets shuffled into separated directories according to a key?
> (My end goal is to wrap the result in a multi-partitioned Hive table
I have a spark program that worked in local mode, but throws an error in
yarn-client mode on a cluster. On the edge node in my home directory, I
have an output directory (called transout) which is ready to receive files.
The spark job I'm running is supposed to write a few hundred files into
that d
th to the file you want to write, and make sure the
> directory exists and is writable by the Spark process.
>
> On Wed, Sep 10, 2014 at 3:46 PM, Arun Luthra
> wrote:
>
>> I have a spark program that worked in local mode, but throws an error in
>> yarn-client mode on a clus
I'm doing a spark SQL benchmark similar to the code in
https://spark.apache.org/docs/latest/sql-programming-guide.html
(section: Inferring the Schema Using Reflection**). What's the simplest way
to time the SQL statement itself, so that I'm not timing
the .map(_.split(",")).map(p => Person(p(0), p(
I built Spark 1.2.0 succesfully, but was unable to build my Spark program
under 1.2.0 with sbt assembly & my build.sbt file. It contains:
I tried:
"org.apache.spark" %% "spark-sql" % "1.2.0",
"org.apache.spark" %% "spark-core" % "1.2.0",
and
"org.apache.spark" %% "spark-sql" % "1.2.0
park (1.2.0-SNAPSHOT),
> you'll need to first build spark and "publish-local" so your application
> build can find those SNAPSHOTs in your local repo.
>
> Just append "publish-local" to your sbt command where you build Spark.
>
> -Pat
>
>
>
> On
null -name
"rack-topology").
Any possibly solution?
Arun Luthra
te.xml
> and remove the setting for a rack topology script there (or it might be in
> core-site.xml).
>
> Matei
>
> On Nov 19, 2014, at 12:13 PM, Arun Luthra wrote:
>
> I'm trying to run Spark on Yarn on a hortonworks 2.1.5 cluster. I'm
> get
I'm wondering how to do this kind of SQL query with PairRDDFunctions.
SELECT zip, COUNT(user), COUNT(DISTINCT user)
FROM users
GROUP BY zip
In the Spark scala API, I can make an RDD (called "users") of key-value
pairs where the keys are zip (as in ZIP code) and the values are user id's.
Then I ca
Is that Spark SQL? I'm wondering if it's possible without spark SQL.
On Wed, Dec 3, 2014 at 8:08 PM, Cheng Lian wrote:
> You may do this:
>
> table("users").groupBy('zip)('zip, count('user), countDistinct('user))
>
> On 12/4/14 8:47 AM
(count + 1, seen + user)
> }, { case ((count0, seen0), (count1, seen1)) =>
> (count0 + count1, seen0 ++ seen1)
> }).mapValues { case (count, seen) =>
> (count, seen.size)
> }
>
> On 12/5/14 3:47 AM, Arun Luthra wrote:
>
> Is that Spark SQL? I'm wonderi
55 matches
Mail list logo