Re: External Spark shuffle service for k8s

2024-04-10 Thread Arun Ravi
Hi Everyone, I had to explored IBM's and AWS's S3 shuffle plugins (some time back), I had also explored AWS FSX lustre in few of my production jobs which has ~20TB of shuffle operations with 200-300 executors. What I have observed is S3 and fax behaviour was fine during the write phase, however I

Re: custom rdd - do I need a hadoop input format?

2019-09-17 Thread Arun Mahadevan
You can do it with custom RDD implementation. You will mainly implement "getPartitions" - the logic to split your input into partitions and "compute" to compute and return the values from the executors. On Tue, 17 Sep 2019 at 08:47, Marcelo Valle wrote: > Just to be more clear about my requireme

GC problem doing fuzzy join

2019-06-18 Thread Arun Luthra
D map operation, and each record in the RDD takes the broadcasted table and FILTERS it. There appears to be large GC happening, so I suspect that huge repeated data deletion of copies of the broadcast table is causing GC. Is there a way to fix this pattern? Thanks, Arun

Re: how to get spark-sql lineage

2019-05-16 Thread Arun Mahadevan
You can check out https://github.com/hortonworks-spark/spark-atlas-connector/ On Wed, 15 May 2019 at 19:44, lk_spark wrote: > hi,all: > When I use spark , if I run some SQL to do ETL how can I get > lineage info. I found that , CDH spark have some config about lineage : > spark.l

Re: JvmPauseMonitor

2019-04-15 Thread Arun Mahadevan
Spark TaskMetrics[1] has a "jvmGCTime" metric that captures the amount of time spent in GC. This is also available via the listener I guess. Thanks, Arun [1] https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala#L89 On Mon, 15 A

Creating Hive Persistent view using Spark Sql defaults to Sequence File Format

2019-03-19 Thread arun rajesh
Hi All , I am using spark 2.2 in EMR cluster. I have a hive table in ORC format and I need to create a persistent view on top of this hive table. I am using spark sql to create the view. By default spark sql creates the view with LazySerde. How can I change the inputformat to use ORC ? PFA scre

Re: Structured Streaming & Query Planning

2019-03-18 Thread Arun Mahadevan
. The tiny micro-batch use cases should ideally be solved using continuous mode (once it matures) which would not have this overhead. Thanks, Arun On Mon, 18 Mar 2019 at 00:39, Jungtaek Lim wrote: > Almost everything is coupled with logical plan right now, including > updated range for so

Re: use rocksdb for spark structured streaming (SSS)

2019-03-10 Thread Arun Mahadevan
Read the link carefully, This solution is available (*only*) in Databricks Runtime. You can enable RockDB-based state management by setting the following configuration in the SparkSession before starting the streaming query. spark.conf.set( "spark.sql.streaming.stateStore.providerClass", "co

Re: Question about RDD pipe

2019-01-17 Thread Arun Mahadevan
Yes, the script should be present on all the executor nodes. You can pass your script via spark-submit (e.g. --files script.sh) and then you should be able to refer that (e.g. "./script.sh") in rdd.pipe. - Arun On Thu, 17 Jan 2019 at 14:18, Mkal wrote: > Hi, im trying to ru

Re: Equivalent of emptyDataFrame in StructuredStreaming

2018-11-17 Thread Arun Manivannan
am sorry if I haven't done a good job in explaining it well. Cheers, Arun On Tue, Nov 6, 2018 at 7:34 AM Jungtaek Lim wrote: > Could you explain what you're trying to do? It should have no batch for no > data in stream, so it will end up to no-op even it is possible. > >

Equivalent of emptyDataFrame in StructuredStreaming

2018-11-05 Thread Arun Manivannan
ility to mutate it but I am converting it to DS immediately. So, I am leaning towards this at the moment. * val emptyErrorStream = (spark:SparkSession) => { implicit val sqlC = spark.sqlContext MemoryStream[DataError].toDS() } Cheers, Arun

Re: Error - Dropping SparkListenerEvent because no remaining room in event queue

2018-10-24 Thread Arun Mahadevan
Maybe you have spark listeners that are not processing the events fast enough? Do you have spark event logging enabled? You might have to profile the built in and your custom listeners to see whats going on. - Arun On Wed, 24 Oct 2018 at 16:08, karan alang wrote: > > Pls note - Spark v

Re: Kafka backlog - spark structured streaming

2018-07-30 Thread Arun Mahadevan
Heres a proposal to a add - https://github.com/apache/spark/pull/21819 Its always good to set "maxOffsetsPerTrigger" unless you want spark to process till the end of the stream in each micro batch. Even without "maxOffsetsPerTrigger" the lag can be non-zero by the time the micro batch completes.

Re: Question of spark streaming

2018-07-27 Thread Arun Mahadevan
. Thanks, Arun From: utkarsh rathor Date: Friday, July 27, 2018 at 5:15 AM To: "user@spark.apache.org" Subject: Question of spark streaming I am following the book Spark the Definitive Guide The following code is executed locally using spark-shell Procedure: Started the s

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-12 Thread Arun Mahadevan
close(null)" is invoked. You can batch your writes in the process and/or in the close. The guess the writes can still be atomic and decided by if “close” returns successfully or throws an exception. Thanks, Arun From: chandan prakash Date: Thursday, July 12, 2018 at 10:37 AM To: Aru

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-07-12 Thread Arun Mahadevan
Yes ForeachWriter [1] could be an option If you want to write to different sinks. You can put your custom logic to split the data into different sinks. The drawback here is that you cannot plugin existing sinks like Kafka and you need to write the custom logic yourself and you cannot scale the p

Re: Unable to alter partition. The transaction for alter partition did not commit successfully.

2018-07-10 Thread Arun Hive
details o what are you doing  On Wed, May 30, 2018 at 12:58 PM Arun Hive wrote: Hi  While running my spark job component i am getting the following exception. Requesting for your help on this:Spark core version - spark-core_2.10-2.1.1 Spark streaming version -spark-streaming_2.10-2.1.1 Spark hive

Re: Unable to alter partition. The transaction for alter partition did not commit successfully.

2018-05-30 Thread Arun Hive
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:50) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39) Regards,Arun On Tuesday, May 29, 2018, 1:22:17 PM PDT, Arun Hive wrote: Hi  While running my spark job component i am getting the

Closing IPC connection

2018-05-30 Thread Arun Hive
(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522) at org.apache.hadoop.ipc.Client.call(Client.java:1439) ... 77 more Regards,Arun

Re: question on collect_list or say aggregations in general in structured streaming 2.3.0

2018-05-03 Thread Arun Mahadevan
I think you need to group by a window (tumbling) and define watermarks (put a very low watermark or even 0) to discard the state. Here the window duration becomes your logical batch. - Arun From: kant kodali Date: Thursday, May 3, 2018 at 1:52 AM To: "user @spark" Subject: Re

Re: [Structured Streaming] Restarting streaming query on exception/termination

2018-04-24 Thread Arun Mahadevan
: StreamingQueryException => // log it } } Thanks, Arun From: Priyank Shrivastava Date: Monday, April 23, 2018 at 11:27 AM To: formice <51296...@qq.com>, "user@spark.apache.org" Subject: Re: [Structured Streaming] Restarting streaming query on exception/termination Thanks for th

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
I assume its going to compare by the first column and if equal compare the second column and so on. From: kant kodali Date: Wednesday, April 18, 2018 at 6:26 PM To: Jungtaek Lim Cc: Arun Iyer , Michael Armbrust , Tathagata Das , "user @spark" Subject: Re: can we use mapGroup

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
The below expr might work: df.groupBy($"id").agg(max(struct($"amount", $"my_timestamp")).as("data")).select($"id", $"data.*") Thanks, Arun From: Jungtaek Lim Date: Wednesday, April 18, 2018 at 4:54 PM To: Michael Armbrust

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
erations is not there yet. Thanks, Arun From: kant kodali Date: Tuesday, April 17, 2018 at 11:41 AM To: Tathagata Das Cc: "user @spark" Subject: Re: can we use mapGroupsWithState in raw sql? Hi TD, Thanks for that. The only reason I ask is I don't see any alternative soluti

unsubscribe

2018-02-25 Thread Arun Khetarpal

Re: [Spark-Submit] Where to store data files while running job in cluster mode?

2017-09-29 Thread Arun Rai
Or you can try mounting that drive to all node. On Fri, Sep 29, 2017 at 6:14 AM Jörn Franke wrote: > You should use a distributed filesystem such as HDFS. If you want to use > the local filesystem then you have to copy each file to each node. > > > On 29. Sep 2017, at 12:05, Gaurav1809 wrote: >

Re: [SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-17 Thread Arun Khetarpal
Ping. I did some digging around in the code base - I see that this is not present currently. Just looking for an acknowledgement Regards, Arun > On 15-Sep-2017, at 8:43 PM, Arun Khetarpal wrote: > > Hi - > > Wanted to understand if spark sql has GRANT and REVOKE state

[SPARK-SQL] Does spark-sql have Authorization built in?

2017-09-15 Thread Arun Khetarpal
Hi - Wanted to understand if spark sql has GRANT and REVOKE statements available? Is anyone working on making that available? Regards, Arun - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

RowMatrix: tallSkinnyQR

2017-06-09 Thread Arun
hi *def tallSkinnyQR(computeQ: Boolean = false): QRDecomposition[RowMatrix, Matrix]* *In output of this method Q is distributed matrix* *and R is local Matrix* *Whats the reason R is Local Matrix?* -Arun

Rmse recomender system

2017-05-20 Thread Arun
hi all.. I am new to machine learning. i am working on recomender system. for training dataset RMSE is  0.08  while on test data its is 2.345 whats conclusion and what steps can i take to improve Sent from Samsung tablet

spark ML Recommender program

2017-05-17 Thread Arun
hi I am writing spark ML Movie Recomender program on Intelij on windows10 Dataset is 2MB with 10 datapoints, My Laptop has 8gb Memory When I set number of iteration 10 works fine When I set number of Iteration 20 I get StackOverFlow error.. Whats the solution?.. thanks Sent from Samsung

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread arun kumar Natva
ully long in spark 2.0. >> >> I am using spark 1.6 & spark 2.0 on HDP 2.5.3 >> >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/My-spark-job-runs-faster-in-spark-1-6- >> and-much-slower-in-spark-2-0-tp28390.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > -- Regards, Arun Kumar Natva

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-16 Thread Arun Patel
issue? I tried playing with spark.memory.fraction and spark.memory.storageFraction. But, it did not help. Appreciate your help on this!!! On Tue, Nov 15, 2016 at 8:44 PM, Arun Patel wrote: > Thanks for the quick response. > > Its a single XML file and I am using a top level rowTag

Re: Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
new version and try to use different rowTags and increase executor-memory tomorrow. I will open a new issue as well. On Tue, Nov 15, 2016 at 7:52 PM, Hyukjin Kwon wrote: > Hi Arun, > > > I have few questions. > > Dose your XML file have like few huge documents? In this case o

Spark-xml - OutOfMemoryError: Requested array size exceeds VM limit

2016-11-15 Thread Arun Patel
I am trying to read an XML file which is 1GB is size. I am getting an error 'java.lang.OutOfMemoryError: Requested array size exceeds VM limit' after reading 7 partitions in local mode. In Yarn mode, it throws 'java.lang.OutOfMemoryError: Java heap space' error after reading 3 partitions. Any su

Spark XML ignore namespaces

2016-11-03 Thread Arun Patel
I see that 'ignoring namespaces' issue is resolved. https://github.com/databricks/spark-xml/pull/75 How do we enable this option and ignore namespace prefixes? - Arun

Re: Best approach for processing all files parallelly

2016-10-10 Thread Arun Patel
) returns schema for FileType > > This for loop DOES NOT process files sequentially. It creates dataframes > on all files which are of same types sequentially. > > On Fri, Oct 7, 2016 at 12:08 AM, Arun Patel > wrote: > >> Thanks Ayan. Couple of questions: >> >&

Re: Best approach for processing all files parallelly

2016-10-06 Thread Arun Patel
is case, if you see, t[1] is NOT the file content, as I have added a > "FileType" field. So, this collect is just bringing in the list of file > types, should be fine > > On Thu, Oct 6, 2016 at 11:47 PM, Arun Patel > wrote: > >> Thanks Ayan. I am really concerned ab

Re: Best approach for processing all files parallelly

2016-10-06 Thread Arun Patel
ist.append(df) > > > > On Thu, Oct 6, 2016 at 10:26 PM, Arun Patel > wrote: > >> My Pyspark program is currently identifies the list of the files from a >> directory (Using Python Popen command taking hadoop fs -ls arguments). For >> each file, a Dataframe is cr

Best approach for processing all files parallelly

2016-10-06 Thread Arun Patel
Above code does not work. I get an error 'TypeError: 'JavaPackage' object is not callable'. How to make it work? Or is there a better approach? -Arun

Re: Check if a nested column exists in DataFrame

2016-09-13 Thread Arun Patel
at 5:28 PM, Arun Patel wrote: > I'm trying to analyze XML documents using spark-xml package. Since all > XML columns are optional, some columns may or may not exist. When I > register the Dataframe as a table, how do I check if a nested column is > existing or not? My column na

Check if a nested column exists in DataFrame

2016-09-12 Thread Arun Patel
I'm trying to analyze XML documents using spark-xml package. Since all XML columns are optional, some columns may or may not exist. When I register the Dataframe as a table, how do I check if a nested column is existing or not? My column name is "emp" which is already exploded and I am trying to c

Re: spark-xml to avro - SchemaParseException: Can't redefine

2016-09-09 Thread Arun Patel
ed the title from Save DF with nested records with the same > name to spark-avro fails to save DF with nested records having the same > name Jun 23, 2015 > > > > -- > *From:* Arun Patel > *Sent:* Thursday, September 8, 2016 5:31 PM > *To:* u

spark-xml to avro - SchemaParseException: Can't redefine

2016-09-08 Thread Arun Patel
I'm trying to convert XML to AVRO. But, I am getting SchemaParser exception for 'Rules' which is existing in two separate containers. Any thoughts? XML is attached. df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='GGLResponse',attributePrefix='').load('GGL.xml') df.show

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-24 Thread Arun Luthra
Also for the record, turning on kryo was not able to help. On Tue, Aug 23, 2016 at 12:58 PM, Arun Luthra wrote: > Splitting up the Maps to separate objects did not help. > > However, I was able to work around the problem by reimplementing it with > RDD joins. > > On Aug 18, 2

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-23 Thread Arun Luthra
Splitting up the Maps to separate objects did not help. However, I was able to work around the problem by reimplementing it with RDD joins. On Aug 18, 2016 5:16 PM, "Arun Luthra" wrote: > This might be caused by a few large Map objects that Spark is trying to > serializ

Re: Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-18 Thread Arun Luthra
ainst me? What if I manually split them up into numerous Map variables? On Mon, Aug 15, 2016 at 2:12 PM, Arun Luthra wrote: > I got this OOM error in Spark local mode. The error seems to have been at > the start of a stage (all of the stages on the UI showed as complete, there > were

Spark 2.0.0 OOM error at beginning of RDD map on AWS

2016-08-15 Thread Arun Luthra
I got this OOM error in Spark local mode. The error seems to have been at the start of a stage (all of the stages on the UI showed as complete, there were more stages to do but had not showed up on the UI yet). There appears to be ~100G of free memory at the time of the error. Spark 2.0.0 200G dr

Re: groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Arun Luthra
r dataset or an unexpected implicit conversion. > Just add rdd() before the groupByKey call to push it into an RDD. That > being said - groupByKey generally is an anti-pattern so please be careful > with it. > > On Wed, Aug 10, 2016 at 8:07 PM, Arun Luthra > wrote: > >> H

groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Arun Luthra
bvious API change... what is the problem? Thanks, Arun

Re: Structured Streaming Parquet Sink

2016-07-30 Thread Arun Patel
hagata Das wrote: > Correction, the two options are. > > - writeStream.format("parquet").option("path", "...").start() > - writestream.parquet("...").start() > > There no start with param. > > On Jul 30, 2016 11:22 AM, "Jacek Laskows

Structured Streaming Parquet Sink

2016-07-30 Thread Arun Patel
t, I don't see path or parquet in DataStreamWriter. scala> val query = streamingCountsDF.writeStream. foreach format option options outputMode partitionBy queryName start trigger Any idea how to write this to parquet file? - Arun

Re: Graphframe Error

2016-07-07 Thread Arun Patel
I have tied this already. It does not work. What version of Python is needed for this package? On Wed, Jul 6, 2016 at 12:45 AM, Felix Cheung wrote: > This could be the workaround: > > http://stackoverflow.com/a/36419857 > > > > > On Tue, Jul 5, 2016 at 5:

Re: Graphframe Error

2016-07-05 Thread Arun Patel
ke either the extracted Python code is corrupted or there is a > mismatch Python version. Are you using Python 3? > > > stackoverflow.com/questions/514371/whats-the-bad-magic-number-error > > > > > > On Mon, Jul 4, 2016 at 1:37 AM -0700, "Yanbo Liang"

Graphframe Error

2016-07-03 Thread Arun Patel
27; is not defined Also, I am getting below error. >>> from graphframes.examples import Graphs Traceback (most recent call last): File "", line 1, in ImportError: Bad magic number in graphframes/examples.pyc Any help will be highly appreciated. - Arun

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-14 Thread Arun Patel
Can anyone answer these questions please. On Mon, Jun 13, 2016 at 6:51 PM, Arun Patel wrote: > Thanks Michael. > > I went thru these slides already and could not find answers for these > specific questions. > > I created a Dataset and converted it to DataFrame in 1.6 and 2

Re: Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Arun Patel
k-dataframes-datasets-and-streaming-by-michael-armbrust > > On Mon, Jun 13, 2016 at 4:01 AM, Arun Patel > wrote: > >> In Spark 2.0, DataFrames and Datasets are unified. DataFrame is simply an >> alias for a Dataset of type row. I have few questions. >> >> 1) What

Spark 2.0: Unify DataFrames and Datasets question

2016-06-13 Thread Arun Patel
? 4) Compile time safety will be there for DataFrames too? 5) Python API is supported for Datasets in 2.0? Thanks Arun

Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel
Thanks Sean and Jacek. Do we have any updated documentation for 2.0 somewhere? On Tue, Jun 7, 2016 at 9:34 AM, Jacek Laskowski wrote: > On Tue, Jun 7, 2016 at 3:25 PM, Sean Owen wrote: > > That's not any kind of authoritative statement, just my opinion and > guess. > > Oh, come on. You're not

Re: Spark 2.0 Release Date

2016-06-07 Thread Arun Patel
Do we have any further updates on release date? Also, Is there a updated documentation for 2.0 somewhere? Thanks Arun On Thu, Apr 28, 2016 at 4:50 PM, Jacek Laskowski wrote: > Hi Arun, > > My bet is...https://spark-summit.org/2016 :) > > Pozdrawiam, > Jacek Laskow

Re: Hive_context

2016-05-23 Thread Arun Natva
Can you try a hive JDBC java client from eclipse and query a hive table successfully ? This way we can narrow down where the issue is ? Sent from my iPhone > On May 23, 2016, at 5:26 PM, Ajay Chander wrote: > > I downloaded the spark 1.5 untilities and exported SPARK_HOME pointing to it. >

Re: HBase / Spark Kerberos problem

2016-05-19 Thread Arun Natva
Some of the Hadoop services cannot make use of the ticket obtained by oginUserFromKeytab. I was able to get past it using gss Jaas configuration where you can pass either Keytab file or ticketCache to spark executors that access HBase. Sent from my iPhone > On May 19, 2016, at 4:51 AM, Ellis,

Spark 2.0 Release Date

2016-04-28 Thread Arun Patel
A small request. Would you mind providing an approximate date of Spark 2.0 release? Is it early May or Mid May or End of May? Thanks, Arun

Re: transformation - spark vs cassandra

2016-03-31 Thread Arun Sethia
primary key and country as cluster >> key). >> >> SELECT count(*) FROM test WHERE cdate ='2016-06-07' AND country='USA' >> >> I would like to know when should we use Cassandra simple query vs >> dataframe >&g

Re: DataFrame vs RDD

2016-03-22 Thread Arun Sethia
Thanks Vinay. Is it fair to say creating RDD and Creating DataFrame from Cassandra uses SparkSQL, with help of Spark-Cassandra Connector API? On Tue, Mar 22, 2016 at 9:32 PM, Vinay Kashyap wrote: > DataFrame is when there is a schema associated with your RDD.. > For any of your transformation o

Re: TaskCommitDenied (Driver denied task commit)

2016-01-22 Thread Arun Luthra
Correction. I have to use spark.yarn.am.memoryOverhead because I'm in Yarn client mode. I set it to 13% of the executor memory. Also quite helpful was increasing the total overall executor memory. It will be great when tungsten enhancements make there way into RDDs. Thanks! Arun On Thu

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
ront in my mind. On Thu, Jan 21, 2016 at 5:35 PM, Arun Luthra wrote: > Looking into the yarn logs for a similar job where an executor was > associated with the same error, I find: > > ... > 16/01/22 01:17:18 INFO client.TransportClientFactory: Found inactive > connection to (SERVE

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
hat the TaskCommitDenied is perhaps a red hearing and the > problem is groupByKey - but I've also just seen a lot of people be bitten > by it so that might not be issue. If you just do a count at the point of > the groupByKey does the pipeline succeed? > > On Thu, Jan 21, 2016

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
this exception because the coordination does not get triggered in > non save/write operations. > > On Thu, Jan 21, 2016 at 2:46 PM Holden Karau wrote: > >> Before we dig too far into this, the thing which most quickly jumps out >> to me is groupByKey which could be causing some p

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
you are performing? > > On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra > wrote: > >> Example warning: >> >> 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID >> 4436, XXX): TaskCommitDenied (Driver denied task commit) for job: 1,

MemoryStore: Not enough space to cache broadcast_N in memory

2016-01-21 Thread Arun Luthra
rnal label. Then it would work the same as the sc.accumulator() "name" argument. It would enable more useful warn/error messages. Arun

TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Arun Luthra
lly I won't have to increase it. The RDD being processed has 2262 partitions. Arun

Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
ues in object > equality. > > On Mon, Jan 4, 2016 at 4:42 PM Arun Luthra wrote: > >> Spark 1.5.0 >> >> data: >> >> p1,lo1,8,0,4,0,5,20150901|5,1,1.0 >> p1,lo2,8,0,4,0,5,20150901|5,1,1.0 >> p1,lo3,8,0,4,0,5,20150901|5,

Re: groupByKey does not work?

2016-01-04 Thread Arun Luthra
see that each key is repeated 2 times but each key should only appear once. Arun On Mon, Jan 4, 2016 at 4:07 PM, Ted Yu wrote: > Can you give a bit more information ? > > Release of Spark you're using > Minimal dataset that shows the problem > > Cheers > > On M

groupByKey does not work?

2016-01-04 Thread Arun Luthra
2 times. Is this the expected behavior? I need to be able to get ALL values associated with each key grouped into a SINGLE record. Is it possible? Arun p.s. reducebykey will not be sufficient for me

Re: Spark Streaming - Number of RDDs in Dstream

2015-12-21 Thread Arun Patel
So, Does that mean only one RDD is created by all receivers? On Sun, Dec 20, 2015 at 10:23 PM, Saisai Shao wrote: > Normally there will be one RDD in each batch. > > You could refer to the implementation of DStream#getOrCompute. > > > On Mon, Dec 21, 2015 at 11:04 AM, A

Spark Streaming - Number of RDDs in Dstream

2015-12-20 Thread Arun Patel
/ spark.streaming.blockInterval) * number of receivers Is it like one RDD per receiver? or Multiple RDDs per receiver? What is the easiest way to find it? Arun

Re: Content based window operation on Time-series data

2015-12-09 Thread Arun Verma
Thank you for your reply. It is a Scala and Python library. Is similar library exists for Java? On Wed, Dec 9, 2015 at 10:26 PM, Sean Owen wrote: > CC Sandy as his https://github.com/cloudera/spark-timeseries might be > of use here. > > On Wed, Dec 9, 2015 at 4:54 PM, Arun Ve

Content based window operation on Time-series data

2015-12-09 Thread Arun Verma
ject();stepResults.put("x", Long.parseLong(row.get(0).toString()));stepResults.put("y", row.get(1));appendResults.add(stepResults);}start = nextStart;nextStart = start + bucketLengthSec;}* -- Thanks and Regards, Arun Verma

Want 1-1 map between input files and output files in map-only job

2015-11-19 Thread Arun Luthra
ile("/data/output_directory") Thanks, Arun

Re: types allowed for saveasobjectfile?

2015-08-27 Thread Arun Luthra
Ah, yes, that did the trick. So more generally, can this handle any serializable object? On Thu, Aug 27, 2015 at 2:11 PM, Jonathan Coveney wrote: > array[String] doesn't pretty print by default. Use .mkString(",") for > example > > > El jueves, 27 de agosto de

types allowed for saveasobjectfile?

2015-08-27 Thread Arun Luthra
[Ljava.lang.String;@13144c [Ljava.lang.String;@75146d [Ljava.lang.String;@79118f Arun

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-24 Thread Arun Ahuja
for all the help everyone! But not sure worth still pursuing, not sure what else to try. Thanks, Arun On Tue, Jul 21, 2015 at 11:16 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > FWIW I've run into similar BLAS related problems before and wrote up a > document

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-21 Thread Arun Ahuja
e driver or exectuor, would you know? Thanks, Arun On Tue, Jul 21, 2015 at 7:52 AM, Sean Owen wrote: > Great, and that file exists on HDFS and is world readable? just > double-checking. > > What classpath is this -- your driver or executor? this is the driver, no? > I assume so just

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-20 Thread Arun Ahuja
Cool, I tried that as well, and doesn't seem different: spark.yarn.jar seems set [image: Inline image 1] This actually doesn't change the classpath, not sure if it should: [image: Inline image 3] But same netlib warning. Thanks for the help! - Arun On Fri, Jul 17, 2015 at 3:18

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Arun Ahuja
spark-assembly-1.5.0-SNAPSHOT-hadoop2.6.0.jar | grep jniloader META-INF/maven/com.github.fommil/jniloader/ META-INF/maven/com.github.fommil/jniloader/pom.xml META-INF/maven/com.github.fommil/jniloader/pom.properties ​ Thanks, Arun On Fri, Jul 17, 2015 at 1:30 PM, Sean Owen wrote: > Make sure /u

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-17 Thread Arun Ahuja
need to be adjusted in my application POM? Thanks, Arun On Thu, Jul 16, 2015 at 5:26 PM, Sean Owen wrote: > Yes, that's most of the work, just getting the native libs into the > assembly. netlib can find them from there even if you don't have BLAS > libs on your OS, since it i

Re: Is it possible to change the default port number 7077 for spark?

2015-07-13 Thread Arun Verma
PFA sample file On Mon, Jul 13, 2015 at 7:37 PM, Arun Verma wrote: > Hi, > > Yes it is. To do it follow these steps; > 1. cd spark/intallation/path/.../conf > 2. cp spark-env.sh.template spark-env.sh > 3. vi spark-env.sh > 4. SPARK_MASTER_PORT=9000(or any other available

Re: Is it possible to change the default port number 7077 for spark?

2015-07-13 Thread Arun Verma
r List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Thanks and Regards, Arun Verma

How to ignore features in mllib

2015-07-09 Thread Arun Luthra
tures-when-training-a-classifier Arun

Re: How to change hive database?

2015-07-08 Thread Arun Luthra
Thanks, it works. On Tue, Jul 7, 2015 at 11:15 AM, Ted Yu wrote: > See this thread http://search-hadoop.com/m/q3RTt0NFls1XATV02 > > Cheers > > On Tue, Jul 7, 2015 at 11:07 AM, Arun Luthra > wrote: > >> >> https://spark.apache.org

Re: unable to bring up cluster with ec2 script

2015-07-07 Thread Arun Ahuja
n-emr/ - Arun On Tue, Jul 7, 2015 at 4:34 PM, Pagliari, Roberto wrote: > > > > > I'm following the tutorial about Apache Spark on EC2. The output is the > following: > > > > > > $ ./spark-ec2 -i ../spark.pem -k spark --copy launch spark-training > >

What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-07 Thread Arun Ahuja
o load implementation from: com.github.fommil.netlib.NativeRefLAPACK ​ Anything in this process I missed? Thanks, Arun

How to change hive database?

2015-07-07 Thread Arun Luthra
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext I'm getting org.apache.spark.sql.catalyst.analysis.NoSuchTableException from: val dataframe = hiveContext.table("other_db.mytable") Do I have to change current database to access it? Is it possible to

Re: Spark launching without all of the requested YARN resources

2015-07-02 Thread Arun Luthra
Thanks Sandy et al, I will try that. I like that I can choose the minRegisteredResourcesRatio. On Wed, Jun 24, 2015 at 11:04 AM, Sandy Ryza wrote: > Hi Arun, > > You can achieve this by > setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really >

Spark launching without all of the requested YARN resources

2015-06-23 Thread Arun Luthra
resources that I request? Thanks, Arun

Missing values support in Mllib yet?

2015-06-19 Thread Arun Luthra
Hi, Is there any support for handling missing values in mllib yet, especially for decision trees where this is a natural feature? Arun

Re: Problem getting program to run on 15TB input

2015-06-09 Thread Arun Luthra
usage of spark. > > > > @Arun, can you kindly confirm if Daniel’s suggestion helped your usecase? > > > > Thanks, > > > > Kapil Malik | kma...@adobe.com | 33430 / 8800836581 > > > > *From:* Daniel Mahler [mailto:dmah...@gmail.com] > *Sent:* 13 April 2

Re: Efficient saveAsTextFile by key, directory for each key?

2015-04-22 Thread Arun Luthra
PARK-3007 On Tue, Apr 21, 2015 at 5:45 PM, Arun Luthra wrote: > Is there an efficient way to save an RDD with saveAsTextFile in such a way > that the data gets shuffled into separated directories according to a key? > (My end goal is to wrap the result in a multi-partitioned Hive table

Scheduling across applications - Need suggestion

2015-04-22 Thread Arun Patel
applications. Is this correct? Regards, Arun

  1   2   >