Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Nitin Siwach
== Analyzed Logical Plan == index: string, 0: string Relation [index#50,0#51] csv == Optimized Logical Plan == Relation [index#50,0#51] csv == Physical Plan == FileScan csv [index#50,0#51] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/home/nitin/work/df1.cs

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is e

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sun, 7 May

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
ain in the second run. You can > also confirm it in other metrics from Spark UI. > > That is my personal understanding based on what I have read and seen on my > job runs. If there is any mistake, be free to correct me. > > Thank You & Best Regards > Winston Lai > --

What is DataFilters and while joining why is the filter isnotnull[joinKey] applied twice

2023-01-31 Thread Nitin Siwach
//monsoon-credittech.appspot.com/spark_datasets/random_tests/dfr_key_int], PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct -- Regards, Nitin

bucketBy in pyspark not retaining partition information

2022-01-31 Thread Nitin Siwach
artitioning(a#24, 200), ENSURE_REQUIREMENTS, [id=#61] : +- Filter isnotnull(a#24) :+- FileScan parquet [a#24,b#25,c#26] Batched: true, DataFilters: [isnotnull(a#24)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/home/nitin/pymonsoon/bucket_test_parquet1],

understanding iterator of series to iterator of series pandasUDF

2022-01-04 Thread Nitin Siwach
I understand pandasUDF as follows: 1. There are multiple partitions per worker 2. Multiple arrow batches are converted per partition 3. Sent to python process 4. In the case of Series to Series the pandasUDF is applied to each arrow batch one after the other? **(So, is it that (a) - The vectorisat

Re: Read hdfs files in spark streaming

2019-06-11 Thread nitin jain
Hi Deepak, Please let us know - how you managed it ? Thanks, NJ On Mon, Jun 10, 2019 at 4:42 PM Deepak Sharma wrote: > Thanks All. > I managed to get this working. > Marking this thread as closed. > > On Mon, Jun 10, 2019 at 4:14 PM Deepak Sharma > wrote: > >> This is the project requirement ,

Re: MLib : Non Linear Optimization

2016-09-09 Thread Nitin Sareen
away from SAS due to the cost, it would be really good to have these algorithms in Spark ML. Let me know if you need any more info, i can share some snippets if required. Thanks, Nitin On Thu, Sep 8, 2016 at 2:08 PM, Robin East wrote: > Do you have any particular algorithms in mind? If you st

Re: Populating tables using hive and spark

2016-08-22 Thread Nitin Kumar
If others can concur we can go ahead and report it as a bug. Regards, Nitin On Mon, Aug 22, 2016 at 4:15 PM, Furcy Pin wrote: > Hi Nitin, > > I confirm that there is something odd here. > > I did the following test : > > create table test_orc (id int, name string, dept

Populating tables using hive and spark

2016-08-22 Thread Nitin Kumar
iple times, the count query in the hive terminal works perfectly fine. This problem occurs for tables stored with different storage formats as well (textFile etc.) Is this because of the different naming conventions used by hive and spark to write records to hdfs? Or maybe it is not a recommended practice to write tables using different services? Your thoughts and comments on this matter would be highly appreciated! Thanks! Nitin

Re: Apache Spark : spark.eventLog.dir on Windows Environment

2015-07-21 Thread Nitin Kalra
Hi Akhil, I don't have HADOOP_HOME or HADOOP_CONF_DIR and even winutils.exe ? What's the configuration required for this ? From where can I get winutils.exe ? Thanks and Regards, Nitin Kalra On Tue, Jul 21, 2015 at 1:30 PM, Akhil Das wrote: > Do you have HADOOP_HOME, HADOOP

Re: Does HiveContext connect to HiveServer2?

2015-06-24 Thread Nitin kak
Hi Marcelo, The issue does not happen while connecting to the hive metstore, that works fine. It seems that HiveContext only uses Hive CLI to execute the queries while HiveServer2 does not support it. I dont think you can specify any configuration in hive-site.xml which can make it connect to Hive

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-22 Thread Nitin kak
Any response to this guys? On Fri, Jun 19, 2015 at 2:34 PM, Nitin kak wrote: > Any other suggestions guys? > > On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak wrote: > >> With Sentry, only hive user has the permission for read/write/execute on >> the subdirectories of ware

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-19 Thread Nitin kak
Any other suggestions guys? On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak wrote: > With Sentry, only hive user has the permission for read/write/execute on > the subdirectories of warehouse. All the users get translated to "hive" > when interacting with hiveserver2. But i t

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread Nitin kak
to grant read execute access through sentry. > On 18 Jun 2015 05:47, "Nitin kak" > wrote: > >> I am trying to run a hive query from Spark code using HiveContext object. >> It was running fine earlier but since the Apache Sentry has been set >> in

Re: Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread Nitin kak
to grant read execute access through sentry. > On 18 Jun 2015 05:47, "Nitin kak" > wrote: > >> I am trying to run a hive query from Spark code using HiveContext object. >> It was running fine earlier but since the Apache Sentry has been set >> in

Hive query execution from Spark(through HiveContext) failing with Apache Sentry

2015-06-17 Thread Nitin kak
I am trying to run a hive query from Spark code using HiveContext object. It was running fine earlier but since the Apache Sentry has been set installed the process is failing with this exception : *org.apache.hadoop.security.AccessControlException: Permission denied: user=kakn, access=READ_EXECUT

Re: HiveContext test, "Spark Context did not initialize after waiting 10000ms"

2015-05-26 Thread Nitin kak
That is a much better solution than how I resolved it. I got around it by placing comma separated jar paths for all the hive related jars in --jars clause. I will try your solution. Thanks for sharing it. On Tue, May 26, 2015 at 4:14 AM, Mohammad Islam wrote: > I got a similar problem. > I'm no

Re: how to clean shuffle write each iteration

2015-03-03 Thread nitin
Shuffle write will be cleaned if it is not referenced by any object directly/indirectly. There is a garbage collector written inside spark which periodically checks for weak references to RDDs/shuffle write/broadcast and deletes them. -- View this message in context: http://apache-spark-user-li

Re: SQLContext.applySchema strictness

2015-02-14 Thread nitin
AFAIK, this is the expected behavior. You have to make sure that the schema matches the row. It won't give any error when you apply the schema as it doesn't validate the nature of data. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-s

Re: Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-11 Thread nitin
I was able to resolve this use case (Thanks Cheng Lian) where I wanted to launch executor on just the specific partition while also getting the batch pruning optimisations of Spark SQL by doing following :- val query = sql("SELECT * FROM cac hedTable WHERE key = 1") val plannedRDD = query.queryExe

Re: Spark (yarn-client mode) Hangs in final stages of Collect or Reduce

2015-02-09 Thread nitin
If the application has failed/succeeded, the logs get pushed to HDFS and can be accessed by following command :- yarn logs --applicationId If it's still running, you can find executors' logs on corresponding data nodes in hadoop logs directory. PAth should be something like :- /data/hadoop_logs

Re: Spark (yarn-client mode) Hangs in final stages of Collect or Reduce

2015-02-09 Thread nitin
Have you checked the corresponding executor logs as well? I think information provided by you here is less to actually understand your issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-yarn-client-mode-Hangs-in-final-stages-of-Collect-or-Reduce-tp

Re: Spark Driver Host under Yarn

2015-02-09 Thread nitin
Are you running in yarn-cluster or yarn-client mode? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21556.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Spark SQL - Point lookup optimisation in SchemaRDD?

2015-02-09 Thread nitin
but in this case, same partition/data is getting cached twice. My question is that can we create a PartitionPruningCachedSchemaRDD like class which can prune the partitions of InMemoryColumnarTableScan's RDD[CachedBatch] and launch executor on just the selected partition(s)? Thanks -Ni

Re: Where can I find logs set inside RDD processing functions?

2015-02-06 Thread Nitin kak
yarn.nodemanager.remote-app-log-dir is set to /tmp/logs On Fri, Feb 6, 2015 at 4:14 PM, Ted Yu wrote: > To add to What Petar said, when YARN log aggregation is enabled, consider > specifying yarn.nodemanager.remote-app-log-dir which is where aggregated > logs are saved. > > Cheers > > On Fri, F

Re: Where can I find logs set inside RDD processing functions?

2015-02-06 Thread Nitin kak
The yarn log aggregation is enabled and the logs which I get through "yarn logs -applicationId " are no different than what I get through logs in Yarn Application tracking URL. They still dont have the above logs. On Fri, Feb 6, 2015 at 3:36 PM, Petar Zecevic wrote: > > You can enable YARN log a

Re: Sort based shuffle not working properly?

2015-02-04 Thread Nitin kak
: String, y : String) = x compare y* *})* On Wed, Feb 4, 2015 at 8:39 AM, Imran Rashid wrote: > I think you are interested in secondary sort, which is still being worked > on: > > https://issues.apache.org/jira/browse/SPARK-3655 > > On Tue, Feb 3, 2015 at 4:41 PM, Nitin kak wro

Re: Sort based shuffle not working properly?

2015-02-03 Thread Nitin kak
bother to also sort them within each partition On Tue, Feb 3, 2015 at 5:41 PM, Nitin kak wrote: > I thought thats what sort based shuffled did, sort the keys going to the > same partition. > > I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that > orderi

Re: Sort based shuffle not working properly?

2015-02-03 Thread Nitin kak
I thought thats what sort based shuffled did, sort the keys going to the same partition. I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that ordering of c2 type is the problem here. On Tue, Feb 3, 2015 at 5:21 PM, Sean Owen wrote: > Hm, I don't think the sort partitioner is go

Re: Running "beyond memory limits" in ConnectedComponents

2015-01-15 Thread Nitin kak
%) which eventually brings the total memory asked by Spark to approximately 22G. On Thu, Jan 15, 2015 at 12:54 PM, Nitin kak wrote: > Is this "Overhead memory" allocation used for any specific purpose. > > For example, will it be any different if I do *"--executor-mem

Re: Running "beyond memory limits" in ConnectedComponents

2015-01-15 Thread Nitin kak
I am sorry for the formatting error, the value for *yarn.scheduler.maximum-allocation-mb = 28G* On Thu, Jan 15, 2015 at 11:31 AM, Nitin kak wrote: > Thanks for sticking to this thread. > > I am guessing what memory my app requests and what Yarn requests on my > part should be

Re: Running "beyond memory limits" in ConnectedComponents

2015-01-15 Thread Nitin kak
html and > spark.yarn.executor.memoryOverhead By default it's 7% of your 20G or > about 1.4G. You might set this higher to 2G to give more overhead. > > See the --config property=value syntax documented in > http://spark.apache.org/docs/latest/submitting-applications.html >

Re: Running "beyond memory limits" in ConnectedComponents

2015-01-14 Thread Nitin kak
Thanks Sean. I guess Cloudera Manager has parameters executor_total_max_heapsize and worker_max_heapsize which point to the parameters you mentioned above. How much should that cushon between the jvm heap size and yarn memory limit be? I tried setting jvm memory to 20g and yarn to 24g, but it ga

Re: Spark 1.2 Release Date

2014-12-18 Thread nitin
Soon enough :) http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-td9815.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20766.html Sent from the Apache Spark User List ma

Re: spark-sql with join terribly slow.

2014-12-17 Thread nitin
IN key "id" and could prevent the shuffle by passing the partition information to in-memory caching. See - https://issues.apache.org/jira/browse/SPARK-4849 Thanks -Nitin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-with-join-terribl

Re: SchemaRDD partition on specific column values?

2014-12-15 Thread Nitin Goyal
Hi Michael, I have opened following JIRA for the same :- https://issues.apache.org/jira/browse/SPARK-4849 I am having a look at the code to see what can be done and then we can have a discussion over the approach. Let me know if you have any comments/suggestions. Thanks -Nitin On Sun, Dec 14

Re: SchemaRDD partition on specific column values?

2014-12-11 Thread nitin
Can we take this as a performance improvement task in Spark-1.2.1? I can help contribute for this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html Sent from the Apache Spark User List mailing

Re: PhysicalRDD problem?

2014-12-09 Thread Nitin Goyal
I see that somebody had already raised a PR for this but it hasn't been merged. https://issues.apache.org/jira/browse/SPARK-4339 Can we merge this in next 1.2 RC? Thanks -Nitin On Wed, Dec 10, 2014 at 11:50 AM, Nitin Goyal wrote: > Hi Michael, > > I think I have found the ex

Re: PhysicalRDD problem?

2014-12-09 Thread Nitin Goyal
think the solution here is to have the FixedPoint constructor argument as configurable/parameterized (also written as TODO). Do we have a plan to do this in 1.2 release? Or I can take this up as a task for myself if you want (since this is very crucial for our release). Thanks -Nitin On Wed, Dec 10

Re: registerTempTable: Table not found

2014-12-09 Thread nitin
Looks like this issue has been fixed very recently and should be available in next RC :- http://apache-spark-developers-list.1001551.n3.nabble.com/CREATE-TABLE-AS-SELECT-does-not-work-with-temp-tables-in-1-2-0-td9662.html -- View this message in context: http://apache-spark-user-list.1001560.n

PhysicalRDD problem?

2014-12-08 Thread nitin
create new RDD by applying schema again and using the existing schema RDD further(in case of simple queries) but then for complex queries, I get TreenodeException (Unresolved Attributes) as I mentioned. Let me know if you need any more info around my problem. Thanks in Advance -Nitin -- View this

Re: SchemaRDD partition on specific column values?

2014-12-04 Thread nitin
With some quick googling, I learnt that I can we can provide "distribute by " in hive ql to distribute data based on a column values. My question now if I use "distribute by id", will there be any performance improvements? Will I be able to avoid data movement in shuffle(Excahnge before JOIN step)

SchemaRDD partition on specific column values?

2014-12-04 Thread nitin
Hi All, I want to hash partition (and then cache) a schema RDD in way that partitions are based on hash of the values of a column ("ID" column in my case). e.g. if my table has "ID" column with values as 1,2,3,4,5,6,7,8,9 and spark.sql.shuffle.partitions is configured as 3, then there should be

Re: Partition sorting by Spark framework

2014-11-05 Thread Nitin kak
Great!! Will try it. Thanks for answering. On Wed, Nov 5, 2014 at 5:19 PM, Vipul Pandey wrote: > One option is that after partitioning you call setKeyOrdering explicitly > on a new ShuffledRDD : > > val rdd = // your rdd > > > val srdd = new > org.apache.spark.rdd.ShuffledRDD(rdd,rdd.partitioner

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Nitin kak
Somehow worked by placing all the jars(except guava) in hive lib after "--jars". Had initially tried to place the jars under another temporary folder and pointing the executor and driver "extraClassPath" to that director, but didnt work. On Mon, Oct 27, 2014 at 2:21 PM, Nitin

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Nitin kak
the plain Apache version of Spark on CDH Yarn. > > On Mon, Oct 27, 2014 at 11:10 AM, Nitin kak wrote: > >> Yes, I added all the Hive jars present in Cloudera distribution of >> Hadoop. I added them because I was getting ClassNotFoundException for many >> required classe

Re: Is Spark 1.1.0 incompatible with Hive?

2014-10-27 Thread Nitin kak
Yes, I added all the Hive jars present in Cloudera distribution of Hadoop. I added them because I was getting ClassNotFoundException for many required classes(one example stack trace below). So, someone on the community suggested to include the hive jars: *Exception in thread "main" java.lang.NoCl