== Analyzed Logical Plan ==
index: string, 0: string
Relation [index#50,0#51] csv
== Optimized Logical Plan ==
Relation [index#50,0#51] csv
== Physical Plan ==
FileScan csv [index#50,0#51] Batched: false, DataFilters: [], Format:
CSV, Location: InMemoryFileIndex(1
paths)[file:/home/nitin/work/df1.cs
/>
>
>
> https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is e
destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 7 May
ain in the second run. You can
> also confirm it in other metrics from Spark UI.
>
> That is my personal understanding based on what I have read and seen on my
> job runs. If there is any mistake, be free to correct me.
>
> Thank You & Best Regards
> Winston Lai
> --
//monsoon-credittech.appspot.com/spark_datasets/random_tests/dfr_key_int],
PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema:
struct
--
Regards,
Nitin
artitioning(a#24, 200), ENSURE_REQUIREMENTS, [id=#61]
: +- Filter isnotnull(a#24)
:+- FileScan parquet [a#24,b#25,c#26] Batched: true,
DataFilters: [isnotnull(a#24)], Format: Parquet, Location:
InMemoryFileIndex(1
paths)[file:/home/nitin/pymonsoon/bucket_test_parquet1],
I understand pandasUDF as follows:
1. There are multiple partitions per worker
2. Multiple arrow batches are converted per partition
3. Sent to python process
4. In the case of Series to Series the pandasUDF is applied to each arrow
batch one after the other? **(So, is it that (a) - The vectorisat
Hi Deepak,
Please let us know - how you managed it ?
Thanks,
NJ
On Mon, Jun 10, 2019 at 4:42 PM Deepak Sharma wrote:
> Thanks All.
> I managed to get this working.
> Marking this thread as closed.
>
> On Mon, Jun 10, 2019 at 4:14 PM Deepak Sharma
> wrote:
>
>> This is the project requirement ,
away from SAS
due to the cost, it would be really good to have these algorithms in Spark
ML.
Let me know if you need any more info, i can share some snippets if
required.
Thanks,
Nitin
On Thu, Sep 8, 2016 at 2:08 PM, Robin East wrote:
> Do you have any particular algorithms in mind? If you st
If others can concur we can go ahead and report it as a bug.
Regards,
Nitin
On Mon, Aug 22, 2016 at 4:15 PM, Furcy Pin wrote:
> Hi Nitin,
>
> I confirm that there is something odd here.
>
> I did the following test :
>
> create table test_orc (id int, name string, dept
iple times, the count query in the hive terminal works perfectly fine.
This problem occurs for tables stored with different storage formats as
well (textFile etc.)
Is this because of the different naming conventions used by hive and spark
to write records to hdfs? Or maybe it is not a recommended practice to
write tables using different services?
Your thoughts and comments on this matter would be highly appreciated!
Thanks!
Nitin
Hi Akhil,
I don't have HADOOP_HOME or HADOOP_CONF_DIR and even winutils.exe ? What's
the configuration required for this ? From where can I get winutils.exe ?
Thanks and Regards,
Nitin Kalra
On Tue, Jul 21, 2015 at 1:30 PM, Akhil Das
wrote:
> Do you have HADOOP_HOME, HADOOP
Hi Marcelo,
The issue does not happen while connecting to the hive metstore, that works
fine. It seems that HiveContext only uses Hive CLI to execute the queries
while HiveServer2 does not support it. I dont think you can specify any
configuration in hive-site.xml which can make it connect to Hive
Any response to this guys?
On Fri, Jun 19, 2015 at 2:34 PM, Nitin kak wrote:
> Any other suggestions guys?
>
> On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak wrote:
>
>> With Sentry, only hive user has the permission for read/write/execute on
>> the subdirectories of ware
Any other suggestions guys?
On Wed, Jun 17, 2015 at 7:54 PM, Nitin kak wrote:
> With Sentry, only hive user has the permission for read/write/execute on
> the subdirectories of warehouse. All the users get translated to "hive"
> when interacting with hiveserver2. But i t
to grant read execute access through sentry.
> On 18 Jun 2015 05:47, "Nitin kak" > wrote:
>
>> I am trying to run a hive query from Spark code using HiveContext object.
>> It was running fine earlier but since the Apache Sentry has been set
>> in
to grant read execute access through sentry.
> On 18 Jun 2015 05:47, "Nitin kak" > wrote:
>
>> I am trying to run a hive query from Spark code using HiveContext object.
>> It was running fine earlier but since the Apache Sentry has been set
>> in
I am trying to run a hive query from Spark code using HiveContext object.
It was running fine earlier but since the Apache Sentry has been set
installed the process is failing with this exception :
*org.apache.hadoop.security.AccessControlException: Permission denied:
user=kakn, access=READ_EXECUT
That is a much better solution than how I resolved it. I got around it by
placing comma separated jar paths for all the hive related jars in --jars
clause.
I will try your solution. Thanks for sharing it.
On Tue, May 26, 2015 at 4:14 AM, Mohammad Islam wrote:
> I got a similar problem.
> I'm no
Shuffle write will be cleaned if it is not referenced by any object
directly/indirectly. There is a garbage collector written inside spark which
periodically checks for weak references to RDDs/shuffle write/broadcast and
deletes them.
--
View this message in context:
http://apache-spark-user-li
AFAIK, this is the expected behavior. You have to make sure that the schema
matches the row. It won't give any error when you apply the schema as it
doesn't validate the nature of data.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SQLContext-applySchema-s
I was able to resolve this use case (Thanks Cheng Lian) where I wanted to
launch executor on just the specific partition while also getting the batch
pruning optimisations of Spark SQL by doing following :-
val query = sql("SELECT * FROM cac
hedTable WHERE key = 1")
val plannedRDD = query.queryExe
If the application has failed/succeeded, the logs get pushed to HDFS and can
be accessed by following command :-
yarn logs --applicationId
If it's still running, you can find executors' logs on corresponding data
nodes in hadoop logs directory. PAth should be something like :-
/data/hadoop_logs
Have you checked the corresponding executor logs as well? I think information
provided by you here is less to actually understand your issue.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-yarn-client-mode-Hangs-in-final-stages-of-Collect-or-Reduce-tp
Are you running in yarn-cluster or yarn-client mode?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21556.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
but in this case, same partition/data is getting cached twice.
My question is that can we create a PartitionPruningCachedSchemaRDD like
class which can prune the partitions of InMemoryColumnarTableScan's
RDD[CachedBatch] and launch executor on just the selected partition(s)?
Thanks
-Ni
yarn.nodemanager.remote-app-log-dir is set to /tmp/logs
On Fri, Feb 6, 2015 at 4:14 PM, Ted Yu wrote:
> To add to What Petar said, when YARN log aggregation is enabled, consider
> specifying yarn.nodemanager.remote-app-log-dir which is where aggregated
> logs are saved.
>
> Cheers
>
> On Fri, F
The yarn log aggregation is enabled and the logs which I get through "yarn
logs -applicationId "
are no different than what I get through logs in Yarn Application tracking
URL. They still dont have the above logs.
On Fri, Feb 6, 2015 at 3:36 PM, Petar Zecevic
wrote:
>
> You can enable YARN log a
: String, y : String) = x compare y*
*})*
On Wed, Feb 4, 2015 at 8:39 AM, Imran Rashid wrote:
> I think you are interested in secondary sort, which is still being worked
> on:
>
> https://issues.apache.org/jira/browse/SPARK-3655
>
> On Tue, Feb 3, 2015 at 4:41 PM, Nitin kak wro
bother to also sort them within each
partition
On Tue, Feb 3, 2015 at 5:41 PM, Nitin kak wrote:
> I thought thats what sort based shuffled did, sort the keys going to the
> same partition.
>
> I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that
> orderi
I thought thats what sort based shuffled did, sort the keys going to the
same partition.
I have tried (c1, c2) as (Int, Int) tuple as well. I don't think that
ordering of c2 type is the problem here.
On Tue, Feb 3, 2015 at 5:21 PM, Sean Owen wrote:
> Hm, I don't think the sort partitioner is go
%) which
eventually brings the total memory asked by Spark to approximately 22G.
On Thu, Jan 15, 2015 at 12:54 PM, Nitin kak wrote:
> Is this "Overhead memory" allocation used for any specific purpose.
>
> For example, will it be any different if I do *"--executor-mem
I am sorry for the formatting error, the value for
*yarn.scheduler.maximum-allocation-mb
= 28G*
On Thu, Jan 15, 2015 at 11:31 AM, Nitin kak wrote:
> Thanks for sticking to this thread.
>
> I am guessing what memory my app requests and what Yarn requests on my
> part should be
html and
> spark.yarn.executor.memoryOverhead By default it's 7% of your 20G or
> about 1.4G. You might set this higher to 2G to give more overhead.
>
> See the --config property=value syntax documented in
> http://spark.apache.org/docs/latest/submitting-applications.html
>
Thanks Sean.
I guess Cloudera Manager has parameters executor_total_max_heapsize
and worker_max_heapsize
which point to the parameters you mentioned above.
How much should that cushon between the jvm heap size and yarn memory limit
be?
I tried setting jvm memory to 20g and yarn to 24g, but it ga
Soon enough :)
http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-td9815.html
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20766.html
Sent from the Apache Spark User List ma
IN key "id" and could prevent the shuffle by passing
the partition information to in-memory caching.
See - https://issues.apache.org/jira/browse/SPARK-4849
Thanks
-Nitin
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/spark-sql-with-join-terribl
Hi Michael,
I have opened following JIRA for the same :-
https://issues.apache.org/jira/browse/SPARK-4849
I am having a look at the code to see what can be done and then we can have
a discussion over the approach.
Let me know if you have any comments/suggestions.
Thanks
-Nitin
On Sun, Dec 14
Can we take this as a performance improvement task in Spark-1.2.1? I can help
contribute for this.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-partition-on-specific-column-values-tp20350p20623.html
Sent from the Apache Spark User List mailing
I see that somebody had already raised a PR for this but it hasn't been
merged.
https://issues.apache.org/jira/browse/SPARK-4339
Can we merge this in next 1.2 RC?
Thanks
-Nitin
On Wed, Dec 10, 2014 at 11:50 AM, Nitin Goyal wrote:
> Hi Michael,
>
> I think I have found the ex
think the solution here is to have the FixedPoint constructor argument as
configurable/parameterized (also written as TODO). Do we have a plan to do
this in 1.2 release? Or I can take this up as a task for myself if you want
(since this is very crucial for our release).
Thanks
-Nitin
On Wed, Dec 10
Looks like this issue has been fixed very recently and should be available in
next RC :-
http://apache-spark-developers-list.1001551.n3.nabble.com/CREATE-TABLE-AS-SELECT-does-not-work-with-temp-tables-in-1-2-0-td9662.html
--
View this message in context:
http://apache-spark-user-list.1001560.n
create new RDD by
applying schema again and using the existing schema RDD further(in case of
simple queries) but then for complex queries, I get TreenodeException
(Unresolved Attributes) as I mentioned.
Let me know if you need any more info around my problem.
Thanks in Advance
-Nitin
--
View this
With some quick googling, I learnt that I can we can provide "distribute by
" in hive ql to distribute data based on a column values. My
question now if I use "distribute by id", will there be any performance
improvements? Will I be able to avoid data movement in shuffle(Excahnge
before JOIN step)
Hi All,
I want to hash partition (and then cache) a schema RDD in way that
partitions are based on hash of the values of a column ("ID" column in my
case).
e.g. if my table has "ID" column with values as 1,2,3,4,5,6,7,8,9 and
spark.sql.shuffle.partitions is configured as 3, then there should be
Great!! Will try it. Thanks for answering.
On Wed, Nov 5, 2014 at 5:19 PM, Vipul Pandey wrote:
> One option is that after partitioning you call setKeyOrdering explicitly
> on a new ShuffledRDD :
>
> val rdd = // your rdd
>
>
> val srdd = new
> org.apache.spark.rdd.ShuffledRDD(rdd,rdd.partitioner
Somehow worked by placing all the jars(except guava) in hive lib after
"--jars". Had initially tried to place the jars under another temporary
folder and pointing the executor and driver "extraClassPath" to that
director, but didnt work.
On Mon, Oct 27, 2014 at 2:21 PM, Nitin
the plain Apache version of Spark on CDH Yarn.
>
> On Mon, Oct 27, 2014 at 11:10 AM, Nitin kak wrote:
>
>> Yes, I added all the Hive jars present in Cloudera distribution of
>> Hadoop. I added them because I was getting ClassNotFoundException for many
>> required classe
Yes, I added all the Hive jars present in Cloudera distribution of Hadoop.
I added them because I was getting ClassNotFoundException for many required
classes(one example stack trace below). So, someone on the community
suggested to include the hive jars:
*Exception in thread "main" java.lang.NoCl
49 matches
Mail list logo