Re: AnalysisException - Infer schema for the Parquet path

2020-05-09 Thread Nilesh Kuchekar
Hi Chetan, You can have a static parquet file created, and when you create a data frame you can pass the location of both the files, with option mergeSchema true. This will always fetch you a dataframe even if the original file is not present. Kuchekar, Nilesh On Sat, May 9

Custom positioning/partitioning Dataframes

2016-06-03 Thread Nilesh Chakraborty
tables that are most frequently joined together are located locally together. Any thoughts on how I can do this with Spark? Any internal hack ideas are welcome too. :) Cheers, Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Custom-positioning

Predicting Class Probability with Gradient Boosting/Random Forest

2015-02-12 Thread nilesh
probability. Can you provide any pointers to any documentation that I can reference for implementing this. Thanks! -Nilesh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Predicting-Class-Probability-with-Gradient-Boosting-Random-Forest-tp21633.html Sent from the

Re: New API for TFIDF generation in Spark 1.1.0

2014-10-09 Thread nilesh
Did some digging in the documentation. Looks like the IDFModel.transform only accepts RDD as an input, and not individual elements. Is this a bug? I am saying this because HashingTF.transform accepts both RDD as well as vector elements as its input. >From your post replying to Jatin, looks like yo

Re: New API for TFIDF generation in Spark 1.1.0

2014-10-09 Thread nilesh
.spark.mllib.linalg.Vector] cannot be applied to (org.apache.spark.mllib.linalg.Vector) val transformedValues = idfModel.transform(values) It seems to be getting confused with multiple (java and scala) transform methods. Any insights? Thanks, Nilesh -- View this message in context: http://apache-spark-

Alternative to checkpointing and materialization for truncating lineage in high iteration jobs

2014-06-28 Thread Nilesh Chakraborty
CC'ing Tathagata too. Cheers, Nilesh [1]: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201405.mbox/%3ccamwrk0kiqxhktfuaamhborov5lv+d8y+c5nycmsxtqasze4...@mail.gmail.com%3E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Alternative-to-ch

Accumulable with huge accumulated value?

2014-06-14 Thread Nilesh Chakraborty
next job/iteration directly, and (b) I wouldn't even be able to retrieve the dense vector iteratively and my vector would become driver-node-memory bound. Any ideas how I can make this work for me? Cheers, Nilesh [1]: http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/l

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty
Hey Michael, Thanks for the great reply! That clears things up a lot. The idea about Apache Kafka sounds very interesting; I'll look into it. The multiple consumers and fault tolerance sound awesome. That's probably what I need. Cheers, Nilesh -- View this message in context: htt

Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty
since Spark workers local to the worker actors should get the data fast, and some optimization like this is definitely done I assume? I suppose the only benefit with HDFS would be better fault tolerance, and the ability to checkpoint and recover even if master fails. Cheers, Nilesh -- View this me