Hi Chetan,
You can have a static parquet file created, and when you
create a data frame you can pass the location of both the files, with
option mergeSchema true. This will always fetch you a dataframe even if the
original file is not present.
Kuchekar, Nilesh
On Sat, May 9
tables that
are most frequently joined together are located locally together.
Any thoughts on how I can do this with Spark? Any internal hack ideas are
welcome too. :)
Cheers,
Nilesh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Custom-positioning
probability. Can you provide any pointers to any documentation that I can
reference for implementing this. Thanks!
-Nilesh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Predicting-Class-Probability-with-Gradient-Boosting-Random-Forest-tp21633.html
Sent from the
Did some digging in the documentation. Looks like the IDFModel.transform only
accepts RDD as an input,
and not individual elements. Is this a bug? I am saying this because
HashingTF.transform accepts both RDD as well as vector elements as its
input.
>From your post replying to Jatin, looks like yo
.spark.mllib.linalg.Vector]
cannot be applied to (org.apache.spark.mllib.linalg.Vector)
val transformedValues = idfModel.transform(values)
It seems to be getting confused with multiple (java and scala) transform
methods.
Any insights?
Thanks,
Nilesh
--
View this message in context:
http://apache-spark-
CC'ing Tathagata too.
Cheers,
Nilesh
[1]:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201405.mbox/%3ccamwrk0kiqxhktfuaamhborov5lv+d8y+c5nycmsxtqasze4...@mail.gmail.com%3E
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Alternative-to-ch
next job/iteration directly, and (b) I
wouldn't even be able to retrieve the dense vector iteratively and my vector
would become driver-node-memory bound.
Any ideas how I can make this work for me?
Cheers,
Nilesh
[1]:
http://hadoop.apache.org/docs/r2.3.0/api/org/apache/hadoop/mapreduce/l
Hey Michael,
Thanks for the great reply! That clears things up a lot. The idea about
Apache Kafka sounds very interesting; I'll look into it. The multiple
consumers and fault tolerance sound awesome. That's probably what I need.
Cheers,
Nilesh
--
View this message in context:
htt
since Spark workers local to the worker actors should
get the data fast, and some optimization like this is definitely done I
assume?
I suppose the only benefit with HDFS would be better fault tolerance, and
the ability to checkpoint and recover even if master fails.
Cheers,
Nilesh
--
View this me