You could call map on an RDD which has “many” partitions, then call
repartition/coalesce to drastically reduce the number of partitions so that
your second map job has less things running.
—
Pedro Rodriguez
PhD Student in Large-Scale Machine Learning | CU Boulder
Systems Oriented Data Scientist
Hi,
My understanding is that these two map functions will end up as a job
with one stage (as if you wrote the two maps as a single map) so you
really need as much vcores and memory as possible for map1 and map2. I
initially thought about dynamic allocation of executors that may or
may not help you
Hi Yanbo,
Appreciate the response. I might not have phrased this correctly, but I
really wanted to know how to convert the pipeline rdd into a data frame. I
have seen the example you posted. However I need to transform all my data,
just not 1 line. So I did sucessfully use map to use the chisq sel
Hi Spark experts,
spark version: 2.0.0-preview,
hadoop version: 2.4, 2.7 (all tried, none works)
The data is in parquet format and stored in hdfs:
/root/file/partition1/file-xxx.parquet
/root/file/partition2/file-xxx.parquet
Then I did:
sqlContext.read.format("parquet").load("hdf
Hi Mark, thanks, we just want to keep our system as simple as
possible, using YARN means we need to maintain a full-size hadoop
cluster, we are using s3 as storage layer, so HDFS is not needed, a
hadoop cluster is a little bit overkill, mesos is an option, but
still, it brings extra operation costs
Hi,
could you please share your thoughts if anyone has idea on the below
topics.
- How to achieve high availability with spark cluster? I have referred
to the link
*https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/exercises/spark-exercise-standalone-master-ha.html
Just to add ,
I want to read the MAX_OFFSET of a topic , then read MAX_OFFSET-200 ,
every time .
Also I want to know , If I want to fetch a specific offset range for Batch
processing, is there any option for doing that ?
On Sat, Jul 16, 2016 at 9:08 PM, Rabin Banerjee <
dev.rabin.baner...@gm
HI All,
I have 1000 kafka topics each storing messages for different devices . I
want to use the direct approach for connecting kafka from Spark , in which
I am only interested in latest 200 messages in the Kafka .
How do I do that ?
Thanks.
Spark 1.5 only support getting feature importance for
RandomForestClassificationModel and RandomForestRegressionModel by Scala.
We support this feature in PySpark until 2.0.0.
It's very straight forward with a few lines of code.
rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexe
Currently we do not expose the APIs to get the Bisecting KMeans tree
structure, they are private in the ml.clustering package scope.
But I think we should make a plan to expose these APIs like what we did for
Decision Tree.
Thanks
Yanbo
2016-07-12 11:45 GMT-07:00 roni :
> Hi Spark,Mlib experts,
Since you use two steps (StringIndexer and OneHotEncoder) to encode
categories to Vector, I guess you want to decode the eventual vector into
their original categories.
Suppose you have a DataFrame with only one column named "name", there are
three categories: "b", "a", "c" (ranked by frequency). Y
Hi Tobi,
The MLlib RDD-based API does support to apply transformation on both Vector
and RDD, but you did not use the appropriate way to do.
Suppose you have a RDD with LabeledPoint in each line, you can refer the
following code snippets to train a ChiSqSelectorModel model and do
transformation:
Could you tell us the Spark version you used?
We have fixed this bug at Spark 1.6.2 and Spark 2.0, please upgrade to
these versions and retry.
If this issue still exists, please let us know.
Thanks
Yanbo
2016-07-12 11:03 GMT-07:00 Pasquinell Urbani <
pasquinell.urb...@exalitica.com>:
> In the fo
Hi,
I would just do a repartition on the initial direct DStream since otherwise
each RDD in the stream has exactly as many partitions as you have
partitions in the Kafka topic (in your case 1). Like that receiving is
still done in only 1 thread but at least the processing further down is
done in p
For Spark 1.6.x, a DataFrame can't be directly converted to a Dataset[Row], but
can done indirectly as follows:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
// assume df is a DataFrame
implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)
val ds = df.as[Row]
However,
Or use mapWithState
Thank you.
Daniel
> On 16 Jul 2016, at 03:02, RK Aduri wrote:
>
> You can probably define sliding windows and set larger batch intervals.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Best-Practices-to-h
16 matches
Mail list logo