Hello,
I'm trying to write to parquet some RDD[T] where T is a protobuf message,
in scala.
I am wondering what is the best option to do this, and I would be
interested by your lights.
So far, I see two possibilities:
- use PairRDD method *saveAsNewAPIHadoopFile*, and I guess I need to
call *Parque
Hello,
I have a few questions related to bucketing and custom partitioning in
dataframe api.
I am considering bucketing to perform one-side free shuffle join in
incremental jobs, but there is one thing that I'm not happy with.
Data is likely to grow/skew over time. At some point, i would need to
.3, up for release soon. It exists in SQL. You can still use it in
>> SQL with `spark.sql(...)` in Python though, not hard.
>>
>> On Mon, Feb 21, 2022 at 4:01 AM David Diebold
>> wrote:
>>
>>> Hello all,
>>>
>>> I'm trying to use the sp
Hello all,
I'm trying to use the spark.sql min_by aggregation function with pyspark.
I'm relying on this distribution of spark : spark-3.2.1-bin-hadoop3.2
I have a dataframe made of these columns:
- productId : int
- sellerId : int
- price : double
For each product, I want to get the seller who
Hello,
In RDD api, you must be looking for reduceByKey.
Cheers
Le ven. 14 janv. 2022 à 11:56, frakass a écrit :
> Is there a RDD API which is similar to Scala's groupMapReduce?
> https://blog.genuine.com/2019/11/scalas-groupmap-and-groupmapreduce/
>
> Thank you.
>
> ---
Hello Andy,
Are you sure you want to perform lots of join operations, and not simple
unions ?
Are you doing inner joins or outer joins ?
Can you provide us with a rough amount of your list size plus each
individual dataset size ?
Have a look at execution plan would help, maybe the high amount of j
Hello all,
I was wondering if it possible to encounter out of memory exceptions on
spark executors when doing some aggregation, when a dataset is skewed.
Let's say we have a dataset with two columns:
- key : int
- value : float
And I want to aggregate values by key.
Let's say that we have a tons o
e you looking for
> https://spark.apache.org/docs/latest/ml-features.html#interaction ?
> That's the closest built in thing I can think of. Otherwise you can make
> custom transformations.
>
> On Fri, Oct 1, 2021, 8:44 AM David Diebold
> wrote:
>
>> Hello everyone,
&
Hello everyone,
In MLLib, I’m trying to rely essentially on pipelines to create features
out of the Titanic dataset, and show-case the power of feature hashing. I
want to:
- Apply bucketization on some columns (QuantileDiscretizer is fine)
- Then I want to cross all my columns
Hi Mich,
I don't quite understand why the driver node is using so much CPU, but it
may be unrelated to your executors being underused.
About your executors being underused, I would check that your job generated
enough tasks.
Then I would check spark.executor.cores and spark.tasks.cpus parameters t
10 matches
Mail list logo