Re: [VOTE] New Spark Connect Client Repository for Rust

2025-05-19 Thread Renjie Liu
to > handle authoring and reviewing the code. I’ll support them as an existing > committer and PMC member, helping with merges and releases. > > CC’ing @Renjie Liu & @sjrus...@gmail.com > to keep me honest here and to add any details I may > have missed from other discussions. &

Re: [DISCUSS] New Spark Connect Client repository for Rust language

2025-05-19 Thread Renjie Liu
>>> Happy to see the Rust client proposal and glad to help with the >>> follow-up repo management work. >>> >>> Let’s use this thread for voting—posting my +1 here. >>> >>> Jules Damji : >>> >>>> I’ll ask them. >>>&g

Re: [DISCUSS] New Spark Connect Client repository for Rust language

2025-05-16 Thread Renjie Liu
> > On May 9, 2025, at 1:53 AM, Renjie Liu wrote: > >  > Hi, All: > > I'd like to propose to add a new Apache Spark repository for `Spark > Connect Client for Rust`. > > https://github.com/apache/spark-connect- > <https://github.com/apache/spark-conn

[DISCUSS] New Spark Connect Client repository for Rust language

2025-05-09 Thread Renjie Liu
Hi, All: I'd like to propose to add a new Apache Spark repository for `Spark Connect Client for Rust`. https://github.com/apache/spark-connect- rust There are already some efforts for building spark-connect client in rust: https://github.com/sjrusso

Re: [VOTE] Move Variant to Parquet

2024-09-04 Thread Renjie Liu
+1 (non-binding) On Thu, Sep 5, 2024 at 4:00 AM Tathagata Das wrote: > +1 > > On Wed, Sep 4, 2024, 9:12 AM karuppayya wrote: > >> +1 >> >> On Tue, Sep 3, 2024 at 11:15 PM L. C. Hsieh wrote: >> >>> +1 >>> >>> On Tue, Sep 3, 2024 at 8:58 PM Chao Sun wrote: >>> > >>> > +1 >>> > >>> > On Tue, Sep

Re: [DISCUSS] Spark Columnar Processing

2019-04-02 Thread Renjie Liu
Batchs and generates code to loop through the data in each batch as >>> InternalRows. >>> >>> >>> Instead, we propose a new set of APIs to work on an >>> RDD[InternalColumnarBatch] instead of abusing type erasure. With this we >>> propose adding in a Rule similar to how WholeStageCodeGen currently works. >>> Each part of the physical SparkPlan would expose columnar support through a >>> combination of traits and method calls. The rule would then decide when >>> columnar processing would start and when it would end. Switching between >>> columnar and row based processing is not free, so the rule would make a >>> decision based off of an estimate of the cost to do the transformation and >>> the estimated speedup in processing time. >>> >>> >>> This should allow us to disable columnar support by simply disabling the >>> rule that modifies the physical SparkPlan. It should be minimal risk to >>> the existing row-based code path, as that code should not be touched, and >>> in many cases could be reused to implement the columnar version. This also >>> allows for small easily manageable patches. No huge patches that no one >>> wants to review. >>> >>> >>> As far as the memory layout is concerned OnHeapColumnVector and >>> OffHeapColumnVector are already really close to being Apache Arrow >>> compatible so shifting them over would be a relatively simple change. >>> Alternatively we could add in a new implementation that is Arrow compatible >>> if there are reasons to keep the old ones. >>> >>> >>> Again this is just to get the discussion started, any feedback is >>> welcome, and we will file a SPIP on it once we feel like the major changes >>> we are proposing are acceptable. >>> >>> Thanks, >>> >>> Bobby Evans >>> >>> >> -- Renjie Liu Software Engineer, MVAD

Re: Guaranteed processing orders of each batch in Spark Streaming

2015-11-02 Thread Renjie Liu
Hi, all: I have given a detailed description of my proposal in this jira <https://issues.apache.org/jira/browse/SPARK-11308>. On Mon, Oct 19, 2015 at 2:58 PM Renjie Liu wrote: > Hi, all: > I've read source code and it seems that there is no guarantee that the > order of pr

Guaranteed processing orders of each batch in Spark Streaming

2015-10-18 Thread Renjie Liu
Hi, all: I've read source code and it seems that there is no guarantee that the order of processing of each RDD is guaranteed since jobs are just submitted to a thread pool. I believe that this is quite important in streaming since updates should be ordered.