[jira] [Updated] (SPARK-50812) Support pyspark.ml on Connect

Ruifeng Zheng (Jira) Wed, 12 Feb 2025 06:25:38 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-50812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ruifeng Zheng updated SPARK-50812:
----------------------------------
    Affects Version/s:     (was: 4.1.0)

> Support pyspark.ml on Connect
> -----------------------------
>
>                 Key: SPARK-50812
>                 URL: https://issues.apache.org/jira/browse/SPARK-50812
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Connect, ML, PySpark
>    Affects Versions: 4.0.0
>            Reporter: Ruifeng Zheng
>            Assignee: Bobby Wang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> Starting from Apache Spark 3.4, Spark has supported Connect which introduced 
> a decoupled client-server architecture that allows remote connectivity to 
> Spark clusters using the DataFrame API and unresolved logical plans as the 
> protocol. The separation between client and server allows Spark and its open 
> ecosystem to be leveraged from everywhere. It can be embedded in modern data 
> applications, in IDEs, Notebooks and programming languages.
> However, Spark Connect currently only supports Spark SQL, which means Spark 
> ML could not run the training/inference via Spark Connect. It will probably 
> result in losing some ML users.
> So I would like to propose a way to support Spark ML on the Connect. Users 
> don't need to change their code to leverage connect to run Spark ML cases.
> Here are some links,
> Design doc: [Support spark.ml on 
> Connect|https://docs.google.com/document/d/1EUvSZuI-so83cxb_fTVMoz0vUfAaFmqXt39yoHI-D9I/edit?usp=sharing]
>  
> Draft PR: [https://github.com/wbo4958/spark/pull/5]
> Example code,
> {code:python}
> spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
> df = spark.createDataFrame([
>     (Vectors.dense([1.0, 2.0]), 1), 
>     (Vectors.dense([2.0, -1.0]), 1), 
>     (Vectors.dense([-3.0, -2.0]), 0), 
>     (Vectors.dense([-1.0, -2.0]), 0), 
> ], schema=['features', 'label'])
> lr = LogisticRegression()
> lr.setMaxIter(30)
> model: LogisticRegressionModel = lr.fit(df)
> z = model.summary
> x = model.predictRaw(Vectors.dense([1.0, 2.0]))
> print(f"predictRaw {x}")
> assert model.getMaxIter() == 30
> model.summary.roc.show()
> print(model.summary.weightedRecall)
> print(model.summary.recallByLabel)
> print(model.coefficients)
> print(model.intercept)
> model.transform(df).show()
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-50812) Support pyspark.ml on Connect

Reply via email to