[ https://issues.apache.org/jira/browse/SPARK-50812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ruifeng Zheng updated SPARK-50812: ---------------------------------- Affects Version/s: (was: 4.1.0) > Support pyspark.ml on Connect > ----------------------------- > > Key: SPARK-50812 > URL: https://issues.apache.org/jira/browse/SPARK-50812 > Project: Spark > Issue Type: Umbrella > Components: Connect, ML, PySpark > Affects Versions: 4.0.0 > Reporter: Ruifeng Zheng > Assignee: Bobby Wang > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Starting from Apache Spark 3.4, Spark has supported Connect which introduced > a decoupled client-server architecture that allows remote connectivity to > Spark clusters using the DataFrame API and unresolved logical plans as the > protocol. The separation between client and server allows Spark and its open > ecosystem to be leveraged from everywhere. It can be embedded in modern data > applications, in IDEs, Notebooks and programming languages. > However, Spark Connect currently only supports Spark SQL, which means Spark > ML could not run the training/inference via Spark Connect. It will probably > result in losing some ML users. > So I would like to propose a way to support Spark ML on the Connect. Users > don't need to change their code to leverage connect to run Spark ML cases. > Here are some links, > Design doc: [Support spark.ml on > Connect|https://docs.google.com/document/d/1EUvSZuI-so83cxb_fTVMoz0vUfAaFmqXt39yoHI-D9I/edit?usp=sharing] > > Draft PR: [https://github.com/wbo4958/spark/pull/5] > Example code, > {code:python} > spark = SparkSession.builder.remote("sc://localhost").getOrCreate() > df = spark.createDataFrame([ > (Vectors.dense([1.0, 2.0]), 1), > (Vectors.dense([2.0, -1.0]), 1), > (Vectors.dense([-3.0, -2.0]), 0), > (Vectors.dense([-1.0, -2.0]), 0), > ], schema=['features', 'label']) > lr = LogisticRegression() > lr.setMaxIter(30) > model: LogisticRegressionModel = lr.fit(df) > z = model.summary > x = model.predictRaw(Vectors.dense([1.0, 2.0])) > print(f"predictRaw {x}") > assert model.getMaxIter() == 30 > model.summary.roc.show() > print(model.summary.weightedRecall) > print(model.summary.recallByLabel) > print(model.coefficients) > print(model.intercept) > model.transform(df).show() > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org