The data transformation is all the same. Sure, linear regression is easy: https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression These are components that operate on DataFrames.
You'll want to look at VectorAssembler to prepare data into an array column. There are other transformations you may want like normalization, also in the Spark ML components. You can put those steps together into a Pipeline to fit and transform with it as one unit. On Wed, Jul 20, 2022 at 3:04 AM Edgar H <kaotix...@gmail.com> wrote: > Morning everyone, > > The question may seem to broad but will try to synth as much as possible: > > I'm used to work with Spark SQL, DFs and such on a daily basis, easily > grouping, getting extra counters and using functions or UDFs. However, I've > come to an scenario where I need to make some predictions and linear > regression is the way to go. > > However, lurking through the docs this belongs to the ML side of Spark and > never been in there before... > > How is it working with Spark ML compared to what I'm used to? Training > models, building a new one, adding more columns and such... Is there even a > change or I'm just confused and it's pretty easy? > > When deploying ML pipelines, is there anything to take into account > compared to the usual ones with Spark SQL and such? > > And... Is it even possible to do linear regression (or any other ML > method) inside a traditional pipeline without training or any other ML > related aspects? > > Some guidelines (or articles, ref to docs) would be helpful to start if > possible. > > Thanks! >