Thanks Alessandro! That did the trick. I all of the indices and interactions are in the metadata. I also wanted to confirm that this solution works in pyspark as the metadata is carried over.
Andrew On Tue, Oct 29, 2019 at 5:26 AM Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > Hello Andrew, > few years ago I had the same need and I found this SO's answer > <https://stackoverflow.com/a/36306784/898154> the way to go. > > Here an extract of my (Scala) code (which was doing other things on > top), I have removed the irrelevant parts but without testing it, so it > might not work out of the box, nonetheless it should help you starting: > > private def getEncodedVectorLookupTable(df: DataFrame, > > featuresColName: String): >> Map[Long, String] = { > > val meta = df.select(featuresColName) >> .schema.fields.head.metadata >> .getMetadata("ml_attr") >> .getMetadata("attrs") >> > > > /* REFLECTION START */ >> val field = meta.getClass.getDeclaredField("map") >> field.setAccessible(true) >> val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet >> field.setAccessible(false) >> /* REFLECTION END */ > > > > keys.flatMap( >> meta.getMetadataArray(_) >> .map(m => m.getLong("idx") -> m.getString("name")) >> ).toMap > > } > > > It looks like there is some support now for achieving this, but I have > never tried it: > https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/r/RWrapperUtils.html > > Best regards, > Alessandro > > On Mon, 28 Oct 2019 at 21:01, Andrew Redd <andrewwr...@gmail.com> wrote: > >> >> Hi All! >> >> I'm performing an econometric analysis over several billion rows of data >> and would like to use the Pyspark SparkML implementation of linear >> regression. In the example below I'm trying to interact hour of day and >> month of year indicators. The StringIndexer documentation tells you what >> it's doing when it's one hot encoding string/factor columns (i.e. taking >> out the most/least common value or first/last when sorted alphabetically) >> but doesn't allow you to recover your coefficient names. This feels like >> such a general case that I must be missing something. How can I get my >> column names back post regression to map to coefficient values? Do I need >> to basically rebuild the RFormula logic in if this isn't already >> implemented? Would be happy to use a different Spark language (Scala/Java >> etc. ) if implemented there. >> >> Thanks in advance >> >> Andrew >> >> rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day + >> month_of_year + hour_of_day:month_of_year + additional_column", >> featuresCol="features", >> labelCol="label") >> >> rform_regression_input = >> rform.fit(regression_input).transform(regression_input) >> >> lr = LinearRegression(featuresCol='features', >> labelCol='label', >> solver='normal') >> >> lr_model = lr.fit(rform_regression_input) >> coefs = [ *lr_model.coefficients, lr_model.intercept] >> >> return pd.DataFrame( >> {"pvalues": lr_model.summary.pValues, >> "tvalues": lr_model.summary.tValues, >> "std_errs": lr_model.summary.coefficientStandardErrors, >> "coefs": coefs} >> ) >> >>