Hi, I am currently utilizing pysparks `GeneralizedLinearRegression`, and have a question when I try to include a categorical:categorical interaction in my model. I have tried searching stackoverflow and exisitngs mail, but with no luck. Let me know if I have missed anything 😊.
My question boils down to, if the `Interaction` functionality between categorical variables is usable in combination with GeneralizedLinearRegression ( and generally functions using IWLS)? Since I am unsure of whether the design matrix generated from the feature column considers collinearity. Please see code and further comments below: The following code utilizes the RFormula transformation to visualize the model matrix: ``` from pyspark.ml.feature import RFormula from pyspark.sql.functions import lit # Create sample data with all combinations data = [("N", "N"), ("Y", "N"), ("N", "Y"), ("Y", "Y")] df = spark.createDataFrame(data, ["x1", "x2"]) df = df.withColumn("y", lit(1.0)) formula = RFormula(formula="y ~ x1+x2+x1:x2", featuresCol="features") transformed = formula.fit(df).transform(df) # Display transformed.select("x1", "x2", "features").show(truncate=False) `` Returning +---+---+-------------------------+ |x1 |x2 |features | +---+---+-------------------------+ |N |N |[1.0,1.0,1.0,0.0,0.0,0.0]| |Y |N |(6,[1,4],[1.0,1.0]) | |N |Y |(6,[0,3],[1.0,1.0]) | |Y |Y |(6,[5],[1.0]) | +---+---+-------------------------+ This will introduce a 4x6 (7 with intercept) dimensioned matrix: Perfect Multicollinearity - The interaction columns are perfectly determined by the main effects. For example, the (N,N) interaction is exactly where x1="N" AND x2="N". Rank Deficiency - The design matrix is rank deficient. For a two-factor interaction model, we should need only 4 parameters (intercept + 3 coefficients), but PySpark creates 6. Causing converging problems when using the IWLS algorithm. -------------------- If we look a the same example using patsy’s dmatrix implementation: ``` import pandas as pd from patsy import dmatrix data = pd.DataFrame({ 'x1': ['N', 'Y', 'N', 'Y'], 'x2': ['N', 'N', 'Y', 'Y'], 'y': [1.0, 1.0, 1.0, 1.0] }) formula_matrix = dmatrix("x1 + x2 + x1:x2", data, return_type="dataframe") print(formula_matrix) ``` Patsy formula matrix: Intercept x1[T.Y] x2[T.Y] x1[T.Y]:x2[T.Y] 0 1.0 0.0 0.0 0.0 1 1.0 1.0 0.0 0.0 2 1.0 0.0 1.0 0.0 3 1.0 1.0 1.0 1.0 We get the correct model matrix setup. Venlig hilsen | Best Regards Emil Hofman | Aktuar T +45 51 55 86 80 | M +45 51 55 86 80 e...@abgroup.dk<mailto:e...@abgroup.dk> [cid:image001.png@01DBBB48.08AE9460] [cid:image002.png@01DBBB48.08AE9460]<https://www.linkedin.com/company/alm-brand-group/> almbrandgroup.com<https://almbrandgroup.com> Alm. Brand Group | Hovedkontor: Midtermolen 7 | DK-2100 København Ø | T +45 35 47 47 47 Er du ikke den tiltænkte modtager af denne mail, beder vi dig venligst informere os, slette mailen og ikke videredistribuere indholdet og evt. vedhæftede filer. Du kan læse mere om, hvordan vi behandler dine personoplysninger, og hvilke rettigheder du har i vores privatlivspolitik ved at klikke her<https://www.almbrandgroup.com/om-os/privatlivspolitik/>. Denne mail er scannet for virus. For information about how we process your personal data, please click here<https://www.almbrandgroup.com/en/about-us/privacy-policy/>.