Hi,

I am currently utilizing pysparks `GeneralizedLinearRegression`, and have a 
question when I try to include a categorical:categorical interaction in my 
model.
I have tried searching stackoverflow and exisitngs mail, but with no luck. Let 
me know if I have missed anything 😊.

My question boils down to, if the `Interaction` functionality between 
categorical variables is usable in combination with GeneralizedLinearRegression 
( and generally functions using IWLS)? Since I am unsure of whether the design 
matrix generated from the feature column considers collinearity.

Please see code and further comments below:

The following code utilizes the RFormula transformation to visualize the model 
matrix:
```
from pyspark.ml.feature import RFormula
from pyspark.sql.functions import lit

# Create sample data with all combinations
data = [("N", "N"), ("Y", "N"), ("N", "Y"), ("Y", "Y")]
df = spark.createDataFrame(data, ["x1", "x2"])
df = df.withColumn("y", lit(1.0))

formula = RFormula(formula="y ~ x1+x2+x1:x2", featuresCol="features")
transformed = formula.fit(df).transform(df)

# Display
transformed.select("x1", "x2", "features").show(truncate=False)

``
Returning
+---+---+-------------------------+
|x1 |x2 |features |
+---+---+-------------------------+
|N |N |[1.0,1.0,1.0,0.0,0.0,0.0]|
|Y |N |(6,[1,4],[1.0,1.0]) |
|N |Y |(6,[0,3],[1.0,1.0]) |
|Y |Y |(6,[5],[1.0]) |
+---+---+-------------------------+

This will introduce a 4x6 (7 with intercept) dimensioned matrix:
Perfect Multicollinearity - The interaction columns are perfectly determined by 
the main effects. For example, the (N,N) interaction is exactly where x1="N" 
AND x2="N".
Rank Deficiency - The design matrix is rank deficient. For a two-factor 
interaction model, we should need only 4 parameters (intercept + 3 
coefficients), but PySpark creates 6.

Causing converging problems when using the IWLS algorithm.

--------------------
If we look a the same example using patsy’s dmatrix implementation:

```
import pandas as pd
from patsy import dmatrix

data = pd.DataFrame({
    'x1': ['N', 'Y', 'N', 'Y'],
    'x2': ['N', 'N', 'Y', 'Y'],
    'y': [1.0, 1.0, 1.0, 1.0]
})

formula_matrix = dmatrix("x1 + x2 + x1:x2", data, return_type="dataframe")
print(formula_matrix)


```
Patsy formula matrix:
   Intercept  x1[T.Y]  x2[T.Y]  x1[T.Y]:x2[T.Y]
0        1.0      0.0      0.0              0.0
1        1.0      1.0      0.0              0.0
2        1.0      0.0      1.0              0.0
3        1.0      1.0      1.0              1.0


We get the correct model matrix setup.

Venlig hilsen | Best Regards

Emil Hofman | Aktuar

T +45 51 55 86 80   |  M +45 51 55 86 80
e...@abgroup.dk<mailto:e...@abgroup.dk>
[cid:image001.png@01DBBB48.08AE9460]

[cid:image002.png@01DBBB48.08AE9460]<https://www.linkedin.com/company/alm-brand-group/>
almbrandgroup.com<https://almbrandgroup.com>

Alm. Brand Group | Hovedkontor: Midtermolen 7 | DK-2100 København Ø | T +45 35 
47 47 47


Er du ikke den tiltænkte modtager af denne mail, beder vi dig venligst 
informere os, slette mailen og ikke videredistribuere indholdet og evt. 
vedhæftede filer.

Du kan læse mere om, hvordan vi behandler dine personoplysninger, og hvilke 
rettigheder du har i vores privatlivspolitik ved at klikke 
her<https://www.almbrandgroup.com/om-os/privatlivspolitik/>. Denne mail er 
scannet for virus.

For information about how we process your personal data, please click 
here<https://www.almbrandgroup.com/en/about-us/privacy-policy/>.

Reply via email to