RE: Ask about Pyspark ML interaction

Du, Yi Mon, 09 Nov 2020 11:51:21 -0800

Do you mean I need to index them, onehotencode and interact them?

I tried both ways:


Index -> interact -> onehotencode: it gave me 25 combinations.

Index -> onehotencode -> interact: it gave me 16 combinations.

Neither of them gave me expected 24 combinations. Did I miss something?

Thanks,

From: Sean Owen [mailto:sro...@gmail.com]
Sent: Monday, November 9, 2020 9:58 AM
To: Du, Yi <y...@archcapservices.com>
Cc: user@spark.apache.org
Subject: Re: Ask about Pyspark ML interaction

CAUTION: External email.
I think you have this flipped around - you want to one-hot encode, then compute 
interactions. As it is you are treating the product of {0,1,2,3,4} x 
{0,1,2,3,4} as if it's a categorical index. That doesn't have nearly 25 
possible values and probably is not what you intend.

On Mon, Nov 9, 2020 at 7:53 AM Du, Yi 
<y...@archcapservices.com<mailto:y...@archcapservices.com>> wrote:
Hi,

How are you doing?

Please first introduce myself to you. I am Yi Du, working in a mortgage 
insurance company called ‘Arch Capital Group’ based in Washington DC office in 
US. I find your profile under the repo Spark of Github and would like to ask 
you one particular coding issue under Spark ML. I tried to read the 
documentation of Spark and also asked in Stackoverflow but still have no clue.

I am using Pyspark and using ML to build models. I have categorical variables 
as predictors and would like to have interactions between two categorical 
variables in the model as well.

I was trying to follow the example here: 
https://spark.apache.org/docs/latest/ml-features#interaction<https://spark.apache.org/docs/latest/ml-features#interaction>
 to create the interaction between two categorical variables.

Here is my snippet of code:

```python
stringIndexer = StringIndexer(inputCols=['fico_group','ltv_group'], 
outputCols=['fico_groupIndex1','ltv_groupIndex1'], 
stringOrderType='frequencyAsc')
trs_data_index = stringIndexer.fit(trs_data).transform(trs_data)

interaction = Interaction(inputCols=['fico_groupIndex1','ltv_groupIndex1'], 
outputCol="interactedCol")
trs_data_interacted_temp = interaction.transform(trs_data_index)

encoder = OneHotEncoder(inputCols=['interactedCol'], 
outputCols=['interactedColVec'])
trs_data_interacted = 
encoder.fit(trs_data_interacted_temp).transform(trs_data_interacted_temp)
```

I basically index ‘fico_group’ and ‘ltv_group’ first and interact them together 
and use onehotencoder to create the final column ‘interactedColVec’ for use.

However, the final results didn’t come as expected. My ‘fico_group’ has 5 
levels and so does ‘ltv_group’. So there are 5*5 = 25 combinations. But in the 
model estimates, one level should be treated as base so I expected to see 25-1 
= 24 interactions in the final estimates. However, by using the above code, I 
have 25 interactions in the model estimates.

This is my post under Stackoverflow. 
https://stackoverflow.com/questions/64602060/add-interaction-term-to-ml<https://stackoverflow.com/questions/64602060/add-interaction-term-to-ml>

I don’t know if I articulated my question/issues clearly to you. But I do 
really appreciate your help if possible or if you can direct me to the person 
who knows this.

Again, thank you very much for your help.

Best,
Yi



________________________________

The information contained in this e-mail message may be privileged and 
confidential information and is intended only for the use of the individual 
and/or entity identified in the alias address of this message. If the reader of 
this message is not the intended recipient, or an employee or agent responsible 
to deliver it to the intended recipient, you are hereby requested not to 
distribute or copy this communication. If you have received this communication 
in error, please notify us immediately by telephone or return e-mail and delete 
the original message from your system.

________________________________

The information contained in this e-mail message may be privileged and 
confidential information and is intended only for the use of the individual 
and/or entity identified in the alias address of this message. If the reader of 
this message is not the intended recipient, or an employee or agent responsible 
to deliver it to the intended recipient, you are hereby requested not to 
distribute or copy this communication. If you have received this communication 
in error, please notify us immediately by telephone or return e-mail and delete 
the original message from your system.

RE: Ask about Pyspark ML interaction

Reply via email to