I think you have this flipped around - you want to one-hot encode, then compute interactions. As it is you are treating the product of {0,1,2,3,4} x {0,1,2,3,4} as if it's a categorical index. That doesn't have nearly 25 possible values and probably is not what you intend.
On Mon, Nov 9, 2020 at 7:53 AM Du, Yi <y...@archcapservices.com> wrote: > Hi, > > > > How are you doing? > > > > Please first introduce myself to you. I am Yi Du, working in a mortgage > insurance company called ‘Arch Capital Group’ based in Washington DC office > in US. I find your profile under the repo Spark of Github and would like to > ask you one particular coding issue under Spark ML. I tried to read the > documentation of Spark and also asked in Stackoverflow but still have no > clue. > > > > I am using Pyspark and using ML to build models. I have categorical > variables as predictors and would like to have interactions between two > categorical variables in the model as well. > > > > I was trying to follow the example here: > https://spark.apache.org/docs/latest/ml-features#interaction to create > the interaction between two categorical variables. > > > > Here is my snippet of code: > > > > ```python > > stringIndexer = StringIndexer(inputCols=['fico_group','ltv_group'], > outputCols=['fico_groupIndex1','ltv_groupIndex1'], > stringOrderType='frequencyAsc') > > trs_data_index = stringIndexer.fit(trs_data).transform(trs_data) > > > > interaction = > Interaction(inputCols=['fico_groupIndex1','ltv_groupIndex1'], > outputCol="interactedCol") > > trs_data_interacted_temp = interaction.transform(trs_data_index) > > > > encoder = OneHotEncoder(inputCols=['interactedCol'], > outputCols=['interactedColVec']) > > trs_data_interacted = > encoder.fit(trs_data_interacted_temp).transform(trs_data_interacted_temp) > > ``` > > > > I basically index ‘fico_group’ and ‘ltv_group’ first and interact them > together and use onehotencoder to create the final column > ‘interactedColVec’ for use. > > > > However, the final results didn’t come as expected. My ‘fico_group’ has 5 > levels and so does ‘ltv_group’. So there are 5*5 = 25 combinations. But in > the model estimates, one level should be treated as base so I expected to > see 25-1 = 24 interactions in the final estimates. However, by using the > above code, I have 25 interactions in the model estimates. > > > > This is my post under Stackoverflow. > https://stackoverflow.com/questions/64602060/add-interaction-term-to-ml > > > > I don’t know if I articulated my question/issues clearly to you. But I do > really appreciate your help if possible or if you can direct me to the > person who knows this. > > > > Again, thank you very much for your help. > > > > Best, > > Yi > > > > > ------------------------------ > > The information contained in this e-mail message may be privileged and > confidential information and is intended only for the use of the individual > and/or entity identified in the alias address of this message. If the > reader of this message is not the intended recipient, or an employee or > agent responsible to deliver it to the intended recipient, you are hereby > requested not to distribute or copy this communication. If you have > received this communication in error, please notify us immediately by > telephone or return e-mail and delete the original message from your system. >