Hello, I don't understand my error message. Basically, all I am doing is : - dfAgg = df.groupBy("S_ID") - dfRes = df.join(dfAgg, Seq("S_ID"), "left_outer")
However I get this AnalysisException: " Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved attribute(s) S_ID#1903L missing from Dummy_ID#740,sex#37L,PERSONAL_STATUS#726L,W_DEP_CODE#736,W_SIZE#739L, POSTAL_CODE#735,COUNTRY_CODE#730, ID#724L,Dummy_ID_1#741,DEP_CODE#729,HOUSEHOLD_TYPE#733L, HOUSEHOLD_SIZE#734L,AGE#727L,W_ID#738L,H_ID#732L,AGE_TYPE#728, S_ID#57L,NATIONALITY#731 in operator !Project [ID#724L, sex#37L, PERSON\ AL_STATUS#726L, AGE#727L, AGE_TYPE#728, DEP_CODE#729, COUNTRY_CODE#730, NATIONALITY#731 AS Nationality#77, H_ID#732L, HOUSEHOLD_TYPE#733L, HOUSEHOLD_SIZE#734L, POSTAL_CODE#735, W_DEP_CODE#736, S_ID#1903L, W_ID#738L, W_SIZE#739L, Dummy_ID#740, Dummy_ID_1#741];; " What I don't understand is it says S_ID#1903L is missing but everything seems fine on the Logical Plan. +- Join LeftOuter, (S_ID#57L = S_ID#1903L) :- Project [W_ID#14L, H_ID#8L, ID#0L, sex#37L, category#97L, AGE#3L, AGE_TYPE#4, DEP_CODE#5, COUNTRY_CODE#6, Nationality#77, HOUSEHOLD_TYPE#9L, familySize#117L, POSTAL_CODE#11, W_DEP_CODE#12, S_ID#57\ L, workplaceSize#137L, Dummy_ID#16, Dummy_ID_1#17, Inc_period#157, Time_inf#1064, Time_inc#200, Health#1014, Inf_period#1039, infectedFamily#1355L, infectedWorker#1385L] +- Aggregate [S_ID#1903L], [S_ID#1903L, count(1) AS infectedStreet#1415L] Does someone have a clue about it? Thanks,