Hi Everyone, I have a use-case where I have two Dataframes like below,
1) First Dataframe(DF1) contains, * ANIMALS * Mammals Birds Fish Reptiles Amphibians 2) Second Dataframe(DF2) contains, * ID, Mammals, Birds, Fish, Reptiles, Amphibians * 1, Dogs, Eagle, Goldfish, NULL, Frog 2, Cats, Peacock, Guppy, Turtle, Salamander 3, Dolphins, Eagle, Zander, NULL, Frog 4, Whales, Parrot, Guppy, Snake, Frog 5, Horses, Owl, Guppy, Snake, Frog 6, Dolphins, Kingfisher, Zander, Turtle, Frog 7, Dogs, Sparrow, Goldfish, NULL, Salamander Now I want to take each row from DF1 and find out its distinct count in DF2. Example, pick Mammals from DF1 then find out count(distinct(Mammals)) from DF2 i.e. 5 DF1 has 70 distinct rows/Animal types DF2 has some million rows Whats the best way to achieve this efficiently using parallelism ? Any inputs are helpful. Thank you. Regards, Ajay