Thanks guys. @Filipp Zhinkin Yes, we might have couple of string columns which will have 15million+ unique values which need to be mapped to indices.
@Nick Pentreath We are on 2.0.2 though I will check it out. Is it better from hashing collision perspective or can handle large volume of data as well? Regards, Shahab On Tue, Apr 10, 2018 at 10:05 AM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Also check out FeatureHasher in Spark 2.3.0 which is designed to handle > this use case in a more natural way than HashingTF (and handles multiple > columns at once). > > > > On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin <filipp.zhin...@gmail.com> > wrote: > >> Hi Shahab, >> >> do you actually need to have a few columns with such a huge amount of >> categories whose value depends on original value's frequency? >> >> If no, then you may use value's hash code as a category or combine all >> columns into a single vector using HashingTF. >> >> Regards, >> Filipp. >> >> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com> >> wrote: >> > Is the StringIndexer keeps all the mapped label to indices in the >> memory of >> > the driver machine? It seems to be unless I am missing something. >> > >> > What if our data that needs to be indexed is huge and columns to be >> indexed >> > are high cardinality (or with lots of categories) and more than one such >> > column need to be indexed? Meaning it wouldn't fit in memory. >> > >> > Thanks. >> > >> > Regards, >> > Shahab >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>