Re: StringIndexer with high cardinality huge data

Shahab Yunus Tue, 10 Apr 2018 08:00:12 -0700

Thanks guys.

@Filipp Zhinkin
Yes, we might have couple of string columns which will have 15million+
unique values which need to be mapped to indices.


@Nick Pentreath
We are on 2.0.2 though I will check it out. Is it better from hashing
collision perspective or can handle large volume of data as well?

Regards,
Shahab

On Tue, Apr 10, 2018 at 10:05 AM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Also check out FeatureHasher in Spark 2.3.0 which is designed to handle
> this use case in a more natural way than HashingTF (and handles multiple
> columns at once).
>
>
>
> On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin <filipp.zhin...@gmail.com>
> wrote:
>
>> Hi Shahab,
>>
>> do you actually need to have a few columns with such a huge amount of
>> categories whose value depends on original value's frequency?
>>
>> If no, then you may use value's hash code as a category or combine all
>> columns into a single vector using HashingTF.
>>
>> Regards,
>> Filipp.
>>
>> On Tue, Apr 10, 2018 at 4:01 PM, Shahab Yunus <shahab.yu...@gmail.com>
>> wrote:
>> > Is the StringIndexer keeps all the mapped label to indices in the
>> memory of
>> > the driver machine? It seems to be unless I am missing something.
>> >
>> > What if our data that needs to be indexed is huge and columns to be
>> indexed
>> > are high cardinality (or with lots of categories) and more than one such
>> > column need to be indexed? Meaning it wouldn't fit in memory.
>> >
>> > Thanks.
>> >
>> > Regards,
>> > Shahab
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

Re: StringIndexer with high cardinality huge data

Reply via email to