Re: [PySpark] Tagging descriptions

Rishi Shah Thu, 14 May 2020 08:33:25 -0700

This is great, thanks you Zhang & Amol !!

Yes we can have multiple tags per row and multiple regex applied to single
row as well. Would you have any example of working with spark & search
engines like Solar, ElasticSearch? Does Spark ML provide tokenization
support as expected (I am yet to try SparkML, still a beginner)?


Any other reference material you found useful while working on similar
problem? appreciate all the help!

Thanks,
-Rishi


On Thu, May 14, 2020 at 6:11 AM Amol Umbarkar <amolumbar...@gmail.com>
wrote:

> Rishi,
> Just adding to zhang's questions.
>
> Are you expecting multiple tags per row?
> Do you check multiple regex for a single tag?
>
> Let's say you had only one tag then theoretically you should be do this -
>
> 1 Remove stop words or any irrelevant stuff
> 2 split text into equal sized chunk column (eg - if max length is
> 1000chars, split into 20 columns of 50 chars)
> 3 distribute work for each column that would result in binary (true/false)
> for a single tag
> 4 merge the 20 resulting columns
> 5 repeat for other tags or do them in parallel 3 and 4 for them
>
> Note on 3: If you expect single tag per row, then you can repeat 3 column
> by column and skip rows that have got tags in prior step.
>
> Secondly, if you expect similarity in text (of some kind) then you could
> jus work on unique text values (might require shuffle, hence expensive) and
> then join the end result back to the original data.  You could use hash of
> some kind to join back. Though I would go for this approach only if the
> chances of similarity in text are very high (it could be in your case for
> being transactional data).
>
> Not the full answer to your question but hope this helps you brainstorm
> more.
>
> Thanks,
> Amol
>
>
>
>
>
> On Wed, May 13, 2020 at 10:17 AM Rishi Shah <rishishah.s...@gmail.com>
> wrote:
>
>> Thanks ZHANG! Please find details below:
>>
>> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a
>> parquet formatted data so, need to worry about only the columns to be
>> tagged)
>>
>> avg length of the text to be parsed : ~300
>>
>> Unfortunately don't have sample data or regex which I can share freely.
>> However about data being parsed - assume these are purchases made online
>> and we are trying to parse the transaction details. Like purchases made on
>> amazon can be tagged to amazon as well as other vendors etc.
>>
>> Appreciate your response!
>>
>>
>>
>> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei <wezh...@outlook.com> wrote:
>>
>>> May I get some requirement details?
>>>
>>> Such as:
>>> 1. The row count and one row data size
>>> 2. The avg length of text to be parsed by RegEx
>>> 3. The sample format of text to be parsed
>>> 4. The sample of current RegEx
>>>
>>> --
>>> Cheers,
>>> -z
>>>
>>> On Mon, 11 May 2020 18:40:49 -0400
>>> Rishi Shah <rishishah.s...@gmail.com> wrote:
>>>
>>> > Hi All,
>>> >
>>> > I have a tagging problem at hand where we currently use regular
>>> expressions
>>> > to tag records. Is there a recommended way to distribute & tag? Data is
>>> > about 10TB large.
>>> >
>>> > --
>>> > Regards,
>>> >
>>> > Rishi Shah
>>>
>>
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>

-- 
Regards,

Rishi Shah

Re: [PySpark] Tagging descriptions

Reply via email to