Re: [PySpark] Tagging descriptions

Netanel Malka Thu, 14 May 2020 11:58:29 -0700

For elasticsearch you can use the elastic official connector.
https://www.elastic.co/what-is/elasticsearch-hadoop


Elastic spark connector docs:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html



On Thu, May 14, 2020, 21:14 Amol Umbarkar <amolumbar...@gmail.com> wrote:

> Check out sparkNLP for tokenization. I am not sure about solar or elastic
> search though
>
> On Thu, May 14, 2020 at 9:02 PM Rishi Shah <rishishah.s...@gmail.com>
> wrote:
>
>> This is great, thanks you Zhang & Amol !!
>>
>> Yes we can have multiple tags per row and multiple regex applied to
>> single row as well. Would you have any example of working with spark &
>> search engines like Solar, ElasticSearch? Does Spark ML provide
>> tokenization support as expected (I am yet to try SparkML, still a
>> beginner)?
>>
>> Any other reference material you found useful while working on similar
>> problem? appreciate all the help!
>>
>> Thanks,
>> -Rishi
>>
>>
>> On Thu, May 14, 2020 at 6:11 AM Amol Umbarkar <amolumbar...@gmail.com>
>> wrote:
>>
>>> Rishi,
>>> Just adding to zhang's questions.
>>>
>>> Are you expecting multiple tags per row?
>>> Do you check multiple regex for a single tag?
>>>
>>> Let's say you had only one tag then theoretically you should be do this -
>>>
>>> 1 Remove stop words or any irrelevant stuff
>>> 2 split text into equal sized chunk column (eg - if max length is
>>> 1000chars, split into 20 columns of 50 chars)
>>> 3 distribute work for each column that would result in binary
>>> (true/false) for a single tag
>>> 4 merge the 20 resulting columns
>>> 5 repeat for other tags or do them in parallel 3 and 4 for them
>>>
>>> Note on 3: If you expect single tag per row, then you can repeat 3
>>> column by column and skip rows that have got tags in prior step.
>>>
>>> Secondly, if you expect similarity in text (of some kind) then you could
>>> jus work on unique text values (might require shuffle, hence expensive) and
>>> then join the end result back to the original data.  You could use hash of
>>> some kind to join back. Though I would go for this approach only if the
>>> chances of similarity in text are very high (it could be in your case for
>>> being transactional data).
>>>
>>> Not the full answer to your question but hope this helps you brainstorm
>>> more.
>>>
>>> Thanks,
>>> Amol
>>>
>>>
>>>
>>>
>>>
>>> On Wed, May 13, 2020 at 10:17 AM Rishi Shah <rishishah.s...@gmail.com>
>>> wrote:
>>>
>>>> Thanks ZHANG! Please find details below:
>>>>
>>>> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a
>>>> parquet formatted data so, need to worry about only the columns to be
>>>> tagged)
>>>>
>>>> avg length of the text to be parsed : ~300
>>>>
>>>> Unfortunately don't have sample data or regex which I can share freely.
>>>> However about data being parsed - assume these are purchases made online
>>>> and we are trying to parse the transaction details. Like purchases made on
>>>> amazon can be tagged to amazon as well as other vendors etc.
>>>>
>>>> Appreciate your response!
>>>>
>>>>
>>>>
>>>> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei <wezh...@outlook.com> wrote:
>>>>
>>>>> May I get some requirement details?
>>>>>
>>>>> Such as:
>>>>> 1. The row count and one row data size
>>>>> 2. The avg length of text to be parsed by RegEx
>>>>> 3. The sample format of text to be parsed
>>>>> 4. The sample of current RegEx
>>>>>
>>>>> --
>>>>> Cheers,
>>>>> -z
>>>>>
>>>>> On Mon, 11 May 2020 18:40:49 -0400
>>>>> Rishi Shah <rishishah.s...@gmail.com> wrote:
>>>>>
>>>>> > Hi All,
>>>>> >
>>>>> > I have a tagging problem at hand where we currently use regular
>>>>> expressions
>>>>> > to tag records. Is there a recommended way to distribute & tag? Data
>>>>> is
>>>>> > about 10TB large.
>>>>> >
>>>>> > --
>>>>> > Regards,
>>>>> >
>>>>> > Rishi Shah
>>>>>
>>>>
>>>>
>>>> --
>>>> Regards,
>>>>
>>>> Rishi Shah
>>>>
>>>
>>
>> --
>> Regards,
>>
>> Rishi Shah
>>
>

Re: [PySpark] Tagging descriptions

Reply via email to