For elasticsearch you can use the elastic official connector. https://www.elastic.co/what-is/elasticsearch-hadoop
Elastic spark connector docs: https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html On Thu, May 14, 2020, 21:14 Amol Umbarkar <amolumbar...@gmail.com> wrote: > Check out sparkNLP for tokenization. I am not sure about solar or elastic > search though > > On Thu, May 14, 2020 at 9:02 PM Rishi Shah <rishishah.s...@gmail.com> > wrote: > >> This is great, thanks you Zhang & Amol !! >> >> Yes we can have multiple tags per row and multiple regex applied to >> single row as well. Would you have any example of working with spark & >> search engines like Solar, ElasticSearch? Does Spark ML provide >> tokenization support as expected (I am yet to try SparkML, still a >> beginner)? >> >> Any other reference material you found useful while working on similar >> problem? appreciate all the help! >> >> Thanks, >> -Rishi >> >> >> On Thu, May 14, 2020 at 6:11 AM Amol Umbarkar <amolumbar...@gmail.com> >> wrote: >> >>> Rishi, >>> Just adding to zhang's questions. >>> >>> Are you expecting multiple tags per row? >>> Do you check multiple regex for a single tag? >>> >>> Let's say you had only one tag then theoretically you should be do this - >>> >>> 1 Remove stop words or any irrelevant stuff >>> 2 split text into equal sized chunk column (eg - if max length is >>> 1000chars, split into 20 columns of 50 chars) >>> 3 distribute work for each column that would result in binary >>> (true/false) for a single tag >>> 4 merge the 20 resulting columns >>> 5 repeat for other tags or do them in parallel 3 and 4 for them >>> >>> Note on 3: If you expect single tag per row, then you can repeat 3 >>> column by column and skip rows that have got tags in prior step. >>> >>> Secondly, if you expect similarity in text (of some kind) then you could >>> jus work on unique text values (might require shuffle, hence expensive) and >>> then join the end result back to the original data. You could use hash of >>> some kind to join back. Though I would go for this approach only if the >>> chances of similarity in text are very high (it could be in your case for >>> being transactional data). >>> >>> Not the full answer to your question but hope this helps you brainstorm >>> more. >>> >>> Thanks, >>> Amol >>> >>> >>> >>> >>> >>> On Wed, May 13, 2020 at 10:17 AM Rishi Shah <rishishah.s...@gmail.com> >>> wrote: >>> >>>> Thanks ZHANG! Please find details below: >>>> >>>> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a >>>> parquet formatted data so, need to worry about only the columns to be >>>> tagged) >>>> >>>> avg length of the text to be parsed : ~300 >>>> >>>> Unfortunately don't have sample data or regex which I can share freely. >>>> However about data being parsed - assume these are purchases made online >>>> and we are trying to parse the transaction details. Like purchases made on >>>> amazon can be tagged to amazon as well as other vendors etc. >>>> >>>> Appreciate your response! >>>> >>>> >>>> >>>> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei <wezh...@outlook.com> wrote: >>>> >>>>> May I get some requirement details? >>>>> >>>>> Such as: >>>>> 1. The row count and one row data size >>>>> 2. The avg length of text to be parsed by RegEx >>>>> 3. The sample format of text to be parsed >>>>> 4. The sample of current RegEx >>>>> >>>>> -- >>>>> Cheers, >>>>> -z >>>>> >>>>> On Mon, 11 May 2020 18:40:49 -0400 >>>>> Rishi Shah <rishishah.s...@gmail.com> wrote: >>>>> >>>>> > Hi All, >>>>> > >>>>> > I have a tagging problem at hand where we currently use regular >>>>> expressions >>>>> > to tag records. Is there a recommended way to distribute & tag? Data >>>>> is >>>>> > about 10TB large. >>>>> > >>>>> > -- >>>>> > Regards, >>>>> > >>>>> > Rishi Shah >>>>> >>>> >>>> >>>> -- >>>> Regards, >>>> >>>> Rishi Shah >>>> >>> >> >> -- >> Regards, >> >> Rishi Shah >> >