This is great, thanks you Zhang & Amol !! Yes we can have multiple tags per row and multiple regex applied to single row as well. Would you have any example of working with spark & search engines like Solar, ElasticSearch? Does Spark ML provide tokenization support as expected (I am yet to try SparkML, still a beginner)?
Any other reference material you found useful while working on similar problem? appreciate all the help! Thanks, -Rishi On Thu, May 14, 2020 at 6:11 AM Amol Umbarkar <amolumbar...@gmail.com> wrote: > Rishi, > Just adding to zhang's questions. > > Are you expecting multiple tags per row? > Do you check multiple regex for a single tag? > > Let's say you had only one tag then theoretically you should be do this - > > 1 Remove stop words or any irrelevant stuff > 2 split text into equal sized chunk column (eg - if max length is > 1000chars, split into 20 columns of 50 chars) > 3 distribute work for each column that would result in binary (true/false) > for a single tag > 4 merge the 20 resulting columns > 5 repeat for other tags or do them in parallel 3 and 4 for them > > Note on 3: If you expect single tag per row, then you can repeat 3 column > by column and skip rows that have got tags in prior step. > > Secondly, if you expect similarity in text (of some kind) then you could > jus work on unique text values (might require shuffle, hence expensive) and > then join the end result back to the original data. You could use hash of > some kind to join back. Though I would go for this approach only if the > chances of similarity in text are very high (it could be in your case for > being transactional data). > > Not the full answer to your question but hope this helps you brainstorm > more. > > Thanks, > Amol > > > > > > On Wed, May 13, 2020 at 10:17 AM Rishi Shah <rishishah.s...@gmail.com> > wrote: > >> Thanks ZHANG! Please find details below: >> >> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a >> parquet formatted data so, need to worry about only the columns to be >> tagged) >> >> avg length of the text to be parsed : ~300 >> >> Unfortunately don't have sample data or regex which I can share freely. >> However about data being parsed - assume these are purchases made online >> and we are trying to parse the transaction details. Like purchases made on >> amazon can be tagged to amazon as well as other vendors etc. >> >> Appreciate your response! >> >> >> >> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei <wezh...@outlook.com> wrote: >> >>> May I get some requirement details? >>> >>> Such as: >>> 1. The row count and one row data size >>> 2. The avg length of text to be parsed by RegEx >>> 3. The sample format of text to be parsed >>> 4. The sample of current RegEx >>> >>> -- >>> Cheers, >>> -z >>> >>> On Mon, 11 May 2020 18:40:49 -0400 >>> Rishi Shah <rishishah.s...@gmail.com> wrote: >>> >>> > Hi All, >>> > >>> > I have a tagging problem at hand where we currently use regular >>> expressions >>> > to tag records. Is there a recommended way to distribute & tag? Data is >>> > about 10TB large. >>> > >>> > -- >>> > Regards, >>> > >>> > Rishi Shah >>> >> >> >> -- >> Regards, >> >> Rishi Shah >> > -- Regards, Rishi Shah