AFAICT, from the data size (25B rows, key cell 300 chars string), looks
like a common Spark job. But the regex might be complex, I guess there
are lots of items to match as (apple|banana|cola|...) from the purchase
list. Regex matching is a high CPU computing task. If the current
performance with more partitions doesn't meet the requirement, the
keywords indexing might be a little help -- rather tokenize the
purchase list firstly and index them (like a search engine) than RegEx
them directly. And there are alos several search engines can work well
with Spark, such as Elasticsearch, Solr.

-- 
Cheers,
-z

On Wed, 13 May 2020 00:47:02 -0400
Rishi Shah <rishishah.s...@gmail.com> wrote:

> Thanks ZHANG! Please find details below:
> 
> # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a parquet
> formatted data so, need to worry about only the columns to be tagged)
> 
> avg length of the text to be parsed : ~300
> 
> Unfortunately don't have sample data or regex which I can share freely.
> However about data being parsed - assume these are purchases made online
> and we are trying to parse the transaction details. Like purchases made on
> amazon can be tagged to amazon as well as other vendors etc.
> 
> Appreciate your response!
> 
> 
> 
> On Tue, May 12, 2020 at 6:23 AM ZHANG Wei <wezh...@outlook.com> wrote:
> 
> > May I get some requirement details?
> >
> > Such as:
> > 1. The row count and one row data size
> > 2. The avg length of text to be parsed by RegEx
> > 3. The sample format of text to be parsed
> > 4. The sample of current RegEx
> >
> > --
> > Cheers,
> > -z
> >
> > On Mon, 11 May 2020 18:40:49 -0400
> > Rishi Shah <rishishah.s...@gmail.com> wrote:
> >
> > > Hi All,
> > >
> > > I have a tagging problem at hand where we currently use regular
> > expressions
> > > to tag records. Is there a recommended way to distribute & tag? Data is
> > > about 10TB large.
> > >
> > > --
> > > Regards,
> > >
> > > Rishi Shah
> >
> 
> 
> -- 
> Regards,
> 
> Rishi Shah

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to