AFAICT, from the data size (25B rows, key cell 300 chars string), looks like a common Spark job. But the regex might be complex, I guess there are lots of items to match as (apple|banana|cola|...) from the purchase list. Regex matching is a high CPU computing task. If the current performance with more partitions doesn't meet the requirement, the keywords indexing might be a little help -- rather tokenize the purchase list firstly and index them (like a search engine) than RegEx them directly. And there are alos several search engines can work well with Spark, such as Elasticsearch, Solr.
-- Cheers, -z On Wed, 13 May 2020 00:47:02 -0400 Rishi Shah <rishishah.s...@gmail.com> wrote: > Thanks ZHANG! Please find details below: > > # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a parquet > formatted data so, need to worry about only the columns to be tagged) > > avg length of the text to be parsed : ~300 > > Unfortunately don't have sample data or regex which I can share freely. > However about data being parsed - assume these are purchases made online > and we are trying to parse the transaction details. Like purchases made on > amazon can be tagged to amazon as well as other vendors etc. > > Appreciate your response! > > > > On Tue, May 12, 2020 at 6:23 AM ZHANG Wei <wezh...@outlook.com> wrote: > > > May I get some requirement details? > > > > Such as: > > 1. The row count and one row data size > > 2. The avg length of text to be parsed by RegEx > > 3. The sample format of text to be parsed > > 4. The sample of current RegEx > > > > -- > > Cheers, > > -z > > > > On Mon, 11 May 2020 18:40:49 -0400 > > Rishi Shah <rishishah.s...@gmail.com> wrote: > > > > > Hi All, > > > > > > I have a tagging problem at hand where we currently use regular > > expressions > > > to tag records. Is there a recommended way to distribute & tag? Data is > > > about 10TB large. > > > > > > -- > > > Regards, > > > > > > Rishi Shah > > > > > -- > Regards, > > Rishi Shah --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org