Thanks everyone. While working on Tagging I stumbled upon another setback..
There are about 5000 regex I am dealing with, out of with couple of
hundreds have variable length lookbehind (originally these worked in a
JVM). In order to use this with Python/Pyspark udf - we need to either
modify these
For elasticsearch you can use the elastic official connector.
https://www.elastic.co/what-is/elasticsearch-hadoop
Elastic spark connector docs:
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
On Thu, May 14, 2020, 21:14 Amol Umbarkar wrote:
> Check out sparkNLP for tok
Check out sparkNLP for tokenization. I am not sure about solar or elastic
search though
On Thu, May 14, 2020 at 9:02 PM Rishi Shah wrote:
> This is great, thanks you Zhang & Amol !!
>
> Yes we can have multiple tags per row and multiple regex applied to single
> row as well. Would you have any e
This is great, thanks you Zhang & Amol !!
Yes we can have multiple tags per row and multiple regex applied to single
row as well. Would you have any example of working with spark & search
engines like Solar, ElasticSearch? Does Spark ML provide tokenization
support as expected (I am yet to try Spa
Rishi,
Just adding to zhang's questions.
Are you expecting multiple tags per row?
Do you check multiple regex for a single tag?
Let's say you had only one tag then theoretically you should be do this -
1 Remove stop words or any irrelevant stuff
2 split text into equal sized chunk column (eg - i
AFAICT, from the data size (25B rows, key cell 300 chars string), looks
like a common Spark job. But the regex might be complex, I guess there
are lots of items to match as (apple|banana|cola|...) from the purchase
list. Regex matching is a high CPU computing task. If the current
performance with m
Thanks ZHANG! Please find details below:
# of rows: ~25B, row size would be somewhere around ~3-5MB (it's a parquet
formatted data so, need to worry about only the columns to be tagged)
avg length of the text to be parsed : ~300
Unfortunately don't have sample data or regex which I can share fre
May I get some requirement details?
Such as:
1. The row count and one row data size
2. The avg length of text to be parsed by RegEx
3. The sample format of text to be parsed
4. The sample of current RegEx
--
Cheers,
-z
On Mon, 11 May 2020 18:40:49 -0400
Rishi Shah wrote:
> Hi All,
>
> I have
Hi All,
I have a tagging problem at hand where we currently use regular expressions
to tag records. Is there a recommended way to distribute & tag? Data is
about 10TB large.
--
Regards,
Rishi Shah