GitHub user aborsu985 opened a pull request:

    https://github.com/apache/spark/pull/4504

    RegEx Tokenizer for mllib

    Added a Regex based tokenizer for mllib.
    Currently the regex is fixed but if I could add a regex type paramater to 
the paramMap,
    changing the tokenizer regex could be a parameter used in the 
crossValidation.
    Also I wonder what would be the best way to add a stop word list.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/aborsu985/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4504.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4504
    
----
commit 01cd26f856d7236035faf0c42f1f8f01ebbb2ce7
Author: Augustin Borsu <[email protected]>
Date:   2015-02-10T09:52:47Z

    RegExTokenizer
    
    A more complex tokenizer that extracts tokens based on a regex. It also 
allows
    to turn lowerCasing on and off, adding a minimum token length and a list of
    stop words to exclude.

commit 9547e9df7f64c74f33526b26b92f6f1ef841ae3c
Author: Augustin Borsu <[email protected]>
Date:   2015-02-10T10:39:39Z

    RegEx Tokenizer
    
    A more complex tokenizer that extracts tokens based on a regex. It also 
allows
    to turn lowerCasing on and off, adding a minimum token length and a list of
    stop words to exclude.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to