I use this regex usually it's been a while since I last used it so I
odn't remember how it performs...
#"(?<=[.!?]|[.!?][\\'\"])(?<!e\.g\.|i\.e\.|vs\.|p\.m\.|a\.m\.|Mr\.|Mrs\.|Ms\.|St\.|Fig\.|fig\.|Jr\.|Dr\.|Prof\.|Sr\.|[A-Z]\.)\s+")
and as Lars said all you need is clojure.string/split
Jim
On 06/07/13 16:56, Denis Papathanasiou wrote:
I have a plain text file containing an English-language essay I want
to split into sentences, based on common punctuation.
I wrote this function, which examines a character and determines if
it's an end of sentence punctuation mark:
(defn ispunc? [c]
(> (count (filter #(= % c) '("." "!" "?" ";"))) 0))
I know this is no grammatically perfect, and that some text such as
"U.S.", etc. will be mis-parsed, but this is just an experiment and I
don't need that level of precision.
So I loaded my file using slurp and tried using the partition-by
function with ispunc? like this:
(def my-text (slurp "mytext.txt"))
(def my-sentences (partition-by ispunc? my-text))
Unfortunately, this returns a sequence of 1, where the only element is
the entire string.
So I tried splitting the string into a list of characters, and
applying partition-by with ispunc? like this:
(def my-text-chars (partition (count my-text) my-text))
(def my-sentences (partition-by ispunc? (nth my-text-chars 0)))
This worked, because it is logically correct, but I get
a java.lang.OutOfMemoryError when I try to access any of the elements
in my-sentences (the plain text "mytext.txt" file is 1.3 mb in size).
So is there a way to do this more idiomatically, without splitting
into single chars and recombining?
While 1.3 mb is not small, it's also not so large that it can't be
slurped, so there must be a simpler way of splitting on punctuation
into sentences.
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient
with your first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.