On Sat, Jul 6, 2013 at 11:42 AM, Denis Papathanasiou
<denis.papathanas...@gmail.com> wrote:
> (def my-text (slurp "mytext.txt"))
> (def my-sentences (partition-by ispunc? my-text))
>
> Unfortunately, this returns a sequence of 1, whose first and only element
> contains the entire text, since ispunc? depends on looking at a single
> character.
>
> So I tried producing a list of chars from the string and passing it to
> partition-by with ispunc? like this:
>
> (def my-text-chars (partition (count my-text) my-text))
> (def my-sentences (partition-by ispunc? (first my-text-chars)))
>
> That worked, in that it's logically "correct", but when I try to access any
> of the elements in my-sentences I get a java.lang.OutOfMemoryError (the
> source text file, "mytext.txt" is 1.3 mb in size).
>
> So is there a simpler and more idiomatic way of doing this without using up
> all the heap space?

If that kind of splitting is really all you require,
(clojure.string/split my-text #"[.!?;]") or (re-seq #"[^.!?;]+"
my-text)

For fancier stuff look into an opennlp wrapper or something like it.

https://github.com/dakrone/clojure-opennlp

Lars Nilsson

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to