Re: review the clojure.string code

Stuart Halloway Sun, 30 May 2010 19:51:29 -0700

Thanks! Trying to pass through non-strings was overreaching. Ease of use first: 
the API should return immutable strings. If you really need to optimize more 
than this, roll your own.


Stu

> Type-hinting args as a CharSequence is a GoodThing; type-hinting that
> you're returning a CharSequence when you're actually returning a
> String is not.
> 
> I disagree with Steven that some functions should return the
> StringBuilder instance due to being type-hinted as CharSequence.
> CharSequence is barely a step above being a marker interface, useful
> only insofar as calling .toString() is likely to be useful (which is
> precisely what StringBuilder's constructor does anyway).  Further, any
> purported performance benefit from getting the mutable object back is
> lost if/when it is passed to another clojure.string function since the
> very first usage of a CharSequence is always to call .toString() on
> it.
> 
> I'd go the other way and document functions as returning a String, and
> type-hinting them as such.  This also has practical effects:
> 
> user=> (defn ^CharSequence reverse
>  [^CharSequence s]
>  (.toString (.reverse (StringBuilder. s))))
> user=> (set! *warn-on-reflection* true)
> true
> user=> (.substring (reverse "hello") 3)
> Reflection warning, NO_SOURCE_PATH:7 - call to substring can't be
> resolved.
> "eh"
> user=> (defn ^String reverse
>  [^CharSequence s]
>  (.toString (.reverse (StringBuilder. s))))
> #'user/reverse
> user=> (.substring (reverse "hello") 3)
> "eh"
> 
> As for the concern about data copying, there are reasonable
> optimizations already in place such that the String returned from
> StringBuilder.toString() is likely to be created with the same char
> array instance; no copying needed.
> 
> As for the use of StringBuffer, it appears that they are used solely
> when dealing with regex, since that API requires them.
> 
> 
> On May 30, 3:45 pm, "Steven E. Harris" <s...@panix.com> wrote:
>> Why do some of the functions use StringBuilder (no internal
>> synchronization) and some use StringBuffer (provides internal
>> synchronization). Using the latter is probably a mistake.
>> 
>> The first function -- reverse -- uses StringBuilder#reverse() to reverse
>> the character sequence in place, and then calls StringBuilder#toString()
>> to yield a String. Why is the final step necessary? Since StringBuilder
>> implements CharSequence, isn't it sufficient to just return the
>> StringBuilder? That's what the function signature promises. Calling
>> toString() makes yet another copy of the data, so it's best avoided.
>> 
>> Some of these functions construct instances of class StringBuilder using
>> its default constructor, which gives it a default capacity. But most of
>> the functions also know at the outset some reasonable size for the
>> buffer. For instance, function `replace-first-by' could reasonably use
>> the length of its argument "s" as the size of the buffer, even though
>> the replacement for some matched substring may increase the overall
>> length.
>> 
>> Why does function `replace-first' unconditional call
>> CharSequence#toString() on its argument "s", when in several cases just
>> using "s" as a CharSequence would be fine. Again, calling
>> CharSequence#toString() might make a copy of the data, depending on
>> whether the CharSequence was actually a String or something like a
>> StringBuilder.
>> 
>> The implementation of a function like `trim' could be more efficient,
>> given that it's allowed to return a CharSequence. You can construct a
>> StringBuilder of length equal to the source CharSequence, then walk the
>> source sequence from the beginning, skipping over all whitespace,
>> copying the non-whitespace characters, repeating until you hit the end
>> of the source string. The "right trim" behavior is a little tricky, as
>> you have to suspend the copying upon encountering whitespace, but but
>> back up and copy that whitespace upon encountering something else before
>> the end of the source string. Alternately, just walk backward from the
>> end of the source string.
>> 
>> The bonus is that you only copy the data once. With the current
>> implementation, calling CharSequence#toString() might copy the data
>> once, and calling String#trim() will copy it again.
>> 
>> Rounding out the review, function `trim-newline' doesn't have to call
>> toString() on the CharSequence yielded by CharSequence#subSequence(), as
>> it already promises to return a CharSequence.
>> 
>> Most Java code poisons its string manipulation efficiency by always
>> promising to return String rather than CharSequence. You've done better
>> in your signatures here, so I'm just encouraging you to avoid String and
>> the extra copies it forces.
>> 
>> --
>> Steven E. Harris
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: review the clojure.string code

Reply via email to