First, I love this discussion! Great questions, great thinking.

Up top, I wanted to mention a couple things that have changed in alpha6:
- Eduction is no longer Seqable and thus the return from eduction is not 
seqable  (but it is reducible and iterable). You can use iterator-seq to 
get a chunked seq over the top if you need one. 
- Prior to alpha6, sequence with transformations returned a 
LazyTransformer. That's gone and now transformations are done using 
TransformerIterator, which is subsequently wrapped with a (now chunked) 
iterator-seq.
- There are lots of performance implications due to those changes and I 
would recommend re-testing any perf test related to sequence or eduction on 
alpha6 to get a fresh picture as anything pre-alpha6 is not comparable.
- eduction now takes multiple transformations, not just one, and composes 
them. This is designed for mechanical rewriting (hello tool developers!!) 
of ->> chains like this:
(->> s (interpose 5) (partition-all 2))

to this:

(->> s (eduction (interpose 5) (partition-all 2)))

That particular example has timings in CLJ-1669 and shows a reduction from 
501 microseconds to 108 microseconds on a source vector, comparable savings 
on a source sequence. You do need to be careful to not include 
non-transducer functions in that chain, or stop the rewrite when you fall 
into a materialization function at the end (for example, sort). 

On Wednesday, April 1, 2015 at 2:13:07 AM UTC-5, Tassilo Horn wrote:
>
> Hi all, 
>
> I've switched many nested filter/map/mapcat applications in my code to 
> using transducers.  That brought a moderate speedup in certain cases 
> and the deeper the nesting has been before, the clearer the transducers 
> code is in comparison, so yay! :-) 
>

Depending what you're transducing over, you may see some additional gains 
in alpha6.
 

> However, I'm still quite unsure about the difference between `sequence` 
> and `eduction`.  From the docs and experimentation, I came to the 
> assumptions below and I'd be grateful if someone with more knowledge 
> could verify/falsify/add: 
>
>   - Return types differ: Sequence returns a standard lazy seq, eductions 
>     an instance of Eduction. 
>

Yes, although I think the actual return type of eduction is not important, 
rather the implemented interfaces of Iterable and IReduceInit are the key 
thing.
 

>   - Eductions are reducible/sequable/iterable, i.e., basically I can use 
>

no longer explicitly seqable, although seq will automatically wrap a 
chunked iterator-seq around it if passed to a sequence function.
 

>     them wherever a (lazy) seq would also do, so sequence and eduction 
>     are quite interchangeable except when poking at internals, e.g., 
>     (.contains (sequence ...) x) works whereas (.contains (eduction ...) 
>     x) doesn't. 
>

No guarantees made on those internals, only on what is promised in the 
docstrings (reducible/iterable for eduction and sequence for sequence). :) 
 

>   - Both compute their contents lazily. 
>

eduction is not lazy by itself. if used in a non-seq context, it is eagerly 
recomputed on each use. If you use it in a seq context, then it will be 
computed and cached as needed.
 

>   - Lazy seqs cache their already realized contents, eductions compute 
>     them over and over again on each iteration. 
>

Yes.
 

>
> Because of that, I came to the conclusion that whenever I ask myself if 
> one of my functions should return a lazy seq or an eduction, I should 
> use these rules: 
>
>   1. If the function is likely to be used like 
>
>      (let [xs (seq-producing-fn args)] 
>        (or (do-stuff-with xs) 
>            (do-other-stuff-with xs) 
>            ...)) 
>
>      that is, the resulting seq is likely to be bound to a variable 
>      which is then used multiple times (and thus lazy seq caching is 
>      benefitical), then use sequence. 
>

Yes, it makes sense to take advantage of seq caching this way.
 

>   2. If it is a private function only used internally and never with the 
>      usage pattern of point 1, then definitively use eduction. 
>


  3. If its a public function which usually isn't used with a pattern as 
>      in point 1, then I'm unsure.  eduction is probably more efficient 
>      but sequence fits better in the original almost everything returns 
>      a lazy seq design.  Also, the latter has the benefit that users of 
>      the library don't need to know anything about transducers. 
>
> Is that sensible?  Or am I completely wrong with my assumptions about 
> sequence and eduction? 
>

You're definitely on the right track. How the result is going to be used is 
pretty important - eduction is designed to be consumed entirely for a 
result and thus gives up the caching of sequences. 

If you are going to consume the entire thing once, particularly in the 
context of a reduction, then eduction is far more efficient. Eduction does 
not need to allocate or cache intermediate sequence objects, so is much 
more memory and allocation efficient when used in this way (where the 
entire transformation and use can be iterable/reducible). 
 

> On a related note, could someone please clarify the statement from the 
> transducers docs for `sequence`? 
>
> ,----[ Docs of sequence at http://clojure.org/transducers ] 
> | The resulting sequence elements are incrementally computed. These 
> | sequences will consume input incrementally as needed and fully realize 
> | intermediate operations.  This behavior differs from the equivalent 
> | operations on lazy sequences. 
> `---- 
>
> I'm curious about the "fully realize intermediate operations" part. 
>

Transducers are a pull model, pulled from the output. As you need each 
output element of the sequence returned from sequence, you must fully 
complete a "step", so there is *no* laziness inside transducer functions. 
If an expanding transformation (like mapcat) produces a large number of 
intermediate elements (or in worst case, an infinite one), then the 
transducer will eagerly the entire expansion at the point where that first 
element is needed. That is a case where lazy seqs will probably be better. 
I think this case is actually relatively rare though. The new chunked 
sequence return may actually make this particular case worse too (well 
worse in doing more eager bunches of work, better in amortizing when that 
happens over a large source). 
 

> Does it mean that in a "traditional" 
>
>     (mapcat #(range %) (range 10000)) 
>
> the inner range is also evaluated lazy but with 
>
>     (sequence (mapcat #(range %)) (range 10000)) 
>
> it is not?  It seems so.  


Yes, exactly.
 

> At least dorun-ning these two expressions 
> shows that the "traditional" version is more than twice as fast than the 
> transducer version.  Also, the same seems to hold for 
>
>     (eduction (mapcat #(range %)) (range 10000)) 
>
> which is exactly as fast (or rather slow) as the sequence version. 
>
>
You might want to re-time these with alpha6 to see the impact of chunking 
(I'm not sure whether I'd expect it to be better or worse actually.) :)

But wouldn't that mean that transducers with mapcat where the mapcatted 
> function isn't super-cheap is a bad idea in general at least from a 
> performance POV? 
>

It depends on the use case - as a wise man named Rich once told me, "Why 
guess when you can measure?". :) 
 

>
> Bye, 
> Tassilo 
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to