Re: Running out of memory when using filter?

Mark Engelberg Fri, 12 Dec 2008 15:15:35 -0800

On Fri, Dec 12, 2008 at 6:37 AM, Rich Hickey <richhic...@gmail.com> wrote:
> I'm appreciate the time you and others have spent on this, and will
> improve filter, but I'm not sure where you are getting your
> presumptions about lazy sequences. They are not a magic bullet that
> makes working with data bigger than memory transparent. They do make
> it possible in some cases, but that is not their intent. Latent
> runtime memory consumption is a fact of life of laziness, and is
> present in Haskell etc. I can make it go away for the specific case of
> filter and the rest of the library, but not generally.


I think this is a great discussion to have, so let me see if I can
articulate my thoughts a little better.

First, let's talk about laziness, because I think we're in agreement
here.  Laziness is undoubtedly a very powerful tool.  In Haskell,
laziness is the default.  Every single function call is automatically
lazy.  In LISP/Scheme, laziness is optional.  It's something you have
to specifically opt into using delay and force.  Which is better?

Peter Van Roy, the designer of the Oz programming language, has argued
in many places that laziness is not a good default for a
general-purpose programming language.  This is also a point he
discusses in his excellent book "Concepts, Techniques, and Models of
Computer Programming".  He has probably explained this point better
than I can, so I won't belabor it here, but in a nutshell, it all
comes down to the fact that laziness makes the performance (especially
space performance) of your program much harder to analyze.  It is all
too easy to write a Haskell program that blows up or performs poorly,
and you have no idea why.  You can get back some performance by
opting-out of the laziness with strictness annotations, but it can be
difficult to figure out how or where to do this.  To put it simply,
it's easier to write high-performing code when you opt-in to laziness
when you need it, rather than trying to figure out how to opt out when
it's wrecking things for you.

I assume that you agree with this, since you have chosen the explicit
force/delay "opt-in laziness" model for Clojure.

So now let's talk about sequences, specifically the behavior of
lazily-generated sequences.  A similar choice needs to be considered.
It is easy to imagine two possible implementations of lazy-cons.
lazy-cons-cached would work exactly as lazy-cons currently does.
lazy-cons-uncached would be similar, but would not cache its first and
rest values when computed.  Sequences built with either version of
lazy-cons would both respond to the same sequence interface, and would
therefore behave the same, producing the same results as long as the
generating functions don't refer to mutable data, but possibly with
different performance profiles.  lazy-cons-cached sequences might run
faster when traversing the same lazy list more than once, but also
might crash your program by consuming too much memory in places where
lazy-cons-uncached would not.  It's a tradeoff, and both versions of
lazy-cons are potentially useful, but which should be the default, or
more specifically, which should be the default used by all of
Clojure's library functions like map, filter, etc.?  Should these
lazily-generated sequences be automatically cached, or not?

The first question is, does it matter?  I would say that yes, this
design decision matters a great deal.  I think I remember you saying
somewhere (in a post here, or possibly in one of your presentations)
that sequences, especially lazy sequences, have become a far more
important part of Clojure programming than you envisioned at the
outset.  And it is easy to see why lazy sequences take on a
significant role in a fairly pure functional programming language:

People coming from imperative languages often ask how you code up
certain kinds of for/while loops that are used to traverse data
structures and build up new ones.  Let's say you want to build a list
of all the even squares of whole numbers up to 100, inclusively.  One
literal way to formulate this problem in an imperative language
(taking Python 2.6 syntax as a sample) would be:
l = []
for i in xrange(100):
  square = i*i
  if (square%2 == 0):
     l.append(square)

Now in Clojure, things like this can be written using loop/recur.  But
any newcomer to Clojure will certainly be told that there's a more
elegant way to express this concept:

(def l (filter even? (map #(* % %) (range 100))))

or the equivalent comprehension.

The newcomer's first reaction is to say, "Whoa, but won't that have
poor performance?  You're generating an awful lot of big, intermediate
lists."  And the answer is usually, "Well, it's lazy, so no, in fact,
this elegant form behaves like the iterative version.  You can have
your cake and eat it too."

You said, "I'm not sure where you are getting your presumptions about
lazy sequences. They are not a magic bullet that makes working with
data bigger than memory transparent."  Well, actually, I know that
cached lazy sequences do not always behave like their iterative
counterparts.  But this is where these presumptions are born.
Iterative stuff is ugly in functional languages.  We are told to use
comprehensions, map, filter, tricks like Stuart Holloway's blog about
writing the indexed function which makes a temporary sequence of pairs
of indices and values to avoid certain imperative patterns.  And we
expect the behavior to be similar, because most of the time it is.
When something blows up from some subtle capturing of a lazy list, we
feel frustrated and betrayed.

Next, let's consider whether it is easier to opt-in or opt-out of
caching your sequences.  Opting out of sequence caching is very
difficult.  Right now, all the built-in sequence functions use a
cached lazy-cons, and there is no way to undo that caching.  The first
line of defense is to make sure that you don't give an intermediate
temporary name to any of the lazy lists you are transforming.  In
other words, you should never do something like:

(def squares (map #(* % %) (range 100))
(def even-squares (filter even? squares)).

This kind of thing could crash your program with large lists.  It's
irritating to have to be careful about this, because it seems
intutitive that giving a temporary name to something shouldn't be the
kind of thing that could crash your program, but okay, let's say you
learn to be really careful, and to carefully avoid naming your lists.
But then, sometimes your lists get captured by closures, or other
subtle corner cases, and it just gets frustrating.  Using lazy
sequences, which are pervasive in Clojure, should not be so difficult.
 Right now the only solution is to rewrite one's code to use
loop/recur, and it sounds like eventually there will be an alternative
"generators" system which you say you're envisioning more as an I/O
tool with a special interface, rather than as a general-purpose form
of sequences.

On the other hand, opting in to sequence caching is very easy.  You
only gain the benefits of a cached sequence when you use it twice,
which inherently means that you have to name it or bind it to
something in order to refer to it multiple times.  So if I'm going to
be traversing a sequence multiple times, I just opt in by calling a
function to turn it into a cached sequence.  (Let's call it
"cached-seq".  I know there is such a function in the Clojure API
already, but I haven't really verified that it works the way I mean
here.  So if I'm using the term in a different way from the way that
cached-seq actually works in the current API, please bear with me for
the sake of argument.)

So with an opt-in system, where range, map and filter use
lazy-cons-uncached, you could easily write something like:
(def squares (map #(* % %) (range 100))
(def even-squares (cached-seq (filter even? squares)))

This means that squares behaves exactly like its iterative
counterpart, despite the fact that I named it, and even-squares I'm
setting up with caching because I intend to use it repeatedly later in
my program.

Is the opt-out or opt-in system more intuitive, and which is easier to
analyze from a correctness and performance standpoint?  I'd argue that
the opt-out system has serious problems with respect to being
intuitive, and ease of confirming correctness and performance bounds.
Programmers really want these things (most of the time) to behave like
the iterative loops they are replacing.  It can be very subtle to
detect the kinds of things that can cause an explosion of memory
consumption.  The filter-map explosion issue is a great case in point.
 You can test it on a wide variety of inputs, even big inputs, and it
seems to work fine.  But then when you pass it a combination of filter
and map that results in a large "window" between filtered elements in
the mapped list, it blows up.  Even if you patch the specific behavior
of filter, this is indicative of a larger problem.  filter is
currently written in the most natural way using lazy-cons.  Other
programmers are going to write similar functions using lazy-cons, and
these programs will all be brittle and prone to unpredictable failure
on certain kinds of large inputs.

On the other hand, the opt-in system is fairly straightforward to
understand.  If a sequence is not explicitly cached, you can expect no
memory to be consumed by a traversal, and the traversal time will be
the same every time you execute it.  Caching a sequence becomes
explcit, and is then clearly identified in the code as such.

>
> Not caching would let to excessive recalculation in many contexts, and
> then people would be complaining about performance.

Let's talk about performance.  First, I would claim that the vast
majority of lazy sequences (especially the really big ones), are used
as one-off temporary sequences to more elegantly express something
that would be expressed via looping in an imperative language.  So for
these cases (which I believe to be the most common case), you gain
nothing by caching, and in fact, you lose something (a small amount of
increased memory consumption / garbage collection, and increased
brittleness and unpredictability).

Then there are some sequences which are traversed more than once, but
the computation is fairly trivial (e.g., the output of range).  In
these situations, it's probably a wash.  I remember reading a paper
not long ago (can't remember exactly which one, sorry) that pointed
out that most programmers' intuitions about the need to cache
intermediate results is often wrong, and simple computations should
just be recomputed.  This is because chips have gotten really fast,
and one of the biggest performance hits these days is a cache miss, so
if you store something, and it requires the program to go off and look
in the "slow part of memory" to find it, you're much worse off than if
you had just recomputed.

Sometimes, if you know you are going to be traversing a sequence more
than once, you would be better off converting it to an explicitly
realized list or putting the data in a vector.

So this leaves what I believe to be a small set of cases where you
genuinely need a cached-but-lazy list.  Yes, make this a possibility;
I don't deny that it is useful.  But I think your fears about people
complaining about performance if you change the default behavior are
unfounded.  It's relatively easy to say to people, "If you need better
performance when traversing a lazy sequence multiple times, you may
benefit from explicitly realizing the result of the intermediate lazy
computations, or using a cached lazy sequence if that's what you
need."  On the other hand, if you keep things as they are, I can
pretty much guarantee that you will be faced with ongoing posts to the
list of, "Help!  I've got this program that is giving me an out of
memory error, and I can't figure out why."

One issue is that you'd want to make that cached-seq behaves
intelligently when someone tries to cache something that's essentially
already cached.  That way people could more easily write generic code
that safely calls cached-seq on various collections to guarantee
certain kinds of time-performance bounds with repeat traversals.  For
example, calling cached-seq on a vector shouldn't really do anything.
But there are already plenty of precedents and facilities in Clojure
for handling this sort of thing.  We already expect seq to "do the
right thing" and not add extra layers of "sequifying" to something
that already is a sequence.  One idea is to have a meta tag of :cached
that applies to things like vectors, sets, maps, lists, but not lazy
lists built with the lazy-cons-uncached variant I've proposed in this
post.  cached-seq acts like an ordinary seq on things with the :cached
tag.  cached-seq of a lazy-cons-uncached cell would essentially just
construct a lazy-cons-cached cell with the same first and rest
function, but with the cache fields.  In the general case, cached-seq
would generate a lazy list built with lazy-cons-cached.  Of course,
lazy-cons-cached cells would be marked with a :cached meta tag, and
would guarantee that the rest, when realized, is also made into a
cached sequence.

> There are many
> benefits of the thread-safe once-and-only-once evaluation promise of
> lazy-cons that you may not be considering. It is certainly not a bad
> default.

Well, I've tried as best as I can to articulate my reasoning that
lazy-cons-cached is in fact a bad default.  It is potentially
unintuitive, hard to verify correctness and performance bounds, and
difficult to opt out of.  The Scala programming language offers both
Streams (uncached lazy list), and LazyList (cached lazy list), and
generally uses Streams as the default.  F# uses Seqs (uncached lazy
list), and LazyList (cached lazy list), and generally uses Seqs as the
default.  Both of these languages are among the closest comparisons to
Clojure, in terms of meshing practical functional programming within
an existing imperative VM, and they have both made this design choice
to favor uncached lazy lists.  And in fact, it turns out that in those
languages, uncached lazy lists end up rarely used.  And of course,
languages like Python get along just fine with "generators" and no
built-in cached variant at all, other than explicitly realizing to a
list.  Haskell, of course, is the one exception.  They use
cached-lazy-lists by default, but then again, they really have to,
because it's the only sensible thing in a language where everything is
lazy by default.

> There will soon be a streams/generator library, intended for i/o, that
> will do one-pass non-cached iteration. It will offer a different set
> of tradeoffs and benefits, including reduced ease-of use - not being
> persistent removes a lot of things like first/rest/nth, destructuring
> etc.

Right, I'm talking about making something that works just like any
other sequence, supporting first/rest/nth and destructuring.

>
> But the important thing is tradeoffs/benefits. They always come
> together.

Yes, but hopefully I can convince you that in this case, the tradeoffs
fall clearly on the side of defaulting to uncached lazy lists.

--Mark

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Running out of memory when using filter?

Reply via email to