On Fri, Dec 12, 2008 at 6:37 AM, Rich Hickey <richhic...@gmail.com> wrote: > I'm appreciate the time you and others have spent on this, and will > improve filter, but I'm not sure where you are getting your > presumptions about lazy sequences. They are not a magic bullet that > makes working with data bigger than memory transparent. They do make > it possible in some cases, but that is not their intent. Latent > runtime memory consumption is a fact of life of laziness, and is > present in Haskell etc. I can make it go away for the specific case of > filter and the rest of the library, but not generally.
I think this is a great discussion to have, so let me see if I can articulate my thoughts a little better. First, let's talk about laziness, because I think we're in agreement here. Laziness is undoubtedly a very powerful tool. In Haskell, laziness is the default. Every single function call is automatically lazy. In LISP/Scheme, laziness is optional. It's something you have to specifically opt into using delay and force. Which is better? Peter Van Roy, the designer of the Oz programming language, has argued in many places that laziness is not a good default for a general-purpose programming language. This is also a point he discusses in his excellent book "Concepts, Techniques, and Models of Computer Programming". He has probably explained this point better than I can, so I won't belabor it here, but in a nutshell, it all comes down to the fact that laziness makes the performance (especially space performance) of your program much harder to analyze. It is all too easy to write a Haskell program that blows up or performs poorly, and you have no idea why. You can get back some performance by opting-out of the laziness with strictness annotations, but it can be difficult to figure out how or where to do this. To put it simply, it's easier to write high-performing code when you opt-in to laziness when you need it, rather than trying to figure out how to opt out when it's wrecking things for you. I assume that you agree with this, since you have chosen the explicit force/delay "opt-in laziness" model for Clojure. So now let's talk about sequences, specifically the behavior of lazily-generated sequences. A similar choice needs to be considered. It is easy to imagine two possible implementations of lazy-cons. lazy-cons-cached would work exactly as lazy-cons currently does. lazy-cons-uncached would be similar, but would not cache its first and rest values when computed. Sequences built with either version of lazy-cons would both respond to the same sequence interface, and would therefore behave the same, producing the same results as long as the generating functions don't refer to mutable data, but possibly with different performance profiles. lazy-cons-cached sequences might run faster when traversing the same lazy list more than once, but also might crash your program by consuming too much memory in places where lazy-cons-uncached would not. It's a tradeoff, and both versions of lazy-cons are potentially useful, but which should be the default, or more specifically, which should be the default used by all of Clojure's library functions like map, filter, etc.? Should these lazily-generated sequences be automatically cached, or not? The first question is, does it matter? I would say that yes, this design decision matters a great deal. I think I remember you saying somewhere (in a post here, or possibly in one of your presentations) that sequences, especially lazy sequences, have become a far more important part of Clojure programming than you envisioned at the outset. And it is easy to see why lazy sequences take on a significant role in a fairly pure functional programming language: People coming from imperative languages often ask how you code up certain kinds of for/while loops that are used to traverse data structures and build up new ones. Let's say you want to build a list of all the even squares of whole numbers up to 100, inclusively. One literal way to formulate this problem in an imperative language (taking Python 2.6 syntax as a sample) would be: l = [] for i in xrange(100): square = i*i if (square%2 == 0): l.append(square) Now in Clojure, things like this can be written using loop/recur. But any newcomer to Clojure will certainly be told that there's a more elegant way to express this concept: (def l (filter even? (map #(* % %) (range 100)))) or the equivalent comprehension. The newcomer's first reaction is to say, "Whoa, but won't that have poor performance? You're generating an awful lot of big, intermediate lists." And the answer is usually, "Well, it's lazy, so no, in fact, this elegant form behaves like the iterative version. You can have your cake and eat it too." You said, "I'm not sure where you are getting your presumptions about lazy sequences. They are not a magic bullet that makes working with data bigger than memory transparent." Well, actually, I know that cached lazy sequences do not always behave like their iterative counterparts. But this is where these presumptions are born. Iterative stuff is ugly in functional languages. We are told to use comprehensions, map, filter, tricks like Stuart Holloway's blog about writing the indexed function which makes a temporary sequence of pairs of indices and values to avoid certain imperative patterns. And we expect the behavior to be similar, because most of the time it is. When something blows up from some subtle capturing of a lazy list, we feel frustrated and betrayed. Next, let's consider whether it is easier to opt-in or opt-out of caching your sequences. Opting out of sequence caching is very difficult. Right now, all the built-in sequence functions use a cached lazy-cons, and there is no way to undo that caching. The first line of defense is to make sure that you don't give an intermediate temporary name to any of the lazy lists you are transforming. In other words, you should never do something like: (def squares (map #(* % %) (range 100)) (def even-squares (filter even? squares)). This kind of thing could crash your program with large lists. It's irritating to have to be careful about this, because it seems intutitive that giving a temporary name to something shouldn't be the kind of thing that could crash your program, but okay, let's say you learn to be really careful, and to carefully avoid naming your lists. But then, sometimes your lists get captured by closures, or other subtle corner cases, and it just gets frustrating. Using lazy sequences, which are pervasive in Clojure, should not be so difficult. Right now the only solution is to rewrite one's code to use loop/recur, and it sounds like eventually there will be an alternative "generators" system which you say you're envisioning more as an I/O tool with a special interface, rather than as a general-purpose form of sequences. On the other hand, opting in to sequence caching is very easy. You only gain the benefits of a cached sequence when you use it twice, which inherently means that you have to name it or bind it to something in order to refer to it multiple times. So if I'm going to be traversing a sequence multiple times, I just opt in by calling a function to turn it into a cached sequence. (Let's call it "cached-seq". I know there is such a function in the Clojure API already, but I haven't really verified that it works the way I mean here. So if I'm using the term in a different way from the way that cached-seq actually works in the current API, please bear with me for the sake of argument.) So with an opt-in system, where range, map and filter use lazy-cons-uncached, you could easily write something like: (def squares (map #(* % %) (range 100)) (def even-squares (cached-seq (filter even? squares))) This means that squares behaves exactly like its iterative counterpart, despite the fact that I named it, and even-squares I'm setting up with caching because I intend to use it repeatedly later in my program. Is the opt-out or opt-in system more intuitive, and which is easier to analyze from a correctness and performance standpoint? I'd argue that the opt-out system has serious problems with respect to being intuitive, and ease of confirming correctness and performance bounds. Programmers really want these things (most of the time) to behave like the iterative loops they are replacing. It can be very subtle to detect the kinds of things that can cause an explosion of memory consumption. The filter-map explosion issue is a great case in point. You can test it on a wide variety of inputs, even big inputs, and it seems to work fine. But then when you pass it a combination of filter and map that results in a large "window" between filtered elements in the mapped list, it blows up. Even if you patch the specific behavior of filter, this is indicative of a larger problem. filter is currently written in the most natural way using lazy-cons. Other programmers are going to write similar functions using lazy-cons, and these programs will all be brittle and prone to unpredictable failure on certain kinds of large inputs. On the other hand, the opt-in system is fairly straightforward to understand. If a sequence is not explicitly cached, you can expect no memory to be consumed by a traversal, and the traversal time will be the same every time you execute it. Caching a sequence becomes explcit, and is then clearly identified in the code as such. > > Not caching would let to excessive recalculation in many contexts, and > then people would be complaining about performance. Let's talk about performance. First, I would claim that the vast majority of lazy sequences (especially the really big ones), are used as one-off temporary sequences to more elegantly express something that would be expressed via looping in an imperative language. So for these cases (which I believe to be the most common case), you gain nothing by caching, and in fact, you lose something (a small amount of increased memory consumption / garbage collection, and increased brittleness and unpredictability). Then there are some sequences which are traversed more than once, but the computation is fairly trivial (e.g., the output of range). In these situations, it's probably a wash. I remember reading a paper not long ago (can't remember exactly which one, sorry) that pointed out that most programmers' intuitions about the need to cache intermediate results is often wrong, and simple computations should just be recomputed. This is because chips have gotten really fast, and one of the biggest performance hits these days is a cache miss, so if you store something, and it requires the program to go off and look in the "slow part of memory" to find it, you're much worse off than if you had just recomputed. Sometimes, if you know you are going to be traversing a sequence more than once, you would be better off converting it to an explicitly realized list or putting the data in a vector. So this leaves what I believe to be a small set of cases where you genuinely need a cached-but-lazy list. Yes, make this a possibility; I don't deny that it is useful. But I think your fears about people complaining about performance if you change the default behavior are unfounded. It's relatively easy to say to people, "If you need better performance when traversing a lazy sequence multiple times, you may benefit from explicitly realizing the result of the intermediate lazy computations, or using a cached lazy sequence if that's what you need." On the other hand, if you keep things as they are, I can pretty much guarantee that you will be faced with ongoing posts to the list of, "Help! I've got this program that is giving me an out of memory error, and I can't figure out why." One issue is that you'd want to make that cached-seq behaves intelligently when someone tries to cache something that's essentially already cached. That way people could more easily write generic code that safely calls cached-seq on various collections to guarantee certain kinds of time-performance bounds with repeat traversals. For example, calling cached-seq on a vector shouldn't really do anything. But there are already plenty of precedents and facilities in Clojure for handling this sort of thing. We already expect seq to "do the right thing" and not add extra layers of "sequifying" to something that already is a sequence. One idea is to have a meta tag of :cached that applies to things like vectors, sets, maps, lists, but not lazy lists built with the lazy-cons-uncached variant I've proposed in this post. cached-seq acts like an ordinary seq on things with the :cached tag. cached-seq of a lazy-cons-uncached cell would essentially just construct a lazy-cons-cached cell with the same first and rest function, but with the cache fields. In the general case, cached-seq would generate a lazy list built with lazy-cons-cached. Of course, lazy-cons-cached cells would be marked with a :cached meta tag, and would guarantee that the rest, when realized, is also made into a cached sequence. > There are many > benefits of the thread-safe once-and-only-once evaluation promise of > lazy-cons that you may not be considering. It is certainly not a bad > default. Well, I've tried as best as I can to articulate my reasoning that lazy-cons-cached is in fact a bad default. It is potentially unintuitive, hard to verify correctness and performance bounds, and difficult to opt out of. The Scala programming language offers both Streams (uncached lazy list), and LazyList (cached lazy list), and generally uses Streams as the default. F# uses Seqs (uncached lazy list), and LazyList (cached lazy list), and generally uses Seqs as the default. Both of these languages are among the closest comparisons to Clojure, in terms of meshing practical functional programming within an existing imperative VM, and they have both made this design choice to favor uncached lazy lists. And in fact, it turns out that in those languages, uncached lazy lists end up rarely used. And of course, languages like Python get along just fine with "generators" and no built-in cached variant at all, other than explicitly realizing to a list. Haskell, of course, is the one exception. They use cached-lazy-lists by default, but then again, they really have to, because it's the only sensible thing in a language where everything is lazy by default. > There will soon be a streams/generator library, intended for i/o, that > will do one-pass non-cached iteration. It will offer a different set > of tradeoffs and benefits, including reduced ease-of use - not being > persistent removes a lot of things like first/rest/nth, destructuring > etc. Right, I'm talking about making something that works just like any other sequence, supporting first/rest/nth and destructuring. > > But the important thing is tradeoffs/benefits. They always come > together. Yes, but hopefully I can convince you that in this case, the tradeoffs fall clearly on the side of defaulting to uncached lazy lists. --Mark --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -~----------~----~----~----~------~----~------~--~---