On 6 Lug, 09:07, Ken Wesson <kwess...@gmail.com> wrote: > On Tue, Jul 5, 2011 at 4:32 PM, Alessio Stalla <alessiosta...@gmail.com> > wrote: > > On 5 Lug, 18:49, Ken Wesson <kwess...@gmail.com> wrote: > >> 1. A too-large string literal should have a specific error message, > >> rather than generate a misleading one suggesting a different type of > >> problem. > > > There is no such thing as a too-large string literal in a class file. > > That's not what Patrick just said.
Not really; he said "65535 characters [...] is the maximum length of a String literal in Java" (actually, it is 65535 *bytes*, not characters) and "Clojure will generate a corrupt class if that String is too long". The class file is incorrect, but it doesn't contain any string longer than 64KB, because it cannot, by definition. There is no way for the JVM to detect a string which is too long. The generated class file contains a legal string followed by garbage; it's that garbage that makes the class file illegal. The JVM can't know that originally the string + garbage was intended to be a longer string. In fact, for certain longer-than-64KB strings, a syntactically valid class file could still be generated. > >> 2. The limit should not be different from that on String objects in > >> general, namely 2147483647 characters which nobody is likely to hit > >> unless they mistakenly call read-string on that 1080p Avatar blu-ray > >> rip .mkv they aren't legally supposed to possess. > > > That's a limitation imposed by the Java class file format. > > And therefore a bug in the Java class file format, which should allow > any size String that the runtime allows. Using 2 bytes instead of 4 > bytes for the length field, as you claim they did, seems to be the > specific error. One would have thought that Java of all languages > would have learned from the Y2K debacle and near-miss with > cyber-armageddon, but limiting a field to 2 of something instead of 4 > out of a misguided perception that space was at a premium was exactly > what caused that, too! A bug is a discrepancy between specification and implementation. Here, there's no discrepancy: the JVM is correctly implementing the spec. Now, you might argue that the spec is badly designed, and I might agree, but it's not a "bug in Oracle's Java implementation" - any conforming Java implementation must have that (mis)feature. > >> 3. Though both of the above bugs are in Oracle's Java implementation, > > > By the above, 1. is a Clojure bug and 2. is not a bug at all. > > Oh, 2 is a bug alright. By your definition, Y2K bugs in a piece of > software would also not be bugs. The users of such software would beg > to differ. User perception and bugs are different things. > >> it would seem to be a bug in Clojure's compiler if it is trying to > >> make the entire source code of a namespace into a string *literal* in > >> dynamically-generated bytecode somewhere rather than a string > >> *object*. > > > Actually it seems it's the IDE, rather than Clojure, that is > > evaluating a form containing such a big literal. Since Clojure has no > > interpreter, it needs to compile that form. > > The same problem has been reported from multiple IDEs, so it seems to > be a problem with eval and/or load-file. Ok, sorry, didn't know that. > The question is not why they > might be using String *objects* that exceed 64K, since they'll need to > use Strings as large as the file gets*. It's why they'd *generate > bytecode* containing String *literals* that large. > > And it's not IDEs that generate bytecode it's > clojure.lang.Compiler.java that generates bytecode in this scenario. Yes, ultimately there's a bug in the compiler regardless of where (form "with-looo....ooong-string") comes from. > * There is a way to reduce the size requirements; crudely, line-seq > could be used to implement a lazy seq of top-level forms built by > consuming lines until delimiters are balanced and them emitting a new > form string, then evaluating these forms one by one. This works with > typical source files that have short individual top-level forms and > have at least 1 line break between any two such and would allow > consuming multi-gig source files if anyone ever had need for such a > thing (I'd hope never to see it unless it was machine-generated for > some purpose). Less crudely, a reader for files could be implemented > that didn't just slurp the file and call read-string on it but instead > read from an IO stream and emitted a seq of top-level forms converted > already into reader-output data structures (but unevaluated). In fact, > read-string could then be implemented in terms of this and a > StringInputStream whose implementation is left as an exercise for the > reader but which ought to be nearly trivial. Yes, that's a good way of circumventing the problem when dealing with reasonable input files. But I don't think this is the problem we're talking about; read on. > >> Sensible alternatives are a) get the string to whatever > >> consumes it by some other means than embedding it as a single > >> monolithic constant in bytecode, > > > This is what we currently do in ABCL (by storing literal objects in a > > thread-local variable and retrieving them later when the compiled code > > is loaded), but it only works for the runtime compiler, not the file > > compiler (in Clojure terms, it won't work with AOT compilation). > > Yes, this is the same issue raised in connection with allowing > arbitrary objects in code in eval. > > >> b) convert long strings into shorter > >> chunks and emit a static initializer into the bytecode to reassemble > >> them with concatenation into a single runtime-computed string constant > >> stored in another static field, > > > This is what I'd like to have :) > > Frankly it seems like a bit of a hack to me, though since it would be > used to work around a Y2K-style bug in Java it might be poetic justice > of a sort. It is a sort of hack, yes. An alternative might be to store constants in a resource external to the class file, unencumbered with silly size limits. > >> and c) restructure whatever consumes > >> the string to consume a seq, java.util.List, or whatever of strings > >> instead and feed it digestible chunks (e.g. a separate string for each > >> defn or other top-level form, in order of appearance in the input file > >> -- surely nobody has *individual defns* exceeding 64KB). > > > The problem is not in the consumer, but in the form containing the > > string; to do what you're proposing, the reader, upon encountering a > > big enough string, would have to produce a seq/List/whatever instead, > > the compiler would need to be able to dump such an object to a class, > > and all Clojure code handling strings would have to be prepared to > > handle such an object, too. I think it's a little impractical. > > I don't think so. The problem isn't with normal strings but only with > strings that get embedded as literals in code; and moreover, the > problem isn't even with those strings exceeding 64k but with whole > .clj files exceeding 64k. The implication is that load-file generates > a class that contains the entire contents of the sourcefile as a > string constant for some reason; so: > > a) What does this class do with this string constant? What code consumes it? Hmm, I don't think it's like you say. Without knowing anything about Clojure's internals, it seems to me that the problem is more likely to be in a form like the one Patrick posted, (clojure.lang.Compiler/load (java.io.StringReader. "the-whole-file-as-a-string")), which is compiled in order to be evaluated in order to compile and load the file... it is that form, and not the file to be compiled, that generates the incorrect class file. > b) Can that particular bit of code be rewritten to digest the same > information provided in smaller chunks? If Patrick is right, and I think he is, then the compiler has to compile (java.io.StringReader. "the-whole-file-as-a-string") in a way that "the-whole-file-as-a-string" does not appear literally in the class file. It has either to somehow split the string, or load it from somewhere else. > > Regarding the size of individual defns, that's an orthogonal problem; > > anyway, the size of the _bytecode_ for methods is limited to 64KB (see > > <http://java.sun.com/docs/books/jvms/second_edition/html/ > > ClassFile.doc.html#88659>) and, while pretty big, it's not impossible > > to reach it, especially when using complex macros to produce a lot of > > generated code. > > Another problem for which we will probably need an eventual fix or > workaround. If bytecode can contain a JMP-like instruction it should > be possible to have the compiler split long generated methods and > chain the pieces together without much loss of runtime efficiency, > particularly if it does so at "natural" places -- existing conditional > branches, particularly, and (loop ...) borders -- (defn foo (if x > (lotta-code-1) (lotta-code-2))) for example can be trivially converted > to (defn foo (if x (lotta-code-1) (jmp bar))) (defn bar > (lotta-code-2)) -- though if you had such a jump instruction I'd have > thought implementing real TCO would have been fairly easy, and > apparently it was not. In fact, no JMP-like instruction exists that can jump to a different method. > Failing such a jmp capability you'd have to just use (bar) in that > last example and suffer an additional method call overhead at the > break-point. Again, the obvious way to do it would be to recognize > common branching construct forms such as (if ...) and (cond ...) that > are larger than the threshold but have individual branches that are > not and turn some or all of the branches into their own under-the-hood > methods and calls to those methods. The positive thing is that the method call overhead disappears thanks to Hotspot for frequently called methods. The negative thing is that splicing bytecode is harder than it seems because of jumps and exception handlers that might be present. > > We used to generate such big methods in ABCL because > > at one point we tried to spell out in the bytecode all the class names > > corresponding to functions in a compiled file, in order to avoid > > reflection when loading the compiled functions. For files with many > > functions (> 1000 iirc) the generated code became too big. It turned > > out that this optimization had a negligible impact on performance, so > > we reverted it. > > I wonder if Clojure is using a similar optimization and would benefit > from its reversion. Might be. One day, I really need to look at how the Clojure compiler works, especially the loading of compiled files - that could be a great source of inspiration. Regards, Alessio -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en