On Sun, Dec 20, 2015 at 9:43 PM, Nadav Rotem <nro...@apple.com> wrote: > > On Dec 20, 2015, at 2:17 AM, Dmitri Gribenko <griboz...@gmail.com> wrote: > > + Stephen Canon, because he probably has good ideas in this domain. > > On Fri, Dec 18, 2015 at 3:42 PM, Nadav Rotem via swift-dev > <swift-dev@swift.org> wrote: >> >> >> What’s next? >> >> The small experiment I described above showed that compressing the names >> in the string table has a huge potential for reducing the size of swift >> binaries. I’d like for us (swift-developers) to talk about the implications >> of this change and start working on the two tasks of tightening our existing >> mangling format and on implementing a new compression layer on top. > > > > Hi Dmitri, > > Hi Nadav, > > This is a great start that shows that there is a potential for improvement > in our mangled names! > > To make this effort more visible, I would suggest creating a bug on > https://bugs.swift.org/ . > > I think we survey existing solutions that industry has developed for > compressing short messages. What comes to mind: > > - header compression in HTTP2: > https://http2.github.io/http2-spec/compression.html > > - PPM algorithms are one of the best-performing compression algorithms for > text. > > > It is important to keep in mind that we are not inventing a new > text-compression algorithms here. We are solving a specific problem with > specific characteristics and constraints. Nothing is groundbreaking here.
This is exactly the case for using a specialized algorithm: a well-defined problem with specific constraints, not a general case of arbitrary input. > Lempel-Ziv based algorithms rely on repetitions within the text, and can use > some prior statistical knowledge on the text (pre-computed tables). Our > mangled names are relatively short and don’t have significant repetitions, > so having a reference to some pre-computed table, and not having an internal > reference is obvious. Names are formed according to well-defined mangling rules with their own redundancy and patterns, and identifiers usually come from English. This is why using a high-order statistical model makes sense. Pre-computed tables on the other hand have their own issues: since we have a runtime demangler that our metadata parsing relies upon, the whole precomputed table of known substitutions has to be a part of the runtime, and not just the demangler. This creates interesting constraints for bare-metal environments for example. I'm not saying that it is wrong to use them, I'm just saying that the trade-off is not so clear-cut. > On top of the basic encoding layer we need to apply some kind of variable > length encoding (Cameron suggested Huffman). The restricted character set is > not a problem here. To encode a string of bits using the charset > [a-zA-Z0-9_] just do: mod63, div63, mod63, div63, until you consume the > whole bit stream. > > - Arithmetic coding is also a natural starting point for experimentation. > > Since the input mangled name also comes in a restricted character set, we > could also remove useless bits first, and try an existing compression > algorithm on the resulting binary string. > > > The upper bits never come into play. You can conceptually imagine that bytes > have ~6 bits (63 combinations). You only need to worry about these kind of > things when you use variable length encoding (see my previous comment about > Cameron’s suggestion to use Huffman encoding as a second pass). > > > We should also build a scheme that uses shortest one between the compressed > and non-compressed names. > > For running experiments it would be useful to publish a sample corpus of > mangled names that we will be using for comparing the algorithms and > approaches. > > > nm -a libswiftCore.dylib > strings.txt > > I also have a concern about making mangled names completely unreadable. > Today, I can frequently at least get a gist of what the referenced entity is > without a demangler. What we could do is make the name consist of a > human-readable prefix that encodes just the base name and a compressed > suffix that encodes the rest of the information. > > _T<length><class name><length><method name><compressed suffix> > > > When are you looking at symbols directly? SIL already shows the detangled > names as comments, and for everything else there is a demangler. System tools to inspect object files on Linux don't know about Swift mangling. Yes, I can pass their output through a demangler, sometimes (when they don't have a GUI for example). Dmitri -- main(i,j){for(i=2;;i++){for(j=2;j<i;j++){if(!(i%j)){j=0;break;}}if (j){printf("%d\n",i);}}} /*Dmitri Gribenko <griboz...@gmail.com>*/ _______________________________________________ swift-dev mailing list swift-dev@swift.org https://lists.swift.org/mailman/listinfo/swift-dev