Hi,

> I think I found the article I was referring to:
> https://www.compose.com/articles/elasticsearch-security-update-groovy-
> scripting-dropped/
> (2015-03):
> "After talking with the Groovy developers, Elasticsearch have decided
> that Groovy will never be sufficiently safe in a sandbox and have
> removed it from the list of sandboxed languages."
> 
> ( I have used Solr, but not Elasticsearch. It seems they now have
> "Painless", a Groovy-subset-compatible language as their main scripting
> language:
> https://www.elastic.co/guide/en/elasticsearch/painless/6.2/painless-
> specification.html
> ... )

Yes, that script language was designed from scratch with only invokedynamic and 
MethodHandles in mind. It also uses a pure whitelist-approach (explicit 
whitelisting what you can call, so no blacklists needed), but still allows most 
groovy syntax. As Elasticsearch is very performance critical (e.g. you may call 
a script for every search result which may be up to billion of times per search 
execution), we tried to avoid boxing and dynamic lookups by using extensive 
caching in the indy call sites. On the other hand it still allows the "def" 
type. The compiler tries to guess most types during compilation and only if it 
cannot guess the type it writes an invokedynamic. But nevertheless, by using 
invokedynamic with good caches (we go up to megamorphic if needed!) the 
slowdown by using invokedynamic for most scripts is neglectible!!! (less than 
10% when full hotspot compilation is done). The main reason for creating a new 
script language was not only security (this was just good for selling it to 
users at that time), but also speed. Groovy behaves very bad when types are 
variying and especially in the "def" case it uses way too much reflection (the 
invokedynamic part has no good caches and still uses reflection all the time on 
the dynamic lookup in the call site type handler).

The script language also supports lambdas and method references, but in 
contrast to Groovy, those are lambdas, not closures! So it uses 
LambdaMetaFactory (actually it uses it's own implementation, because 
LambdaMetaFactory is too strict with types and has some bugs in Java 9 that 
make it behave differently than the spec; java code does not see this, but 
dynamic languages break heavily with Java 9 when they use LambdaMetaFactory).

As said before, the compiler tries to guess all types at compile time and 
passes the native types to the invokedynamic, unless it can directly generate a 
real method call (everything is known). For dynamic ("def") stuff it delegates 
everything to DefBootstrap (https://goo.gl/DdNJJG), which has several modes:

- simple method calls: here it uses a polymorphic cache with up to 5 receiver 
items (the usual guardWithTest chain using MethodHandles). Once it sees more 
receivers, it reverts the whole cache and goes megamorphic (it builds a 
methodhandle that calls ClassValue.get on every call). This is slower, but 
better than resolving every time. Keep in mind: This one has no bugs with 
ClassValue, as all seen types are local and can be cleaned up by GC correctly. 
The problems of Groovy with ClassValue are homemade, sorry. ClassValue works 
perfectly, if you do not misuse it!!! Unfortunately, Groovy has a monomorphic 
cache only, which was one reason why groovy was behaving bad in Elasticsearch!
- fields, array loads/stores: same as method calls, it just translates to a 
methodhandle like for simple method calls (including all caches), as mentioned 
above.
- (binary-)operators: As you have mostly 2 variable types here, it's hard to 
build good caches. So it goes for a monomorphic cache here. In most cases this 
is not an issue, as operator constructs don't have varying types (unless both 
sides are DEF). Still, once seen at runtime it stays constant for most cases.
- references (these are method references): Like in Java, lambdas (with/without 
captures) are also compiled to static/instance methods in the script class.

Lambdas are also tried to be resolved at compile time, but you can also call a 
method taking a functional interface on the "def" type! In that case the above 
"references" variant is used (the DefBootstrap then delegates after type 
resolving to the LambdaBootstrap). At the end of an invokedynamic call, field 
access, array lookup, map lookup, list lookup, lambda lookup is just a compiled 
method handle including all MIC, PIC, Megamorphic caching logic (using those 
many methods in Java's "MethodHandles" class for combining MHs)! Except for 
Bootstrapping method references, no byte code is generated at runtime (of 
course java runtime does it, but we do not). We just combine methodhandles!

Most static definitions that are used in created method handles are part of the 
"Def" (https://goo.gl/g7GD5t) and "DefMath" (https://goo.gl/AN4SJ1) classes. If 
you look at Def.java, you also see many cases where it backports some 
MethodHandles stuff only available in Java 9+, but falls back to Java 9 native 
code when Java 9 is used. Actually some methods missing in Java 8's 
MethodHandles implementation were added to Java 9, because of our feedback 
(e.g. array length lookup for implementing array[].length using invokedynamic).

One thing: Because of the strict types and complexity/slowness with caching, 
Painless does not allow method overloading. The above whitelist uses alias 
names for methods that use overloading. One famous example are the group() 
methods on regex matchers (named, unnamed groups). Some users complained, but 
that's under control. The whitelist also adds addon methods (like Groovy) to 
some java classes in the same way like aliases.

> > In my opinion that project is wrong, because the security manager
> > mechanisms provide enough protection. The problem is that rarely
> > anyone can use a security manager properly. Anyway... Groovy won't be
> > able to do any call Java cannot do in principle in this version. That
> > is not because of keeping security in mind, that is more because of
> > the module system, that enforces this

This is partly right, but also not always applicable in that simple form. 
Elasticsearch uses SecurityManager to encapsulate all of its plugins, script 
engines,... and also itsself. It prevents stuff like deep reflection 
(setAccessible is completely disallowed anywhere in Elasticsearch) or 
System.exit/halt(). File system is also restricted to not escape config/index 
folders. I think Elasticsearch's implementation of SecurityManager is one ogf 
those implementation that really work correctly. It opened a lot of bug reports 
in foreign projects to fix their code (e.g, missing doPrivileged or calls to 
setAccessible without proper try/catch). Of course this helps to make 
Elasticsearch safe for the typical security issues (code escapes sandbox). But 
on the other hand it does not help, if code in a script calls some public class 
in Elasticsearchs's core and executes a command to delete an index from inside 
a query script.

Of course you could also guard all public APIs of Elasticsearch with 
securitymanager, but as you know, many stuff is very performance critical, so 
you cannot really guard everything. Of course in the future, all access to 
lower level "Apache Lucene" or internals can be shielded by the module system, 
but that's not yet possible (Elasticsearch has Java 8 as minimum requirement).

Groovy has the functionality of blacklists (so you can exclude certain classes 
from being called by scripts), but Elasticsearch decided to design it's script 
language the other way round. It is solely "whitelist" based. In short, you 
cannot access anything from the Elasticsearch/Java code that is not explicitely 
whitelisted (that's called "Definition" in Painless): https://goo.gl/ehdjvh 
with those whitelists: https://goo.gl/adCU88

> Whatever the reason it will be more secure, my line of thought was, that
> if some (clever) way to work around this exists, to maybe still not go
> along that route, even though I don't like to see this feature go and
> people who use this in tests will surely miss it...
> (Of course that argument is mute if they could have fixed their problem
> through proper security manager use).

I agree with that! That was the main reason to redesign the script language in 
Painless. It was made mostly compatible to Groovy, but the backend and compiler 
was designed to use MethodHandles and Invokedynamic only. There is no 
reflection inside (some parts of the above definition use unreflect, but that's 
more on startup only when building the whitelist graph, because reflection 
allows to better inspect stuff at runtime, while MethodHandles are typed from 
the beginning - you cannot get a methodhandle without exact types). During 
execution of a script there is nowhere a call to any reflective field/method 
nor is there setAccessible anywhere. The Invokedynamic call site does 
everything (includes caching) with MethodHandles: DefBootstrap, LambdaBootstrap.

BTW: There are also some goodies in this script engine: If it figures out at 
runtime, that Java 9 is used, the byte code generated does not use 
StringBuilder when concatting strings, instead it uses invokedynamic with 
StringConcatFactory. The code to do that is quite "generic" and applied with a 
subclass of ASM's GeneratorAdapter: 
https://github.com/elastic/elasticsearch/blob/6dadce47613a3c69d928940bcc1b2043e0a0184a/modules/lang-painless/src/main/java/org/elasticsearch/painless/MethodWriter.java#L238-L292
 (if I have some time, I will extend it to use the better 
makeConcatWithConstants() to improve strings with many constant parts).


Uwe

Reply via email to