[
https://issues.apache.org/jira/browse/METRON-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16484013#comment-16484013
]
ASF GitHub Bot commented on METRON-1568:
----------------------------------------
Github user cestella commented on a diff in the pull request:
https://github.com/apache/metron/pull/1021#discussion_r189916513
--- Diff:
metron-platform/metron-common/src/main/java/org/apache/metron/common/configuration/enrichment/handler/StellarConfig.java
---
@@ -142,8 +143,14 @@ else if(kv.getValue() instanceof List) {
{
--- End diff --
> It feels like '_' conflates the 'messaging' with the language.
I hear you there, that's why support for this variable is VariableResolver
specific. You very well could have a variable resolver that does NOT support
it (it's just that all of ours happen to). I'd argue that it's not even really
part of the language as it's a feature of the variable resolver rather than the
parsing infrastructure.
One reason why this was done as a variable is that the split/join topology
requires knowledge of the fields used by inspecting the variables used in
stellar (this way we send only the required fields to the individual stellar
adapter workers). I had contemplated adjusting the interface or passing along
the VariableResolver in the spark context, but that didn't feel right either
and it was also more complex and it mandated that VariableResolvers support
`_`, which not all can do.
> Also, I hope some of these MAAS applications make it into Metron /contrib
;)
You will get your wish as this is the preliminary PR for one of them going
into Metron. It's actually not a MaaS model, but a semantic hash function
backed by Word2Vec that fits into the `HASH()` infrastructure you and JJ
created.
>I'm saying why not just have the _ in the configuration side, and just
have the scripts reference the vars by name and not have to MAP_GET()?
So, the model scripts wouldn't reference `MAP_GET`. I was going to wait
and put out a discuss thread, but perhaps an example of what I'm contributing
next will illuminate the need. The model in question's job is to take the
whole message and generate a hash from it such that messages that are similar
have the same hash. This has a similar usecase to the forensic clustering
use-case that I wrote up, but it's customized to your data and does not presume
the user is constructing a string.
The model itself knows about the schema becuase it's specific to your data.
For instance, if you build the model on netflow data, it'll know about netflow
fields:
* source computer/port
* destination computer/port
* packet count
* byte count
* duration
Now, I need a way to pass the whole message into the `HASH()` function.
One way of doing it would be:
`HASH( { 'ip_src_addr' : ip_src_addr, 'ip_dst_addr' : ip_dst_addr,
'ip_src_port' : ip_src_port, 'ip_dst_port' : ip_dst_port, 'packet_count' :
packet_count, 'duration' : duration, 'byte_count' : byte_count}, 'SEMHASH', {
'model' : OBJECT_GET('/path/to/model.ser') })`
Rather than doing that, I'd rather let the model select the relevant fields
like so:
`HASH( _ , 'SEMHASH', { 'model' : OBJECT_GET('/path/to/model.ser') })`
Similar situations exist with MaaS models as well, where the model knows
which fields it cares about and the translation as the number of fields being
input can become onerous to the user.
What do you think? Do you like another option that would solve the issue?
PS. You'll get a full PR with a worked use-case on the Los Alamos National
Labs data for the semantic hashing function I teased by end of week.
> Stellar should have a _ special variable which returns the message in map form
> ------------------------------------------------------------------------------
>
> Key: METRON-1568
> URL: https://issues.apache.org/jira/browse/METRON-1568
> Project: Metron
> Issue Type: Improvement
> Reporter: Casey Stella
> Priority: Major
>
> In order to support functions which operate on the whole message, we should
> have a special variable (_, keeping with the vaguely scala theme) which can
> return the entire underlying message. This map should be immutable.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)