[jira] [Updated] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

Joel Bernstein (JIRA) Mon, 27 Jun 2016 16:33:07 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-9258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Joel Bernstein updated SOLR-9258:
---------------------------------
    Description: 
This ticket describes a framework for *optimizing*, *storing* and *deploying* 
AI models within the Streaming Expression framework.

*Optimizing*
[~caomanhdat], has contributed SOLR-9252 which provides *Streaming Expressions* 
for both feature selection and optimization of a logistic regression text 
classifier. SOLR-9252 also provides a great working example of *optimization* 
of a machine learning model using an in-place parallel iterative algorithm.

*Storing*

Both features and optimized models can be stored in SolrCloud collections using 
the update expression. Using [~caomanhdat]'s example in SOLR-9252, the pseudo 
code for storing features would be:

{code}
update(featuresCollection, 
       featuresSelection(collection1, id="myFeatures", q="*:*",  
field="tv_text", outcome="out_i", positiveLabel=1, numTerms=100))
{code}  

The id field can be added to the featureSelection expression so that features 
can be later retrieved from the collection it's stored in.

*Deploying*

With the introduction of the topic() expression, SolrCloud can be treated as a 
distributed message queue. This messaging capability can  be used to deploy 
models and process data through the models.

To implement this approach a classify() function can be created that uses a 
topic() function to return both the model and the data to be classified:

The pseudo code looks like this:

{code}
classify(topic(models, q="modelID", fl="features, weights"),
         topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
{code}


In the example above the classify() function uses the topic() function to 
retrieve the model. Each time there is an updated to the model in the index, 
the topic() expression will automatically read the new model.

The topic function() is also used to pull in the data set that is being 
classified. Notice the *version* parameter. This will be added to the topic 
function to support pulling results from a specific version number (jira ticket 
to follow).

With this approach both the model and the data to process through the model are 
treated as messages in a message queue.

The daemon function can be used to send the classify function to Solr where it 
will be run in the background. The pseudo code looks like this:

{code}
daemon(...,
         update(classifiedEmails, 
                 classify(topic(models, q="modelID", fl="features, weights"),
                          topic(emails, q="*:*", fl="id, fl, body", rows="500", 
version="3232323"))))
{code}

In this scenario the daemon will run the classify function repeatedly in the 
background. With each run the topic() functions will re-pull the model if the 
model has been updated. It will also pull a new set of emails to be classified. 
The classified emails can be stored in another SolrCloud collection using the 
update() function.

Using this approach emails can be classified in batches. The daemon can 
continue to run even after all all the emails have been classified. New emails 
added to the emails collections will then be automatically classified when they 
enter the index.

Classification can be done in parallel once SOLR-9240 is completed. This will 
allow topic() results to be partitioned across worker nodes so they can be 
processed in parallel. The pseudo code for this is:

{code}
parallel(workerCollection, worker="20", ...,
         daemon(...,
                   update(classifiedEmails, 
                           classify(topic(models, q="modelID", fl="features, 
weights", partitionKeys="none"),
                                    topic(emails, q="*:*", fl="id, fl, body", 
rows="500", version="3232323", partitionKeys="id")))))
{code}

The code above sends a daemon to 20 workers, which will each classify a 
partition of records pulled by the topic() function.

*AI based alerting*

If the *version* parameter is not supplied to the topic stream it will stream 
only new content from the topic, rather then starting from an older version 
number.

In this scenario the topic function behaves like an alert. Pseudo code for 
alerts look like this:

{code}
daemon(...,
         alert(..., 
             classify(topic(models, q="modelID", fl="features, weights"),
                      topic(emails, q="*:*", fl="id, fl, body", rows="500"))))
{code}

In the example above an alert() function wraps the classify() function and 
takes actions based on the classification of documents. Developers can build 
there own alert functions using the Streaming API and plug them in to provide 
custom actions.












 






  was:
This ticket describes a framework for *optimizing*, *storing* and *deploying* 
AI models within the Streaming Expression framework.

*Optimizing*
[~caomanhdat], has contributed SOLR-9252 which provides *Streaming Expressions* 
for both feature selection and optimization of a logistic regression text 
classifier. SOLR-9252 also provides a great working example of *optimization* 
of a machine learning model using an in-place parallel iterative algorithm.

*Storing*

Both features and optimized models can be stored in SolrCloud collections using 
the update expression. Using [~caomanhdat]'s example in SOLR-9252, the pseudo 
code for storing features would be:

{code}
update(featuresCollection, 
       featuresSelection(collection1, id="myFeatures", q="*:*",  
field="tv_text", outcome="out_i", positiveLabel=1, numTerms=100))
{code}  

The id field can be added to the featureSelection expression so that features 
can be later retrieved from the collection it's stored in.

*Deploying*

With the introduction of the topic() expression, SolrCloud can be treated as a 
distributed message queue. This messaging capability can  be used to deploy 
models and process data through the models.

To implement this approach a classify() function can be created that uses a 
topic() function to return both the model and the data to be classified:

The pseudo code looks like this:

{code}
classify(topic(models, q="modelID", fl="features, weights"),
         topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
{code}

In the example above the classify() function uses the topic() function to 
retrieve the model. Each time there is an updated to the model in the index, 
the topic() expression will automatically read the new model.

The topic function() is also used to pull in the data set that is being 
classified. Notice the *version* parameter. This will be added to the topic 
function to support pulling results from a specific version number (jira ticket 
to follow).

The daemon function can be used to send the classify function to Solr where it 
will be run in the background. The pseudo code looks like this:

{code}
daemon(...,
         update(classifiedEmails, 
                 classify(topic(models, q="modelID", fl="features, weights"),
                          topic(emails, q="*:*", fl="id, fl, body", rows="500", 
version="3232323"))))
{code}

In this scenario the daemon will run the classify function repeatedly in the 
background. With each run the topic() functions will re-pull the model if the 
model has been updated. It will also pull a new set of emails to be classified. 
The classified emails can be stored in another SolrCloud collection using the 
update() function.

Using this approach emails can be classified in batches. The daemon can 
continue to run even after all all the emails have been classified. New emails 
added to the emails collections will then be automatically classified when they 
enter the index.

Classification can be done in parallel once SOLR-9240 is completed. This will 
allow topic() results to be partitioned across worker nodes so they can be 
processed in parallel. The pseudo code for this is:

{code}
parallel(workerCollection, worker="20", ...,
         daemon(...,
                   update(classifiedEmails, 
                           classify(topic(models, q="modelID", fl="features, 
weights", partitionKeys="none"),
                                    topic(emails, q="*:*", fl="id, fl, body", 
rows="500", version="3232323", partitionKeys="id")))))
{code}

The code above sends a daemon to 20 workers, which will each classify a 
partition of records pulled by the topic() function.

*AI based alerting*

If the *version* parameter is not supplied to the topic stream it will stream 
only new content from the topic, rather then starting from an older version 
number.

In this scenario the topic function behaves like an alert. Pseudo code for 
alerts look like this:

{code}
daemon(...,
         alert(..., 
             classify(topic(models, q="modelID", fl="features, weights"),
                      topic(emails, q="*:*", fl="id, fl, body", rows="500"))))
{code}

In the example above an alert() function wraps the classify() function and 
takes actions based on the classification of documents. Developers can build 
there own alert functions using the Streaming API and plug them in to provide 
custom actions.












 







> Optimizing, storing and deploying AI models with Streaming Expressions
> ----------------------------------------------------------------------
>
>                 Key: SOLR-9258
>                 URL: https://issues.apache.org/jira/browse/SOLR-9258
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Joel Bernstein
>
> This ticket describes a framework for *optimizing*, *storing* and *deploying* 
> AI models within the Streaming Expression framework.
> *Optimizing*
> [~caomanhdat], has contributed SOLR-9252 which provides *Streaming 
> Expressions* for both feature selection and optimization of a logistic 
> regression text classifier. SOLR-9252 also provides a great working example 
> of *optimization* of a machine learning model using an in-place parallel 
> iterative algorithm.
> *Storing*
> Both features and optimized models can be stored in SolrCloud collections 
> using the update expression. Using [~caomanhdat]'s example in SOLR-9252, the 
> pseudo code for storing features would be:
> {code}
> update(featuresCollection, 
>        featuresSelection(collection1, id="myFeatures", q="*:*",  
> field="tv_text", outcome="out_i", positiveLabel=1, numTerms=100))
> {code}  
> The id field can be added to the featureSelection expression so that features 
> can be later retrieved from the collection it's stored in.
> *Deploying*
> With the introduction of the topic() expression, SolrCloud can be treated as 
> a distributed message queue. This messaging capability can  be used to deploy 
> models and process data through the models.
> To implement this approach a classify() function can be created that uses a 
> topic() function to return both the model and the data to be classified:
> The pseudo code looks like this:
> {code}
> classify(topic(models, q="modelID", fl="features, weights"),
>          topic(emails, q="*:*", fl="id, body", rows="500", version="3232323"))
> {code}
> In the example above the classify() function uses the topic() function to 
> retrieve the model. Each time there is an updated to the model in the index, 
> the topic() expression will automatically read the new model.
> The topic function() is also used to pull in the data set that is being 
> classified. Notice the *version* parameter. This will be added to the topic 
> function to support pulling results from a specific version number (jira 
> ticket to follow).
> With this approach both the model and the data to process through the model 
> are treated as messages in a message queue.
> The daemon function can be used to send the classify function to Solr where 
> it will be run in the background. The pseudo code looks like this:
> {code}
> daemon(...,
>          update(classifiedEmails, 
>                  classify(topic(models, q="modelID", fl="features, weights"),
>                           topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323"))))
> {code}
> In this scenario the daemon will run the classify function repeatedly in the 
> background. With each run the topic() functions will re-pull the model if the 
> model has been updated. It will also pull a new set of emails to be 
> classified. The classified emails can be stored in another SolrCloud 
> collection using the update() function.
> Using this approach emails can be classified in batches. The daemon can 
> continue to run even after all all the emails have been classified. New 
> emails added to the emails collections will then be automatically classified 
> when they enter the index.
> Classification can be done in parallel once SOLR-9240 is completed. This will 
> allow topic() results to be partitioned across worker nodes so they can be 
> processed in parallel. The pseudo code for this is:
> {code}
> parallel(workerCollection, worker="20", ...,
>          daemon(...,
>                    update(classifiedEmails, 
>                            classify(topic(models, q="modelID", fl="features, 
> weights", partitionKeys="none"),
>                                     topic(emails, q="*:*", fl="id, fl, body", 
> rows="500", version="3232323", partitionKeys="id")))))
> {code}
> The code above sends a daemon to 20 workers, which will each classify a 
> partition of records pulled by the topic() function.
> *AI based alerting*
> If the *version* parameter is not supplied to the topic stream it will stream 
> only new content from the topic, rather then starting from an older version 
> number.
> In this scenario the topic function behaves like an alert. Pseudo code for 
> alerts look like this:
> {code}
> daemon(...,
>          alert(..., 
>              classify(topic(models, q="modelID", fl="features, weights"),
>                       topic(emails, q="*:*", fl="id, fl, body", rows="500"))))
> {code}
> In the example above an alert() function wraps the classify() function and 
> takes actions based on the classification of documents. Developers can build 
> there own alert functions using the Streaming API and plug them in to provide 
> custom actions.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-9258) Optimizing, storing and deploying AI models with Streaming Expressions

Reply via email to