This is an automated email from the ASF dual-hosted git repository. michaelsmith pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/impala.git
commit c83e5d97693fd3035b33622512d1584a5e56ce8b Author: m-sanjana19 <[email protected]> AuthorDate: Thu Aug 1 12:38:47 2024 +0530 IMPALA-13030: [DOCS] Documentation of AI built-in function (ai_generate_text) Change-Id: Iae921f6554c7010f9568ee4a42b4abcb3534d4a6 Reviewed-on: http://gerrit.cloudera.org:8080/21629 Tested-by: Impala Public Jenkins <[email protected]> Reviewed-by: Yida Wu <[email protected]> --- docs/impala.ditamap | 1 + docs/topics/impala_ai_functions.xml | 228 ++++++++++++++++++++++++++++++++++++ 2 files changed, 229 insertions(+) diff --git a/docs/impala.ditamap b/docs/impala.ditamap index 682ae6491..40e7a3ae5 100644 --- a/docs/impala.ditamap +++ b/docs/impala.ditamap @@ -285,6 +285,7 @@ under the License. <topicref href="topics/impala_datetime_functions.xml"/> <topicref href="topics/impala_conditional_functions.xml"/> <topicref href="topics/impala_string_functions.xml"/> + <topicref href="topics/impala_ai_functions.xml"/> <topicref href="topics/impala_misc_functions.xml"/> <topicref href="topics/impala_aggregate_functions.xml"> <topicref href="topics/impala_appx_median.xml"/> diff --git a/docs/topics/impala_ai_functions.xml b/docs/topics/impala_ai_functions.xml new file mode 100644 index 000000000..3b093e05f --- /dev/null +++ b/docs/topics/impala_ai_functions.xml @@ -0,0 +1,228 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- +Licensed to the Apache Software Foundation (ASF) under one +or more contributor license agreements. See the NOTICE file +distributed with this work for additional information +regarding copyright ownership. The ASF licenses this file +to you under the Apache License, Version 2.0 (the +"License"); you may not use this file except in compliance +with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, +software distributed under the License is distributed on an +"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +KIND, either express or implied. See the License for the +specific language governing permissions and limitations +under the License. +--> +<!DOCTYPE concept PUBLIC "-//OASIS//DTD DITA Concept//EN" "concept.dtd"> +<concept id="impala_ai_functions"> + <title>Advantages and use cases of Impala AI functions</title> + <titlealts audience="PDF"> + <navtitle>AI Functions</navtitle> + </titlealts> + <prolog> + <metadata> + <data name="Category" value="Impala"/> + <data name="Category" value="Impala Functions"/> + <data name="Category" value="UDF"/> + <data name="Category" value="Data Analysts"/> + <data name="Category" value="Developers"/> + <data name="Category" value="Querying"/> + </metadata> + </prolog> + <conbody> + <p> You can use Impala's ai_generate_text function to access Large Language Models + (LLMs) in SQL queries. This function enables you to input a prompt, retrieve the LLM response, + and include it in results. You can create custom UDFs for complex tasks like sentiment + analysis and translation. + </p> + <section id="section_ufg_4tv_fcc"> + <title>Use LLMs directly in SQL with Impala's <codeph>ai_generate_text</codeph> + function></title> + <p>Impala introduces a built-in AI function called <codeph>ai_generate_text</codeph> that + enables direct access to and utilization of Large Language Models (LLMs) in SQL queries. + With this function, you can input a prompt, which may include data. The function + communicates with a supported LLM endpoint, sends the prompt, retrieves the response, and + includes it in the query result.</p> + <p>Alternatively, seamlessly integrate LLM intelligence into your Impala workflow by creating + custom User Defined Functions (UDFs) on top of <codeph>ai_generate_text</codeph>. This + allows you to use concise SQL statements for sending prompts to an LLM and receiving + responses. You can define UDFs for complex tasks like sentiment analysis, language + translation, and generative contextual analysis.</p> + </section> + <section id="ai_advantages"> + <title>Advantages of using AI functions</title> + <p> + <ul id="ul_cdc_fm4_tbc"> + <li><b>Simplified Workflow</b>: Eliminates the necessity for setting up intricate data + pipelines.</li> + <li><b>No ML Expertise Required</b>: No specialized machine learning skills are + needed.</li> + <li><b>Swift Decision-Making</b>: Enables faster insights on the data, facilitating + critical business decisions by using in-database function calls.</li> + <li><b>Integrated Functionality</b>: Requires no external applications, as it is a + built-in feature in Data Warehouse.</li> + </ul> + </p> + </section> + <section id="ai_usecases"> + <title>List of possible use cases</title> + <p>Here are some practical applications of using AI models with the function:<ul + id="ul_ond_pm4_tbc"> + <li><b>Sentiment Analysis</b>: Use the AI model to examine customer reviews for a product + and identify their sentiment as positive, negative, or neutral.</li> + <li><b>Language Translation</b>: Translate product reviews written in different languages + to understand customer feedback from various regions.</li> + <li><b>Generative Contextual Analysis</b>: Generate detailed reports and insights on + various topics based on provided data.</li> + </ul></p> + </section> + <section id="syntax_ai_function"> + <title>Syntax for AI built-in function arguments</title> + <p>The following example of a built-in AI function demonstrates the use of the OpenAI API as a + large language model. Currently, OpenAI's public endpoint and Azure OpenAI endpoints are + supported.</p> + <dl id="dl_w5l_nn4_tbc"> + <dlentry> + <dt>AI_GENERATE_TEXT_DEFAULT</dt> + <dd>Syntax:<codeblock id="codeblock_ibj_qn4_tbc">ai_generate_text_default(prompt)</codeblock></dd> + </dlentry> + <dlentry> + <dt>AI_GENERATE_TEXT</dt> + <dd>Syntax:<codeblock id="codeblock_qdg_zn4_tbc">ai_generate_text(ai_endpoint, prompt, ai_model, ai_api_key_jceks_secret, additional_params) + </codeblock></dd> + </dlentry> + </dl> + <p>The <codeph>ai_generate_text</codeph> function uses the values you provide as an argument + in the function for <codeph>ai_endpoint</codeph>, <codeph>ai_model</codeph>, and + <codeph>ai_api_key_jceks_secret</codeph>. If any of the arguments are left empty or set to + NULL, the function uses the default values defined at the instance level. These default + values correspond to the flag settings configured in the Impala instance. For example, if + the <codeph>ai_endpoint</codeph> argument is NULL or empty, the function will use the value + specified by the <codeph>ai_endpoint</codeph> flag as the default.</p> + <p dir="ltr">When using the <codeph>ai_generate_text_default</codeph> function, make sure to + set all parameters (<codeph>ai_endpoint</codeph>, <codeph>ai_model</codeph>, and + <codeph>ai_api_key_jceks_secret</codeph>) in the coordinator/executor flagfiles with + appropriate values.</p> + </section> + <section id="key_parameters_ai_function"> + <title>Key parameters for using the AI model</title> + <ul id="ul_pfm_2p4_tbc"> + <li><b>ai_endpoint</b>: The endpoint for the model API that is being interfaced with, + supports services like OpenAI and Azure OpenAI Service, for example, + https://api.openai.com/v1/chat/completions.</li> + <li><b>prompt</b>: The text you submit to the AI model to generate a response.</li> + <li><b>ai_model</b>: The specific model name you want to use within the desired API, for + example, gpt-3.5-turbo.</li> + <li><b>ai_api_key_jceks_secret</b>: The key name for the JCEKS secret that contains your API + key for the AI API you are using. You need a JCEKS keystore containing the specified JCEKS + secret referenced in <codeph>ai_api_key_jceks_secret</codeph>. To do this, set the + <codeph>hadoop.security.credential.provider.path</codeph> property in the + <codeph>core-site</codeph> configuration for both the executor and coordinator.</li> + <li><b>additional_params</b>: Additional parameters that the AI API offers that is provided + to the built-in function as a JSON object.</li> + </ul> + </section> + <section id="impala-built-in-ai-function"> + <title>Examples of using the built-in AI function</title> + <p>The following example lists the steps needed to turn a prompt into a custom SQL function + using just the built-in function + <codeph>ai_generate_text_default</codeph>.<codeblock id="codeblock_ppy_dr4_tbc">> select ai_generate_text_default('hello'); + Response: + Hello! How can I assist you today? + </codeblock></p> + <p>In the below example, a query is sent to the Amazon book reviews database for the book + titled Artificial Superintelligence. The large language model (LLM) is prompted to classify + the sentiment as positive, neutral, or + negative.<codeblock id="codeblock_hyp_4np_tbc">> select customer_id, star_rating, ai_generate_text_default(CONCAT('Classify the following review as positive, neutral, or negative', and only include the uncapitalized category in the response: ', review_body)) AS review_analysis, review_body from amazon_book_reviews where product_title='Artificial Superintelligence' order by customer_id LIMIT 1; + Response: + +--+------------+------------+----------------+------------------+ + | |customer_id |star_rating |review_analysis |review_body | + +--+------------+------------+----------------+------------------+ + |1 |4343565 | 5 |positive |What is this book | + | | | | |all about ………… | + +--+------------+------------+----------------+------------------+ + </codeblock></p> + </section> + <section id="impala-custom-udf"> + <title>Examples of creating and using custom UDFs along with the built-in AI function</title> + <p>Instead of writing the prompts in a SQL query, you can build a UDF with your intended + prompt. Once you build your custom UDF, pass your desired prompt within your custom UDF into + the <codeph>ai_generate_text_default</codeph> built-in Impala function.</p> + <p>Example: Classify input customer reviews</p> + <p>The following UDF uses the Amazon book reviews database as the input and requests the LLM + to classify the sentiment.</p> + <p>Classify input customer reviews:</p> + <codeblock id="codeblock_h1z_yrp_tbc"> + IMPALA_UDF_EXPORT + StringVal ClassifyReviews(FunctionContext* context, const StringVal& input) { + std::string request = + std::string("Classify the following review as positive, neutral, or negative") + + std::string(" and only include the uncapitalized category in the response: ") + + std::string(reinterpret_cast<const char*>(input.ptr), input.len); + StringVal prompt(request.c_str()); + const StringVal endpoint("https://api.openai.com/v1/chat/completions"); + const StringVal model("gpt-3.5-turbo"); + const StringVal api_key_jceks_secret("open-ai-key"); + const StringVal params("{\"temperature\": 0.9, \"model\": \"gpt-4\"}"); + return context->Functions()->ai_generate_text( + context, endpoint, prompt, model, api_key_jceks_secret, params); + } + </codeblock> + <p>Now you can define these prompt building UDFs and build them in Impala. Once you have them + running, you can query your datasets using them.</p> + <p>Creating <codeph>analyze_reviews</codeph> function:</p> + <codeblock id="codeblock_b4l_ctp_tbc">> CREATE FUNCTION analyze_reviews(STRING) + RETURNS STRING + LOCATION ‘s3a://dw-...............’ + SYMBOL=’ClassifyReviews’ + </codeblock> + <p>Using SELECT query for Sentiment analysis to classify Amazon book reviews</p> + <codeblock id="codeblock_xdr_jtp_tbc">> SELECT customer_id, star_rating, analyze_reviews(review_body) AS review_analysis, review_body from amazon_book_reviews where product_title='Artificial Superintelligence' order by customer_id; + Response: + +--+------------+------------+----------------+----------------------+ + | |customer_id |star_rating |review_analysis |review_body | + +--+------------+------------+----------------+----------------------+ + |1 |44254093 | 5 |positive |What is this book all | + | | | | |about? It is all about| + | | | | |a mind-blowing | + | | | | |universal law of | + | | | | |nature. Mind-blow… | + +--+------------+------------+----------------+----------------------+ + |2 |50050072 | 5 |positive |The two tightly- | + | | | | |connected ideas strike| + | | | | |you as amazed. In the | + | | | | |first place, what has | + | | | | |never bef… | + +--+------------+------------+----------------+----------------------+ + |3 |50050072 | 5 |positive |The two tightly- | + | | | | |connected ideas strike| + | | | | |you as amazed. In the | + | | | | |first place, what has | + | | | | |never bef… | + +--+------------+------------+----------------+----------------------+ + |4 |52932308 | 1 |negative |This book is seriously| + | | | | |flawed. I could not | + | | | | |work out if the author| + | | | | |was a mathemetician | + | | | | |dabbi… | + +--+------------+------------+----------------+----------------------+ + |5 |52971961 | 1 |negative |Abdoullaev's | + | | | | |exploration of | + | | | | |Al issues appears to | + | | | | |be very technological | + | | | | |and straightforward… | + +--+------------+------------+----------------+----------------------+ + |6 |53008416 | 4 |positive |As Co Founder of | + | | | | |ArtilectWorld:ultra | + | | | | |intelligent machine, | + | | | | |I recommend reading | + | | | | |this book! | + +--+------------+------------+----------------+----------------------+ +</codeblock> + </section> + </conbody> +</concept>
