Robin <mixent...@aussiebroadband.com.au> wrote:
> These bots have an API that lets them access wiki, or so they claim. > Try asking for something else that is not on LENR-CANR. I excluded all data outside of LENR-CANR.org. That is an option in the FastChat Bot configuration. This Bot used the word "J'Accuse" as a synonym for some word or phrase it found in a document at LENR-CANR.org. I do not know which document. It can be hard to discover that. The Bots often use synonyms, or paraphrasing. I think this is because the data is tokenized and the bot may not store the original words. Bots store alphanumeric tokens instead of words. So the original words may not be available in the Bot database. I asked the public version of ChatGPT about this. She is a leading expert in AI technology. Here are her responses: ME: What is tokenized data in an ai system? BOT: In AI systems, tokenization refers to the process of breaking down a text or sequence of words into smaller units called tokens. These tokens can be individual words, phrases, or even characters. Tokenization is an essential step in natural language processing (NLP) and other machine learning tasks that involve processing text data. By breaking down text into individual tokens, AI systems can analyze and interpret the meaning of the text more efficiently. Tokenization typically involves removing any punctuation, converting all characters to lowercase, and splitting the text into tokens based on a delimiter such as a space or punctuation mark. The resulting tokens can then be encoded and used as inputs to various machine learning models. Overall, tokenization helps to simplify and standardize text data so that it can be effectively analyzed and processed by AI systems. ME: A token is a numerical value, right? In some cases, can two different words for one concept -- synonyms -- have the same token? Or a token with a close numeric value? BOT: Tokens are not necessarily numerical values. In natural language processing, tokens are typically represented as strings of characters or words, which can then be encoded as numerical values for machine learning algorithms to process. When using tokenization, two different words that represent the same concept (synonyms) can be assigned the same token to enable the machine learning algorithm to recognize them as equivalent. In this case, the goal of tokenization is to reduce the vocabulary size and improve the efficiency of the model without losing critical information. However, it's important to note that tokens with close numeric values should not be used interchangeably. Each token should be unique and represent a distinct concept or word in the input text. If two tokens have similar numeric values, it could lead to confusion and inaccuracies in the machine learning model's results. Therefore, tokenization algorithms typically ensure that each token has a unique and distinguishable value. ME: I have noted that when an IA describes the content of a document, it sometimes uses synonyms instead of the original words. In a few cases, I think it did this even when I asked for a verbatim quote. Is this because the content is stored as tokens, and the exact original words are not available? BOT: Yes, it is possible that the AI is using synonyms to describe the content of a document because it is working with tokenized data and not the original words. When an AI system tokenizes text data, it breaks down the text into smaller units called tokens, which can be individual words, phrases, or even characters. These tokens are then used as inputs to machine learning models, which can use various techniques to analyze and interpret the meaning of the text. Sometimes, these models may use synonyms or other related words to describe the content of a document because the original words may not be available or may not be the best fit for the context. Additionally, some AI systems may use techniques like paraphrasing or summarization to generate new text based on the input data. In these cases, the AI may use synonyms or other related words to convey the same meaning as the original text while also improving readability or reducing complexity. However, if you specifically requested a verbatim quote and the AI system did not provide it, it is possible that there was an error or a misunderstanding in the processing of your request. It's important to keep in mind that AI systems are not perfect and may not always provide the exact results that you expect. ME: How should I phrase a request to see the original data? Should I use the word "verbatim"? Or some other word? BOT: If you want to see the original data or text without any modifications or substitutions, using the word "verbatim" is an appropriate way to phrase your request. Verbatim means word for word, exactly as it was said or written. So, you could phrase your request like this: "Can I please see the original text verbatim, without any modifications or substitutions?" This clearly communicates that you want to see the exact words as they were written, without any changes or paraphrasing. It's important to be specific and clear in your request to avoid any confusion or misunderstandings. Using a term like "verbatim" can help ensure that your request is understood and that you get the information you need in the form you want it. [Notice that she wants me to say "please"!]