Robin <mixent...@aussiebroadband.com.au> wrote:

>  These bots have an API that lets them access wiki, or so they claim.
>  Try asking for something else that is not on LENR-CANR.


I excluded all data outside of LENR-CANR.org. That is an option in the
FastChat Bot configuration. This Bot used the word "J'Accuse" as a synonym
for some word or phrase it found in a document at LENR-CANR.org. I do not
know which document. It can be hard to discover that. The Bots often use
synonyms, or paraphrasing. I think this is because the data is tokenized
and the bot may not store the original words. Bots store alphanumeric
tokens instead of words. So the original words may not be available in the
Bot database.

I asked the public version of ChatGPT about this. She is a leading expert
in AI technology. Here are her responses:


ME: What is tokenized data in an ai system?
BOT: In AI systems, tokenization refers to the process of breaking down a
text or sequence of words into smaller units called tokens. These tokens
can be individual words, phrases, or even characters.

Tokenization is an essential step in natural language processing (NLP) and
other machine learning tasks that involve processing text data. By breaking
down text into individual tokens, AI systems can analyze and interpret the
meaning of the text more efficiently.

Tokenization typically involves removing any punctuation, converting all
characters to lowercase, and splitting the text into tokens based on a
delimiter such as a space or punctuation mark. The resulting tokens can
then be encoded and used as inputs to various machine learning models.

Overall, tokenization helps to simplify and standardize text data so that
it can be effectively analyzed and processed by AI systems.


ME: A token is a numerical value, right? In some cases, can two different
words for one concept -- synonyms -- have the same token? Or a token with a
close numeric value?

BOT: Tokens are not necessarily numerical values. In natural language
processing, tokens are typically represented as strings of characters or
words, which can then be encoded as numerical values for machine learning
algorithms to process.

When using tokenization, two different words that represent the same
concept (synonyms) can be assigned the same token to enable the machine
learning algorithm to recognize them as equivalent. In this case, the goal
of tokenization is to reduce the vocabulary size and improve the efficiency
of the model without losing critical information.

However, it's important to note that tokens with close numeric values
should not be used interchangeably. Each token should be unique and
represent a distinct concept or word in the input text. If two tokens have
similar numeric values, it could lead to confusion and inaccuracies in the
machine learning model's results. Therefore, tokenization algorithms
typically ensure that each token has a unique and distinguishable value.


ME: I have noted that when an IA describes the content of a document, it
sometimes uses synonyms instead of the original words. In a few cases, I
think it did this even when I asked for a verbatim quote. Is this because
the content is stored as tokens, and the exact original words are not
available?

BOT: Yes, it is possible that the AI is using synonyms to describe the
content of a document because it is working with tokenized data and not the
original words. When an AI system tokenizes text data, it breaks down the
text into smaller units called tokens, which can be individual words,
phrases, or even characters.

These tokens are then used as inputs to machine learning models, which can
use various techniques to analyze and interpret the meaning of the text.
Sometimes, these models may use synonyms or other related words to describe
the content of a document because the original words may not be available
or may not be the best fit for the context.

Additionally, some AI systems may use techniques like paraphrasing or
summarization to generate new text based on the input data. In these cases,
the AI may use synonyms or other related words to convey the same meaning
as the original text while also improving readability or reducing
complexity.

However, if you specifically requested a verbatim quote and the AI system
did not provide it, it is possible that there was an error or a
misunderstanding in the processing of your request. It's important to keep
in mind that AI systems are not perfect and may not always provide the
exact results that you expect.


ME: How should I phrase a request to see the original data? Should I use
the word "verbatim"? Or some other word?

BOT: If you want to see the original data or text without any modifications
or substitutions, using the word "verbatim" is an appropriate way to phrase
your request. Verbatim means word for word, exactly as it was said or
written.

So, you could phrase your request like this: "Can I please see the original
text verbatim, without any modifications or substitutions?" This clearly
communicates that you want to see the exact words as they were written,
without any changes or paraphrasing.

It's important to be specific and clear in your request to avoid any
confusion or misunderstandings. Using a term like "verbatim" can help
ensure that your request is understood and that you get the information you
need in the form you want it.

[Notice that she wants me to say "please"!]

Reply via email to