[I] Datafusion Native llm.txt [datafusion]

via GitHub Wed, 20 Nov 2024 09:50:31 -0800


ChristianCasazza opened a new issue, #13501:
URL: https://github.com/apache/datafusion/issues/13501


   ### Is your feature request related to a problem or challenge?
   
   LLMs provide a fantastic way to learn and use a new codebase. By providing 
the documentation, they can create a custom guide for new users to directly 
learn how to use a library or API, answer specific questions, and help address 
bugs with custom implementation of the library. One trend that is emerging 
because of this paradigm is to create two versions of documentation. One 
version is the traditional version and built for humans, so that it separates 
the different parts of the codebase into separate parts that is easy for a 
human to view. The other version of the docs would be optimized for LLMs. It 
would exist as a single markdown file that removes all of the human formatting 
and puts all of the information in one central file that can easily be copy and 
pasted into an LLM chat, which could then teach the user the library.
   
   DuckDB recently released their version of an llm.txt 
[here](https://duckdb.org/duckdb-docs.md). it is basically one huge markdown 
file that includes all of their documentation, totaling about 700k tokens. 
Caleb Fahlgren from Huggingface extracted the data into an organized version 
[here](https://huggingface.co/datasets/duckdb-nsql-hub/duckdb-docs)
   
   I would like to propose making a Datafusion version of these LLM docs
   
   ### Describe the solution you'd like
   
   I would propose making a few versions of the llm.txt for datafusion. We 
should make one version that includes all of the docs in one large md file. 
This is a strong start. However, it is likely that it will be so large that 
doesn't fit within common chat LLMs context windows. Therefore, I would also 
suggest making smaller versions of the different sections of the docs, such as 
architecture, API, etc. The individual sections could make it easier for a user 
to selectively give the specific context for a particular part of datafusion 
they are working with, so as not to overload the LLM context.
   
   After some trial and error, I would also suggest creating fully LLM 
optimized versions. These versions would include a mixture of conceptual 
explanations of datafusion, example code snippets, and the raw API interface. 
The goal for the final versions would be simple templates that can be copy and 
pasted into a chat, which then primes the LLM to have the context of the latest 
version of datafusion, along with the knowledge of working examples.
   
   ### Describe alternatives you've considered
   
   One alternative to LLM docs that is popular are companies just simply 
creating their own chat bot that has the context of their docs. While this is 
useful, I think it misses the point. I believe in the future it can be assumed 
that developers are already paying for their own LLM, through chat(ChatGPT, 
Claude), IDEs(Cursor), or their own set-up working with the LLM APIs. 
   
   Therefore, I don't think it is the best model for each company to have their 
own chatbot LLM. It becomes difficult for a user to combine context across 
different libraries they use together, and they must iterate in a companies 
chosen interface instead of the LLM interface they are already comfortable 
with. 
   
   Instead, it would be better to provide the raw context and allow users to 
bring it into the LLM interface they are already using.
   
   ### Additional context
   
   I think LLM-paired development is the future of data engineering, and having 
first class LLM support is vital for the adoption of datafusion. As an example, 
we could look at pandas and polars. Even though polars offers massive 
improvements over pandas, there is an order of magnitude more public code 
examples of pandas than polars. Therefore, LLMs will often suggest pandas code 
first and often create better working code compared to polars. Even though the 
underlying library of polars is better, I think many new developers will just 
use whatever works best with LLMs. I believe this is part of why DuckDB has 
been very popular, as LLMs are already good at creating SQL compared to 
dataframes.
   
   By creating first class LLM support for datafusion, I think it can be 
positioned to gain developer mindshare as modern Arrow based engines become 
common sense to use.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[I] Datafusion Native llm.txt [datafusion]

Reply via email to