I am +1 if we're sure that it's adding one or only a few files, On Thu, 11 Sept 2025 at 06:53, Denny Lee <[email protected]> wrote:
> While it is not standard per se, it is quickly becoming a common > approach. And as you noted per MCP site, they have the llms-full.txt, they > also have > https://modelcontextprotocol.io/llms.txt > > > On Wed, Sep 10, 2025 at 14:48 Bjørn Jørgensen <[email protected]> > wrote: > >> The protocol for this llms.txt is not a standard yet. >> >> "*To clarify, llms.txt is not meant to be a duplication of the full >> documentation.*" >> Some like the Model Context Protocol (MCP) >> <https://modelcontextprotocol.io/tutorials/building-mcp-with-llms> site >> have their full web page in the llms page. >> https://modelcontextprotocol.io/llms-full.txt >> >> >> https://modelcontextprotocol.io/tutorials/building-mcp-with-llms >> >> ons. 10. sep. 2025 kl. 22:27 skrev Allison Wang >> <[email protected]>: >> >>> Thanks Dongjoon for raising these concerns. I agree with your point that >>> it’s worth making the lightweight manifest scope explicit in the SPIP so we >>> have a systematic guarantee it stays small (under 10MB). >>> >>> To clarify, llms.txt is not meant to be a duplication of the full >>> documentation. Instead, it acts more like an index or table of contents >>> page: a small, curated manifest that points to existing canonical docs. >>> The intent is to help AI-assisted tools and LLMs discover the right entry >>> points, not to repackage the entire documentation set. >>> >>> For example this DuckDB's llms.txt >>> <https://duckdb.org/docs/stable/llms.txt> file is around 30KB in >>> size. Spark’s manifests will likely be a bit larger given the broader scope >>> of APIs and documentation, but they should still remain lightweight >>> link-only markdown files and well under the 10MB limit, even across >>> multiple versions and language scopes. >>> >>> On Wed, Sep 10, 2025 at 8:47 AM Wenchen Fan <[email protected]> wrote: >>> >>>> This should just be a llm-facing index page of Spark docs? Given the >>>> amount of APIs Spark provides today, I think this index page should be >>>> useful to humans as well. >>>> >>>> On Wed, Sep 10, 2025 at 10:46 PM Dongjoon Hyun <[email protected]> >>>> wrote: >>>> >>>>> Thank you, Allison and Hyukjin. >>>>> >>>>> IIUC, this proposal is not about a single file. SPIP already exposes >>>>> multiple files which may increase our documentation and website size twice >>>>> (or more in the worst case) because it's simply a duplication of the >>>>> content. If we start to use AI tools to generate these LLMS.txt files, it >>>>> could be much bigger than the original. >>>>> >>>>> *** From SPIP *** >>>>> - [PySpark (Python)]( >>>>> https://spark.apache.org/docs/latest/api/python/llms.txt) >>>>> - [Scala](https://spark.apache.org/docs/latest/api/scala/llms.txt) >>>>> - [4.0.0 docs hub]( >>>>> https://archive.apache.org/dist/spark/docs/4.0.0/llms.txt) >>>>> *** >>>>> >>>>> Since the size of Apache Spark 4.1.0-preview1 documentation is 1.2GB, >>>>> could you propose to limit the total size of newly added llms.txt files >>>>> under 10MB always systematically, Allison? If we don't have full >>>>> controllability, this duplication will break the ASF Spark website like >>>>> last year. We already inevitably archived old Spark documents from the >>>>> original website location to "https://archive.apache.org/dist/spark/" >>>>> due to the CI outage. >>>>> >>>>> $ du -h 4.1.0-preview1 | tail -n1 >>>>> 1.2G 4.1.0-preview1 >>>>> >>>>> The bottom line is that we need to have a clear hard limit for this >>>>> newly proposed duplication for machine-friendly metadata. If we have a >>>>> systematic way to control the upper bound which is less than 10MB per >>>>> Spark >>>>> version in total (now and forever), it sounds like a good addition. >>>>> >>>>> Thanks, >>>>> Dongjoon. >>>>> >>>>> >>>>> On Tue, Sep 9, 2025 at 7:19 PM Allison Wang <[email protected]> >>>>> wrote: >>>>> >>>>>> Yes, that’s right. It’s essentially just one markdown file to start >>>>>> with, and we can add more later for language or version specific files if >>>>>> needed. >>>>>> >>>>>> On Tue, Sep 9, 2025 at 4:32 PM Hyukjin Kwon <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> so it's basically adding one text file for llm, right? I think it's >>>>>>> a good idea. >>>>>>> >>>>>>> On Tue, 9 Sept 2025 at 10:22, Allison Wang <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I’d like to propose adding llms.txt files to the Spark >>>>>>>> documentation. >>>>>>>> >>>>>>>> As more users rely on AI-assisted tools and LLMs to learn, write >>>>>>>> Spark code, and troubleshoot issues, it’s increasingly important that >>>>>>>> these >>>>>>>> tools point back to the up-to-date official documentation. This >>>>>>>> will help improve code generation quality and make new Spark features >>>>>>>> easier to discover. The emerging llms.txt convention >>>>>>>> <https://llmstxt.org/> provides a lightweight way to curate >>>>>>>> LLM-friendly manifests of key documentation links. >>>>>>>> >>>>>>>> Would love to hear your feedback! >>>>>>>> SPIP: >>>>>>>> https://docs.google.com/document/d/1tRYdNTrIs8-JTgDthQ-7kcxEG7S91mNUVmUOfevW-cE/edit?tab=t.0#heading=h.wq8o4rl94dvr >>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-53528 >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Allison >>>>>>>> >>>>>>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >> Norge >> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g> >> >> +47 480 94 297 >> >
