I am +1 if we're sure that it's adding one or only a few files,

On Thu, 11 Sept 2025 at 06:53, Denny Lee <[email protected]> wrote:

> While it is not standard per se, it is quickly becoming a common
> approach.  And as you noted per MCP site, they have the llms-full.txt, they
> also have
> https://modelcontextprotocol.io/llms.txt
>
>
> On Wed, Sep 10, 2025 at 14:48 Bjørn Jørgensen <[email protected]>
> wrote:
>
>> The protocol for this llms.txt is not a standard yet.
>>
>> "*To clarify, llms.txt is not meant to be a duplication of the full
>> documentation.*"
>> Some like the Model Context Protocol (MCP)
>> <https://modelcontextprotocol.io/tutorials/building-mcp-with-llms> site
>> have their full web page in the llms page.
>> https://modelcontextprotocol.io/llms-full.txt
>>
>>
>> https://modelcontextprotocol.io/tutorials/building-mcp-with-llms
>>
>> ons. 10. sep. 2025 kl. 22:27 skrev Allison Wang
>> <[email protected]>:
>>
>>> Thanks Dongjoon for raising these concerns. I agree with your point that
>>> it’s worth making the lightweight manifest scope explicit in the SPIP so we
>>> have a systematic guarantee it stays small (under 10MB).
>>>
>>> To clarify, llms.txt is not meant to be a duplication of the full
>>> documentation. Instead, it acts more like an index or table of contents
>>> page: a small, curated manifest that points to existing canonical docs.
>>> The intent is to help AI-assisted tools and LLMs discover the right entry
>>> points, not to repackage the entire documentation set.
>>>
>>> For example this DuckDB's llms.txt
>>> <https://duckdb.org/docs/stable/llms.txt> file is around 30KB in
>>> size. Spark’s manifests will likely be a bit larger given the broader scope
>>> of APIs and documentation, but they should still remain lightweight
>>> link-only markdown files and well under the 10MB limit, even across
>>> multiple versions and language scopes.
>>>
>>> On Wed, Sep 10, 2025 at 8:47 AM Wenchen Fan <[email protected]> wrote:
>>>
>>>> This should just be a llm-facing index page of Spark docs? Given the
>>>> amount of APIs Spark provides today, I think this index page should be
>>>> useful to humans as well.
>>>>
>>>> On Wed, Sep 10, 2025 at 10:46 PM Dongjoon Hyun <[email protected]>
>>>> wrote:
>>>>
>>>>> Thank you, Allison and Hyukjin.
>>>>>
>>>>> IIUC, this proposal is not about a single file. SPIP already exposes
>>>>> multiple files which may increase our documentation and website size twice
>>>>> (or more in the worst case) because it's simply a duplication of the
>>>>> content. If we start to use AI tools to generate these LLMS.txt files, it
>>>>> could be much bigger than the original.
>>>>>
>>>>> *** From SPIP ***
>>>>> - [PySpark (Python)](
>>>>> https://spark.apache.org/docs/latest/api/python/llms.txt)
>>>>> - [Scala](https://spark.apache.org/docs/latest/api/scala/llms.txt)
>>>>> - [4.0.0 docs hub](
>>>>> https://archive.apache.org/dist/spark/docs/4.0.0/llms.txt)
>>>>> ***
>>>>>
>>>>> Since the size of Apache Spark 4.1.0-preview1 documentation is 1.2GB,
>>>>> could you propose to limit the total size of newly added llms.txt files
>>>>> under 10MB always systematically, Allison? If we don't have full
>>>>> controllability, this duplication will break the ASF Spark website like
>>>>> last year. We already inevitably archived old Spark documents from the
>>>>> original website location to "https://archive.apache.org/dist/spark/";
>>>>> due to the CI outage.
>>>>>
>>>>> $ du -h 4.1.0-preview1 | tail -n1
>>>>> 1.2G 4.1.0-preview1
>>>>>
>>>>> The bottom line is that we need to have a clear hard limit for this
>>>>> newly proposed duplication for machine-friendly metadata. If we have a
>>>>> systematic way to control the upper bound which is less than 10MB per 
>>>>> Spark
>>>>> version in total (now and forever), it sounds like a good addition.
>>>>>
>>>>> Thanks,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Tue, Sep 9, 2025 at 7:19 PM Allison Wang <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Yes, that’s right. It’s essentially just one markdown file to start
>>>>>> with, and we can add more later for language or version specific files if
>>>>>> needed.
>>>>>>
>>>>>> On Tue, Sep 9, 2025 at 4:32 PM Hyukjin Kwon <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> so it's basically adding one text file for llm, right? I think it's
>>>>>>> a good idea.
>>>>>>>
>>>>>>> On Tue, 9 Sept 2025 at 10:22, Allison Wang <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I’d like to propose adding llms.txt files to the Spark
>>>>>>>> documentation.
>>>>>>>>
>>>>>>>> As more users rely on AI-assisted tools and LLMs to learn, write
>>>>>>>> Spark code, and troubleshoot issues, it’s increasingly important that 
>>>>>>>> these
>>>>>>>> tools point back to the up-to-date official documentation. This
>>>>>>>> will help improve code generation quality and make new Spark features
>>>>>>>> easier to discover. The emerging llms.txt convention
>>>>>>>> <https://llmstxt.org/> provides a lightweight way to curate
>>>>>>>> LLM-friendly manifests of key documentation links.
>>>>>>>>
>>>>>>>> Would love to hear your feedback!
>>>>>>>> SPIP:
>>>>>>>> https://docs.google.com/document/d/1tRYdNTrIs8-JTgDthQ-7kcxEG7S91mNUVmUOfevW-cE/edit?tab=t.0#heading=h.wq8o4rl94dvr
>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-53528
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Allison
>>>>>>>>
>>>>>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>> Norge
>> <https://www.google.com/maps/search/Vestre+Aspehaug+4,+6010+%C3%85lesund++Norge?entry=gmail&source=g>
>>
>> +47 480 94 297
>>
>

Reply via email to