This should just be a llm-facing index page of Spark docs? Given the amount
of APIs Spark provides today, I think this index page should be useful to
humans as well.

On Wed, Sep 10, 2025 at 10:46 PM Dongjoon Hyun <[email protected]>
wrote:

> Thank you, Allison and Hyukjin.
>
> IIUC, this proposal is not about a single file. SPIP already exposes
> multiple files which may increase our documentation and website size twice
> (or more in the worst case) because it's simply a duplication of the
> content. If we start to use AI tools to generate these LLMS.txt files, it
> could be much bigger than the original.
>
> *** From SPIP ***
> - [PySpark (Python)](
> https://spark.apache.org/docs/latest/api/python/llms.txt)
> - [Scala](https://spark.apache.org/docs/latest/api/scala/llms.txt)
> - [4.0.0 docs hub](
> https://archive.apache.org/dist/spark/docs/4.0.0/llms.txt)
> ***
>
> Since the size of Apache Spark 4.1.0-preview1 documentation is 1.2GB,
> could you propose to limit the total size of newly added llms.txt files
> under 10MB always systematically, Allison? If we don't have full
> controllability, this duplication will break the ASF Spark website like
> last year. We already inevitably archived old Spark documents from the
> original website location to "https://archive.apache.org/dist/spark/"; due
> to the CI outage.
>
> $ du -h 4.1.0-preview1 | tail -n1
> 1.2G 4.1.0-preview1
>
> The bottom line is that we need to have a clear hard limit for this newly
> proposed duplication for machine-friendly metadata. If we have a systematic
> way to control the upper bound which is less than 10MB per Spark version in
> total (now and forever), it sounds like a good addition.
>
> Thanks,
> Dongjoon.
>
>
> On Tue, Sep 9, 2025 at 7:19 PM Allison Wang <[email protected]>
> wrote:
>
>> Yes, that’s right. It’s essentially just one markdown file to start with,
>> and we can add more later for language or version specific files if needed.
>>
>> On Tue, Sep 9, 2025 at 4:32 PM Hyukjin Kwon <[email protected]> wrote:
>>
>>> so it's basically adding one text file for llm, right? I think it's a
>>> good idea.
>>>
>>> On Tue, 9 Sept 2025 at 10:22, Allison Wang <[email protected]>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I’d like to propose adding llms.txt files to the Spark documentation.
>>>>
>>>> As more users rely on AI-assisted tools and LLMs to learn, write Spark
>>>> code, and troubleshoot issues, it’s increasingly important that these tools
>>>> point back to the up-to-date official documentation. This will help
>>>> improve code generation quality and make new Spark features easier to
>>>> discover. The emerging llms.txt convention <https://llmstxt.org/>
>>>> provides a lightweight way to curate LLM-friendly manifests of key
>>>> documentation links.
>>>>
>>>> Would love to hear your feedback!
>>>> SPIP:
>>>> https://docs.google.com/document/d/1tRYdNTrIs8-JTgDthQ-7kcxEG7S91mNUVmUOfevW-cE/edit?tab=t.0#heading=h.wq8o4rl94dvr
>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-53528
>>>>
>>>> Thanks,
>>>> Allison
>>>>
>>>

Reply via email to