Re: [DISCUSS] Use plain text logs by default

Wenchen Fan Fri, 22 Nov 2024 18:51:50 -0800

Hi Martin,

Yea, we should be more deliberate about when to use Structured Logging. Let
me start with when people prefer plain text logs:
- Spark engine developers like us. When running tests, the logs are printed
in the console and plain text log is more human-readable.
- Spark users who prefer to read the logs manually due to the lack of infra
support.
- Spark users who already have decent log infra based on the plain text
logs.


In general, I think Structured Logging should be used when users want to
build an infra to consume logs by machine, or they want to switch their
existing infra to use JSON logs. Both need non-trivial work and turning
Structured Logging by default won't provide them much value, but it hurts
UX for people who still prefer plain text logs.

On Sat, Nov 23, 2024 at 9:09 AM Mridul Muralidharan <[email protected]>
wrote:

> +1 to defaulting to text logs !
>
> Regards,
> Mridul
>
> On Fri, Nov 22, 2024 at 6:21 PM Gengliang Wang <[email protected]> wrote:
>
>> Hi all,
>>
>> Earlier this year, we introduced JSON logging as the default in Spark
>> with the aim of enhancing log structure and facilitating better analysis.
>> While this change was made with the best intentions, we've collectively
>> observed some practical challenges that impact usability.
>>
>> *Key Observations:*
>>
>>    1.
>>
>>    *Human Readability*
>>    - *Cumbersome Formatting*: The JSON format, with its quotes and
>>       braces, has proven less readable for direct log inspection.
>>       - *Limitations of Pretty-Printing*: As noted in the Log4j
>>       documentation
>>       
>> <https://logging.apache.org/log4j/2.x/manual/json-template-layout.html>,
>>       pretty-printing JSON logs isn't feasible due to performance concerns.
>>       - *Difficult Interpretation*: Elements like logical plans and
>>       stack traces are rendered as single-line strings with embedded newline 
>> (
>>       \n) characters, making quick interpretation challenging.
>>       An example of a side-by-side plan comparison after setting
>>       spark.sql.planChangeLog.level=info:
>>       [image: image.png]
>>       2.
>>
>>    *Lack of Log Centralization Tools*
>>    - Although we can programmatically analyze logs using
>>       spark.read.schema(SPARK_LOG_SCHEMA).json("path/to/logs"), there is
>>       currently a lack of open-source tools to easily centralize and manage 
>> these
>>       logs across Drivers, Executors, Masters, and Workers. This limits the
>>       practical benefits we hoped to achieve with JSON logging.
>>    3.
>>
>>    *Consistency and Timing*
>>    - Since Spark 4.0 has yet to be released, we have an opportunity to
>>       maintain consistency with previous versions by reverting to plain text 
>> logs
>>       as the default. This doesn't close the door on structured logging; we 
>> can
>>       revisit this decision in future releases as the ecosystem matures and 
>> more
>>       supportive tools become available.
>>
>> Given these considerations, I support Wenchen's proposal to switch back
>> to plain text logs by default in Spark 4.0. Our goal is to provide the best
>> possible experience for our users, and adjusting our approach based on
>> real-world feedback is a part of that process.
>>
>> I'm looking forward to hearing your thoughts and discussing how we can
>> continue to improve our logging practices.
>>
>> Best regards,
>>
>> Gengliang Wang
>>
>> On Fri, Nov 22, 2024 at 3:32 PM bo yang <[email protected]> wrote:
>>
>>> +1 for default using plain text logging. It is good for simple usage
>>> scenario, will also be more friendly to first time Spark users.
>>>
>>> And different companies may already build some tooling to process Spark
>>> logs. Using plain text by default will make those exiting tools continue to
>>> work.
>>>
>>>
>>> On Friday, November 22, 2024, serge rielau.com <[email protected]> wrote:
>>>
>>>> It doesn’t have to be very easy. It just has to be easier than
>>>> maintaining two infrastrictures forever.
>>>> If we can’t easily parse the json log to emmit the existing text
>>>> content, I’d say we have a bigger problem.
>>>>
>>>> On Nov 22, 2024 at 2:17 PM -0800, Jungtaek Lim <
>>>> [email protected]>, wrote:
>>>>
>>>> I'm not sure it is very easy to provide a reader (I meant, viewer); it
>>>> would be mostly not a reader but a post-processor which will convert JSON
>>>> formatted log to plain text log. And after that users would get the "same"
>>>> UI/UX when dealing with log files in Spark 3.x. For people who do not
>>>> really need to structure the log and just want to go with their way of
>>>> reading the log (I'm a lover of grep), JSON formatted log by default is a
>>>> regression of UI/UX.
>>>>
>>>> JSON formatted log is definitely useful, but also definitely not
>>>> something to be human friendly. It is mostly only useful if they have
>>>> constructed an ecosystem around Spark which never requires humans to read
>>>> the log as JSON. I'm not quite sure whether we can/want to force users to
>>>> build the ecosystem to use Spark; for me, it's a lot easier for users to
>>>> have both options and turn on the config when they need it.
>>>>
>>>> +1 on Wenchen's proposal.
>>>>
>>>> On Sat, Nov 23, 2024 at 12:36 AM serge rielau.com <[email protected]>
>>>> wrote:
>>>>
>>>>> Shouldn’t we differentiate between teh logging and the reading of the
>>>>> log.
>>>>> The problem appears to be in the presentation layer.
>>>>> We could provide a basic log reader, insteda of supporting longterm
>>>>> two different ways to log.
>>>>>
>>>>>
>>>>> On Nov 22, 2024, at 6:37 AM, Martin Grund
>>>>> <[email protected]> wrote:
>>>>>
>>>>> I'm generally supportive of this direction. However, I'm wondering if
>>>>> we can be more deliberate about when to use it. For example, for the 
>>>>> common
>>>>> scenarios that you mention as "light" usage, we should switch to plain 
>>>>> text
>>>>> logging.
>>>>>
>>>>> IMO, this would cover the cases where a user runs simply the pyspark
>>>>> or spark-shell scripts. For these use cases, most users will probably
>>>>> prefer plain text logging. Maybe we should even go one step further and
>>>>> have some default console filters that use color output for these
>>>>> interactive use cases? And make it more readable in general?
>>>>>
>>>>> For the regular spark-submit-based job submissions, I would actually
>>>>> say that the benefits outweigh the potential complexity.
>>>>>
>>>>> WDYT?
>>>>>
>>>>> On Fri, Nov 22, 2024 at 3:26 PM Wenchen Fan <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm writing this email to propose switching back to the previous
>>>>>> plain text logs by default, for the following reasons:
>>>>>>
>>>>>>    - The JSON log is not very human-readable. It's more verbose than
>>>>>>    plain text, and new lines become `\n`, making query plan tree string 
>>>>>> and
>>>>>>    error stacktrace very hard to read.
>>>>>>    - Structured Logging is not available out of the box. Users must
>>>>>>    set up a log pipeline to collect the JSON log files on drivers and
>>>>>>    executors first. Turning it on by default doesn't provide much value.
>>>>>>
>>>>>> Some examples of the hard-to-read JSON log:
>>>>>> [image: image.png]
>>>>>> [image: image.png]
>>>>>>
>>>>>> For the good of Spark engine developers and light Spark users, I
>>>>>> think the previous plain text log is a better choice. We can add a doc 
>>>>>> page
>>>>>> to introduce how to use Structured Logging: turn on the config, collect
>>>>>> JSON log files, and run queries.
>>>>>>
>>>>>> Please let me know if you share the same feelings or have different
>>>>>> opinions.
>>>>>>
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>>
>>>>>
>>>>>

Re: [DISCUSS] Use plain text logs by default

Reply via email to