Re: Question about logging

Peter Vary Thu, 27 Aug 2020 05:56:14 -0700

Hi Ryan,

Thanks for the detailed answer, and the confirmation! Created the pull request:


https://github.com/apache/iceberg/pull/1394 
<https://github.com/apache/iceberg/pull/1394>
Hive: Add logging for Hive related classes #1394

 <https://github.com/apache/iceberg/pull/1394>
One comment for the message below: One place where DEBUG level can be useful, 
if an issue is only reproducible on customer side they can turn on DEBUG 
logging and share the logs - connecting with a debugger is not always desirable.

I have seen multiline log messages in Iceberg - this could cause hard to read 
logs if there are plenty of concurrent users and the log lines start to mix-up. 
I suggest collapsing them to a single line.

Thanks,
Peter


> On Aug 26, 2020, at 18:33, Ryan Blue <rb...@netflix.com.INVALID> wrote:
> 
> Hi Peter,
> 
> Thanks for thinking about this! Improving logs is a great contribution.
> 
> My philosophy is to stick to the usual definitions of logging levels. Here’s 
> my quick summary:
> 
> FATAL: the event that caused the service to stop (not used in Iceberg, since 
> it’s a library)
> ERROR: an event that stops an operation
> WARNING: an unexpected event where the operation can continue but there may 
> be a bigger problem
> INFO: events that help you understand what’s happening in normal operation
> DEBUG: events that help you spot bugs by looking for unexpected values
> In general, I think that we can do better with INFO logging. But there is a 
> balance to target because we don’t want Iceberg to be so verbose that people 
> turn the logging off. We have to remember that Iceberg is typically 
> integrated into some other framework that is more important. Iceberg should 
> log the messages that help someone understand what the engine is doing, 
> rather than logging internal concerns. A good example of what not to do comes 
> from Parquet, where INFO messages were printed with stats from each row group 
> — this was too much because Parquet’s contribution to the overall picture of 
> what the engine was doing was just that it was storing data at the file 
> level. Parquet internal concerns should not have leaked.
> 
> I also don’t typically like to use DEBUG logging because you almost never 
> keep those logs. For debugging, I prefer to use a debugger and keep the code 
> clean. I think you could argue that DEBUG is appropriate for cases like my 
> Parquet example above, where INFO isn’t a good idea for the final 
> application, but you may still want to log information about the operations 
> in a library. I think that’s fine, just not things like logging the value of 
> arguments to functions all over the codebase. I think the use of DEBUG for 
> files that are being opened sounds reasonable.
> 
> I think the additional logs you’re proposing mostly look good. The scan 
> configuration is logged in Iceberg’s BaseTableScan, so you should already get 
> snapshot ID and filters. I’m not sure that we keep the time. We also have a 
> system for emitting events with more detail. Those can be used for additional 
> logging if you choose. We send them through a Kafka pipeline so we can 
> analyze all the scans taking place in our data warehouse.
> 
> rb
> 
> 
> On Wed, Aug 26, 2020 at 6:24 AM Peter Vary <pv...@cloudera.com.invalid> wrote:
> Hi Team,
> 
> I was wondering if there is a general best practice for using log levels in 
> Iceberg, what is the usual way we do it.
> 
> I have been playing around with Iceberg/Hive integration and was wondering 
> how I would be able to debug a customer case based on the logs.
> To be the honest answer based on the current code is most probably using some 
> additional method to generate new log lines, which I think is not a good 
> situation.
> 
> I was thinking about adding some more logging to the code and added the 
> proposed log level next to it:
> CatalogLoader/Catalogs:
> Which catalog is used to fetch the table - INFO
> What is the table being fetched in the end - INFO
> HiveCatalog:
> Table is created/dropped/renamed - INFO
> Catalog is created/dropped - INFO
> HiveTableOperation:
> Write committed / table created - INFO 
> IcebergInputFormat:
> Scan parameters provided - DEBUG
> SnapshotId
> Time
> Filters
> Final snapshotId being read - INFO (if I find a way to get this)
> Opening specific files and their format - DEBUG
> 
> Are the logs above are compliant with the current practice? Shall I modify 
> them up or down? Any places around Iceberg/Hive you think it would be good to 
> add logging?
> 
> Thanks,
> Peter
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix

Re: Question about logging

Reply via email to