Hi Ryan, Thanks for the detailed answer, and the confirmation! Created the pull request:
https://github.com/apache/iceberg/pull/1394 <https://github.com/apache/iceberg/pull/1394> Hive: Add logging for Hive related classes #1394 <https://github.com/apache/iceberg/pull/1394> One comment for the message below: One place where DEBUG level can be useful, if an issue is only reproducible on customer side they can turn on DEBUG logging and share the logs - connecting with a debugger is not always desirable. I have seen multiline log messages in Iceberg - this could cause hard to read logs if there are plenty of concurrent users and the log lines start to mix-up. I suggest collapsing them to a single line. Thanks, Peter > On Aug 26, 2020, at 18:33, Ryan Blue <rb...@netflix.com.INVALID> wrote: > > Hi Peter, > > Thanks for thinking about this! Improving logs is a great contribution. > > My philosophy is to stick to the usual definitions of logging levels. Here’s > my quick summary: > > FATAL: the event that caused the service to stop (not used in Iceberg, since > it’s a library) > ERROR: an event that stops an operation > WARNING: an unexpected event where the operation can continue but there may > be a bigger problem > INFO: events that help you understand what’s happening in normal operation > DEBUG: events that help you spot bugs by looking for unexpected values > In general, I think that we can do better with INFO logging. But there is a > balance to target because we don’t want Iceberg to be so verbose that people > turn the logging off. We have to remember that Iceberg is typically > integrated into some other framework that is more important. Iceberg should > log the messages that help someone understand what the engine is doing, > rather than logging internal concerns. A good example of what not to do comes > from Parquet, where INFO messages were printed with stats from each row group > — this was too much because Parquet’s contribution to the overall picture of > what the engine was doing was just that it was storing data at the file > level. Parquet internal concerns should not have leaked. > > I also don’t typically like to use DEBUG logging because you almost never > keep those logs. For debugging, I prefer to use a debugger and keep the code > clean. I think you could argue that DEBUG is appropriate for cases like my > Parquet example above, where INFO isn’t a good idea for the final > application, but you may still want to log information about the operations > in a library. I think that’s fine, just not things like logging the value of > arguments to functions all over the codebase. I think the use of DEBUG for > files that are being opened sounds reasonable. > > I think the additional logs you’re proposing mostly look good. The scan > configuration is logged in Iceberg’s BaseTableScan, so you should already get > snapshot ID and filters. I’m not sure that we keep the time. We also have a > system for emitting events with more detail. Those can be used for additional > logging if you choose. We send them through a Kafka pipeline so we can > analyze all the scans taking place in our data warehouse. > > rb > > > On Wed, Aug 26, 2020 at 6:24 AM Peter Vary <pv...@cloudera.com.invalid> wrote: > Hi Team, > > I was wondering if there is a general best practice for using log levels in > Iceberg, what is the usual way we do it. > > I have been playing around with Iceberg/Hive integration and was wondering > how I would be able to debug a customer case based on the logs. > To be the honest answer based on the current code is most probably using some > additional method to generate new log lines, which I think is not a good > situation. > > I was thinking about adding some more logging to the code and added the > proposed log level next to it: > CatalogLoader/Catalogs: > Which catalog is used to fetch the table - INFO > What is the table being fetched in the end - INFO > HiveCatalog: > Table is created/dropped/renamed - INFO > Catalog is created/dropped - INFO > HiveTableOperation: > Write committed / table created - INFO > IcebergInputFormat: > Scan parameters provided - DEBUG > SnapshotId > Time > Filters > Final snapshotId being read - INFO (if I find a way to get this) > Opening specific files and their format - DEBUG > > Are the logs above are compliant with the current practice? Shall I modify > them up or down? Any places around Iceberg/Hive you think it would be good to > add logging? > > Thanks, > Peter > > > -- > Ryan Blue > Software Engineer > Netflix