Hi Peter,

Thanks for thinking about this! Improving logs is a great contribution.

My philosophy is to stick to the usual definitions of logging levels.
Here’s my quick summary:

   - FATAL: the event that caused the service to stop (not used in Iceberg,
   since it’s a library)
   - ERROR: an event that stops an operation
   - WARNING: an unexpected event where the operation can continue but
   there may be a bigger problem
   - INFO: events that help you understand what’s happening in normal
   operation
   - DEBUG: events that help you spot bugs by looking for unexpected values

In general, I think that we can do better with INFO logging. But there is a
balance to target because we don’t want Iceberg to be so verbose that
people turn the logging off. We have to remember that Iceberg is typically
integrated into some other framework that is more important. Iceberg should
log the messages that help someone understand what the engine is doing,
rather than logging internal concerns. A good example of what not to do
comes from Parquet, where INFO messages were printed with stats from each
row group — this was too much because Parquet’s contribution to the overall
picture of what the engine was doing was just that it was storing data at
the file level. Parquet internal concerns should not have leaked.

I also don’t typically like to use DEBUG logging because you almost never
keep those logs. For debugging, I prefer to use a debugger and keep the
code clean. I think you could argue that DEBUG is appropriate for cases
like my Parquet example above, where INFO isn’t a good idea for the final
application, but you may still want to log information about the operations
in a library. I think that’s fine, just not things like logging the value
of arguments to functions all over the codebase. I think the use of DEBUG
for files that are being opened sounds reasonable.

I think the additional logs you’re proposing mostly look good. The scan
configuration is logged in Iceberg’s BaseTableScan, so you should already
get snapshot ID and filters. I’m not sure that we keep the time. We also
have a system for emitting events with more detail. Those can be used for
additional logging if you choose. We send them through a Kafka pipeline so
we can analyze all the scans taking place in our data warehouse.

rb

On Wed, Aug 26, 2020 at 6:24 AM Peter Vary <pv...@cloudera.com.invalid>
wrote:

> Hi Team,
>
> I was wondering if there is a general best practice for using log levels
> in Iceberg, what is the usual way we do it.
>
> I have been playing around with Iceberg/Hive integration and was wondering
> how I would be able to debug a customer case based on the logs.
> To be the honest answer based on the current code is most probably using
> some additional method to generate new log lines, which I think is not a
> good situation.
>
> I was thinking about adding some more logging to the code and added the
> proposed log level next to it:
>
>    - CatalogLoader/Catalogs:
>       - Which catalog is used to fetch the table - INFO
>       - What is the table being fetched in the end - INFO
>    - HiveCatalog:
>       - Table is created/dropped/renamed - INFO
>       - Catalog is created/dropped - INFO
>    - HiveTableOperation:
>       - Write committed / table created - INFO
>    - IcebergInputFormat:
>       - Scan parameters provided - DEBUG
>          - SnapshotId
>          - Time
>          - Filters
>       - Final snapshotId being read - INFO (if I find a way to get this)
>       - Opening specific files and their format - DEBUG
>
>
> Are the logs above are compliant with the current practice? Shall I modify
> them up or down? Any places around Iceberg/Hive you think it would be good
> to add logging?
>
> Thanks,
> Peter
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to