Re: [PR] [SPARK-48752][PYTHON][CONNECT][DOCS] Introduce `pyspark.logger` for improved structured logging for PySpark [spark]

via GitHub Tue, 16 Jul 2024 00:06:03 -0700


itholic commented on code in PR #47145:
URL: https://github.com/apache/spark/pull/47145#discussion_r1678866930



##########
python/docs/source/development/logger.rst:
##########
@@ -0,0 +1,151 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+..    http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+==================
+Logging in PySpark
+==================
+
+.. currentmodule:: pyspark.logger
+
+Introduction
+============
+
+The :ref:`pyspark.logger</reference/pyspark.logger.rst>` module facilitates 
structured client-side logging for PySpark users.
+
+This module includes a :class:`PySparkLogger` class that provides several 
methods for logging messages at different levels in a structured JSON format:
+
+- :meth:`PySparkLogger.log_info`
+- :meth:`PySparkLogger.log_warn`
+- :meth:`PySparkLogger.log_error`
+
+The logger can be easily configured to write logs to either the console or a 
specified file.
+
+Customizing Log Format
+======================
+The default log format is JSON, which includes the timestamp, log level, 
logger name, and the log message along with any additional context provided.
+
+Example log entry:
+
+.. code-block:: python
+
+    {
+      "ts": "2024-06-28T10:53:48.528Z",
+      "level": "ERROR",
+      "logger": "DataFrameQueryContextLogger",
+      "msg": "[DIVIDE_BY_ZERO] Division by zero.",
+      "context": {
+        "file": "/path/to/file.py",

Review Comment:
   @gengliangwang Just updated PR. Now the log will include `exception` field 
as below when error occurs:
   
   ```json
   {
     "ts": "2024-06-28 19:53:48,563",
     "level": "ERROR",
     "logger": "DataFrameQueryContextLogger",
     "msg": "[DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate 
divisor being 0 and return NULL instead. If necessary set 
\"spark.sql.ansi.enabled\" to \"false\" to bypass this error. SQLSTATE: 
22012\n== DataFrame ==\n\"__truediv__\" was called 
from\n/.../spark/python/test_error_context.py:17\n", 
     "context": {
       "file": "/.../spark/python/test_error_context.py",
       "line_no": "17",
       "fragment": "__truediv__"
       "error_class": "DIVIDE_BY_ZERO"
     },
     "exception": {
       "class": "Py4JJavaError",
       "msg": "An error occurred while calling o52.showString.\n: 
org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. 
Use `try_divide` to tolerate divisor being 0 and return NULL instead. If 
necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. 
SQLSTATE: 22012\n== DataFrame ==\n\"__truediv__\" was called 
from\n/Users/haejoon.lee/Desktop/git_repos/spark/python/test_error_context.py:22\n\n\tat
 
org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:203)\n\tat
 
org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)\n\tat
 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
 Source)\n\tat 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)\n\tat 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)\n
 \tat 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)\n\tat
 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)\n\tat
 org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896)\n\tat 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:896)\n\tat
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)\n\tat 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:369)\n\tat 
org.apache.spark.rdd.RDD.iterator(RDD.scala:333)\n\tat 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)\n\tat 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)\n\tat 
org.apache.spark.scheduler.Task.run(Task.scala:146)\n\tat 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)\n\tat
 org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkE
 rrorUtils.scala:64)\n\tat 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)\n\tat
 org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)\n\tat 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)\n\tat 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat
 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat
 java.base/java.lang.Thread.run(Thread.java:840)\n\tat 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:1007)\n\tat 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)\n\tat 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2479)\n\tat 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2498)\n\tat 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2523)\n\tat 
org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1052)\n\tat 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala
 :151)\n\tat 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)\n\tat
 org.apache.spark.rdd.RDD.withScope(RDD.scala:412)\n\tat 
org.apache.spark.rdd.RDD.collect(RDD.scala:1051)\n\tat 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)\n\tat
 org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4449)\n\tat 
org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3393)\n\tat 
org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4439)\n\tat 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:599)\n\tat
 org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4437)\n\tat 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:154)\n\tat
 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:263)\n\tat
 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:118)\n\tat
 org.
 apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)\n\tat 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:74)\n\tat
 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:218)\n\tat
 org.apache.spark.sql.Dataset.withAction(Dataset.scala:4437)\n\tat 
org.apache.spark.sql.Dataset.head(Dataset.scala:3393)\n\tat 
org.apache.spark.sql.Dataset.take(Dataset.scala:3626)\n\tat 
org.apache.spark.sql.Dataset.getRows(Dataset.scala:294)\n\tat 
org.apache.spark.sql.Dataset.showString(Dataset.scala:330)\n\tat 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)\n\tat 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)\n\tat
 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat
 java.base/java.lang.reflect.Method.invoke(Method.java:568)\n\tat 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\
 n\tat py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)\n\tat 
py4j.Gateway.invoke(Gateway.java:282)\n\tat 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\n\tat 
py4j.commands.CallCommand.execute(CallCommand.java:79)\n\tat 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)\n\tat
 py4j.ClientServerConnection.run(ClientServerConnection.java:106)\n\tat 
java.base/java.lang.Thread.run(Thread.java:840)\n",
       "stacktrace": ["Traceback (most recent call last):", "  File 
\"/Users/haejoon.lee/Desktop/git_repos/spark/python/pyspark/errors/exceptions/captured.py\",
 line 272, in deco", "    return f(*a, **kw)", "  File 
\"/Users/haejoon.lee/anaconda3/envs/pyspark-dev-env/lib/python3.9/site-packages/py4j/protocol.py\",
 line 326, in get_return_value", "    raise Py4JJavaError(", 
"py4j.protocol.Py4JJavaError: An error occurred while calling o52.showString.", 
": org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by 
zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If 
necessary set \"spark.sql.ansi.enabled\" to \"false\" to bypass this error. 
SQLSTATE: 22012", "== DataFrame ==", "\"__truediv__\" was called from", 
"/Users/haejoon.lee/Desktop/git_repos/spark/python/test_error_context.py:22", 
"", "\tat 
org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:203)",
 "\tat org.apache.spark.sql.errors.QueryExecutionErr
 ors.divideByZeroError(QueryExecutionErrors.scala)", "\tat 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
 Source)", "\tat 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)", "\tat 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)",
 "\tat 
org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)",
 "\tat 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:388)",
 "\tat 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:896)", 
"\tat 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:896)",
 "\tat 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)", 
"\tat org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:369)
 ", "\tat org.apache.spark.rdd.RDD.iterator(RDD.scala:333)", "\tat 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)", "\tat 
org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:171)", 
"\tat org.apache.spark.scheduler.Task.run(Task.scala:146)", "\tat 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$5(Executor.scala:644)",
 "\tat 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)",
 "\tat 
org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)",
 "\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:99)", "\tat 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:647)", "\tat 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)",
 "\tat 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)",
 "\tat java.base/java.lang.Thread.run(Thread.java:840)", "\tat 
org.apache.spark.scheduler.DAGScheduler.r
 unJob(DAGScheduler.scala:1007)", "\tat 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2458)", "\tat 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2479)", "\tat 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2498)", "\tat 
org.apache.spark.SparkContext.runJob(SparkContext.scala:2523)", "\tat 
org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1052)", "\tat 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)",
 "\tat 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)",
 "\tat org.apache.spark.rdd.RDD.withScope(RDD.scala:412)", "\tat 
org.apache.spark.rdd.RDD.collect(RDD.scala:1051)", "\tat 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:448)", 
"\tat org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4449)", "\tat 
org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3393)", "\tat 
org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4439)", "\tat 
org.apache.
 
spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:599)",
 "\tat org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4437)", 
"\tat 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$6(SQLExecution.scala:154)",
 "\tat 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:263)",
 "\tat 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId0$1(SQLExecution.scala:118)",
 "\tat org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:923)", 
"\tat 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId0(SQLExecution.scala:74)",
 "\tat 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:218)",
 "\tat org.apache.spark.sql.Dataset.withAction(Dataset.scala:4437)", "\tat 
org.apache.spark.sql.Dataset.head(Dataset.scala:3393)", "\tat 
org.apache.spark.sql.Dataset.take(Dataset.scala:3626)", "\tat 
org.apache.spark.sql.Dataset.getRows(Dataset.scala:294
 )", "\tat org.apache.spark.sql.Dataset.showString(Dataset.scala:330)", "\tat 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)", "\tat 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)",
 "\tat 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)",
 "\tat java.base/java.lang.reflect.Method.invoke(Method.java:568)", "\tat 
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)", "\tat 
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)", "\tat 
py4j.Gateway.invoke(Gateway.java:282)", "\tat 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)", "\tat 
py4j.commands.CallCommand.execute(CallCommand.java:79)", "\tat 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)", 
"\tat py4j.ClientServerConnection.run(ClientServerConnection.java:106)", "\tat 
java.base/java.lang.Thread.run(Thread.java:840)"]
     },
   }
   ```
   
   FYI: Python logger provide a way to get a detailed exception information so 
we can just leverage it :-)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-48752][PYTHON][CONNECT][DOCS] Introduce `pyspark.logger` for improved structured logging for PySpark [spark]

Reply via email to