+1 to job and stage info in the SQL visualization. This is one of the most
difficult places for both users and our data platform team to understand.
We've resorted to logging the plan that is compiled in
`WholeStageCodegenExec` so at least we can go from a stage to what the plan
was, but there's n
Hi,
>I agree, it would be great if we could make the errors more clear about where
>the error happened (user code or in Spark code) and what assumption was
>violated. The problem is that this is a really hard thing to do generally,
>like Reynold said. I think we should look for individual cases
> it would be fantastic if we could make it easier to debug Spark programs
without needing to rely on eager execution.
I agree, it would be great if we could make the errors more clear about
where the error happened (user code or in Spark code) and what assumption
was violated. The problem is that
Blue , Koert Kuipers , dev
Subject: Re: eager execution and debuggability
Marco,
There is understanding how Spark works, and there is finding bugs early in
their own program. One can perfectly understand how Spark works and still find
it valuable to get feedback asap, and that's why we
The repr() trick is neat when working on a notebook. When working in a
library, I used to use an evaluate(dataframe) -> DataFrame function that
simply forces the materialization of a dataframe. As Reynold mentions, this
is very convenient when working on a lot of chained UDFs, and it is a
standard
Yes would be great if possible but it’s non trivial (might be impossible to
do in general; we already have stacktraces that point to line numbers when
an error occur in UDFs but clearly that’s not sufficient). Also in
environments like REPL it’s still more useful to show error as soon as it
occurs,
This may be technically impractical, but it would be fantastic if we could
make it easier to debug Spark programs without needing to rely on eager
execution. Sprinkling .count() and .checkpoint() at various points in my
code is still a debugging technique I use, but it always makes me wish
Spark co
I've opened SPARK-24215 to track this.
On Tue, May 8, 2018 at 3:58 PM, Reynold Xin wrote:
> Yup. Sounds great. This is something simple Spark can do and provide huge
> value to the end users.
>
>
> On Tue, May 8, 2018 at 3:53 PM Ryan Blue wrote:
>
>> Would be great if it is something more turn-
Yup. Sounds great. This is something simple Spark can do and provide huge
value to the end users.
On Tue, May 8, 2018 at 3:53 PM Ryan Blue wrote:
> Would be great if it is something more turn-key.
>
> We can easily add the __repr__ and _repr_html_ methods and behavior to
> PySpark classes. We c
Would be great if it is something more turn-key.
We can easily add the __repr__ and _repr_html_ methods and behavior to
PySpark classes. We could also add a configuration property to determine
whether the dataset evaluation is eager or not. That would make it turn-key
for anyone running PySpark in
s/underestimated/overestimated/
On Tue, May 8, 2018 at 3:44 PM Reynold Xin wrote:
> Marco,
>
> There is understanding how Spark works, and there is finding bugs early in
> their own program. One can perfectly understand how Spark works and still
> find it valuable to get feedback asap, and that'
Marco,
There is understanding how Spark works, and there is finding bugs early in
their own program. One can perfectly understand how Spark works and still
find it valuable to get feedback asap, and that's why we built eager
analysis in the first place.
Also I'm afraid you've significantly undere
I am not sure how this is useful. For students, it is important to
understand how Spark works. This can be critical in many decision they have
to take (whether and what to cache for instance) in order to have
performant Spark application. Creating a eager execution probably can help
them having som
At Netflix, we use Jupyter notebooks and consoles for interactive sessions.
For anyone interested, this mode of interaction is really easy to add in
Jupyter and PySpark. You would just define a different *repr_html* or *repr*
method for Dataset that runs a take(10) or take(100) and formats the resu
yeah we run into this all the time with new hires. they will send emails
explaining there is an error in the .write operation and they are debugging
the writing to disk, focusing on that piece of code :)
unrelated, but another frequent cause for confusion is cascading errors.
like the FetchFailedE
15 matches
Mail list logo