Hello Dylan,

Thank you for help. The result do look formatted after making the change.
However, from the following code, I was expecting RDD types like MappedRDD
and filteredRDD to be present in the lineage. However, I can only see
PythonRDD and parallelCollectionRDD in the lineage [I am running in local
mode].

`sc.parallelize([1,2,3,3]).map(lambda x:x**2).filter(lambda x:x>5).count()`

Note: I also tried setting logLineage property to true, but it did not
yield any additional details in the log.

Thanks,
Kanchan

On Sun, Apr 21, 2019 at 12:11 AM Dylan Guedes <djmggue...@gmail.com> wrote:

> Kanchan,
> the `toDebugString` looks unformatted because in some scenarios you need
> to parse it before (can't remember the reason, though). I suggest you to
> print the RDD Lineage using
> `print(rdd.toDebugString().decode("utf-8"))` instead (obs: this only
> occurs in Pyspark).
>
> About the other question, you may use `getNumberPartitions`.
>
> On Sat, Apr 20, 2019 at 2:40 PM kanchan tewary <kanchan.tew...@gmail.com>
> wrote:
>
>> Dear All,
>>
>> Greetings!
>>
>> I am new to Apache Spark and working on RDDs using pyspark. I am trying
>> to understand the logical plan provided by toDebugString function, but I
>> find two issues a) the output is not formatted when I print the result
>> b) I do not see number of partitions shown.
>>
>> Can anyone direct me to any reference documentation to understand the
>> logical plan better? Or, do you suggest to use DAG from spark UI instead?
>>
>>
>> Thanks & Best Regards,
>> Kanchan
>> Data Engineer, IBM
>>
>

-- 
Thanks & Best Regards,
Kanchan

Reply via email to