[ https://issues.apache.org/jira/browse/SPARK-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maciej Bryński updated SPARK-16569: ----------------------------------- Description: CC: [~davies] Many operations I do are like: {code} dataframe.rdd.map(some_function) {code} In Pyspark this mean creating Row object for every record and this is slow. IDEA: Use Cython to speed up Pyspark internals What do you think ? Sample profile: {code} ============================================================ Profile of RDD<id=9> ============================================================ 2000373036 function calls (2000312850 primitive calls) in 2045.307 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 14948 427.117 0.029 1811.622 0.121 {built-in method loads} 199920000 402.086 0.000 937.045 0.000 types.py:1162(_create_row) 199920000 262.708 0.000 262.708 0.000 {built-in method __new__ of type object at 0x9d1c40} 199920000 190.908 0.000 1219.794 0.000 types.py:558(fromInternal) 199920000 153.611 0.000 153.611 0.000 types.py:1280(__setattr__) 199920197 145.022 0.000 2024.126 0.000 rdd.py:1004(<genexpr>) 199920000 118.640 0.000 381.348 0.000 types.py:1194(__new__) 199920000 101.272 0.000 1321.067 0.000 types.py:1159(<lambda>) 200189064 91.928 0.000 91.928 0.000 {built-in method isinstance} 199920000 61.608 0.000 61.608 0.000 types.py:1158(_create_row_inbound_converter) {code} was: CC: [~davies] Many operations I do are like: {code} dataframe.rdd.map(some_function) {code} In Pyspark this mean creating Row object for every record and this is slow. IDEA: Use Cython to speed up Pyspark internals Sample profile: {code} ============================================================ Profile of RDD<id=9> ============================================================ 2000373036 function calls (2000312850 primitive calls) in 2045.307 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 14948 427.117 0.029 1811.622 0.121 {built-in method loads} 199920000 402.086 0.000 937.045 0.000 types.py:1162(_create_row) 199920000 262.708 0.000 262.708 0.000 {built-in method __new__ of type object at 0x9d1c40} 199920000 190.908 0.000 1219.794 0.000 types.py:558(fromInternal) 199920000 153.611 0.000 153.611 0.000 types.py:1280(__setattr__) 199920197 145.022 0.000 2024.126 0.000 rdd.py:1004(<genexpr>) 199920000 118.640 0.000 381.348 0.000 types.py:1194(__new__) 199920000 101.272 0.000 1321.067 0.000 types.py:1159(<lambda>) 200189064 91.928 0.000 91.928 0.000 {built-in method isinstance} 199920000 61.608 0.000 61.608 0.000 types.py:1158(_create_row_inbound_converter) {code} > Use Cython in Pyspark internals > ------------------------------- > > Key: SPARK-16569 > URL: https://issues.apache.org/jira/browse/SPARK-16569 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 1.6.2, 2.0.0 > Reporter: Maciej Bryński > Priority: Minor > > CC: [~davies] > Many operations I do are like: > {code} > dataframe.rdd.map(some_function) > {code} > In Pyspark this mean creating Row object for every record and this is slow. > IDEA: > Use Cython to speed up Pyspark internals > What do you think ? > Sample profile: > {code} > ============================================================ > Profile of RDD<id=9> > ============================================================ > 2000373036 function calls (2000312850 primitive calls) in 2045.307 > seconds > Ordered by: internal time, cumulative time > ncalls tottime percall cumtime percall filename:lineno(function) > 14948 427.117 0.029 1811.622 0.121 {built-in method loads} > 199920000 402.086 0.000 937.045 0.000 types.py:1162(_create_row) > 199920000 262.708 0.000 262.708 0.000 {built-in method __new__ of > type object at 0x9d1c40} > 199920000 190.908 0.000 1219.794 0.000 types.py:558(fromInternal) > 199920000 153.611 0.000 153.611 0.000 types.py:1280(__setattr__) > 199920197 145.022 0.000 2024.126 0.000 rdd.py:1004(<genexpr>) > 199920000 118.640 0.000 381.348 0.000 types.py:1194(__new__) > 199920000 101.272 0.000 1321.067 0.000 types.py:1159(<lambda>) > 200189064 91.928 0.000 91.928 0.000 {built-in method isinstance} > 199920000 61.608 0.000 61.608 0.000 > types.py:1158(_create_row_inbound_converter) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org