Maciej Bryński created SPARK-16569: --------------------------------------
Summary: Use Cython in Pyspark internals Key: SPARK-16569 URL: https://issues.apache.org/jira/browse/SPARK-16569 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 1.6.2, 2.0.0 Reporter: Maciej Bryński Priority: Minor CC: [~davies] Many operations I do are like: {code} dataframe.rdd.map(some_function) {code} In Pyspark this mean creating Row object for every record and this is slow. IDEA: Use Cython to speed up Pyspark internals Sample profile: {code} ============================================================ Profile of RDD<id=9> ============================================================ 2000373036 function calls (2000312850 primitive calls) in 2045.307 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 14948 427.117 0.029 1811.622 0.121 {built-in method loads} 199920000 402.086 0.000 937.045 0.000 types.py:1162(_create_row) 199920000 262.708 0.000 262.708 0.000 {built-in method __new__ of type object at 0x9d1c40} 199920000 190.908 0.000 1219.794 0.000 types.py:558(fromInternal) 199920000 153.611 0.000 153.611 0.000 types.py:1280(__setattr__) 199920197 145.022 0.000 2024.126 0.000 rdd.py:1004(<genexpr>) 199920000 118.640 0.000 381.348 0.000 types.py:1194(__new__) 199920000 101.272 0.000 1321.067 0.000 types.py:1159(<lambda>) 200189064 91.928 0.000 91.928 0.000 {built-in method isinstance} 199920000 61.608 0.000 61.608 0.000 types.py:1158(_create_row_inbound_converter) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org