[ https://issues.apache.org/jira/browse/SPARK-16324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-16324: ------------------------------ Issue Type: Improvement (was: Bug) Summary: regexp_extract should doc that it returns empty string when match fails (was: regexp_extract returns empty string when match fails) > regexp_extract should doc that it returns empty string when match fails > ----------------------------------------------------------------------- > > Key: SPARK-16324 > URL: https://issues.apache.org/jira/browse/SPARK-16324 > Project: Spark > Issue Type: Improvement > Components: PySpark > Affects Versions: 2.0.0 > Reporter: Max Moroz > Priority: Minor > > The documentation for regexp_extract isn't clear about how it should behave > if the regex didn't match the row. However, the Java documentation it refers > for further detail suggests that the return value should be null if the group > wasn't matched at all, empty string is the group actually matched empty > string, and an exception raised if the entire regex didn't match. > This would be identical to how python's own re module behaves when a > MatchObject.group() is called. > However, in practice regexp_extract() returns empty string when the match > fails. This seems to be a bug; if it was intended as a feature, it should > have been documented as such - and it was probably not a good idea since it > can result in silent bugs. > {code} > import pyspark.sql.functions as F > df = spark.createDataFrame([['abc']], ['text']) > assert df.select(F.regexp_extract('text', r'(z)', 1)).first()[0] == '' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org