[jira] [Commented] (SPARK-21118) OOM with 2 handred million vertex when mitrx multply

2017-06-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16051641#comment-16051641
 ] 

Lorenz Bühmann commented on SPARK-21118:


The first point would be to use a subject title without typos. I mean "handred" 
and "mitrx multply"? Come on - how can others search for similar problems?!

Secondly, you're using `collect()` for both matrices. That's more or less 
breaking the idea of Spark, since you're collecting everything to the driver in 
memory. Of course, this will mean for large data to an OOM. You should read 
more about the principles of Spark I guess.

> OOM with 2 handred million vertex when mitrx multply
> 
>
> Key: SPARK-21118
> URL: https://issues.apache.org/jira/browse/SPARK-21118
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.0
> Environment: on yarn cluster,19 node.30GB per node
>Reporter: tao
>
> i have 2 matrix each one is 200milions*200milions.
> i want to multiply them ,but run out with oom .
> finally i find the oom appear at blockmatrix.simulateMultiply . there is a 
> collect action at this method. 
>  the collect will return all dataset that is too large to driver so the 
> driver will go to oom.
> class BlockMatrix @Since("1.3.0") (
> private[distributed] def simulateMultiply(
>   other: BlockMatrix,
>   partitioner: GridPartitioner): (BlockDestinations, BlockDestinations) = 
> {
> val leftMatrix = {color:red}blockInfo.keys.collect() {color}// blockInfo 
> should already be cached
> val rightMatrix = other.blocks.keys.collect()
> ..



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-06-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16054162#comment-16054162
 ] 

Michael Schmeißer commented on SPARK-650:
-

[~riteshtijoriwala] - Sorry, but I am not familiar with Spark 2.0.0 yet. But 
what I can say is that we have raised a Cloudera support case to address this 
issue so maybe we can expect some help from this side.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21287) Cannot use Int.MIN_VALUE as Spark SQL fetchsize

2017-07-03 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-21287:
---
Summary: Cannot use Int.MIN_VALUE as Spark SQL fetchsize  (was: Cannot use 
Iint.MIN_VALUE as Spark SQL fetchsize)

> Cannot use Int.MIN_VALUE as Spark SQL fetchsize
> ---
>
> Key: SPARK-21287
> URL: https://issues.apache.org/jira/browse/SPARK-21287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maciej Bryński
>
> MySQL JDBC driver gives possibility to not store ResultSet in memory.
> We can do this by setting fetchSize to Int.MIN_VALUE.
> Unfortunately this configuration isn't correct in Spark.
> {code}
> java.lang.IllegalArgumentException: requirement failed: Invalid value 
> `-2147483648` for parameter `fetchsize`. The minimum value is 0. When the 
> value is 0, the JDBC driver ignores the value and does the estimates.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:206)
>   at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21287) Cannot use Iint.MIN_VALUE as Spark SQL fetchsize

2017-07-03 Thread JIRA
Maciej Bryński created SPARK-21287:
--

 Summary: Cannot use Iint.MIN_VALUE as Spark SQL fetchsize
 Key: SPARK-21287
 URL: https://issues.apache.org/jira/browse/SPARK-21287
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1
Reporter: Maciej Bryński


MySQL JDBC driver gives possibility to not store ResultSet in memory.
We can do this by setting fetchSize to Int.MIN_VALUE.
Unfortunately this configuration isn't correct in Spark.
{code}
java.lang.IllegalArgumentException: requirement failed: Invalid value 
`-2147483648` for parameter `fetchsize`. The minimum value is 0. When the value 
is 0, the JDBC driver ignores the value and does the estimates.
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:105)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:206)
at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21287) Cannot use Int.MIN_VALUE as Spark SQL fetchsize

2017-07-03 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-21287:
---
Description: 
MySQL JDBC driver gives possibility to not store ResultSet in memory.
We can do this by setting fetchSize to Int.MIN_VALUE.
Unfortunately this configuration isn't correct in Spark.
{code}
java.lang.IllegalArgumentException: requirement failed: Invalid value 
`-2147483648` for parameter `fetchsize`. The minimum value is 0. When the value 
is 0, the JDBC driver ignores the value and does the estimates.
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:105)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:206)
at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
{code}

https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html

  was:
MySQL JDBC driver gives possibility to not store ResultSet in memory.
We can do this by setting fetchSize to Int.MIN_VALUE.
Unfortunately this configuration isn't correct in Spark.
{code}
java.lang.IllegalArgumentException: requirement failed: Invalid value 
`-2147483648` for parameter `fetchsize`. The minimum value is 0. When the value 
is 0, the JDBC driver ignores the value and does the estimates.
at scala.Predef$.require(Predef.scala:224)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:105)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:206)
at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
{code}


> Cannot use Int.MIN_VALUE as Spark SQL fetchsize
> ---
>
> Key: SPARK-21287
> URL: https://issues.apache.org/jira/browse/SPARK-21287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maciej Bryński
>
> MySQL JDBC driver gives possibility to not store ResultSet in memory.
> We can do this by setting fetchSize to Int.MIN_VALUE.
> Unfortunately this configuration isn't correct in Spark.
> {code}
> java.lang.IllegalArgumentException: requirement failed: Invalid value 
> `-2147483648` for parameter `fetchsize`. The minimum value is 0. When the 
> value is 0, the JDBC driver ignores the value and does the est

[jira] [Commented] (SPARK-21287) Cannot use Int.MIN_VALUE as Spark SQL fetchsize

2017-07-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072404#comment-16072404
 ] 

Maciej Bryński commented on SPARK-21287:


No.  It's not the same like setting 0 or 1.
Every other value makes MySQL JDBC driver to store whole ResultSet in memory.

Maybe we can just remove this assertion ?
I think this is not so super popular configuration option.

> Cannot use Int.MIN_VALUE as Spark SQL fetchsize
> ---
>
> Key: SPARK-21287
> URL: https://issues.apache.org/jira/browse/SPARK-21287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maciej Bryński
>
> MySQL JDBC driver gives possibility to not store ResultSet in memory.
> We can do this by setting fetchSize to Int.MIN_VALUE.
> Unfortunately this configuration isn't correct in Spark.
> {code}
> java.lang.IllegalArgumentException: requirement failed: Invalid value 
> `-2147483648` for parameter `fetchsize`. The minimum value is 0. When the 
> value is 0, the JDBC driver ignores the value and does the estimates.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:206)
>   at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21287) Cannot use Int.MIN_VALUE as Spark SQL fetchsize

2017-07-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072474#comment-16072474
 ] 

Maciej Bryński commented on SPARK-21287:


Quote
{code}
By default, ResultSets are completely retrieved and stored in memory. In most 
cases this is the most efficient way to operate and, due to the design of the 
MySQL network protocol, is easier to implement. If you are working with 
ResultSets that have a large number of rows or large values and cannot allocate 
heap space in your JVM for the memory required, you can tell the driver to 
stream the results back one row at a time.
{code}
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html

> Cannot use Int.MIN_VALUE as Spark SQL fetchsize
> ---
>
> Key: SPARK-21287
> URL: https://issues.apache.org/jira/browse/SPARK-21287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maciej Bryński
>
> MySQL JDBC driver gives possibility to not store ResultSet in memory.
> We can do this by setting fetchSize to Int.MIN_VALUE.
> Unfortunately this configuration isn't correct in Spark.
> {code}
> java.lang.IllegalArgumentException: requirement failed: Invalid value 
> `-2147483648` for parameter `fetchsize`. The minimum value is 0. When the 
> value is 0, the JDBC driver ignores the value and does the estimates.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:206)
>   at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21287) Cannot use Int.MIN_VALUE as Spark SQL fetchsize

2017-07-03 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16072474#comment-16072474
 ] 

Maciej Bryński edited comment on SPARK-21287 at 7/3/17 1:59 PM:


Quote
{quote}
By default, ResultSets are completely retrieved and stored in memory. In most 
cases this is the most efficient way to operate and, due to the design of the 
MySQL network protocol, is easier to implement. If you are working with 
ResultSets that have a large number of rows or large values and cannot allocate 
heap space in your JVM for the memory required, you can tell the driver to 
stream the results back one row at a time.
{quote}
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html


was (Author: maver1ck):
Quote
{code}
By default, ResultSets are completely retrieved and stored in memory. In most 
cases this is the most efficient way to operate and, due to the design of the 
MySQL network protocol, is easier to implement. If you are working with 
ResultSets that have a large number of rows or large values and cannot allocate 
heap space in your JVM for the memory required, you can tell the driver to 
stream the results back one row at a time.
{code}
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html

> Cannot use Int.MIN_VALUE as Spark SQL fetchsize
> ---
>
> Key: SPARK-21287
> URL: https://issues.apache.org/jira/browse/SPARK-21287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Maciej Bryński
>
> MySQL JDBC driver gives possibility to not store ResultSet in memory.
> We can do this by setting fetchSize to Int.MIN_VALUE.
> Unfortunately this configuration isn't correct in Spark.
> {code}
> java.lang.IllegalArgumentException: requirement failed: Invalid value 
> `-2147483648` for parameter `fetchsize`. The minimum value is 0. When the 
> value is 0, the JDBC driver ignores the value and does the estimates.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:105)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.(JDBCOptions.scala:34)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:32)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:125)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:166)
>   at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:206)
>   at sun.reflect.GeneratedMethodAccessor46.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:214)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-implementation-notes.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21439) Cannot use Spark with Python ABCmeta (exception from cloudpickle)

2017-07-17 Thread JIRA
Maciej Bryński created SPARK-21439:
--

 Summary: Cannot use Spark with Python ABCmeta (exception from 
cloudpickle)
 Key: SPARK-21439
 URL: https://issues.apache.org/jira/browse/SPARK-21439
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: Maciej Bryński


I'm trying to use code with ABCMeta.
This code gives exception as a result.
{code}
from abc import ABCMeta, abstractmethod
class A(metaclass=ABCMeta):
@abstractmethod
def x(self):
"""Abstract"""

class B(A):
def x(self):
return 10

b = B()

sc.range(10).map(lambda x: b.x()).collect()
{code}

Exception:
{code}
---
AttributeErrorTraceback (most recent call last)
/opt/spark/python/pyspark/cloudpickle.py in dump(self, obj)
146 try:
--> 147 return Pickler.dump(self, obj)
148 except RuntimeError as e:

/usr/lib/python3.4/pickle.py in dump(self, obj)
409 self.framer.start_framing()
--> 410 self.save(obj)
411 self.write(STOP)

/usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
476 if f is not None:
--> 477 f(self, obj) # Call unbound method with explicit self
478 return

/usr/lib/python3.4/pickle.py in save_tuple(self, obj)
741 for element in obj:
--> 742 save(element)
743 

/usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
476 if f is not None:
--> 477 f(self, obj) # Call unbound method with explicit self
478 return

/opt/spark/python/pyspark/cloudpickle.py in save_function(self, obj, name)
253 if klass is None or klass is not obj:
--> 254 self.save_function_tuple(obj)
255 return

/opt/spark/python/pyspark/cloudpickle.py in save_function_tuple(self, func)
290 save(_make_skel_func)
--> 291 save((code, closure, base_globals))
292 write(pickle.REDUCE)

/usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
476 if f is not None:
--> 477 f(self, obj) # Call unbound method with explicit self
478 return

/usr/lib/python3.4/pickle.py in save_tuple(self, obj)
726 for element in obj:
--> 727 save(element)
728 # Subtle.  Same as in the big comment below.

/usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
476 if f is not None:
--> 477 f(self, obj) # Call unbound method with explicit self
478 return

/usr/lib/python3.4/pickle.py in save_list(self, obj)
771 self.memoize(obj)
--> 772 self._batch_appends(obj)
773 

/usr/lib/python3.4/pickle.py in _batch_appends(self, items)
795 for x in tmp:
--> 796 save(x)
797 write(APPENDS)

/usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
476 if f is not None:
--> 477 f(self, obj) # Call unbound method with explicit self
478 return

/opt/spark/python/pyspark/cloudpickle.py in save_function(self, obj, name)
253 if klass is None or klass is not obj:
--> 254 self.save_function_tuple(obj)
255 return

/opt/spark/python/pyspark/cloudpickle.py in save_function_tuple(self, func)
290 save(_make_skel_func)
--> 291 save((code, closure, base_globals))
292 write(pickle.REDUCE)

/usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
476 if f is not None:
--> 477 f(self, obj) # Call unbound method with explicit self
478 return

/usr/lib/python3.4/pickle.py in save_tuple(self, obj)
726 for element in obj:
--> 727 save(element)
728 # Subtle.  Same as in the big comment below.

/usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
476 if f is not None:
--> 477 f(self, obj) # Call unbound method with explicit self
478 return

/usr/lib/python3.4/pickle.py in save_list(self, obj)
771 self.memoize(obj)
--> 772 self._batch_appends(obj)
773 

/usr/lib/python3.4/pickle.py in _batch_appends(self, items)
798 elif n:
--> 799 save(tmp[0])
800 write(APPEND)

/usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
476 if f is not None:
--> 477 f(self, obj) # Call unbound method with explicit self
478 return

/opt/spark/pytho

[jira] [Updated] (SPARK-21439) Cannot use Spark with Python ABCmeta (exception from cloudpickle)

2017-07-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-21439:
---
Component/s: PySpark

> Cannot use Spark with Python ABCmeta (exception from cloudpickle)
> -
>
> Key: SPARK-21439
> URL: https://issues.apache.org/jira/browse/SPARK-21439
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.1.1
>Reporter: Maciej Bryński
>
> I'm trying to use code with ABCMeta.
> This code gives exception as a result.
> {code}
> from abc import ABCMeta, abstractmethod
> class A(metaclass=ABCMeta):
> @abstractmethod
> def x(self):
> """Abstract"""
> 
> class B(A):
> def x(self):
> return 10
> b = B()
> sc.range(10).map(lambda x: b.x()).collect()
> {code}
> Exception:
> {code}
> ---
> AttributeErrorTraceback (most recent call last)
> /opt/spark/python/pyspark/cloudpickle.py in dump(self, obj)
> 146 try:
> --> 147 return Pickler.dump(self, obj)
> 148 except RuntimeError as e:
> /usr/lib/python3.4/pickle.py in dump(self, obj)
> 409 self.framer.start_framing()
> --> 410 self.save(obj)
> 411 self.write(STOP)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_tuple(self, obj)
> 741 for element in obj:
> --> 742 save(element)
> 743 
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function(self, obj, name)
> 253 if klass is None or klass is not obj:
> --> 254 self.save_function_tuple(obj)
> 255 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function_tuple(self, func)
> 290 save(_make_skel_func)
> --> 291 save((code, closure, base_globals))
> 292 write(pickle.REDUCE)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_tuple(self, obj)
> 726 for element in obj:
> --> 727 save(element)
> 728 # Subtle.  Same as in the big comment below.
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_list(self, obj)
> 771 self.memoize(obj)
> --> 772 self._batch_appends(obj)
> 773 
> /usr/lib/python3.4/pickle.py in _batch_appends(self, items)
> 795 for x in tmp:
> --> 796 save(x)
> 797 write(APPENDS)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function(self, obj, name)
> 253 if klass is None or klass is not obj:
> --> 254 self.save_function_tuple(obj)
> 255 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function_tuple(self, func)
> 290 save(_make_skel_func)
> --> 291 save((code, closure, base_globals))
> 292 write(pickle.REDUCE)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_tuple(self, obj)
> 726 for element in obj:
> --> 727 save(element)
> 728 # Subtle.  Same as in the big comment below.
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persiste

[jira] [Commented] (SPARK-21439) Cannot use Spark with Python ABCmeta (exception from cloudpickle)

2017-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092744#comment-16092744
 ] 

Maciej Bryński commented on SPARK-21439:


I think is a Spark problem with Python code serialization.

> Cannot use Spark with Python ABCmeta (exception from cloudpickle)
> -
>
> Key: SPARK-21439
> URL: https://issues.apache.org/jira/browse/SPARK-21439
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.1.1
>Reporter: Maciej Bryński
>
> I'm trying to use code with ABCMeta.
> This code gives exception as a result.
> {code}
> from abc import ABCMeta, abstractmethod
> class A(metaclass=ABCMeta):
> @abstractmethod
> def x(self):
> """Abstract"""
> 
> class B(A):
> def x(self):
> return 10
> b = B()
> sc.range(10).map(lambda x: b.x()).collect()
> {code}
> Exception:
> {code}
> ---
> AttributeErrorTraceback (most recent call last)
> /opt/spark/python/pyspark/cloudpickle.py in dump(self, obj)
> 146 try:
> --> 147 return Pickler.dump(self, obj)
> 148 except RuntimeError as e:
> /usr/lib/python3.4/pickle.py in dump(self, obj)
> 409 self.framer.start_framing()
> --> 410 self.save(obj)
> 411 self.write(STOP)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_tuple(self, obj)
> 741 for element in obj:
> --> 742 save(element)
> 743 
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function(self, obj, name)
> 253 if klass is None or klass is not obj:
> --> 254 self.save_function_tuple(obj)
> 255 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function_tuple(self, func)
> 290 save(_make_skel_func)
> --> 291 save((code, closure, base_globals))
> 292 write(pickle.REDUCE)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_tuple(self, obj)
> 726 for element in obj:
> --> 727 save(element)
> 728 # Subtle.  Same as in the big comment below.
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_list(self, obj)
> 771 self.memoize(obj)
> --> 772 self._batch_appends(obj)
> 773 
> /usr/lib/python3.4/pickle.py in _batch_appends(self, items)
> 795 for x in tmp:
> --> 796 save(x)
> 797 write(APPENDS)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function(self, obj, name)
> 253 if klass is None or klass is not obj:
> --> 254 self.save_function_tuple(obj)
> 255 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function_tuple(self, func)
> 290 save(_make_skel_func)
> --> 291 save((code, closure, base_globals))
> 292 write(pickle.REDUCE)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_tuple(self, obj)
> 726 for element in obj:
> --> 727 save(element)
> 728 # Subtle.  Same as in t

[jira] [Commented] (SPARK-12717) pyspark broadcast fails when using multiple threads

2017-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092756#comment-16092756
 ] 

Maciej Bryński commented on SPARK-12717:


Any progress with this error ?
I have patched from this thread merged and it's working fine.
Maybe we can merge this to master? (and other branches)

> pyspark broadcast fails when using multiple threads
> ---
>
> Key: SPARK-12717
> URL: https://issues.apache.org/jira/browse/SPARK-12717
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
> Environment: Linux, python 2.6 or python 2.7.
>Reporter: Edward Walker
>Priority: Critical
> Attachments: run.log
>
>
> The following multi-threaded program that uses broadcast variables 
> consistently throws exceptions like:  *Exception("Broadcast variable '18' not 
> loaded!",)* --- even when run with "--master local[10]".
> {code:title=bug_spark.py|borderStyle=solid}
> try:  
>  
> import pyspark
>  
> except:   
>  
> pass  
>  
> from optparse import OptionParser 
>  
>   
>  
> def my_option_parser():   
>  
> op = OptionParser()   
>  
> op.add_option("--parallelism", dest="parallelism", type="int", 
> default=20)  
> return op 
>  
>   
>  
> def do_process(x, w): 
>  
> return x * w.value
>  
>   
>  
> def func(name, rdd, conf):
>  
> new_rdd = rdd.map(lambda x :   do_process(x, conf))   
>  
> total = new_rdd.reduce(lambda x, y : x + y)   
>  
> count = rdd.count()   
>  
> print name, 1.0 * total / count   
>  
>   
>  
> if __name__ == "__main__":
>  
> import threading  
>  
> op = my_option_parser()   
>  
> options, args = op.parse_args()   
>  
> sc = pyspark.SparkContext(appName="Buggy")
>  
> data_rdd = sc.parallelize(range(0,1000), 1)   
>  
> confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ]  
>  
> threads = [ threading.Thread(target=func, args=["thread_" + str(i), 
> data_rdd, confs[i]]) for i in xrange(options.parallelism) ]   
>
> for t in threads: 
> 

[jira] [Created] (SPARK-21470) Spark History server doesn't support HDFS HA

2017-07-19 Thread JIRA
Maciej Bryński created SPARK-21470:
--

 Summary: Spark History server doesn't support HDFS HA
 Key: SPARK-21470
 URL: https://issues.apache.org/jira/browse/SPARK-21470
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Maciej Bryński


With Spark version up to 2.1.1 there was possibility to config history server 
to read from hdfs without putting namenode:
spark.history.fs.logDirectory hdfs:///apps/spark
And this works with HDFS HA.

Unfortunately there is regression with Spark 2.2.0 when such configuration 
gives error:
{code}
Caused by: java.io.IOException: Incomplete HDFS URI, no host: hdfs:///apps/spark
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:108)
at 
org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:78)
... 6 more
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21471) Read binary file error in Spark Streaming

2017-07-19 Thread JIRA
Lê Văn Thanh created SPARK-21471:


 Summary: Read binary file error in Spark Streaming
 Key: SPARK-21471
 URL: https://issues.apache.org/jira/browse/SPARK-21471
 Project: Spark
  Issue Type: Bug
  Components: Java API, Spark Core
Affects Versions: 2.2.0
 Environment: org.apache.spark
spark-streaming_2.11
2.2.0
Ubuntu - 16.10
Hadoop - 2.7.3
Reporter: Lê Văn Thanh


My client using  GZIPOutputStream to compressed the data . When I using 
binaryRecordsStream method to stream/read data and I got a message like :

!http://sv1.upsieutoc.com/2017/07/19/error.png!

My code :

{code:java}
SparkConf conf = new 
SparkConf().setAppName("SparkStream").setMaster("local[*]").set("spark.executor.memory",
 "1g");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, 
Seconds.apply(10));
JavaDStream javaDStream = 
streamingContext.binaryRecordsStream("hdfs://localhost:9000/user/data/", 10); 
javaDStream.foreachRDD(x -> {
List bytes = x.collect();
System.out.println(bytes);
});
{code}

Can you tell me how to fix this issue .




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21471) Read binary file error in Spark Streaming

2017-07-19 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lê Văn Thanh updated SPARK-21471:
-
Description: 
My client using  GZIPOutputStream to compressed the data and push to my server. 
When I using binaryRecordsStream method to stream/read data and I got a message 
like :

!http://sv1.upsieutoc.com/2017/07/19/error.png!

My code :

{code:java}
SparkConf conf = new 
SparkConf().setAppName("SparkStream").setMaster("local[*]").set("spark.executor.memory",
 "1g");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, 
Seconds.apply(10));
JavaDStream javaDStream = 
streamingContext.binaryRecordsStream("hdfs://localhost:9000/user/data/", 10); 
javaDStream.foreachRDD(x -> {
List bytes = x.collect();
System.out.println(bytes);
});
{code}

Can you tell me how to fix this issue .


  was:
My client using  GZIPOutputStream to compressed the data . When I using 
binaryRecordsStream method to stream/read data and I got a message like :

!http://sv1.upsieutoc.com/2017/07/19/error.png!

My code :

{code:java}
SparkConf conf = new 
SparkConf().setAppName("SparkStream").setMaster("local[*]").set("spark.executor.memory",
 "1g");
JavaStreamingContext streamingContext = new JavaStreamingContext(conf, 
Seconds.apply(10));
JavaDStream javaDStream = 
streamingContext.binaryRecordsStream("hdfs://localhost:9000/user/data/", 10); 
javaDStream.foreachRDD(x -> {
List bytes = x.collect();
System.out.println(bytes);
});
{code}

Can you tell me how to fix this issue .



> Read binary file error in Spark Streaming
> -
>
> Key: SPARK-21471
> URL: https://issues.apache.org/jira/browse/SPARK-21471
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Core
>Affects Versions: 2.2.0
> Environment: org.apache.spark
> spark-streaming_2.11
> 2.2.0
> Ubuntu - 16.10
> Hadoop - 2.7.3
>Reporter: Lê Văn Thanh
>
> My client using  GZIPOutputStream to compressed the data and push to my 
> server. When I using binaryRecordsStream method to stream/read data and I got 
> a message like :
> !http://sv1.upsieutoc.com/2017/07/19/error.png!
> My code :
> {code:java}
> SparkConf conf = new 
> SparkConf().setAppName("SparkStream").setMaster("local[*]").set("spark.executor.memory",
>  "1g");
> JavaStreamingContext streamingContext = new JavaStreamingContext(conf, 
> Seconds.apply(10));
> JavaDStream javaDStream = 
> streamingContext.binaryRecordsStream("hdfs://localhost:9000/user/data/", 10); 
> javaDStream.foreachRDD(x -> {
> List bytes = x.collect();
> System.out.println(bytes);
> });
> {code}
> Can you tell me how to fix this issue .



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21470) [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA

2017-07-19 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-21470:
---
Summary: [SPARK 2.2 Regression] Spark History server doesn't support HDFS 
HA  (was: Spark History server doesn't support HDFS HA)

> [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA
> ---
>
> Key: SPARK-21470
> URL: https://issues.apache.org/jira/browse/SPARK-21470
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> With Spark version up to 2.1.1 there was possibility to config history server 
> to read from hdfs without putting namenode:
> spark.history.fs.logDirectory hdfs:///apps/spark
> And this works with HDFS HA.
> Unfortunately there is regression with Spark 2.2.0 when such configuration 
> gives error:
> {code}
> Caused by: java.io.IOException: Incomplete HDFS URI, no host: 
> hdfs:///apps/spark
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:108)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:78)
> ... 6 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21439) Cannot use Spark with Python ABCmeta (exception from cloudpickle)

2017-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092888#comment-16092888
 ] 

Maciej Bryński commented on SPARK-21439:


https://github.com/cloudpipe/cloudpickle/pull/104

> Cannot use Spark with Python ABCmeta (exception from cloudpickle)
> -
>
> Key: SPARK-21439
> URL: https://issues.apache.org/jira/browse/SPARK-21439
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.1.1
>Reporter: Maciej Bryński
>
> I'm trying to use code with ABCMeta.
> This code gives exception as a result.
> {code}
> from abc import ABCMeta, abstractmethod
> class A(metaclass=ABCMeta):
> @abstractmethod
> def x(self):
> """Abstract"""
> 
> class B(A):
> def x(self):
> return 10
> b = B()
> sc.range(10).map(lambda x: b.x()).collect()
> {code}
> Exception:
> {code}
> ---
> AttributeErrorTraceback (most recent call last)
> /opt/spark/python/pyspark/cloudpickle.py in dump(self, obj)
> 146 try:
> --> 147 return Pickler.dump(self, obj)
> 148 except RuntimeError as e:
> /usr/lib/python3.4/pickle.py in dump(self, obj)
> 409 self.framer.start_framing()
> --> 410 self.save(obj)
> 411 self.write(STOP)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_tuple(self, obj)
> 741 for element in obj:
> --> 742 save(element)
> 743 
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function(self, obj, name)
> 253 if klass is None or klass is not obj:
> --> 254 self.save_function_tuple(obj)
> 255 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function_tuple(self, func)
> 290 save(_make_skel_func)
> --> 291 save((code, closure, base_globals))
> 292 write(pickle.REDUCE)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_tuple(self, obj)
> 726 for element in obj:
> --> 727 save(element)
> 728 # Subtle.  Same as in the big comment below.
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_list(self, obj)
> 771 self.memoize(obj)
> --> 772 self._batch_appends(obj)
> 773 
> /usr/lib/python3.4/pickle.py in _batch_appends(self, items)
> 795 for x in tmp:
> --> 796 save(x)
> 797 write(APPENDS)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function(self, obj, name)
> 253 if klass is None or klass is not obj:
> --> 254 self.save_function_tuple(obj)
> 255 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function_tuple(self, func)
> 290 save(_make_skel_func)
> --> 291 save((code, closure, base_globals))
> 292 write(pickle.REDUCE)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_tuple(self, obj)
> 726 for element in obj:
> --> 727 save(element)
> 728 # Subtle.  Same as in the b

[jira] [Comment Edited] (SPARK-21439) Cannot use Spark with Python ABCmeta (exception from cloudpickle)

2017-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092888#comment-16092888
 ] 

Maciej Bryński edited comment on SPARK-21439 at 7/19/17 10:15 AM:
--

I think this is bug in cloudpickle and it'll be repaired in next version.
https://github.com/cloudpipe/cloudpickle/pull/104

So we need to use new cloudpickle to resolve this in Spark.



was (Author: maver1ck):
https://github.com/cloudpipe/cloudpickle/pull/104

> Cannot use Spark with Python ABCmeta (exception from cloudpickle)
> -
>
> Key: SPARK-21439
> URL: https://issues.apache.org/jira/browse/SPARK-21439
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.1.1
>Reporter: Maciej Bryński
>
> I'm trying to use code with ABCMeta.
> This code gives exception as a result.
> {code}
> from abc import ABCMeta, abstractmethod
> class A(metaclass=ABCMeta):
> @abstractmethod
> def x(self):
> """Abstract"""
> 
> class B(A):
> def x(self):
> return 10
> b = B()
> sc.range(10).map(lambda x: b.x()).collect()
> {code}
> Exception:
> {code}
> ---
> AttributeErrorTraceback (most recent call last)
> /opt/spark/python/pyspark/cloudpickle.py in dump(self, obj)
> 146 try:
> --> 147 return Pickler.dump(self, obj)
> 148 except RuntimeError as e:
> /usr/lib/python3.4/pickle.py in dump(self, obj)
> 409 self.framer.start_framing()
> --> 410 self.save(obj)
> 411 self.write(STOP)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_tuple(self, obj)
> 741 for element in obj:
> --> 742 save(element)
> 743 
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function(self, obj, name)
> 253 if klass is None or klass is not obj:
> --> 254 self.save_function_tuple(obj)
> 255 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function_tuple(self, func)
> 290 save(_make_skel_func)
> --> 291 save((code, closure, base_globals))
> 292 write(pickle.REDUCE)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_tuple(self, obj)
> 726 for element in obj:
> --> 727 save(element)
> 728 # Subtle.  Same as in the big comment below.
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /usr/lib/python3.4/pickle.py in save_list(self, obj)
> 771 self.memoize(obj)
> --> 772 self._batch_appends(obj)
> 773 
> /usr/lib/python3.4/pickle.py in _batch_appends(self, items)
> 795 for x in tmp:
> --> 796 save(x)
> 797 write(APPENDS)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call unbound method with explicit self
> 478 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function(self, obj, name)
> 253 if klass is None or klass is not obj:
> --> 254 self.save_function_tuple(obj)
> 255 return
> /opt/spark/python/pyspark/cloudpickle.py in save_function_tuple(self, func)
> 290 save(_make_skel_func)
> --> 291 save((code, closure, base_globals))
> 292 write(pickle.REDUCE)
> /usr/lib/python3.4/pickle.py in save(self, obj, save_persistent_id)
> 476 if f is not None:
> --> 477 f(self, obj) # Call u

[jira] [Updated] (SPARK-11248) Spark hivethriftserver is using the wrong user to while getting HDFS permissions

2017-07-19 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11248:
---
Affects Version/s: 2.1.1

> Spark hivethriftserver is using the wrong user to while getting HDFS 
> permissions
> 
>
> Key: SPARK-11248
> URL: https://issues.apache.org/jira/browse/SPARK-11248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 2.1.1, 2.2.0
>Reporter: Trystan Leftwich
>
> While running spark as a hivethrift-server via Yarn Spark will use the user 
> running the Hivethrift server rather than the user connecting via JDBC to 
> check HDFS perms.
> i.e.
> In HDFS the perms are
> rwx--   3 testuser testuser /user/testuser/table/testtable
> And i connect via beeline as user testuser
> beeline -u 'jdbc:hive2://localhost:10511' -n 'testuser' -p ''
> If i try to hit that table
> select count(*) from test_table;
> I get the following error
> Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch 
> table test_table. java.security.AccessControlException: Permission denied: 
> user=hive, access=READ, 
> inode="/user/testuser/table/testtable":testuser:testuser:drwxr-x--x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> (state=,code=0)
> I have the following in set in hive-site.xml so it should be using the 
> correct user.
> 
>   hive.server2.enable.doAs
>   true
> 
> 
>   hive.metastore.execute.setugi
>   true
> 
> 
> This works correctly in hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11248) Spark hivethriftserver is using the wrong user to while getting HDFS permissions

2017-07-19 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11248:
---
Affects Version/s: 2.2.0

> Spark hivethriftserver is using the wrong user to while getting HDFS 
> permissions
> 
>
> Key: SPARK-11248
> URL: https://issues.apache.org/jira/browse/SPARK-11248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 2.1.1, 2.2.0
>Reporter: Trystan Leftwich
>
> While running spark as a hivethrift-server via Yarn Spark will use the user 
> running the Hivethrift server rather than the user connecting via JDBC to 
> check HDFS perms.
> i.e.
> In HDFS the perms are
> rwx--   3 testuser testuser /user/testuser/table/testtable
> And i connect via beeline as user testuser
> beeline -u 'jdbc:hive2://localhost:10511' -n 'testuser' -p ''
> If i try to hit that table
> select count(*) from test_table;
> I get the following error
> Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch 
> table test_table. java.security.AccessControlException: Permission denied: 
> user=hive, access=READ, 
> inode="/user/testuser/table/testtable":testuser:testuser:drwxr-x--x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> (state=,code=0)
> I have the following in set in hive-site.xml so it should be using the 
> correct user.
> 
>   hive.server2.enable.doAs
>   true
> 
> 
>   hive.metastore.execute.setugi
>   true
> 
> 
> This works correctly in hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11248) Spark hivethriftserver is using the wrong user to while getting HDFS permissions

2017-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093055#comment-16093055
 ] 

Maciej Bryński commented on SPARK-11248:


I have similar issue in Spark 2.2.0

> Spark hivethriftserver is using the wrong user to while getting HDFS 
> permissions
> 
>
> Key: SPARK-11248
> URL: https://issues.apache.org/jira/browse/SPARK-11248
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1, 2.1.1, 2.2.0
>Reporter: Trystan Leftwich
>
> While running spark as a hivethrift-server via Yarn Spark will use the user 
> running the Hivethrift server rather than the user connecting via JDBC to 
> check HDFS perms.
> i.e.
> In HDFS the perms are
> rwx--   3 testuser testuser /user/testuser/table/testtable
> And i connect via beeline as user testuser
> beeline -u 'jdbc:hive2://localhost:10511' -n 'testuser' -p ''
> If i try to hit that table
> select count(*) from test_table;
> I get the following error
> Error: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch 
> table test_table. java.security.AccessControlException: Permission denied: 
> user=hive, access=READ, 
> inode="/user/testuser/table/testtable":testuser:testuser:drwxr-x--x
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkFsPermission(FSPermissionChecker.java:271)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:257)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:185)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6795)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPermission(FSNamesystem.java:6777)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6702)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkAccess(FSNamesystem.java:9529)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.checkAccess(NameNodeRpcServer.java:1516)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.checkAccess(ClientNamenodeProtocolServerSideTranslatorPB.java:1433)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2039)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2035)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2033) 
> (state=,code=0)
> I have the following in set in hive-site.xml so it should be using the 
> correct user.
> 
>   hive.server2.enable.doAs
>   true
> 
> 
>   hive.metastore.execute.setugi
>   true
> 
> 
> This works correctly in hive.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2017-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093105#comment-16093105
 ] 

Maciej Bryński commented on SPARK-5159:
---

Still existed in Spark 2.2.0.
Probably duplicate of https://issues.apache.org/jira/browse/SPARK-11248

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21470) [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA

2017-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093716#comment-16093716
 ] 

Maciej Bryński commented on SPARK-21470:


[~vanzin]
I tried.
{code}
/etc/hadoop/conf$ grep -A1 fs.defaultFS core-site.xml
  fs.defaultFS
  hdfs://hdfs1
{code}

So I change spark.history.fs.logDirectory to hdfs://hdfs1/apps/spark
Result:
{code}
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:278)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: 
hdfs1
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:108)
at 
org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:78)
... 6 more
Caused by: java.net.UnknownHostException: hdfs1
... 20 more
{code}


> [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA
> ---
>
> Key: SPARK-21470
>     URL: https://issues.apache.org/jira/browse/SPARK-21470
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> With Spark version up to 2.1.1 there was possibility to config history server 
> to read from hdfs without putting namenode:
> spark.history.fs.logDirectory hdfs:///apps/spark
> And this works with HDFS HA.
> Unfortunately there is regression with Spark 2.2.0 when such configuration 
> gives error:
> {code}
> Caused by: java.io.IOException: Incomplete HDFS URI, no host: 
> hdfs:///apps/spark
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:108)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:78)
> ... 6 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21470) [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA

2017-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093716#comment-16093716
 ] 

Maciej Bryński edited comment on SPARK-21470 at 7/19/17 8:05 PM:
-

[~vanzin]
I tried.
{code}
/etc/hadoop/conf$ grep -A1 fs.defaultFS core-site.xml
  fs.defaultFS
  hdfs://hdfs1
{code}

So I change spark.history.fs.logDirectory to hdfs://hdfs1/apps/spark
Result:
{code}
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:278)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: 
hdfs1
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:108)
at 
org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:78)
... 6 more
Caused by: java.net.UnknownHostException: hdfs1
... 20 more
{code}




was (Author: maver1ck):
[~vanzin]
I tried.
{code}
/etc/hadoop/conf$ grep -A1 fs.defaultFS core-site.xml
  fs.defaultFS
  hdfs://hdfs1
{code}

So I change spark.history.fs.logDirectory to hdfs://hdfs1/apps/spark
Result:
{code}
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:278)
at 
org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.IllegalArgumentException: java.net.UnknownHostException: 
hdfs1
at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
at 
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at 
org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:108)
at 
org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:78)
... 6 more
Caused by: java.net.UnknownHostException: hdfs1
... 20 more
{code}


> [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA
> ---
>
> Key: SPARK-21470
> URL: https://issues.apache.org/jira/browse/SPARK-21470
> Project: Spark
>  Issue Type: Bug
>

[jira] [Closed] (SPARK-21470) [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA

2017-07-19 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński closed SPARK-21470.
--
Resolution: Invalid

> [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA
> ---
>
> Key: SPARK-21470
> URL: https://issues.apache.org/jira/browse/SPARK-21470
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> With Spark version up to 2.1.1 there was possibility to config history server 
> to read from hdfs without putting namenode:
> spark.history.fs.logDirectory hdfs:///apps/spark
> And this works with HDFS HA.
> Unfortunately there is regression with Spark 2.2.0 when such configuration 
> gives error:
> {code}
> Caused by: java.io.IOException: Incomplete HDFS URI, no host: 
> hdfs:///apps/spark
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:108)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:78)
> ... 6 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21470) [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA

2017-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093750#comment-16093750
 ] 

Maciej Bryński commented on SPARK-21470:


OK. 
I think I found the reason. 
There were no HADOOP_CONF_DIR env variable.

> [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA
> ---
>
> Key: SPARK-21470
> URL: https://issues.apache.org/jira/browse/SPARK-21470
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> With Spark version up to 2.1.1 there was possibility to config history server 
> to read from hdfs without putting namenode:
> spark.history.fs.logDirectory hdfs:///apps/spark
> And this works with HDFS HA.
> Unfortunately there is regression with Spark 2.2.0 when such configuration 
> gives error:
> {code}
> Caused by: java.io.IOException: Incomplete HDFS URI, no host: 
> hdfs:///apps/spark
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:108)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:78)
> ... 6 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21470) [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA

2017-07-19 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093750#comment-16093750
 ] 

Maciej Bryński edited comment on SPARK-21470 at 7/19/17 9:06 PM:
-

OK. 
I think I found the reason. 
There were no HADOOP_CONF_DIR env variable.

Sorry for making a problem.


was (Author: maver1ck):
OK. 
I think I found the reason. 
There were no HADOOP_CONF_DIR env variable.

> [SPARK 2.2 Regression] Spark History server doesn't support HDFS HA
> ---
>
> Key: SPARK-21470
> URL: https://issues.apache.org/jira/browse/SPARK-21470
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> With Spark version up to 2.1.1 there was possibility to config history server 
> to read from hdfs without putting namenode:
> spark.history.fs.logDirectory hdfs:///apps/spark
> And this works with HDFS HA.
> Unfortunately there is regression with Spark 2.2.0 when such configuration 
> gives error:
> {code}
> Caused by: java.io.IOException: Incomplete HDFS URI, no host: 
> hdfs:///apps/spark
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:142)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:108)
> at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:78)
> ... 6 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19743) Exception when creating more than one implicit Encoder in REPL

2017-07-19 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-19743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-19743:
---
Affects Version/s: 2.1.1
   2.2.0

> Exception when creating more than one implicit Encoder in REPL
> --
>
> Key: SPARK-19743
> URL: https://issues.apache.org/jira/browse/SPARK-19743
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: Maciej Bryński
>
> During test I wanted to create 2 different bean classes and encoders for 
> them. 
> First time it worked.
> {code}
> scala> class Test(@scala.beans.BeanProperty var xxx: Long) {}
> defined class Test
> scala> import org.apache.spark.sql.Encoders
> import org.apache.spark.sql.Encoders
> scala> implicit val testEncoder = Encoders.bean(classOf[Test])
> testEncoder: org.apache.spark.sql.Encoder[Test] = class[xxx[0]: bigint]
> scala> spark.range(10).map(new Test(_)).show()
> +---+
> |xxx|
> +---+
> |  0|
> |  1|
> |  2|
> |  3|
> |  4|
> |  5|
> |  6|
> |  7|
> |  8|
> |  9|
> +---+
> {code}
> Second try give me exception.
> {code}
> scala> class Test2(@scala.beans.BeanProperty var xxx: Long) {}
> defined class Test2
> scala> implicit val test2Encoder = Encoders.bean(classOf[Test2])
> test2Encoder: org.apache.spark.sql.Encoder[Test2] = class[xxx[0]: bigint]
> scala> spark.range(10).map(new Test2(_)).show()
> 17/02/26 18:10:15 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2, 
> cdh-data-4.gid): java.lang.ExceptionInInitializerError
> at $line17.$read$$iw.(:9)
> at $line17.$read.(:45)
> at $line17.$read$.(:49)
> at $line17.$read$.()
> at $line19.$read$$iw.(:10)
> at $line19.$read.(:21)
> at $line19.$read$.(:25)
> at $line19.$read$.()
> at 
> $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:32)
> at 
> $line21.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: A master URL must be set in your 
> configuration
> at org.apache.spark.SparkContext.(SparkContext.scala:368)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2258)
> at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:831)
> at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$8.apply(SparkSession.scala:823)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:823)
> at org.apache.spark.repl.Main$.createSparkSession(Main.scala:95)
> at $line3.$read$$iw$$iw.(:15)
> at $line3.$read$$iw.(:31)
> at $line3.$read.(:33)
> at $line3.$read$.(:37)
> at $line3.$read$.()
> ... 26 more
> 17/02/26 18:10:15 ERROR TaskSetManager: Task 0 in stage 2.0 failed 1 times; 
> aborting job
> org.apache.spark.SparkException: Job aborted due to stage fa

[jira] [Created] (SPARK-21507) Exception when using spark.jars.packages

2017-07-22 Thread JIRA
Maciej Bryński created SPARK-21507:
--

 Summary: Exception when using spark.jars.packages 
 Key: SPARK-21507
 URL: https://issues.apache.org/jira/browse/SPARK-21507
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Maciej Bryński
Priority: Minor


When more than one process is using packages option it's possible to create 
exception
{code}
[2017-07-21 18:14:18,356] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:18,356] {bash_operator.py:94} INFO - Ivy Default Cache set to: 
/home/bi/.ivy2/cache
[2017-07-21 18:14:18,357] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:18,356] {bash_operator.py:94} INFO - The jars for the packages stored in: 
/home/bi/.ivy2/jars
[2017-07-21 18:14:18,406] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:18,405] {bash_operator.py:94} INFO - :: loading settings :: url = 
jar:file:/opt/spark-2.2.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
[2017-07-21 18:14:18,735] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:18,735] {bash_operator.py:94} INFO - datastax#spark-cassandra-connector 
added as a dependency
[2017-07-21 18:14:18,737] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:18,737] {bash_operator.py:94} INFO - :: resolving dependencies :: 
org.apache.spark#spark-submit-parent;1.0
[2017-07-21 18:14:18,738] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:18,737] {bash_operator.py:94} INFO - confs: [default]
[2017-07-21 18:14:19,284] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,283] {bash_operator.py:94} INFO - found 
datastax#spark-cassandra-connector;2.0.3-s_2.11 in spark-packages
[2017-07-21 18:14:19,379] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,379] {bash_operator.py:94} INFO - found 
commons-beanutils#commons-beanutils;1.9.3 in central
[2017-07-21 18:14:19,415] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,415] {bash_operator.py:94} INFO - found 
commons-collections#commons-collections;3.2.2 in local-m2-cache
[2017-07-21 18:14:19,469] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,469] {bash_operator.py:94} INFO - found joda-time#joda-time;2.3 in 
central
[2017-07-21 18:14:19,518] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,517] {bash_operator.py:94} INFO - found org.joda#joda-convert;1.2 in 
central
[2017-07-21 18:14:19,583] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,582] {bash_operator.py:94} INFO - found com.twitter#jsr166e;1.1.0 in 
central
[2017-07-21 18:14:19,666] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,666] {bash_operator.py:94} INFO - found 
io.netty#netty-all;4.0.33.Final in central
[2017-07-21 18:14:19,732] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,732] {bash_operator.py:94} INFO - found 
org.scala-lang#scala-reflect;2.11.8 in local-m2-cache
[2017-07-21 18:14:19,831] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,830] {bash_operator.py:94} INFO - :: resolution report :: resolve 
1042ms :: artifacts dl 51ms
[2017-07-21 18:14:19,831] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,831] {bash_operator.py:94} INFO - :: modules in use:
[2017-07-21 18:14:19,834] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,832] {bash_operator.py:94} INFO - com.twitter#jsr166e;1.1.0 from 
central in [default]
[2017-07-21 18:14:19,834] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,832] {bash_operator.py:94} INFO - 
commons-beanutils#commons-beanutils;1.9.3 from central in [default]
[2017-07-21 18:14:19,834] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,832] {bash_operator.py:94} INFO - 
commons-collections#commons-collections;3.2.2 from local-m2-cache in [default]
[2017-07-21 18:14:19,834] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,832] {bash_operator.py:94} INFO - 
datastax#spark-cassandra-connector;2.0.3-s_2.11 from spark-packages in [default]
[2017-07-21 18:14:19,834] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,832] {bash_operator.py:94} INFO - io.netty#netty-all;4.0.33.Final from 
central in [default]
[2017-07-21 18:14:19,835] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,833] {bash_operator.py:94} INFO - joda-time#joda-time;2.3 from central 
in [default]
[2017-07-21 18:14:19,835] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,833] {bash_operator.py:94} INFO - org.joda#joda-convert;1.2 from 
central in [default]
[2017-07-21 18:14:19,835] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:19,833] {bash_operator.py:94} INFO - org.scala-lang#scala-reflect;2.11.8 
from local-m2-cache in [default]
[2017-07-21 18:14:19,835] {base_task_runner.py:95} INFO - Subtask: [2017-07-21 
18:14:1

[jira] [Updated] (SPARK-20712) [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has length greater than 4000 bytes

2017-07-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-20712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-20712:
---
Summary: [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when 
column type has length greater than 4000 bytes  (was: [SQL] Spark can't read 
Hive table when column type has length greater than 4000 bytes)

> [SPARK 2.1 REGRESSION][SQL] Spark can't read Hive table when column type has 
> length greater than 4000 bytes
> ---
>
> Key: SPARK-20712
>     URL: https://issues.apache.org/jira/browse/SPARK-20712
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.1.2, 2.3.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> Hi,
> I have following issue.
> I'm trying to read a table from hive when one of the column is nested so it's 
> schema has length longer than 4000 bytes.
> Everything worked on Spark 2.0.2. On 2.1.1 I'm getting Exception:
> {code}
> >> spark.read.table("SOME_TABLE")
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/opt/spark-2.1.1/python/pyspark/sql/readwriter.py", line 259, in table
> return self._df(self._jreader.table(tableName))
>   File 
> "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 
> 1133, in __call__
>   File "/opt/spark-2.1.1/python/pyspark/sql/utils.py", line 63, in deco
> return f(*a, **kw)
>   File "/opt/spark-2.1.1/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", 
> line 319, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o71.table.
> : org.apache.spark.SparkException: Cannot recognize hive type string: 
> SOME_VERY_LONG_FIELD_TYPE
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.org$apache$spark$sql$hive$client$HiveClientImpl$$fromHiveColumn(HiveClientImpl.scala:789)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:365)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
> at scala.collection.AbstractTraversable.map(Traversable.scala:104)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:365)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11.apply(HiveClientImpl.scala:361)
> at scala.Option.map(Option.scala:146)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:361)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1.apply(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:279)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:226)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:225)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:268)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTableOption(HiveClientImpl.scala:359)
> at 
> org.apache.spark.sql.hive.client.HiveClient$class.getTable(HiveClient.scala:74)
> at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getTable(HiveClientImpl.scala:78)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:118)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$org$apache$spark$sql$hive$HiveExternalCatalog$$getRawTable$1.apply(HiveExternalCatalog.scala:

[jira] [Updated] (SPARK-21507) Exception when using spark.jars.packages

2017-07-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-21507:
---
Description: 
When more than one process is using packages option it's possible to create 
exception
{code}
INFO - Ivy Default Cache set to: /home/bi/.ivy2/cache
INFO - The jars for the packages stored in: /home/bi/.ivy2/jars
INFO - :: loading settings :: url = 
jar:file:/opt/spark-2.2.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
INFO - datastax#spark-cassandra-connector added as a dependency
INFO - :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
INFO - confs: [default]
INFO - found datastax#spark-cassandra-connector;2.0.3-s_2.11 in spark-packages
INFO - found commons-beanutils#commons-beanutils;1.9.3 in central
INFO - found commons-collections#commons-collections;3.2.2 in local-m2-cache
INFO - found joda-time#joda-time;2.3 in central
INFO - found org.joda#joda-convert;1.2 in central
INFO - found com.twitter#jsr166e;1.1.0 in central
INFO - found io.netty#netty-all;4.0.33.Final in central
INFO - found org.scala-lang#scala-reflect;2.11.8 in local-m2-cache
INFO - :: resolution report :: resolve 1042ms :: artifacts dl 51ms
INFO - :: modules in use:
INFO - com.twitter#jsr166e;1.1.0 from central in [default]
INFO - commons-beanutils#commons-beanutils;1.9.3 from central in [default]
INFO - commons-collections#commons-collections;3.2.2 from local-m2-cache in 
[default]
INFO - datastax#spark-cassandra-connector;2.0.3-s_2.11 from spark-packages in 
[default]
INFO - io.netty#netty-all;4.0.33.Final from central in [default]
INFO - joda-time#joda-time;2.3 from central in [default]
INFO - org.joda#joda-convert;1.2 from central in [default]
INFO - org.scala-lang#scala-reflect;2.11.8 from local-m2-cache in [default]
INFO - -
INFO - |  |modules||   artifacts   |
INFO - |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
INFO - -
INFO - |  default |   8   |   0   |   0   |   0   ||   8   |   0   |
INFO - -
INFO - :: retrieving :: org.apache.spark#spark-submit-parent
INFO - confs: [default]
INFO - Exception in thread "main" java.lang.RuntimeException: problem during 
retrieve of org.apache.spark#spark-submit-parent: java.text.ParseException: 
failed to parse report: 
/home/bi/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml: 
Premature end of file.
INFO - at 
org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:249)
INFO - at 
org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:83)
INFO - at org.apache.ivy.Ivy.retrieve(Ivy.java:551)
INFO - at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1180)
INFO - at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:298)
INFO - at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
INFO - at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
INFO - at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
INFO - Caused by: java.text.ParseException: failed to parse report: 
/home/bi/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml: 
Premature end of file.
INFO - at 
org.apache.ivy.plugins.report.XmlReportParser.parse(XmlReportParser.java:293)
INFO - at 
org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:329)
INFO - at 
org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:118)
INFO - ... 7 more
INFO - Caused by: org.xml.sax.SAXParseException; Premature end of file.
INFO - at 
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
Source)
INFO - at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
INFO - at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
INFO - at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
INFO - at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
INFO - at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown 
Source)
INFO - at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
INFO - at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
INFO - at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
INFO - at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
INFO - at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown 
Source)
INFO - at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
INFO - at javax.xml.parsers.SAXParser.parse(SAXParser.java:328)
INFO - at 
org.apache.ivy.plugins.report.XmlReportParser$SaxXmlRep

[jira] [Updated] (SPARK-21507) Exception when using spark.jars.packages

2017-07-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-21507:
---
Description: 
When more than one process is using packages option it's possible to create 
following exception
{code}
INFO - Ivy Default Cache set to: /home/bi/.ivy2/cache
INFO - The jars for the packages stored in: /home/bi/.ivy2/jars
INFO - :: loading settings :: url = 
jar:file:/opt/spark-2.2.0/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
INFO - datastax#spark-cassandra-connector added as a dependency
INFO - :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
INFO - confs: [default]
INFO - found datastax#spark-cassandra-connector;2.0.3-s_2.11 in spark-packages
INFO - found commons-beanutils#commons-beanutils;1.9.3 in central
INFO - found commons-collections#commons-collections;3.2.2 in local-m2-cache
INFO - found joda-time#joda-time;2.3 in central
INFO - found org.joda#joda-convert;1.2 in central
INFO - found com.twitter#jsr166e;1.1.0 in central
INFO - found io.netty#netty-all;4.0.33.Final in central
INFO - found org.scala-lang#scala-reflect;2.11.8 in local-m2-cache
INFO - :: resolution report :: resolve 1042ms :: artifacts dl 51ms
INFO - :: modules in use:
INFO - com.twitter#jsr166e;1.1.0 from central in [default]
INFO - commons-beanutils#commons-beanutils;1.9.3 from central in [default]
INFO - commons-collections#commons-collections;3.2.2 from local-m2-cache in 
[default]
INFO - datastax#spark-cassandra-connector;2.0.3-s_2.11 from spark-packages in 
[default]
INFO - io.netty#netty-all;4.0.33.Final from central in [default]
INFO - joda-time#joda-time;2.3 from central in [default]
INFO - org.joda#joda-convert;1.2 from central in [default]
INFO - org.scala-lang#scala-reflect;2.11.8 from local-m2-cache in [default]
INFO - -
INFO - |  |modules||   artifacts   |
INFO - |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
INFO - -
INFO - |  default |   8   |   0   |   0   |   0   ||   8   |   0   |
INFO - -
INFO - :: retrieving :: org.apache.spark#spark-submit-parent
INFO - confs: [default]
INFO - Exception in thread "main" java.lang.RuntimeException: problem during 
retrieve of org.apache.spark#spark-submit-parent: java.text.ParseException: 
failed to parse report: 
/home/bi/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml: 
Premature end of file.
INFO - at 
org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:249)
INFO - at 
org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:83)
INFO - at org.apache.ivy.Ivy.retrieve(Ivy.java:551)
INFO - at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1180)
INFO - at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:298)
INFO - at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
INFO - at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
INFO - at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
INFO - Caused by: java.text.ParseException: failed to parse report: 
/home/bi/.ivy2/cache/org.apache.spark-spark-submit-parent-default.xml: 
Premature end of file.
INFO - at 
org.apache.ivy.plugins.report.XmlReportParser.parse(XmlReportParser.java:293)
INFO - at 
org.apache.ivy.core.retrieve.RetrieveEngine.determineArtifactsToCopy(RetrieveEngine.java:329)
INFO - at 
org.apache.ivy.core.retrieve.RetrieveEngine.retrieve(RetrieveEngine.java:118)
INFO - ... 7 more
INFO - Caused by: org.xml.sax.SAXParseException; Premature end of file.
INFO - at 
org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown 
Source)
INFO - at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
INFO - at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
INFO - at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
INFO - at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
INFO - at org.apache.xerces.impl.XMLVersionDetector.determineDocVersion(Unknown 
Source)
INFO - at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
INFO - at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
INFO - at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
INFO - at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
INFO - at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown 
Source)
INFO - at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
INFO - at javax.xml.parsers.SAXParser.parse(SAXParser.java:328)
INFO - at 
org.apache.ivy.plugins.report.XmlReportParser$SaxXmlRep

[jira] [Commented] (SPARK-20392) Slow performance when calling fit on ML pipeline for dataset with many columns but few rows

2017-07-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-20392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16098612#comment-16098612
 ] 

Maciej Bryński commented on SPARK-20392:


Is it safe to merge it to 2.2 ?
I'm tracing problems with Catalyst performance and this could be a solution.

> Slow performance when calling fit on ML pipeline for dataset with many 
> columns but few rows
> ---
>
> Key: SPARK-20392
> URL: https://issues.apache.org/jira/browse/SPARK-20392
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Barry Becker
>Assignee: Liang-Chi Hsieh
>Priority: Blocker
> Attachments: blockbuster.csv, blockbuster_fewCols.csv, 
> giant_query_plan_for_fitting_pipeline.txt, model_9754.zip, model_9756.zip
>
>
> This started as a [question on stack 
> overflow|http://stackoverflow.com/questions/43484006/why-is-it-slow-to-apply-a-spark-pipeline-to-dataset-with-many-columns-but-few-ro],
>  but it seems like a bug.
> I am testing spark pipelines using a simple dataset (attached) with 312 
> (mostly numeric) columns, but only 421 rows. It is small, but it takes 3 
> minutes to apply my ML pipeline to it on a 24 core server with 60G of memory. 
> This seems much to long for such a tiny dataset. Similar pipelines run 
> quickly on datasets that have fewer columns and more rows. It's something 
> about the number of columns that is causing the slow performance.
> Here are a list of the stages in my pipeline:
> {code}
> 000_strIdx_5708525b2b6c
> 001_strIdx_ec2296082913
> 002_bucketizer_3cbc8811877b
> 003_bucketizer_5a01d5d78436
> 004_bucketizer_bf290d11364d
> 005_bucketizer_c3296dfe94b2
> 006_bucketizer_7071ca50eb85
> 007_bucketizer_27738213c2a1
> 008_bucketizer_bd728fd89ba1
> 009_bucketizer_e1e716f51796
> 010_bucketizer_38be665993ba
> 011_bucketizer_5a0e41e5e94f
> 012_bucketizer_b5a3d5743aaa
> 013_bucketizer_4420f98ff7ff
> 014_bucketizer_777cc4fe6d12
> 015_bucketizer_f0f3a3e5530e
> 016_bucketizer_218ecca3b5c1
> 017_bucketizer_0b083439a192
> 018_bucketizer_4520203aec27
> 019_bucketizer_462c2c346079
> 020_bucketizer_47435822e04c
> 021_bucketizer_eb9dccb5e6e8
> 022_bucketizer_b5f63dd7451d
> 023_bucketizer_e0fd5041c841
> 024_bucketizer_ffb3b9737100
> 025_bucketizer_e06c0d29273c
> 026_bucketizer_36ee535a425f
> 027_bucketizer_ee3a330269f1
> 028_bucketizer_094b58ea01c0
> 029_bucketizer_e93ea86c08e2
> 030_bucketizer_4728a718bc4b
> 031_bucketizer_08f6189c7fcc
> 032_bucketizer_11feb74901e6
> 033_bucketizer_ab4add4966c7
> 034_bucketizer_4474f7f1b8ce
> 035_bucketizer_90cfa5918d71
> 036_bucketizer_1a9ff5e4eccb
> 037_bucketizer_38085415a4f4
> 038_bucketizer_9b5e5a8d12eb
> 039_bucketizer_082bb650ecc3
> 040_bucketizer_57e1e363c483
> 041_bucketizer_337583fbfd65
> 042_bucketizer_73e8f6673262
> 043_bucketizer_0f9394ed30b8
> 044_bucketizer_8530f3570019
> 045_bucketizer_c53614f1e507
> 046_bucketizer_8fd99e6ec27b
> 047_bucketizer_6a8610496d8a
> 048_bucketizer_888b0055c1ad
> 049_bucketizer_974e0a1433a6
> 050_bucketizer_e848c0937cb9
> 051_bucketizer_95611095a4ac
> 052_bucketizer_660a6031acd9
> 053_bucketizer_aaffe5a3140d
> 054_bucketizer_8dc569be285f
> 055_bucketizer_83d1bffa07bc
> 056_bucketizer_0c6180ba75e6
> 057_bucketizer_452f265a000d
> 058_bucketizer_38e02ddfb447
> 059_bucketizer_6fa4ad5d3ebd
> 060_bucketizer_91044ee766ce
> 061_bucketizer_9a9ef04a173d
> 062_bucketizer_3d98eb15f206
> 063_bucketizer_c4915bb4d4ed
> 064_bucketizer_8ca2b6550c38
> 065_bucketizer_417ee9b760bc
> 066_bucketizer_67f3556bebe8
> 067_bucketizer_0556deb652c6
> 068_bucketizer_067b4b3d234c
> 069_bucketizer_30ba55321538
> 070_bucketizer_ad826cc5d746
> 071_bucketizer_77676a898055
> 072_bucketizer_05c37a38ce30
> 073_bucketizer_6d9ae54163ed
> 074_bucketizer_8cd668b2855d
> 075_bucketizer_d50ea1732021
> 076_bucketizer_c68f467c9559
> 077_bucketizer_ee1dfc840db1
> 078_bucketizer_83ec06a32519
> 079_bucketizer_741d08c1b69e
> 080_bucketizer_b7402e4829c7
> 081_bucketizer_8adc590dc447
> 082_bucketizer_673be99bdace
> 083_bucketizer_77693b45f94c
> 084_bucketizer_53529c6b1ac4
> 085_bucketizer_6a3ca776a81e
> 086_bucketizer_6679d9588ac1
> 087_bucketizer_6c73af456f65
> 088_bucketizer_2291b2c5ab51
> 089_bucketizer_cb3d0fe669d8
> 090_bucketizer_e71f913c1512
> 091_bucketizer_156528f65ce7
> 092_bucketizer_f3ec5dae079b
> 093_bucketizer_809fab77eee1
> 094_bucketizer_6925831511e6
> 0

[jira] [Commented] (SPARK-12261) pyspark crash for large dataset

2017-07-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099686#comment-16099686
 ] 

Paul Magnus Sørensen-Clark commented on SPARK-12261:


I have a similar problem, so I have tried the solution suggested here, adding 
"for _ in iterator: pass". Two people suggested doing this, but in different 
places.
Niall McCarrol said to add it in the function process in pyspark/worker.py.
Shea Parkes said to add it in the function takeUpToNumLeft, which I found in 
pyspark/rdd.py.
What is the deal with these two different locations? I don't really know the 
difference, so I just added it both places to be safe. This error only randomly 
happens every once in a while for me. So it is hard to tell if it actually 
helped until several days have passed with no error. I use Python and Spark 
version 2.1.1.

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
>     URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12261) pyspark crash for large dataset

2017-07-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099686#comment-16099686
 ] 

Paul Magnus Sørensen-Clark edited comment on SPARK-12261 at 7/25/17 8:10 AM:
-

I have a similar problem, so I have tried the solution suggested here, adding 
"for _ in iterator: pass". Two people suggested doing this, but in different 
places.
Niall McCarrol said to add it in the function process in pyspark/worker.py.
Shea Parkes said to add it in the function takeUpToNumLeft, which I found in 
pyspark/rdd.py.
What is the deal with these two different locations? I don't really know the 
difference, so I just added it both places to be safe. This error only randomly 
happens every once in a while for me, so it is hard to tell if it actually 
helped until several days have passed with no error. I use Python and Spark 
version 2.1.1.


was (Author: paulmag91):
I have a similar problem, so I have tried the solution suggested here, adding 
"for _ in iterator: pass". Two people suggested doing this, but in different 
places.
Niall McCarrol said to add it in the function process in pyspark/worker.py.
Shea Parkes said to add it in the function takeUpToNumLeft, which I found in 
pyspark/rdd.py.
What is the deal with these two different locations? I don't really know the 
difference, so I just added it both places to be safe. This error only randomly 
happens every once in a while for me. So it is hard to tell if it actually 
helped until several days have passed with no error. I use Python and Spark 
version 2.1.1.

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
>     URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21534) Exception when creating dataframe from python row with empty bytearray

2017-07-25 Thread JIRA
Maciej Bryński created SPARK-21534:
--

 Summary: Exception when creating dataframe from python row with 
empty bytearray
 Key: SPARK-21534
 URL: https://issues.apache.org/jira/browse/SPARK-21534
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 2.2.0
Reporter: Maciej Bryński


{code}
spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: 
{"abc": x.xx})).show()
{code}
This code creates exception. It looks like corner-case.
{code}
net.razorvine.pickle.PickleException: invalid pickle data for bytearray; 
expected 1 or 2 args, got 0
at 
net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152)
at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2853)
at 
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2153)
at 
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2153)
at org.apache.spark.sql.Dataset$$anonf

[jira] [Updated] (SPARK-21534) PickleException when creating dataframe from python row with empty bytearray

2017-07-25 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-21534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-21534:
---
Summary: PickleException when creating dataframe from python row with empty 
bytearray  (was: Exception when creating dataframe from python row with empty 
bytearray)

> PickleException when creating dataframe from python row with empty bytearray
> 
>
> Key: SPARK-21534
> URL: https://issues.apache.org/jira/browse/SPARK-21534
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Maciej Bryński
>
> {code}
> spark.createDataFrame(spark.sql("select unhex('') as xx").rdd.map(lambda x: 
> {"abc": x.xx})).show()
> {code}
> This code creates exception. It looks like corner-case.
> {code}
> net.razorvine.pickle.PickleException: invalid pickle data for bytearray; 
> expected 1 or 2 args, got 0
>   at 
> net.razorvine.pickle.objects.ByteArrayConstructor.construct(ByteArrayConstructor.java:20)
>   at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
>   at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
>   at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
>   at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:152)
>   at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:748)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1499)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1487)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1486)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1486)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1714)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1669)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1658)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
>   at org.apache.spark.SparkConte

[jira] [Created] (SPARK-21558) Kinesis lease failover time should be increased or made configurable

2017-07-28 Thread JIRA
Clément MATHIEU created SPARK-21558:
---

 Summary: Kinesis lease failover time should be increased or made 
configurable
 Key: SPARK-21558
 URL: https://issues.apache.org/jira/browse/SPARK-21558
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.0.2
Reporter: Clément MATHIEU


I have a Spark Streaming application reading from a Kinesis stream which 
exhibits serious shard lease fickleness. The root cause as been identified as 
KCL default failover time being too low for our typical JVM pauses time:

#  KinesisClientLibConfiguration#DEFAULT_FAILOVER_TIME_MILLIS is 10 seconds, 
meaning that if a worker does not renew a lease within 10s, others workers will 
steal it
# spark-streaming-kinesis-asl uses default KCL failover time and does not allow 
to configure it
# Executor's JVM logs show frequent 10+ seconds pauses

While we could spend some time to fine tune GC configuration to reduce pause 
times, I am wondering if 10 seconds is not too low. Typical Spark executors 
have very large heaps and GCs available in HotSpot are not great at ensuring 
low and deterministic pause times. One might also want to use ParallelGC. 

What do you think about:

# Increasing fail over time (it might hurts application with low latency 
requirements)
# Making it configurable




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-08-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110006#comment-16110006
 ] 

Michael Schmeißer commented on SPARK-650:
-

Please see my comment from 05/Dec/16 12:39 and the following discussion - we 
are kind of going in circles here. I tried to explain the (real) problems we 
were facing as good as I can and which solution we applied to them and why 
other solutions have been dismissed. The fact is: There are numerous people 
here who seem to have the same issues and are glad to apply the workaround 
because "using the singleton" doesn't seem to provide a solution to them 
either. Probably we all don't understand how to do this but then again there 
seems to be something missing - at least documentation, doesn't it? What I can 
tell you in addition is that we have concerned experienced developers with the 
topic who have used quite a few singletons.

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2016-06-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356633#comment-15356633
 ] 

Maciej Bryński commented on SPARK-:
---

[~rxin]
What about Python API ? What's the target version ?
https://databricks.com/blog/2016/01/04/introducing-apache-spark-datasets.html

Quote:
"As we look forward to Spark 2.0, we plan some exciting improvements to 
Datasets, specifically:
Python Support."

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
>     URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13233) Python Dataset

2016-06-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356645#comment-15356645
 ] 

Maciej Bryński commented on SPARK-13233:


[~holdenk]
Did you find out what's going on about Dataset API for Python ?

[~davies]
Maybe you can shed some light on the API ?



> Python Dataset
> --
>
> Key: SPARK-13233
> URL: https://issues.apache.org/jira/browse/SPARK-13233
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
> Attachments: DesignDocPythonDataset.pdf
>
>
> add Python Dataset w.r.t. the scala version



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2016-06-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356647#comment-15356647
 ] 

Maciej Bryński commented on SPARK-:
---

OK.
So what about this patch ?
https://issues.apache.org/jira/browse/SPARK-13594

Should we break backward compatibility in Python Dataframe API ?

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5151) Parquet Predicate Pushdown Does Not Work with Nested Structures.

2016-06-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-5151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356648#comment-15356648
 ] 

Maciej Bryński commented on SPARK-5151:
---

CC: [~michael]
Found this.

> Parquet Predicate Pushdown Does Not Work with Nested Structures.
> 
>
> Key: SPARK-5151
> URL: https://issues.apache.org/jira/browse/SPARK-5151
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: pyspark, spark-ec2 created cluster
>Reporter: Brad Willard
>  Labels: parquet, pyspark, sql
>
> I have json files of objects created with a nested structure roughly of the 
> formof the form:
> { id: 123, event: "login", meta_data: {'user: "user1"}}
> 
> { id: 125, event: "login", meta_data: {'user: "user2"}}
> I load the data via spark with
> rdd = sql_context.jsonFile()
> # save it as a parquet file
> rdd.saveAsParquetFile()
> rdd = sql_context.parquetFile()
> rdd.registerTempTable('events')
> so if I run this query it works without issue if predicate pushdown is 
> disabled
> select count(1) from events where meta_data.user = "user1"
> if I enable predicate pushdown I get an error saying meta_data.user is not in 
> the schema
> Py4JJavaError: An error occurred while calling o218.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 125 
> in stage 12.0 failed 4 times, most recent failure: Lost task 125.3 in stage 
> 12.0 (TID 6164, ): java.lang.IllegalArgumentException: Column [user] was not 
> found in schema!
>   at parquet.Preconditions.checkArgument(Preconditions.java:47)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
>   at 
> parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
>   at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> .
> I expect this is actually related to another bug I filed where nested 
> structure is not preserved with spark sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16226) change the way of JDBC commit

2016-06-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15356653#comment-15356653
 ] 

Maciej Bryński commented on SPARK-16226:


-1 for this patch.

> change the way of JDBC commit
> -
>
> Key: SPARK-16226
> URL: https://issues.apache.org/jira/browse/SPARK-16226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2
>Reporter: lihongli
>
> In the file  
> spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala,we
>  have the method savePartition().We can create JDBC connection with each 
> partition.But when I insert data to my relational DB,it is blocked.I found 
> that we commit once in a partition.Then I changed it.I commit   
>  after executeBatch() and my code worked properly.I did not know how it 
> performed in NoSQL database,but it does have a problem in my relational DB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-06-30 Thread JIRA
Maciej Bryński created SPARK-16320:
--

 Summary: Spark 2.0 slower than 1.6 when querying nested columns
 Key: SPARK-16320
 URL: https://issues.apache.org/jira/browse/SPARK-16320
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Maciej Bryński


I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) select count(*) where id > some_id
In this query performance is similar. (about 1 sec)

2) select count(*) where nested_column.id > some_id
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-06-30 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) `select count(*) where id > some_id`
In this query performance is similar. (about 1 sec)

2) `select count(*) where nested_column.id > some_id`
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) select count(*) where id > some_id
In this query performance is similar. (about 1 sec)

2) select count(*) where nested_column.id > some_id
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
>     URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) `select count(*) where id > some_id`
> In this query performance is similar. (about 1 sec)
> 2) `select count(*) where nested_column.id > some_id`
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-06-30 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) `select count(*) where id > some_id`
In this query performance is similar. (about 1 sec)

2) `select count(*) where nested_column.id > some_id`
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
>     URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16321) Pyspark 2.0 performance drop vs pyspark 1.6

2016-06-30 Thread JIRA
Maciej Bryński created SPARK-16321:
--

 Summary: Pyspark 2.0 performance drop vs pyspark 1.6
 Key: SPARK-16321
 URL: https://issues.apache.org/jira/browse/SPARK-16321
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.0.0
Reporter: Maciej Bryński


I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is 2x slower.

{code}
df = sqlctx.read.parquet(path)
df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %
10 else []).collect()
{code}
Spark 1.6 -> 2.3 min
Spark 2.0 -> 4.6 min (2x slower)

I used BasicProfiler for this task and cumulative time was:
Spark 1.6 - 4300 sec
Spark 2.0 - 5800 sec

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16321) Pyspark 2.0 performance drop vs pyspark 1.6

2016-06-30 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16321:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is 2x slower.

{code}
df = sqlctx.read.parquet(path)
df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 else 
[]).collect()
{code}
Spark 1.6 -> 2.3 min
Spark 2.0 -> 4.6 min (2x slower)

I used BasicProfiler for this task and cumulative time was:
Spark 1.6 - 4300 sec
Spark 2.0 - 5800 sec

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is 2x slower.

{code}
df = sqlctx.read.parquet(path)
df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %
10 else []).collect()
{code}
Spark 1.6 -> 2.3 min
Spark 2.0 -> 4.6 min (2x slower)

I used BasicProfiler for this task and cumulative time was:
Spark 1.6 - 4300 sec
Spark 2.0 - 5800 sec

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?


> Pyspark 2.0 performance drop vs pyspark 1.6
> ---
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-06-30 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15357213#comment-15357213
 ] 

Maciej Bryński commented on SPARK-16320:


OK. 
I'll try to confirm this issue on generated data.


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16321) Pyspark 2.0 performance drop vs pyspark 1.6

2016-07-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359702#comment-15359702
 ] 

Maciej Bryński commented on SPARK-16321:


[~zjffdu]
This query executes in about 1sec. 

[~srowen]
Maybe we can start with Python profile ?

> Pyspark 2.0 performance drop vs pyspark 1.6
> ---
>
> Key: SPARK-16321
> URL: https://issues.apache.org/jira/browse/SPARK-16321
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %10 
> else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16206) Defining our own folds using CrossValidator

2016-07-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361737#comment-15361737
 ] 

Rémi Delassus commented on SPARK-16206:
---

>You can implement whatever you want to produce folds though, so not clear what 
>this is about.
I don't think so?
The `fit` method explictly calls the Kfold method, and there is no way to tell 
it to use another one.
How would you do so?

> Defining our own folds using CrossValidator
> ---
>
> Key: SPARK-16206
> URL: https://issues.apache.org/jira/browse/SPARK-16206
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Danilo Bustos
>Priority: Trivial
>
> I have been using cross validation process in order to train a Naive Bayes 
> Model and I realize that it uses kFold method to get the random sampling data 
> in order to create the folds. This method return an Array[(RDD[T], RDD[T])] 
> of tuples, which I think are the set of different combination of the folds 
> for training and testing.
> My question is whether there is any specific reason because the API does not 
> allow you to define your own array of folds. I think would be a good idea if 
> this capability is supported, it would help a lot. 
> Please refer to: 
> http://stackoverflow.com/questions/37868984/why-we-can-not-define-our-own-folds-when-we-are-using-crossvalidator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-04 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

UPDATE.
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
#return df
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query

{code}

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?


> Spark 2.0 slower than 1.6 when querying nested columns
> ------
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> UPDATE.
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> 

[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-04 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

UPDATE.
I created script to generate data and to confirm this problem.
{code:java}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
#return df
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query

{code}

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

UPDATE.
I created script to generate data and to confirm this problem.
{code:python}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
#return df
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query

{code}


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> S

[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-04 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

UPDATE.
I created script to generate data and to confirm this problem.
{code:python}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
#return df
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query

{code}

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

UPDATE.
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
#return df
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query

{code}


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6

[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-04 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

UPDATE.
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
#return df
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

UPDATE.
I created script to generate data and to confirm this problem.
{code:java}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
#return df
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query

{code}


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
>

[jira] [Commented] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361781#comment-15361781
 ] 

Maciej Bryński commented on SPARK-16320:


[~rxin]
I created benchmark script and added results.

Spark 2.0 is about 25% slower than 1.6.

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> UPDATE.
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> #return df
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361781#comment-15361781
 ] 

Maciej Bryński edited comment on SPARK-16320 at 7/4/16 9:17 PM:


[~rxin]
I created benchmark script and added results.

Spark 2.0 is about *25% slower* than 1.6.


was (Author: maver1ck):
[~rxin]
I created benchmark script and added results.

Spark 2.0 is about 25% slower than 1.6.

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> UPDATE.
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-04 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

UPDATE.
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

UPDATE.
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
#return df
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: 

[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-04 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

*UPDATE*
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

UPDATE.
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Aff

[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-04 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Priority: Critical  (was: Major)

> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is sometimes slower.
> I tested following queries:
> 1) {code}select count(*) where id > some_id{code}
> In this query performance is similar. (about 1 sec)
> 2) {code}select count(*) where nested_column.id > some_id{code}
> Spark 1.6 -> 1.6 min
> Spark 2.0 -> 2.1 min
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?
> *UPDATE*
> I created script to generate data and to confirm this problem.
> {code}
> #Initialization
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import HiveContext
> conf = SparkConf()
> conf.set('spark.cores.max', 15)
> conf.set('spark.executor.memory', '30g')
> conf.set('spark.driver.memory', '30g')
> sc = SparkContext(conf=conf)
> sqlctx = HiveContext(sc)
> #Data creation
> MAX_SIZE = 2**32 - 1
> path = '/mnt/mfs/parquet_nested'
> def create_sample_data(levels, rows, path):
> 
> def _create_column_data(cols):
> import random
> random.seed()
> return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
> range(cols)}
> 
> def _create_sample_df(cols, rows):
> rdd = sc.parallelize(range(rows))   
> data = rdd.map(lambda r: _create_column_data(cols))
> df = sqlctx.createDataFrame(data)
> return df
> 
> def _create_nested_data(levels, rows):
> if len(levels) == 1:
> return _create_sample_df(levels[0], rows).cache()
> else:
> df = _create_nested_data(levels[1:], rows)
> return df.select([struct(df.columns).alias("column{}".format(i)) 
> for i in range(levels[0])])
> df = _create_nested_data(levels, rows)
> df.write.mode('overwrite').parquet(path)
> 
> #Sample data
> create_sample_data([2,10,200], 100, path)
> #Query
> df = sqlctx.read.parquet(path)
> %%timeit
> df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
> {code}
> Results
> Spark 1.6
> 1 loop, best of 3: *1min 5s* per loop
> Spark 2.0
> 1 loop, best of 3: *1min 21s* per loop



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13645) DAG Diagram not shown properly in Chrome

2016-07-04 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361807#comment-15361807
 ] 

Maciej Bryński commented on SPARK-13645:


I have the same problem.

> DAG Diagram not shown properly in Chrome
> 
>
> Key: SPARK-13645
> URL: https://issues.apache.org/jira/browse/SPARK-13645
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
> Environment: Chrome 49, 64-bit
>Reporter: Todd Leo
> Attachments: Slack for iOS Upload.jpg
>
>
> In my Chrome 49, the execution DAG diagram can't be shown properly. Only a 
> few grey dots lays there. Thought this is what I'm supposed to see at first. 
> It works fine in Firefox, though.
> See the attachment below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16371) IS NOT NULL clause gives false for nested column

2016-07-04 Thread JIRA
Maciej Bryński created SPARK-16371:
--

 Summary: IS NOT NULL clause gives false for nested column
 Key: SPARK-16371
 URL: https://issues.apache.org/jira/browse/SPARK-16371
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Maciej Bryński
Priority: Critical


I have df where column1 is struct type and there is 1M rows.
(sample data from https://issues.apache.org/jira/browse/SPARK-16320)

{code}
df.where("column1 is not null").count()
{code}
gives:
1M in Spark 1.6
*0* in Spark 2.0

Is there a change in IS NOT NULL behaviour in Spark 2.0 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16371) IS NOT NULL clause gives false for nested not empty column

2016-07-04 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16371:
---
Summary: IS NOT NULL clause gives false for nested not empty column  (was: 
IS NOT NULL clause gives false for nested column)

> IS NOT NULL clause gives false for nested not empty column
> --
>
> Key: SPARK-16371
> URL: https://issues.apache.org/jira/browse/SPARK-16371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I have df where column1 is struct type and there is 1M rows.
> (sample data from https://issues.apache.org/jira/browse/SPARK-16320)
> {code}
> df.where("column1 is not null").count()
> {code}
> gives:
> 1M in Spark 1.6
> *0* in Spark 2.0
> Is there a change in IS NOT NULL behaviour in Spark 2.0 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16371) IS NOT NULL clause gives false for nested not empty column

2016-07-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362225#comment-15362225
 ] 

Maciej Bryński commented on SPARK-16371:


I tried your example and it's working.
Could you try mine ?

> IS NOT NULL clause gives false for nested not empty column
> --
>
> Key: SPARK-16371
> URL: https://issues.apache.org/jira/browse/SPARK-16371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I have df where column1 is struct type and there is 1M rows.
> (sample data from https://issues.apache.org/jira/browse/SPARK-16320)
> {code}
> df.where("column1 is not null").count()
> {code}
> gives:
> 1M in Spark 1.6
> *0* in Spark 2.0
> Is there a change in IS NOT NULL behaviour in Spark 2.0 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16371) IS NOT NULL clause gives false for nested not empty column

2016-07-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362225#comment-15362225
 ] 

Maciej Bryński edited comment on SPARK-16371 at 7/5/16 8:56 AM:


I tried your example and it's working.
Could you try mine ?

Maybe the problem appears only in some conditions ?


was (Author: maver1ck):
I tried your example and it's working.
Could you try mine ?

> IS NOT NULL clause gives false for nested not empty column
> --
>
> Key: SPARK-16371
> URL: https://issues.apache.org/jira/browse/SPARK-16371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I have df where column1 is struct type and there is 1M rows.
> (sample data from https://issues.apache.org/jira/browse/SPARK-16320)
> {code}
> df.where("column1 is not null").count()
> {code}
> gives:
> 1M in Spark 1.6
> *0* in Spark 2.0
> Is there a change in IS NOT NULL behaviour in Spark 2.0 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16371) IS NOT NULL clause gives false for nested not empty column

2016-07-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362225#comment-15362225
 ] 

Maciej Bryński edited comment on SPARK-16371 at 7/5/16 9:03 AM:


I tried your example and it's working.
Could you try mine ?

Maybe the problem appears only in some conditions like very wide schema?


was (Author: maver1ck):
I tried your example and it's working.
Could you try mine ?

Maybe the problem appears only in some conditions ?

> IS NOT NULL clause gives false for nested not empty column
> --
>
> Key: SPARK-16371
> URL: https://issues.apache.org/jira/browse/SPARK-16371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I have df where column1 is struct type and there is 1M rows.
> (sample data from https://issues.apache.org/jira/browse/SPARK-16320)
> {code}
> df.where("column1 is not null").count()
> {code}
> gives:
> 1M in Spark 1.6
> *0* in Spark 2.0
> Is there a change in IS NOT NULL behaviour in Spark 2.0 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16320) Spark 2.0 slower than 1.6 when querying nested columns

2016-07-05 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16320:
---
Description: 
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

*UPDATE*
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
from pyspark.sql.functions import struct
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop

  was:
I did some test on parquet file with many nested columns (about 30G in
400 partitions) and Spark 2.0 is sometimes slower.

I tested following queries:
1) {code}select count(*) where id > some_id{code}
In this query performance is similar. (about 1 sec)

2) {code}select count(*) where nested_column.id > some_id{code}
Spark 1.6 -> 1.6 min
Spark 2.0 -> 2.1 min

Should I expect such a drop in performance ?

I don't know how to prepare sample data to show the problem.
Any ideas ? Or public data with many nested columns ?

*UPDATE*
I created script to generate data and to confirm this problem.
{code}
#Initialization
from pyspark import SparkContext, SparkConf
from pyspark.sql import HiveContext
conf = SparkConf()
conf.set('spark.cores.max', 15)
conf.set('spark.executor.memory', '30g')
conf.set('spark.driver.memory', '30g')
sc = SparkContext(conf=conf)
sqlctx = HiveContext(sc)

#Data creation
MAX_SIZE = 2**32 - 1

path = '/mnt/mfs/parquet_nested'

def create_sample_data(levels, rows, path):

def _create_column_data(cols):
import random
random.seed()
return {"column{}".format(i): random.randint(0, MAX_SIZE) for i in 
range(cols)}

def _create_sample_df(cols, rows):
rdd = sc.parallelize(range(rows))   
data = rdd.map(lambda r: _create_column_data(cols))
df = sqlctx.createDataFrame(data)
return df

def _create_nested_data(levels, rows):
if len(levels) == 1:
return _create_sample_df(levels[0], rows).cache()
else:
df = _create_nested_data(levels[1:], rows)
return df.select([struct(df.columns).alias("column{}".format(i)) 
for i in range(levels[0])])
df = _create_nested_data(levels, rows)
df.write.mode('overwrite').parquet(path)

#Sample data
create_sample_data([2,10,200], 100, path)

#Query
df = sqlctx.read.parquet(path)

%%timeit
df.where("column1.column5.column50 > {}".format(int(MAX_SIZE / 2))).count()
{code}

Results
Spark 1.6
1 loop, best of 3: *1min 5s* per loop
Spark 2.0
1 loop, best of 3: *1min 21s* per loop


> Spark 2.0 slower than 1.6 when querying nested columns
> --
>
> Key: SPARK-16320
> URL: https://issues.apache.org/jira/browse/SPARK-16320
> Project: Spark
>  Issue Type: 

[jira] [Commented] (SPARK-16371) IS NOT NULL clause gives false for nested not empty column

2016-07-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362305#comment-15362305
 ] 

Maciej Bryński commented on SPARK-16371:


I forget to add import.
I repair this.

> IS NOT NULL clause gives false for nested not empty column
> --
>
> Key: SPARK-16371
> URL: https://issues.apache.org/jira/browse/SPARK-16371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I have df where column1 is struct type and there is 1M rows.
> (sample data from https://issues.apache.org/jira/browse/SPARK-16320)
> {code}
> df.where("column1 is not null").count()
> {code}
> gives:
> 1M in Spark 1.6
> *0* in Spark 2.0
> Is there a change in IS NOT NULL behaviour in Spark 2.0 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16371) IS NOT NULL clause gives false for nested not empty column

2016-07-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15362305#comment-15362305
 ] 

Maciej Bryński edited comment on SPARK-16371 at 7/5/16 2:18 PM:


I forget to add import.
I repaired this.


was (Author: maver1ck):
I forget to add import.
I repair this.

> IS NOT NULL clause gives false for nested not empty column
> --
>
> Key: SPARK-16371
> URL: https://issues.apache.org/jira/browse/SPARK-16371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Priority: Critical
>
> I have df where column1 is struct type and there is 1M rows.
> (sample data from https://issues.apache.org/jira/browse/SPARK-16320)
> {code}
> df.where("column1 is not null").count()
> {code}
> gives:
> 1M in Spark 1.6
> *0* in Spark 2.0
> Is there a change in IS NOT NULL behaviour in Spark 2.0 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16416) Logging in shutdown hook does not work properly with Log4j 2.x

2016-07-07 Thread JIRA
Mikael Ståldal created SPARK-16416:
--

 Summary: Logging in shutdown hook does not work properly with 
Log4j 2.x
 Key: SPARK-16416
 URL: https://issues.apache.org/jira/browse/SPARK-16416
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.6.2
Reporter: Mikael Ståldal


Spark registers some shutdown hooks, and they log messages during shutdown:

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L58

Since the {{Logging}} trait creates SLF4J loggers lazily:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala#L47

a SLF4J logger is created during the execution of the shutdown hook.

This does not work when Log4j 2.x is used as SLF4J implementation:
https://issues.apache.org/jira/browse/LOG4J2-1222

Even though Log4j 2.6 handles this more gracefully than before, it still does 
emit a warning and will not be able to process the log message properly.

Proposed solution: make sure to eagerly create the SLF4J logger to be used in 
shutdown hooks when registering the hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16416) Logging in shutdown hook does not work properly with Log4j 2.x

2016-07-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367442#comment-15367442
 ] 

Mikael Ståldal commented on SPARK-16416:


Maybe just add a log statement in {{ShutdownHookManager}} before registering 
any hook to force the logger to be created:

{code}
logDebug("Registering shutdown hook")
{code}

can be added here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L56


> Logging in shutdown hook does not work properly with Log4j 2.x
> --
>
> Key: SPARK-16416
> URL: https://issues.apache.org/jira/browse/SPARK-16416
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Mikael Ståldal
>Priority: Minor
>
> Spark registers some shutdown hooks, and they log messages during shutdown:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L58
> Since the {{Logging}} trait creates SLF4J loggers lazily:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala#L47
> a SLF4J logger is created during the execution of the shutdown hook.
> This does not work when Log4j 2.x is used as SLF4J implementation:
> https://issues.apache.org/jira/browse/LOG4J2-1222
> Even though Log4j 2.6 handles this more gracefully than before, it still does 
> emit a warning and will not be able to process the log message properly.
> Proposed solution: make sure to eagerly create the SLF4J logger to be used in 
> shutdown hooks when registering the hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16416) Logging in shutdown hook does not work properly with Log4j 2.x

2016-07-08 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15367442#comment-15367442
 ] 

Mikael Ståldal edited comment on SPARK-16416 at 7/8/16 9:26 AM:


Maybe just add a log statement in {{ShutdownHookManager}} before adding any 
hook to force the logger to be created:

{code}
logDebug("Adding shutdown hook")
{code}

can be added here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L56



was (Author: mikaelstaldal):
Maybe just add a log statement in {{ShutdownHookManager}} before registering 
any hook to force the logger to be created:

{code}
logDebug("Registering shutdown hook")
{code}

can be added here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L56


> Logging in shutdown hook does not work properly with Log4j 2.x
> --
>
> Key: SPARK-16416
> URL: https://issues.apache.org/jira/browse/SPARK-16416
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2
>Reporter: Mikael Ståldal
>Priority: Minor
>
> Spark registers some shutdown hooks, and they log messages during shutdown:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ShutdownHookManager.scala#L58
> Since the {{Logging}} trait creates SLF4J loggers lazily:
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/Logging.scala#L47
> a SLF4J logger is created during the execution of the shutdown hook.
> This does not work when Log4j 2.x is used as SLF4J implementation:
> https://issues.apache.org/jira/browse/LOG4J2-1222
> Even though Log4j 2.6 handles this more gracefully than before, it still does 
> emit a warning and will not be able to process the log message properly.
> Proposed solution: make sure to eagerly create the SLF4J logger to be used in 
> shutdown hooks when registering the hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16439) Incorrect information on SQL Query details

2016-07-08 Thread JIRA
Maciej Bryński created SPARK-16439:
--

 Summary: Incorrect information on SQL Query details
 Key: SPARK-16439
 URL: https://issues.apache.org/jira/browse/SPARK-16439
 Project: Spark
  Issue Type: Bug
  Components: SQL, Web UI
Affects Versions: 2.0.0
Reporter: Maciej Bryński


One picture is worth a thousand words.

Please see attachment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16439) Incorrect information on SQL Query details

2016-07-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16439:
---
Attachment: spark.jpg

> Incorrect information on SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16439) Incorrect information on SQL Query details

2016-07-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16439:
---
Description: 
One picture is worth a thousand words.

Please see attachment

Incorrect values are in fields:
* data size
* number of output rows
* time to collect

  was:
One picture is worth a thousand words.

Please see attachment


> Incorrect information on SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16439) Incorrect information in SQL Query details

2016-07-08 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16439:
---
Summary: Incorrect information in SQL Query details  (was: Incorrect 
information on SQL Query details)

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370435#comment-15370435
 ] 

Maciej Bryński commented on SPARK-16439:


I think that the problem exist when value is greater than 1024.
Could you repeat your test ?

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370463#comment-15370463
 ] 

Maciej Bryński commented on SPARK-16439:


I'll try to prepare sth.
This was the screenshot from testing my production env on Spark 2.0. So I can't 
share the data.

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370555#comment-15370555
 ] 

Maciej Bryński commented on SPARK-16439:


OK. Got this.
Using spark-shell

{code}
case class Left(a: Long)
case class Right(b: Long)
val df_left = spark.range(1).map(num => Left(num))
val df_right = spark.range(1).map(num => Right(num))
df_left.registerTempTable("l")
df_right.registerTempTable("r")
spark.sql("select count(*) from l join r on l.a = r.b").collect()
{code}

Screenshot from SQL Query details attached as sample.jpg

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16439) Incorrect information in SQL Query details

2016-07-11 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16439:
---
Attachment: sample.png

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16439) Incorrect information in SQL Query details

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370555#comment-15370555
 ] 

Maciej Bryński edited comment on SPARK-16439 at 7/11/16 11:02 AM:
--

OK. Got this.
Using spark-shell

{code}
case class Left(a: Long)
case class Right(b: Long)
val df_left = spark.range(1).map(num => Left(num))
val df_right = spark.range(1).map(num => Right(num))
df_left.registerTempTable("l")
df_right.registerTempTable("r")
spark.sql("select count(*) from l join r on l.a = r.b").collect()
{code}

Screenshot from SQL Query details attached as sample.png


was (Author: maver1ck):
OK. Got this.
Using spark-shell

{code}
case class Left(a: Long)
case class Right(b: Long)
val df_left = spark.range(1).map(num => Left(num))
val df_right = spark.range(1).map(num => Right(num))
df_left.registerTempTable("l")
df_right.registerTempTable("r")
spark.sql("select count(*) from l join r on l.a = r.b").collect()
{code}

Screenshot from SQL Query details attached as sample.jpg

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16439) Incorrect information in SQL Query details

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370555#comment-15370555
 ] 

Maciej Bryński edited comment on SPARK-16439 at 7/11/16 11:04 AM:
--

OK. Got this.
Using spark-shell

{code}
case class Left(a: Long)
case class Right(b: Long)
val df_left = spark.range(1).map(num => Left(num))
val df_right = spark.range(1).map(num => Right(num))
df_left.registerTempTable("l")
df_right.registerTempTable("r")
spark.sql("select count(*) from l join r on l.a = r.b").collect()
{code}

Screenshot from SQL Query details attached as sample.png

Tested with:
* Firefox 47
* Chrome 51



was (Author: maver1ck):
OK. Got this.
Using spark-shell

{code}
case class Left(a: Long)
case class Right(b: Long)
val df_left = spark.range(1).map(num => Left(num))
val df_right = spark.range(1).map(num => Right(num))
df_left.registerTempTable("l")
df_right.registerTempTable("r")
spark.sql("select count(*) from l join r on l.a = r.b").collect()
{code}

Screenshot from SQL Query details attached as sample.png

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16439) Incorrect information in SQL Query details

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370555#comment-15370555
 ] 

Maciej Bryński edited comment on SPARK-16439 at 7/11/16 11:04 AM:
--

OK. Got this.
Using spark-shell (Spark deployed with YARN)

{code}
case class Left(a: Long)
case class Right(b: Long)
val df_left = spark.range(1).map(num => Left(num))
val df_right = spark.range(1).map(num => Right(num))
df_left.registerTempTable("l")
df_right.registerTempTable("r")
spark.sql("select count(*) from l join r on l.a = r.b").collect()
{code}

Screenshot from SQL Query details attached as sample.png

Tested with:
* Firefox 47
* Chrome 51



was (Author: maver1ck):
OK. Got this.
Using spark-shell

{code}
case class Left(a: Long)
case class Right(b: Long)
val df_left = spark.range(1).map(num => Left(num))
val df_right = spark.range(1).map(num => Right(num))
df_left.registerTempTable("l")
df_right.registerTempTable("r")
spark.sql("select count(*) from l join r on l.a = r.b").collect()
{code}

Screenshot from SQL Query details attached as sample.png

Tested with:
* Firefox 47
* Chrome 51


> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16439) Incorrect information in SQL Query details

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370555#comment-15370555
 ] 

Maciej Bryński edited comment on SPARK-16439 at 7/11/16 11:19 AM:
--

OK. Got this.
Using spark-shell (Spark deployed with YARN)

{code}
case class Left(a: Long)
case class Right(b: Long)
val df_left = spark.range(1).map(num => Left(num))
val df_right = spark.range(1).map(num => Right(num))
df_left.registerTempTable("l")
df_right.registerTempTable("r")
spark.sql("select count(*) from l join r on l.a = r.b").collect()
{code}

Screenshot from SQL Query details attached as sample.png

Tested with:
* Firefox 47
* Chrome 51

The problem is that where number is greater than 1000 we put special unicode 
character u+00A0 between thousands and hundreds.
So 1 looks like 10u00A


was (Author: maver1ck):
OK. Got this.
Using spark-shell (Spark deployed with YARN)

{code}
case class Left(a: Long)
case class Right(b: Long)
val df_left = spark.range(1).map(num => Left(num))
val df_right = spark.range(1).map(num => Right(num))
df_left.registerTempTable("l")
df_right.registerTempTable("r")
spark.sql("select count(*) from l join r on l.a = r.b").collect()
{code}

Screenshot from SQL Query details attached as sample.png

Tested with:
* Firefox 47
* Chrome 51


> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16439) Incorrect information in SQL Query details

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370555#comment-15370555
 ] 

Maciej Bryński edited comment on SPARK-16439 at 7/11/16 11:21 AM:
--

OK. Got this.
Using spark-shell (Spark deployed with YARN)

{code}
case class Left(a: Long)
case class Right(b: Long)
val df_left = spark.range(1).map(num => Left(num))
val df_right = spark.range(1).map(num => Right(num))
df_left.registerTempTable("l")
df_right.registerTempTable("r")
spark.sql("select count(*) from l join r on l.a = r.b").collect()
{code}

Screenshot from SQL Query details attached as sample.png

Tested with:
* Firefox 47
* Chrome 51

The problem is that where number is greater than 1000 we put special unicode 
character u+00A0 (no-breaking space) between thousands and hundreds.
So 1 looks like 10u00A


was (Author: maver1ck):
OK. Got this.
Using spark-shell (Spark deployed with YARN)

{code}
case class Left(a: Long)
case class Right(b: Long)
val df_left = spark.range(1).map(num => Left(num))
val df_right = spark.range(1).map(num => Right(num))
df_left.registerTempTable("l")
df_right.registerTempTable("r")
spark.sql("select count(*) from l join r on l.a = r.b").collect()
{code}

Screenshot from SQL Query details attached as sample.png

Tested with:
* Firefox 47
* Chrome 51

The problem is that where number is greater than 1000 we put special unicode 
character u+00A0 between thousands and hundreds.
So 1 looks like 10u00A

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16478) strongly connected components doesn't cache returned RDD

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370595#comment-15370595
 ] 

Michał Wesołowski commented on SPARK-16478:
---

If you run code that I provided on databrics you can see that without 
materializing graph that is returned simple count on vertices takes about 20 
minutes, whereas strongly connected components runs 2 minutes. 
I tried to us it on some real data and I wasn't able to save the result because 
of this. After materializing graph with every iteration I can save results with 
no problem. Materializing only within outside loop caused less severe problems 
but wasn't sufficient. 

> strongly connected components doesn't cache returned RDD
> 
>
> Key: SPARK-16478
>     URL: https://issues.apache.org/jira/browse/SPARK-16478
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.2
>Reporter: Michał Wesołowski
>
> Strongly Connected Components algorithm caches intermediary RDD's but doesn't 
> cache the one that is going to be returned. With large enough graph comparing 
> to available memory when one tries to take action on returned RDD whole RDD 
> has to be computed from scratch which takes much more time than 
> StronglyConnectedComponents alone . 
> I managed to replicate the issue on databrics platform. 
> [Here|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html]
>  is notebook. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16478) strongly connected components doesn't cache returned RDD

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370595#comment-15370595
 ] 

Michał Wesołowski edited comment on SPARK-16478 at 7/11/16 11:47 AM:
-

If you run code that I provided on databrics you can see that without 
materializing graph that is returned simple count on vertices takes about 20 
minutes, whereas strongly connected components runs 2 minutes. 
I tried to us it on some real data and I wasn't able to save the result because 
of this. After materializing graph with every iteration I can save results with 
no problem. Materializing only within outside loop caused less severe problems 
but wasn't sufficient. 

In original implementation there is lot's of RDD cached and immediately 
matrialized. Some of them are removed before scc returnes due to LRU fashion 
spark operates, but returned RDDs are not materialized and depend on the ones 
already removed from RAM. That is my current understanding of observed 
behavior. 


was (Author: wesolows):
If you run code that I provided on databrics you can see that without 
materializing graph that is returned simple count on vertices takes about 20 
minutes, whereas strongly connected components runs 2 minutes. 
I tried to us it on some real data and I wasn't able to save the result because 
of this. After materializing graph with every iteration I can save results with 
no problem. Materializing only within outside loop caused less severe problems 
but wasn't sufficient. 

> strongly connected components doesn't cache returned RDD
> 
>
> Key: SPARK-16478
>     URL: https://issues.apache.org/jira/browse/SPARK-16478
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.2
>Reporter: Michał Wesołowski
>
> Strongly Connected Components algorithm caches intermediary RDD's but doesn't 
> cache the one that is going to be returned. With large enough graph comparing 
> to available memory when one tries to take action on returned RDD whole RDD 
> has to be computed from scratch which takes much more time than 
> StronglyConnectedComponents alone . 
> I managed to replicate the issue on databrics platform. 
> [Here|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html]
>  is notebook. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16478) strongly connected components doesn't cache returned RDD

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370595#comment-15370595
 ] 

Michał Wesołowski edited comment on SPARK-16478 at 7/11/16 11:49 AM:
-

If you run code that I provided on databrics you can see that without 
materializing graph that is returned simple count on vertices takes about 20 
minutes, whereas strongly connected components runs 2 minutes. 
I tried to use it on some real data and I wasn't able to save the result 
because of this. After materializing graph with every iteration I can save 
results with no problem. Materializing only within outside loop caused less 
severe problems but wasn't sufficient. 

In original implementation there is lot's of RDD cached and immediately 
matrialized. Some of them are removed before scc returnes due to LRU fashion 
spark operates, but returned RDDs are not materialized and depend on the ones 
already removed from RAM. That is my current understanding of observed 
behavior. 


was (Author: wesolows):
If you run code that I provided on databrics you can see that without 
materializing graph that is returned simple count on vertices takes about 20 
minutes, whereas strongly connected components runs 2 minutes. 
I tried to us it on some real data and I wasn't able to save the result because 
of this. After materializing graph with every iteration I can save results with 
no problem. Materializing only within outside loop caused less severe problems 
but wasn't sufficient. 

In original implementation there is lot's of RDD cached and immediately 
matrialized. Some of them are removed before scc returnes due to LRU fashion 
spark operates, but returned RDDs are not materialized and depend on the ones 
already removed from RAM. That is my current understanding of observed 
behavior. 

> strongly connected components doesn't cache returned RDD
> 
>
> Key: SPARK-16478
> URL: https://issues.apache.org/jira/browse/SPARK-16478
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.6.2
>Reporter: Michał Wesołowski
>
> Strongly Connected Components algorithm caches intermediary RDD's but doesn't 
> cache the one that is going to be returned. With large enough graph comparing 
> to available memory when one tries to take action on returned RDD whole RDD 
> has to be computed from scratch which takes much more time than 
> StronglyConnectedComponents alone . 
> I managed to replicate the issue on databrics platform. 
> [Here|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/4889410027417133/3634650767364730/3117184429335832/latest.html]
>  is notebook. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15370647#comment-15370647
 ] 

Maciej Bryński commented on SPARK-16439:


[~proflin]
I don't know if waiting is right decision if we want this to be merged to 2.0.0

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details

2016-07-11 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371780#comment-15371780
 ] 

Maciej Bryński commented on SPARK-16439:


I found that problem is locale dependent.

The \u00A0 sign is added by java NumberFormat class.
It can be avoided when using NumberFormat.setGroupingUsed(false)

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13645) DAG Diagram not shown properly in Chrome

2016-07-14 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-13645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15377022#comment-15377022
 ] 

Maciej Bryński commented on SPARK-13645:


I think that problem is resolved in 2.0.0

> DAG Diagram not shown properly in Chrome
> 
>
> Key: SPARK-13645
> URL: https://issues.apache.org/jira/browse/SPARK-13645
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
> Environment: Chrome 49, 64-bit
>Reporter: Todd Leo
> Attachments: Slack for iOS Upload.jpg
>
>
> In my Chrome 49, the execution DAG diagram can't be shown properly. Only a 
> few grey dots lays there. Thought this is what I'm supposed to see at first. 
> It works fine in Firefox, though.
> See the attachment below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14234) Executor crashes for TaskRunner thread interruption

2016-07-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-14234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15379041#comment-15379041
 ] 

Josef Lindman Hörnlund commented on SPARK-14234:


+1 for backporting

> Executor crashes for TaskRunner thread interruption
> ---
>
> Key: SPARK-14234
> URL: https://issues.apache.org/jira/browse/SPARK-14234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Devaraj K
>Assignee: Devaraj K
> Fix For: 2.0.0
>
>
> If the TaskRunner thread gets interrupted while running due to task kill or 
> any other reason, the interrupted thread will try to update the task status 
> as part of the exception handling and fails with the below exception. This is 
> happening from all of these catch blocks statusUpdate calls, below are the 
> exceptions correspondingly for all these catch cases.
> {code:title=Executor.scala|borderStyle=solid}
> case _: TaskKilledException | _: InterruptedException if task.killed 
> =>
>  ..
> case cDE: CommitDeniedException =>
>  ..
> case t: Throwable =>
>  ..
> {code}
> {code:xml}
> 16/03/29 17:32:33 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor task launch worker-2,5,main]
> java.lang.Error: java.nio.channels.ClosedByInterruptException
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1151)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Caused by: java.nio.channels.ClosedByInterruptException
>   at 
> java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
>   at 
> java.nio.channels.Channels$WritableByteChannelImpl.write(Channels.java:460)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:49)
>   at 
> org.apache.spark.util.SerializableBuffer$$anonfun$writeObject$1.apply(SerializableBuffer.scala:47)
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1204)
>   at 
> org.apache.spark.util.SerializableBuffer.writeObject(SerializableBuffer.scala:47)
>   at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:43)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEnv.serialize(NettyRpcEnv.scala:253)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:513)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.statusUpdate(CoarseGrainedExecutorBackend.scala:135)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   ... 2 more
> {code}
> {code:xml}
> 16/03/29 08:00:29 ERROR SparkUncaughtExceptionHandler: Uncaught exception in 
> thread Thread[Executor

[jira] [Created] (SPARK-16569) Use Cython in Pyspark internals

2016-07-15 Thread JIRA
Maciej Bryński created SPARK-16569:
--

 Summary: Use Cython in Pyspark internals
 Key: SPARK-16569
 URL: https://issues.apache.org/jira/browse/SPARK-16569
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.6.2, 2.0.0
Reporter: Maciej Bryński
Priority: Minor


CC: [~davies]

Many operations I do are like:
{code}
dataframe.rdd.map(some_function)
{code}
In Pyspark this mean creating Row object for every record and this is slow.

IDEA:
Use Cython to speed up Pyspark internals

Sample profile:
{code}

Profile of RDD

 2000373036 function calls (2000312850 primitive calls) in 2045.307 
seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
14948  427.1170.029 1811.6220.121 {built-in method loads}
19992  402.0860.000  937.0450.000 types.py:1162(_create_row)
19992  262.7080.000  262.7080.000 {built-in method __new__ of type 
object at 0x9d1c40}
19992  190.9080.000 1219.7940.000 types.py:558(fromInternal)
19992  153.6110.000  153.6110.000 types.py:1280(__setattr__)
199920197  145.0220.000 2024.1260.000 rdd.py:1004()
19992  118.6400.000  381.3480.000 types.py:1194(__new__)
19992  101.2720.000 1321.0670.000 types.py:1159()
200189064   91.9280.000   91.9280.000 {built-in method isinstance}
19992   61.6080.000   61.6080.000 
types.py:1158(_create_row_inbound_converter)
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16569) Use Cython in Pyspark internals

2016-07-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16569:
---
Description: 
CC: [~davies]

Many operations I do are like:
{code}
dataframe.rdd.map(some_function)
{code}
In Pyspark this mean creating Row object for every record and this is slow.

IDEA:
Use Cython to speed up Pyspark internals
What do you think ?

Sample profile:
{code}

Profile of RDD

 2000373036 function calls (2000312850 primitive calls) in 2045.307 
seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
14948  427.1170.029 1811.6220.121 {built-in method loads}
19992  402.0860.000  937.0450.000 types.py:1162(_create_row)
19992  262.7080.000  262.7080.000 {built-in method __new__ of type 
object at 0x9d1c40}
19992  190.9080.000 1219.7940.000 types.py:558(fromInternal)
19992  153.6110.000  153.6110.000 types.py:1280(__setattr__)
199920197  145.0220.000 2024.1260.000 rdd.py:1004()
19992  118.6400.000  381.3480.000 types.py:1194(__new__)
19992  101.2720.000 1321.0670.000 types.py:1159()
200189064   91.9280.000   91.9280.000 {built-in method isinstance}
19992   61.6080.000   61.6080.000 
types.py:1158(_create_row_inbound_converter)
{code}


  was:
CC: [~davies]

Many operations I do are like:
{code}
dataframe.rdd.map(some_function)
{code}
In Pyspark this mean creating Row object for every record and this is slow.

IDEA:
Use Cython to speed up Pyspark internals

Sample profile:
{code}

Profile of RDD

 2000373036 function calls (2000312850 primitive calls) in 2045.307 
seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
14948  427.1170.029 1811.6220.121 {built-in method loads}
19992  402.0860.000  937.0450.000 types.py:1162(_create_row)
19992  262.7080.000  262.7080.000 {built-in method __new__ of type 
object at 0x9d1c40}
19992  190.9080.000 1219.7940.000 types.py:558(fromInternal)
19992  153.6110.000  153.6110.000 types.py:1280(__setattr__)
199920197  145.0220.000 2024.1260.000 rdd.py:1004()
19992  118.6400.000  381.3480.000 types.py:1194(__new__)
19992  101.2720.000 1321.0670.000 types.py:1159()
200189064   91.9280.000   91.9280.000 {built-in method isinstance}
19992   61.6080.000   61.6080.000 
types.py:1158(_create_row_inbound_converter)
{code}



> Use Cython in Pyspark internals
> ---
>
> Key: SPARK-16569
> URL: https://issues.apache.org/jira/browse/SPARK-16569
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> CC: [~davies]
> Many operations I do are like:
> {code}
> dataframe.rdd.map(some_function)
> {code}
> In Pyspark this mean creating Row object for every record and this is slow.
> IDEA:
> Use Cython to speed up Pyspark internals
> What do you think ?
> Sample profile:
> {code}
> 
> Profile of RDD
> 
>  2000373036 function calls (2000312850 primitive calls) in 2045.307 
> seconds
>Ordered by: internal time, cumulative time
>ncalls  tottime  percall  cumtime  percall filename:lineno(function)
> 14948  427.1170.029 1811.6220.121 {built-in method loads}
> 19992  402.0860.000  937.0450.000 types.py:1162(_create_row)
> 19992  262.7080.000  262.7080.000 {built-in method __new__ of 
> type object at 0x9d1c40}
> 19992  190.9080.000 1219.7940.000 types.py:558(fromInternal)
> 19992  153.6110.000  153.6110.000 types.py:1280(__setattr__)
> 199920197  145.0220.000 2024.1260.000 rdd.py:1004()
> 19992  118.6400.000  381.3480.000 types.py:1194(__new__)
> 19992  101.2720.000 1321.0670.000 types.py:1159()
> 200189064   91.9280.000   91.9280.000 {built-in method isinstance}
> 19992   61.608    0.000   61.6080.000 
> types.py:1158(_create_row_inbound_converter)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16569) Use Cython to speed up Pyspark internals

2016-07-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-16569:
---
Summary: Use Cython to speed up Pyspark internals  (was: Use Cython in 
Pyspark internals)

> Use Cython to speed up Pyspark internals
> 
>
> Key: SPARK-16569
> URL: https://issues.apache.org/jira/browse/SPARK-16569
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> CC: [~davies]
> Many operations I do are like:
> {code}
> dataframe.rdd.map(some_function)
> {code}
> In Pyspark this mean creating Row object for every record and this is slow.
> IDEA:
> Use Cython to speed up Pyspark internals
> What do you think ?
> Sample profile:
> {code}
> 
> Profile of RDD
> 
>  2000373036 function calls (2000312850 primitive calls) in 2045.307 
> seconds
>Ordered by: internal time, cumulative time
>ncalls  tottime  percall  cumtime  percall filename:lineno(function)
> 14948  427.1170.029 1811.6220.121 {built-in method loads}
> 19992  402.0860.000  937.0450.000 types.py:1162(_create_row)
> 19992  262.7080.000  262.7080.000 {built-in method __new__ of 
> type object at 0x9d1c40}
> 19992  190.9080.000 1219.7940.000 types.py:558(fromInternal)
> 19992  153.6110.000  153.6110.000 types.py:1280(__setattr__)
> 199920197  145.0220.000 2024.1260.000 rdd.py:1004()
> 19992  118.6400.000  381.3480.000 types.py:1194(__new__)
> 19992  101.2720.000 1321.0670.000 types.py:1159()
> 200189064   91.9280.000   91.9280.000 {built-in method isinstance}
> 19992   61.608    0.000   61.6080.000 
> types.py:1158(_create_row_inbound_converter)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16571) DataFrame repartition leads to unexpected error during shuffle

2016-07-15 Thread JIRA
Björn-Elmar Macek created SPARK-16571:
-

 Summary: DataFrame repartition leads to unexpected error during 
shuffle
 Key: SPARK-16571
 URL: https://issues.apache.org/jira/browse/SPARK-16571
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.6.1
Reporter: Björn-Elmar Macek


When executing the following code, an exception is thrown.

{code}
val finalProbabilityProxiesDF = 
sqlc.sql(app2AgeNormalissationQuery(transformedAppsCol, bucketedAgeCol, 
probProxyCol, normFacCol, ageFeatureRawTable, normFactorsTableName, 
ageApp2AgeProxyTableName)).repartition(10)


//sort the stats
val finalFeatMap = finalProbabilityProxiesDF.select(transformedAppsCol, 
probProxyCol).map{ row =>
  val probs = 
row.getAs[mutable.WrappedArray[util.ArrayList[Double]]](1).map(array => 
(array.get(0),array.get(1))).toArray
  val bucketsExist = probs.map(_._1)
  val allBuckets = ageCol match {
case "label" => (0 to ageSplits.size - 1).map(_.toDouble)
case "age" => (1 to ageSplits.size - 2).map(_.toDouble)
  }

  val missingBuckets = allBuckets.diff(bucketsExist).map{(_, 0.0)}
  val fixedProbs = probs ++ missingBuckets
  val filteredFixedProbs = ageCol match {
case "label" => fixedProbs
case "age" => fixedProbs.filter(_._1 != 0.0).map(elem => (elem._1 - 
1.0, elem._2))
  }

  val sortedProbs = filteredFixedProbs.sortWith( _._1 < _._1 )
  (row.getInt(0), sortedProbs)
}
{/code}

The stacktrace shows:
{code}
java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:114)
at 
org.apache.spark.sql.catalyst.util.GenericArrayData.getDouble(GenericArrayData.scala:53)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{/code}

If i remove the repartition from  finalProbabilityProxiesDF the code runs 
without problems.

I am unsure about the reasons tho. This should not happen should it?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16571) DataFrame repartition leads to unexpected error during shuffle

2016-07-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-16571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Björn-Elmar Macek updated SPARK-16571:
--
Description: 
When executing the following code, an exception is thrown.

{code}
val finalProbabilityProxiesDF = 
sqlc.sql(app2AgeNormalissationQuery(transformedAppsCol, bucketedAgeCol, 
probProxyCol, normFacCol, ageFeatureRawTable, normFactorsTableName, 
ageApp2AgeProxyTableName)).repartition(10)


//sort the stats
val finalFeatMap = finalProbabilityProxiesDF.select(transformedAppsCol, 
probProxyCol).map{ row =>
  val probs = 
row.getAs[mutable.WrappedArray[util.ArrayList[Double]]](1).map(array => 
(array.get(0),array.get(1))).toArray
  val bucketsExist = probs.map(_._1)
  val allBuckets = ageCol match {
case "label" => (0 to ageSplits.size - 1).map(_.toDouble)
case "age" => (1 to ageSplits.size - 2).map(_.toDouble)
  }

  val missingBuckets = allBuckets.diff(bucketsExist).map{(_, 0.0)}
  val fixedProbs = probs ++ missingBuckets
  val filteredFixedProbs = ageCol match {
case "label" => fixedProbs
case "age" => fixedProbs.filter(_._1 != 0.0).map(elem => (elem._1 - 
1.0, elem._2))
  }

  val sortedProbs = filteredFixedProbs.sortWith( _._1 < _._1 )
  (row.getInt(0), sortedProbs)
}
{code}

The stacktrace shows:
{code}
java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:114)
at 
org.apache.spark.sql.catalyst.util.GenericArrayData.getDouble(GenericArrayData.scala:53)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

If i remove the repartition from  finalProbabilityProxiesDF the code runs 
without problems.

I am unsure about the reasons tho. This should not happen should it?

  was:
When executing the following code, an exception is thrown.

{code}
val finalProbabilityProxiesDF = 
sqlc.sql(app2AgeNormalissationQuery(transformedAppsCol, bucketedAgeCol, 
probProxyCol, normFacCol, ageFeatureRawTable, normFactorsTableName, 
ageApp2AgeProxyTableName)).repartition(10)


//sort the stats
val finalFeatMap = finalProbabilityProxiesDF.select(transformedAppsCol, 
probProxyCol).map{ row =>
  val probs = 
row.getAs[mutable.WrappedArray[util.ArrayList[Double]]](1).map(array => 
(array.get(0),array.get(1))).toArray
  val bucketsExist = probs.map(_._1)
  val allBuckets = ageCol match {
case "label" => (0 to ageSplits.size - 1).map(_.toDouble)
case "age" => (1 to ageSplits.size - 2).map(_.toDouble)
  }

  val missingBuckets = allBuckets.diff(bucketsExist).map{(_, 0.0)}
  val fixedProbs = probs ++ missingBuckets
  val filteredFixedProbs = ageCol match {
case "label" => fixedProbs
case "age" => fixedProbs.filter(_._1 != 0.0).map(elem => (elem._1 - 
1.0, elem._2))
  }

  val sortedProbs = filteredFixedProbs.sortWith( _._1 < _._1 )
  (row.getInt(0), sortedProbs)
}
{/code}

The stacktrace shows:
{code}
java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
java.lang.Double
at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:114)
at 
org.apache.spark.sql.catalyst.util.GenericArrayData.getDouble(GenericArrayData.scala:53)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMe

[jira] [Commented] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError

2016-07-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15380653#comment-15380653
 ] 

Michał Wesołowski commented on SPARK-5484:
--

[~ankurd] do you plan to prepare another solution? I could see your pull 
request was closed, but did solve problem and was best thing at a time. The 
only thing I could see lacking was no ckeckpoint directory cleaning, and I 
guess I would change checkpoint iterations to 35 since it worked better during 
my tests.
More general solution proposed within Spark-5561 needs a change withing 
PeriodicRDDCheckpointer and PeriodicGraphCheckpointer - either to move it from 
mllib or change access modifier. 

> Pregel should checkpoint periodically to avoid StackOverflowError
> -
>
> Key: SPARK-5484
> URL: https://issues.apache.org/jira/browse/SPARK-5484
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> Pregel-based iterative algorithms with more than ~50 iterations begin to slow 
> down and eventually fail with a StackOverflowError due to Spark's lack of 
> support for long lineage chains. Instead, Pregel should checkpoint the graph 
> periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5484) Pregel should checkpoint periodically to avoid StackOverflowError

2016-07-16 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-5484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15380653#comment-15380653
 ] 

Michał Wesołowski edited comment on SPARK-5484 at 7/16/16 9:33 AM:
---

[~ankurd] do you plan to prepare another solution? I could see your pull 
request was closed, but did solve problem and was best thing at a time. The 
only thing I could see lacking was no ckeckpoint directory cleaning, and I 
guess I would change checkpoint iterations to 35 since it worked better during 
my tests.
More general solution proposed within [Spark-5561] needs a change within 
PeriodicRDDCheckpointer and PeriodicGraphCheckpointer - either to move it from 
mllib or change access modifier. 


was (Author: wesolows):
[~ankurd] do you plan to prepare another solution? I could see your pull 
request was closed, but did solve problem and was best thing at a time. The 
only thing I could see lacking was no ckeckpoint directory cleaning, and I 
guess I would change checkpoint iterations to 35 since it worked better during 
my tests.
More general solution proposed within Spark-5561 needs a change withing 
PeriodicRDDCheckpointer and PeriodicGraphCheckpointer - either to move it from 
mllib or change access modifier. 

> Pregel should checkpoint periodically to avoid StackOverflowError
> -
>
> Key: SPARK-5484
> URL: https://issues.apache.org/jira/browse/SPARK-5484
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> Pregel-based iterative algorithms with more than ~50 iterations begin to slow 
> down and eventually fail with a StackOverflowError due to Spark's lack of 
> support for long lineage chains. Instead, Pregel should checkpoint the graph 
> periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >