Question Create External table location S3

2016-03-31 Thread Raymond Honderdors
Hi,

I pulled the latest version

git pull git://github.com/apache/spark.git

Compiled:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver 
-DskipTests clean package


now I am getting the following error:
Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
MetaException(message:Got exception: java.io.IOException No FileSystem for 
scheme: s3n) (state=,code=0)

Did anyone else experience this?


Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com
T +972.7325.3569
Herzliya


[Read More]

[http://www.sizmek.com/Sizmek.png]


Re: Any documentation on Spark's security model beyond YARN?

2016-03-31 Thread Steve Loughran

> On 30 Mar 2016, at 21:02, Sean Busbey  wrote:
> 
> On Wed, Mar 30, 2016 at 4:33 AM, Steve Loughran  
> wrote:
>> 
>>> On 29 Mar 2016, at 22:19, Michael Segel  wrote:
>>> 
>>> Hi,
>>> 
>>> So yeah, I know that Spark jobs running on a Hadoop cluster will inherit 
>>> its security from the underlying YARN job.
>>> However… that’s not really saying much when you think about some use cases.
>>> 
>>> Like using the thrift service …
>>> 
>>> I’m wondering what else is new and what people have been thinking about how 
>>> to enhance spark’s security.
>>> 
>> 
>> Been thinking a bit.
>> 
>> One thing to look at is renewal of hbase and hive tokens on long-lived 
>> services, alongside hdfs
>> 
>> 
> 
> I've been looking at this as well. The current work-around I'm using
> is to use keytab logins on the executors, which is less than
> desirable.


OK, let's work together on this ... the current spark renewal code assumes its 
only for HDFS (indeed, that the filesystem is HDFS and therefore the #of tokens 
> 0); there' s no fundamental reason why the code in YarnSparkHadoopUtils can't 
run in the AM too.

> 
> Since the HBase project maintains Spark integration points, it'd be
> great if there were just a hook for services to provide "here's how to
> renew" to a common renewal service.
> 

1. Wittenauer is doing some work on a tool for doing this; I'm pushing for it 
to be a fairly generic API. Even if Spark has to use reflection to get at it, 
at least it would be consistent across services. See 
https://issues.apache.org/jira/browse/HADOOP-12563

2. The topic of HTTPS based acquisition/use of HDFS tokens has arisen 
elsewhere; needed for long-haul job submission when  you don' t have a keytab 
to hand. This could be useful as it'd avoid actually needing hbase-*.jar on the 
classpath at submit time.


> 
> 
> -- 
> busbey
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
> 



Re: Question Create External table location S3

2016-03-31 Thread Steve Loughran

On 31 Mar 2016, at 10:00, Raymond Honderdors 
mailto:raymond.honderd...@sizmek.com>> wrote:

Hi,

I pulled the latest version

git pull git://github.com/apache/spark.git

Compiled:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver 
-DskipTests clean package


now I am getting the following error:
Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
MetaException(message:Got exception: java.io.IOException No FileSystem for 
scheme: s3n) (state=,code=0)

Did anyone else experience this?


Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com
T +972.7325.3569
Herzliya


[Read More]

[http://www.sizmek.com/Sizmek.png]


You are going to need to add hadoop-aws.JAR to your classpath, along with 
amazon's aws-java-sdk v 1.7.4 on your CP

I am actually working on a PR to add the hadoop-aws, openstack and (hadoop 
2.7+) azure JARs to spark-assembly, but you'll still need to add the relevant 
amazon SDK (which has proven brittle across versions):

https://github.com/apache/spark/pull/12004

It's not ready yet; once I've got the 2.7 profile working with tests for 
openstack and azure, *and documentation on use and testing* then you'll be able 
to play with.


Re: Question Create External table location S3

2016-03-31 Thread Raymond Honderdors
Thanks for the insites
Ill try to add it

Sent from Outlook Mobile



On Thu, Mar 31, 2016 at 4:39 AM -0700, "Steve Loughran" 
mailto:ste...@hortonworks.com>> wrote:


On 31 Mar 2016, at 10:00, Raymond Honderdors 
mailto:raymond.honderd...@sizmek.com>> wrote:

Hi,

I pulled the latest version

git pull git://github.com/apache/spark.git

Compiled:
mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver 
-DskipTests clean package


now I am getting the following error:
Error: org.apache.spark.sql.execution.QueryExecutionException: FAILED: 
Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. 
MetaException(message:Got exception: java.io.IOException No FileSystem for 
scheme: s3n) (state=,code=0)

Did anyone else experience this?


Raymond Honderdors
Team Lead Analytics BI
Business Intelligence Developer
raymond.honderd...@sizmek.com
T +972.7325.3569
Herzliya


[Read More]

[http://www.sizmek.com/Sizmek.png]


You are going to need to add hadoop-aws.JAR to your classpath, along with 
amazon's aws-java-sdk v 1.7.4 on your CP

I am actually working on a PR to add the hadoop-aws, openstack and (hadoop 
2.7+) azure JARs to spark-assembly, but you'll still need to add the relevant 
amazon SDK (which has proven brittle across versions):

https://github.com/apache/spark/pull/12004

It's not ready yet; once I've got the 2.7 profile working with tests for 
openstack and azure, *and documentation on use and testing* then you'll be able 
to play with.


Jenkins PR failing, Mima unhappy: bad constant pool tag 50 at byte 12

2016-03-31 Thread Steve Loughran

A WiP PR of mine is failing in mima: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54525/consoleFull

[info] spark-examples: previous-artifact not set, not analyzing binary 
compatibility
java.lang.RuntimeException: bad constant pool tag 50 at byte 12
at 
com.typesafe.tools.mima.core.ClassfileParser$ConstantPool.errorBadTag(ClassfileParser.scala:204)
at 
com.typesafe.tools.mima.core.ClassfileParser$ConstantPool.(ClassfileParser.scala:106)
at 
com.typesafe.tools.mima.core.ClassfileParser.parseAll(ClassfileParser.scala:67)
at com.typesafe.tools.mima.core.ClassfileParser.parse(ClassfileParser.scala:59)
at com.typesafe.tools.mima.core.ClassInfo.ensureLoaded(ClassInfo.scala:86)
at com.typesafe.tools.mima.core.ClassInfo.methods(ClassInfo.scala:101)
at 
com.typesafe.tools.mima.core.ClassInfo$$anonfun$lookupClassMethods$2.apply(ClassInfo.scala:123)
at 
com.typesafe.tools.mima.core.ClassInfo$$anonfun$lookupClassMethods$2.apply(ClassInfo.scala:123)

...

That's the kind of message which hints at some kind of JVM versioning mismatch, 
but, AFAIK, I'm (a) just pulling in java 6/7 libraries and (b) skipping the 
hadoop-2.6+ module anyway.

Any suggestions to make the stack trace go away


Re: Spark SQL UDF Returning Rows

2016-03-31 Thread Hamel Kothari
Hi Michael,

Thanks for the response. I am just extracting part of the nested structure
and returning only a piece that same structure.

I haven't looked at Encoders or Datasets since we're bound to 1.6 for now
but I'll look at encoders to see if that covers it. Datasets seems like it
would solve this problem for sure.

I avoided returning a case object because even if we use reflection to
build byte code and do it efficiently. I still need to convert my Row to a
case object manually within my UDF, just to have it converted to a Row
again. Even if it's fast, it's still fairly necessary.

The thing I guess that threw me off was that UDF1/2/3 was in a "java"
prefixed package although there was nothing that made it java specific and
in fact was the only way to do what I wanted in scala. For things like
JavaRDD, etc it makes sense, but for generic things like UDF is there a
reason they get put into a package with "java" in the name?

Regards,
Hamel

On Wed, Mar 30, 2016 at 3:47 PM Michael Armbrust 
wrote:

> Some answers and more questions inline
>
> - UDFs can pretty much only take in Primitives, Seqs, Maps and Row objects
>> as parameters. I cannot take in a case class object in place of the
>> corresponding Row object, even if the schema matches because the Row object
>> will always be passed in at Runtime and it will yield a ClassCastException.
>
>
> This is true today, but could be improved using the new encoder
> framework.  Out of curiosity, have you look at that?  If so, what is
> missing thats leading you back to UDFs.
>
> Is there any way to return a Row object in scala from a UDF and specify
>> the known schema that would be returned at UDF registration time? In
>> python/java this seems to be the case because you need to explicitly
>> specify return DataType of your UDF but using scala functions this isn't
>> possible. I guess I could use the Java UDF1/2/3... API but I wanted to see
>> if there was a first class scala way to do this.
>
>
> I think UDF1/2/3 are the only way to do this today.  Is the problem here
> that you are only changing a subset of the nested data and you want to
> preserve the structure.  What kind of changes are you doing?
>
> 2) Is Spark actually converting the returned case class object when the
>>> UDF is called, or does it use the fact that it's essentially "Product" to
>>> efficiently coerce it to a Row in some way?
>>>
>>
> We use reflection to figure out the schema and extract the data into the
> internal row format.  We actually runtime build bytecode for this in many
> cases (though not all yet) so it can be pretty fast.
>
>
>> 2.1) If this is the case, we could just take in a case object as a
>> parameter (rather than a Row) and perform manipulation on that and return
>> it. Is this explicitly something we avoided?
>
>
> You can do this with Datasets:
>
> df.as[CaseClass].map(o => do stuff)
>


What influences the space complexity of Spark operations?

2016-03-31 Thread Steve Johnston
*What we’ve observed*
Increasing the number of partitions (and thus decreasing the partition size)
seems to reliably help avoid OOM errors. To demonstrate this we used a
single executor and loaded a small table into a DataFrame, persisted it with
MEMORY_AND_DISK, repartitioned it and joined it to itself. Varying the
number of partitions identifies a threshold between completing the join and
incurring an OOM error. 
lineitem = sc.textFile('lineitem.tbl').map(converter)lineitem =
sqlContext.createDataFrame(lineitem,
schema)lineitem.persist(StorageLevel.MEMORY_AND_DISK)repartitioned =
lineitem.repartition(partition_count)joined =
repartitioned.join(repartitioned)joined.show() 
*Questions*
 Generally, what influences the space complexity of Spark operations? Is it
the case that a single partition of each operand’s data set + a single
partition of the resulting data set all need to fit in memory at the same
time? We can see where the transformations (for say joins) are implemented
in the source code (for the example above BroadcastNestedLoopJoin), but they
seem to be based on virtualized iterators; where in the code is the
partition data for the inputs and outputs actually materialized?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-influences-the-space-complexity-of-Spark-operations-tp16944.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: Making BatchPythonEvaluation actually Batch

2016-03-31 Thread Davies Liu
@Justin, it's fixed by https://github.com/apache/spark/pull/12057

On Thu, Feb 11, 2016 at 11:26 AM, Davies Liu  wrote:
> Had a quick look in your commit, I think that make sense, could you
> send a PR for that, then we can review it.
>
> In order to support 2), we need to change the serialized Python
> function from `f(iter)` to `f(x)`, process one row at a time (not a
> partition),
> then we can easily combine them together:
>
> for f1(f2(x))  and g1(g2(x)), we can do this in Python:
>
> for row in reading_stream:
>x1, x2 = row
>y1 = f1(f2(x1))
>y2 = g1(g2(x2))
>yield (y1, y2)
>
> For RDD, we still need to use `f(iter)`, but for SQL UDF, use `f(x)`.
>
> On Sun, Jan 31, 2016 at 1:37 PM, Justin Uang  wrote:
>> Hey guys,
>>
>> BLUF: sorry for the length of this email, trying to figure out how to batch
>> Python UDF executions, and since this is my first time messing with
>> catalyst, would like any feedback
>>
>> My team is starting to use PySpark UDFs quite heavily, and performance is a
>> huge blocker. The extra roundtrip serialization from Java to Python is not a
>> huge concern if we only incur it ~once per column for most workflows, since
>> it'll be in the same order of magnitude as reading files from disk. However,
>> right now each Python UDFs lead to a single roundtrip. There is definitely a
>> lot we can do regarding this:
>>
>> (all the prototyping code is here:
>> https://github.com/justinuang/spark/commit/8176749f8a6e6dc5a49fbbb952735ff40fb309fc)
>>
>> 1. We can't chain Python UDFs.
>>
>> df.select(python_times_2(python_times_2("col1")))
>>
>> throws an exception saying that the inner expression isn't evaluable. The
>> workaround is to do
>>
>>
>> df.select(python_times_2("col1").alias("tmp")).select(python_time_2("tmp"))
>>
>> This can be solved in ExtractPythonUDFs by always extracting the inner most
>> Python UDF first.
>>
>>  // Pick the UDF we are going to evaluate (TODO: Support evaluating
>> multiple UDFs at a time)
>>  // If there is more than one, we will add another evaluation
>> operator in a subsequent pass.
>> -udfs.find(_.resolved) match {
>> +udfs.find { udf =>
>> +  udf.resolved && udf.children.map { child: Expression =>
>> +child.find { // really hacky way to find if a child of a udf
>> has the PythonUDF node
>> +  case p: PythonUDF => true
>> +  case _ => false
>> +}.isEmpty
>> +  }.reduce((x, y) => x && y)
>> +} match {
>>case Some(udf) =>
>>  var evaluation: EvaluatePython = null
>>
>> 2. If we have a Python UDF applied to many different columns, where they
>> don’t depend on each other, we can optimize them by collapsing them down
>> into a single python worker. Although we have to serialize and send the same
>> amount of data to the python interpreter, in the case where I am applying
>> the same function to 20 columns, the overhead/context_switches of having 20
>> interpreters run at the same time causes huge performance hits. I have
>> confirmed this by manually taking the 20 columns, converting them to a
>> struct, and then writing a UDF that processes the struct at the same time,
>> and the speed difference is 2x. My approach to adding this to catalyst is
>> basically to write an optimizer rule called CombinePython which joins
>> adjacent EvaluatePython nodes that don’t depend on each other’s variables,
>> and then having BatchPythonEvaluation run multiple lambdas at once. I would
>> also like to be able to handle the case
>> df.select(python_times_2(“col1”).alias(“col1x2”)).select(F.col(“col1x2”),
>> python_times_2(“col1x2”).alias(“col1x4”)). To get around that, I add a
>> PushDownPythonEvaluation optimizer that will push the optimization through a
>> select/project, so that the CombinePython rule can join the two.
>>
>> 3. I would like CombinePython to be able to handle UDFs that chain off of
>> each other.
>>
>> df.select(python_times_2(python_times_2(“col1”)))
>>
>> I haven’t prototyped this yet, since it’s a lot more complex. The way I’m
>> thinking about this is to still have a rule called CombinePython, except
>> that the BatchPythonEvaluation will need to be smart enough to build up the
>> dag of dependencies, and then feed that information to the python
>> interpreter, so it can compute things in the right order, and reuse the
>> in-memory objects that it has already computed. Does this seem right? Should
>> the code mainly be in BatchPythonEvaluation? In addition, we will need to
>> change up the protocol between the java and python sides to support sending
>> this information. What is acceptable?
>>
>> Any help would be much appreciated! Especially w.r.t where to the design
>> choices such that the PR that has a chance of being accepted.
>>
>> Justin

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: d