Re: Change for submitting to yarn in 1.3.1

2015-05-14 Thread Chester At Work
Marcelo Thanks for the comments. All my requirements are from our work over last year in yarn-cluster mode. So I am biased on the yarn side. It's true some of the task might be able accomplished with a separate yarn API call, the API just does not same to be that nature any more if w

RE: Does Spark SQL (JDBC) support nest select with current version

2015-05-14 Thread Cheng, Hao
Spark SQL just load the query result as a new source (via JDBC), so DO NOT confused with the Spark SQL tables. They are totally independent database systems. From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID] Sent: Friday, May 15, 2015 1:59 PM To: Cheng, Hao; Dev Subject: Re: Does Spark SQL (JD

Re: Does Spark SQL (JDBC) support nest select with current version

2015-05-14 Thread Yi Zhang
@Hao,Because the querying joined more than one table, if I register data frame as temp table, Spark can't disguise which table is correct. I don't how to set dbtable and register temp table.  Any suggestion? On Friday, May 15, 2015 1:38 PM, "Cheng, Hao" wrote: #yiv0864379581 #yiv08

RE: Does Spark SQL (JDBC) support nest select with current version

2015-05-14 Thread Cheng, Hao
You need to register the “dataFrame” as a table first and then do queries on it? Do you mean that also failed? From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID] Sent: Friday, May 15, 2015 1:10 PM To: Yi Zhang; Dev Subject: Re: Does Spark SQL (JDBC) support nest select with current version If I

Re: Does Spark SQL (JDBC) support nest select with current version

2015-05-14 Thread Yi Zhang
If I pass the whole statement as dbtable to sqlContext.load() method as below:val query = """ (select t1._salory as salory, |t1._name as employeeName, |(select _name from mock_locations t3 where t3._id = t1._location_id ) as locationName |from mock_employees t1 |inner join m

Does Spark SQL (JDBC) support nest select with current version

2015-05-14 Thread Yi Zhang
The sql statement is like this:select t1._salory as salory, t1._name as employeeName, (select _name from mock_locations t3 where t3._id = t1._location_id ) as locationName from mock_employees t1 inner join mock_locations t2 on t1._location_id = t2._id where t1._salory > t2._max_price I noticed th

Re: Change for submitting to yarn in 1.3.1

2015-05-14 Thread Marcelo Vanzin
Hi Chester, Thanks for the feedback. A few of those are great candidates for improvements to the launcher library. On Wed, May 13, 2015 at 5:44 AM, Chester At Work wrote: > 1) client should not be private ( unless alternative is provided) so > we can call it directly. > Patrick already to

RE: testing HTML email

2015-05-14 Thread Ulanov, Alexander
Testing too. Recently I got few undelivered mails to dev-list. From: Reynold Xin [mailto:r...@databricks.com] Sent: Thursday, May 14, 2015 3:39 PM To: dev@spark.apache.org Subject: testing HTML email Testing html emails ... Hello This is bold This is a link

Spark Summit 2015 - June 15-17 - Dev list invite

2015-05-14 Thread Scott walent
*Join the Apache Spark community at the fourth Spark Summit in San Francisco on June 15, 2015. At Spark Summit 2015 you will hear keynotes from NASA, the CIA, Toyota, Databricks, AWS, Intel, MapR, IBM, Cloudera, Hortonworks, Timeful, O'Reilly, and Andreessen Horowitz. 260 talks proposal were submit

Re: s3 vfs on Mesos Slaves

2015-05-14 Thread Haoyuan Li
Another way is to configure S3 as Tachyon's under storage system, and then run Spark on Tachyon. More info: http://tachyon-project.org/Setup-UFS.html Best, Haoyuan On Wed, May 13, 2015 at 10:52 AM, Stephen Carman wrote: > Thank you for the suggestions, the problem exists in the fact we need t

Re: practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

2015-05-14 Thread Cody Koeninger
Sorry, realized I probably didn't fully answer your question about my blog post, as opposed to Michael Nolls. The direct stream is really blunt, a given RDD partition is just a kafka topic/partition and an upper / lower bound for the range of offsets. When an executor computes the partition, it c

Re: Error recovery strategies using the DirectKafkaInputDStream

2015-05-14 Thread Cody Koeninger
I think as long as you have adequate monitoring and Kafka retention, the simplest solution is safest - let it crash. On May 14, 2015 4:00 PM, "badgerpants" wrote: > We've been using the new DirectKafkaInputDStream to implement an exactly > once > processing solution that tracks the provided offs

testing HTML email

2015-05-14 Thread Reynold Xin
Testing html emails ... Hello *This is bold* This is a link

Re: practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

2015-05-14 Thread Cody Koeninger
If the transformation you're trying to do really is per-partition, it shouldn't matter whether you're using scala methods or spark methods. The parallel speedup you're getting is all from doing the work on multiple machines, and shuffle or caching or other benefits of spark aren't a factor. If us

Error recovery strategies using the DirectKafkaInputDStream

2015-05-14 Thread badgerpants
We've been using the new DirectKafkaInputDStream to implement an exactly once processing solution that tracks the provided offset ranges within the same transaction that persists our data results. When an exception is thrown within the processing loop and the configured number of retries are exhaus

Re: practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

2015-05-14 Thread will-ob
Hey Cody (et. al.), Few more questions related to this. It sounds like our missing data issues appear fixed with this approach. Could you shed some light on a few questions that came up? - Processing our data inside a single foreachPartition function appears to be very differ

Re: How to link code pull request with JIRA ID?

2015-05-14 Thread Patrick Wendell
Yeah I wrote the original script and I intentionally made it easy for other projects to use (you'll just need to tweak some variables at the top). You just need somewhere to run it... we were using a jenkins cluster to run it every 5 minutes. BTW - I looked and there is one instance where it hard

Re: How to link code pull request with JIRA ID?

2015-05-14 Thread Josh Rosen
Spark PRs didn't always used to handle the JIRA linking. We used to rely on a Jenkins job that ran https://github.com/apache/spark/blob/master/dev/github_jira_sync.py. We switched this over to Spark PRs at a time when the Jenkins GitHub Pull Request Builder plugin was having flakiness issues, but

Simple SQL queries producing invalid results [SPARK-6743]

2015-05-14 Thread Santiago Mola
Hi, Could someone give a look to this issue? [SPARK-6743] Join with empty projection on one side produces invalid results https://issues.apache.org/jira/browse/SPARK-6743 Thank you, -- Santiago M. Mola Vía de las dos Castillas, 33, Ática 4, 3ª Planta 28224 Pozuelo d

Build change PSA: Hadoop 2.2 default; -Phadoop-x.y profile recommended for builds

2015-05-14 Thread Sean Owen
This change will be merged shortly for Spark 1.4, and has a minor implication for those creating their own Spark builds: https://issues.apache.org/jira/browse/SPARK-7249 https://github.com/apache/spark/pull/5786 The default Hadoop dependency has actually been Hadoop 2.2 for some time, but the def