R: Integrating D3 with Spark

2015-04-13 Thread Paolo Platter
Hi,

I integrated charts on spark-notebook, very similar task. In order to reduce D3 
boiler plate I suggest to use dimple.js.
It provides out of the box d3 based charts.

Bye

Paolo

Inviata dal mio Windows Phone

Da: anshu shukla
Inviato: ‎13/‎04/‎2015 05:50
A: Kay Ousterhout
Cc: shroffpradyumn; 
dev@spark.apache.org
Oggetto: Re: Integrating D3 with Spark

Hey  Ousterhout ,

I found its amazing .Before  this  i used to use  my own  D3.js files that
 subscribes  to the redis pub-shub  database  where  output  tuples are
being published  to the DB . So  it was already including latency  to push
 data to redis ,although  it was very less.
Once again thanks .

On Sun, Apr 12, 2015 at 10:06 PM, Kay Ousterhout 
wrote:

> Hi Pradyumn,
>
> Take a look at this pull request, which does something similar:
> https://github.com/apache/spark/pull/2342/files
>
> You can put JavaScript in 

Manning looking for a co-author for the GraphX in Action book

2015-04-13 Thread Reynold Xin
Hi all,

Manning (the publisher) is looking for a co-author for the GraphX in Action
book. The book currently has one author (Michael Malak), but they are
looking for a co-author to work closely with Michael and improve the
writings and make it more consumable.

Early access page for the book: http://www.manning.com/malak/

Let me know if you are interested in that. Cheers.


Spark SQL 1.3.1 "saveAsParquetFile" will output tachyon file with different block size

2015-04-13 Thread zhangxiongfei
Hi experts
I run below code  in Spark Shell to access parquet files in Tachyon.
1.First,created a DataFrame by loading a bunch of Parquet Files in Tachyon
val ta3 
=sqlContext.parquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m");
2.Second, set the "fs.local.block.size" to 256M to make sure that block size of 
output files in Tachyon is 256M.
   sc.hadoopConfiguration.setLong("fs.local.block.size",268435456)
3.Third,saved above DataFrame into Parquet files that is stored in Tachyon
  
ta3.saveAsParquetFile("tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test");
After above code run successfully, the output parquet files were stored in 
Tachyon,but these files have different block size,below is the information of 
those files in the path 
"tachyon://tachyonserver:19998/apps/tachyon/zhangxf/parquetAdClick-6p-256m-test":
File Name Size  Block Size In-Memory
 Pin Creation Time
 _SUCCESS  0.00 B   256.00 MB 100% NO   
  04-13-2015 17:48:23:519
_common_metadata  1088.00 B  256.00 MB 100% NO 
04-13-2015 17:48:23:741
_metadata   22.71 KB   256.00 MB 100% NO
 04-13-2015 17:48:23:646
part-r-1.parquet 177.19 MB 32.00 MB  100% NO 
04-13-2015 17:46:44:626
part-r-2.parquet 177.21 MB 32.00 MB  100% NO 
04-13-2015 17:46:44:636
part-r-3.parquet 177.02 MB 32.00 MB  100% NO 
04-13-2015 17:46:45:439
part-r-4.parquet 177.21 MB 32.00 MB  100% NO 
04-13-2015 17:46:44:845
part-r-5.parquet 177.40 MB 32.00 MB  100% NO 
04-13-2015 17:46:44:638
part-r-6.parquet 177.33 MB 32.00 MB  100% NO 
04-13-2015 17:46:44:648

It seems that the API saveAsParquetFile does not distribute/broadcast the 
hadoopconfiguration to executors like the other API such as saveAsTextFile.The 
configutation "fs.local.block.size" only take effects on Driver.
If I set that configuration before loading parquet files,the problem is gone.
Could anyone help me verify this problem?

Thanks
Zhang Xiongfei


Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-13 Thread Sean McNamara
+1

Sean

> On Apr 11, 2015, at 12:07 AM, Patrick Wendell  wrote:
> 
> Please vote on releasing the following candidate as Apache Spark version 
> 1.3.1!
> 
> The tag to be voted on is v1.3.1-rc2 (commit 3e83913):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44
> 
> The list of fixes present in this release can be found at:
> http://bit.ly/1C2nVPY
> 
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc3/
> 
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
> 
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1088/
> 
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/
> 
> The patches on top of RC2 are:
> [SPARK-6851] [SQL] Create new instance for each converted parquet relation
> [SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey.
> [SPARK-6343] Doc driver-worker network reqs
> [SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme
> [SPARK-6781] [SQL] use sqlContext in python shell
> [SPARK-6753] Clone SparkConf in ShuffleSuite tests
> [SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed...
> 
> Please vote on releasing this package as Apache Spark 1.3.1!
> 
> The vote is open until Tuesday, April 14, at 07:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Spark 1.3.1
> [ ] -1 Do not release this package because ...
> 
> To learn more about Apache Spark, please see
> http://spark.apache.org/
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Spark ThriftServer encounter java.lang.IllegalArgumentException: Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]

2015-04-13 Thread Andrew Lee
Hi Cheng,
I couldn't find the component for Spark ThriftServer, will that be 'SQL' 
component?
JIRA created.https://issues.apache.org/jira/browse/SPARK-6882

> Date: Sun, 15 Mar 2015 21:03:34 +0800
> From: lian.cs@gmail.com
> To: alee...@hotmail.com; dev@spark.apache.org
> Subject: Re: Spark ThriftServer encounter java.lang.IllegalArgumentException: 
> Unknown auth type: null Allowed values are: [auth-int, auth-conf, auth]
> 
> Hey Andrew,
> 
> Would you please create a JIRA ticket for this? To preserve 
> compatibility with existing Hive JDBC/ODBC drivers, Spark SQL's 
> HiveThriftServer intercepts some HiveServer2 components and injects 
> Spark stuff into it. This makes the implementation details are somewhat 
> hacky (e.g. a bunch of reflection tricks were used). We haven't include 
> KRB tests in Spark unit/integration test suites, and it's possible that 
> HiveThriftServer2 somehow breaks Hive's KRB feature.
> 
> Cheng
> 
> On 3/14/15 3:43 AM, Andrew Lee wrote:
> > When Kerberos is enabled, I get the following exceptions. (Spark 1.2.1 git 
> > commit
> >
> >
> >
> >
> >
> >
> >
> >
> > b6eaf77d4332bfb0a698849b1f5f917d20d70e97, Hive 0.13.1, Apache Hadoop 2.4.1) 
> > when starting Spark ThriftServer.
> > Command to start thriftserver
> > ./start-thriftserver.sh --hiveconf hive.server2.thrift.port=2 
> > --hiveconf hive.server2.thrift.bind.host=$(hostname) --master yarn-client
> > Error message in spark.log
> >
> > 2015-03-13 18:26:05,363 ERROR 
> > org.apache.hive.service.cli.thrift.ThriftCLIService 
> > (ThriftBinaryCLIService.java:run(93)) - Error:
> > java.lang.IllegalArgumentException: Unknown auth type: null Allowed values 
> > are: [auth-int, auth-conf, auth]
> >  at org.apache.hive.service.auth.SaslQOP.fromString(SaslQOP.java:56)
> >  at 
> > org.apache.hive.service.auth.HiveAuthFactory.getSaslProperties(HiveAuthFactory.java:118)
> >  at 
> > org.apache.hive.service.auth.HiveAuthFactory.getAuthTransFactory(HiveAuthFactory.java:133)
> >  at 
> > org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:43)
> >  at java.lang.Thread.run(Thread.java:744)
> >
> > I'm wondering if this is due to the same problem described in HIVE-8154 
> > HIVE-7620 due to an older code based for the Spark ThriftServer?
> > Any insights are appreciated. Currently, I can't get Spark ThriftServer to 
> > run against a Kerberos cluster (Apache 2.4.1).
> >
> > My hive-site.xml looks like the following for spark/conf.
> >
> >
> >
> >
> >
> >
> >
> >
> > 
> >hive.semantic.analyzer.factory.impl
> >org.apache.hcatalog.cli.HCatSemanticAnalyzerFactory
> > 
> > 
> >hive.metastore.execute.setugi
> >true
> > 
> > 
> >hive.stats.autogather
> >false
> > 
> > 
> >hive.session.history.enabled
> >true
> > 
> > 
> >hive.querylog.location
> >/home/hive/log/${user.name}
> > 
> > 
> >hive.exec.local.scratchdir
> >/tmp/hive/scratch/${user.name}
> > 
> > 
> >hive.metastore.uris
> >thrift://somehostname:9083
> > 
> > 
> > 
> >hive.server2.authentication
> >KERBEROS
> > 
> > 
> >hive.server2.authentication.kerberos.principal
> >***
> > 
> > 
> >hive.server2.authentication.kerberos.keytab
> >***
> > 
> > 
> >hive.server2.thrift.sasl.qop
> >auth
> >Sasl QOP value; one of 'auth', 'auth-int' and 
> > 'auth-conf'
> > 
> > 
> >hive.server2.enable.impersonation
> >Enable user impersonation for HiveServer2
> >true
> > 
> > 
> > 
> >hive.metastore.sasl.enabled
> >true
> > 
> > 
> >hive.metastore.kerberos.keytab.file
> >***
> > 
> > 
> >hive.metastore.kerberos.principal
> >***
> > 
> > 
> >hive.metastore.cache.pinobjtypes
> >Table,Database,Type,FieldSchema,Order
> > 
> > 
> >hdfs_sentinel_file
> >***
> > 
> > 
> >hive.metastore.warehouse.dir
> >/hive
> > 
> > 
> >hive.metastore.client.socket.timeout
> >600
> > 
> > 
> >hive.warehouse.subdir.inherit.perms
> >true
> >  
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 
  

Re: How is hive-site.xml loaded?

2015-04-13 Thread Steve Loughran
There's some magic in the process that is worth knowing/being cautious of

Those special HDFSConfiguration, YarnConfiguration, HiveConf objects are all 
doing work in their class initializer to call Configuration.addDefaultResource

this puts their -default and -site XML files onto the list of default 
configuration. Hadoop then runs through the list of configuration instances it 
is tracking in a WeakHashmap, and, if created with the useDefaults=true option 
in their constructor, tells them to reload all their "default" config props 
(preserving anything set explicitly).

This means you can use/abuse this feature to force in properties onto all 
Hadoop Configuration instances that asked for the default values -though this 
doesn't guarantee the changes will be picked up.

It's generally considered best practice for apps to create an instance of the 
configuration classes whose defaults & site they want picked up as soon as they 
can. Even if you discard the instance itself. Your goal is to get those 
settings in, so that the defaults don't get picked up elsewhere.
-steve

> On 13 Apr 2015, at 07:10, Raunak Jhawar  wrote:
> 
> The most obvious path being /etc/hive/conf, but this can be changed to
> lookup for any other path.
> 
> --
> Thanks,
> Raunak Jhawar
> 
> 
> 
> 
> 
> 
> On Mon, Apr 13, 2015 at 11:22 AM, Dean Chen  wrote:
> 
>> Ah ok, thanks!
>> 
>> --
>> Dean Chen
>> 
>> On Apr 12, 2015, at 10:45 PM, Reynold Xin  wrote:
>> 
>> It is loaded by Hive's HiveConf, which simply searches for hive-site.xml on
>> the classpath.
>> 
>> 
>> On Sun, Apr 12, 2015 at 10:41 PM, Dean Chen  wrote:
>> 
>>> The docs state that:
>>> Configuration of Hive is done by placing your `hive-site.xml` file in
>>> `conf/`.
>>> 
>>> I've searched the codebase for hive-site.xml and didn't find code that
>>> specifically loaded it anywhere so it looks like there is some magic to
>>> autoload *.xml files in /conf? I've skimmed through HiveContext
>>> <
>>> 
>> https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala
 
>>> and didn't see anything obvious in there.
>>> 
>>> The reason I'm asking is that I am working on a feature that needs config
>>> in hbase-site.xml to be available in the spark context and would prefer
>> to
>>> follow the convention set by hive-site.xml.
>>> 
>>> --
>>> Dean Chen
>>> 
>> 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: SPARK-5364

2015-04-13 Thread Sree V
Thank you, Raynold. 

Thanking you.

With Regards
Sree 


 On Sunday, April 12, 2015 11:18 AM, Reynold Xin  
wrote:
   

 I closed it. Thanks.


On Sun, Apr 12, 2015 at 11:08 AM, Sree V 
wrote:

> Hi,
> I was browsing through the JIRAs and found this can be closed.If anyone
> who has edit permissions on Spark JIRA, please close this.
> https://issues.apache.org/jira/browse/SPARK-5364
> It is OpenIts Pull Request already merged
> Its parent and grand parent Resolved
>
>
> Thanking you.
>
> With Regards
> Sree Vaddi
>


  

Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-13 Thread Sree V
+1builds - checktests - checkinstalls and sample run - check

Thanking you.

With Regards
Sree 


 On Friday, April 10, 2015 11:07 PM, Patrick Wendell  
wrote:
   

 Please vote on releasing the following candidate as Apache Spark version 1.3.1!

The tag to be voted on is v1.3.1-rc2 (commit 3e83913):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44

The list of fixes present in this release can be found at:
http://bit.ly/1C2nVPY

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1088/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/

The patches on top of RC2 are:
[SPARK-6851] [SQL] Create new instance for each converted parquet relation
[SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey.
[SPARK-6343] Doc driver-worker network reqs
[SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme
[SPARK-6781] [SQL] use sqlContext in python shell
[SPARK-6753] Clone SparkConf in ShuffleSuite tests
[SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed...

Please vote on releasing this package as Apache Spark 1.3.1!

The vote is open until Tuesday, April 14, at 07:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



   

Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-13 Thread Marcelo Vanzin
+1 (non-binding)

Tested 2.6 build with standalone and yarn (no external shuffle service
this time, although it does come up).

On Fri, Apr 10, 2015 at 11:05 PM, Patrick Wendell  wrote:
> Please vote on releasing the following candidate as Apache Spark version 
> 1.3.1!
>
> The tag to be voted on is v1.3.1-rc2 (commit 3e83913):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44
>
> The list of fixes present in this release can be found at:
> http://bit.ly/1C2nVPY
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc3/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1088/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/
>
> The patches on top of RC2 are:
> [SPARK-6851] [SQL] Create new instance for each converted parquet relation
> [SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey.
> [SPARK-6343] Doc driver-worker network reqs
> [SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme
> [SPARK-6781] [SQL] use sqlContext in python shell
> [SPARK-6753] Clone SparkConf in ShuffleSuite tests
> [SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed...
>
> Please vote on releasing this package as Apache Spark 1.3.1!
>
> The vote is open until Tuesday, April 14, at 07:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark Sql reading hive partitioned tables?

2015-04-13 Thread Tom Graves
Hey,
I was trying out spark sql using the HiveContext and doing a select on a 
partitioned table with lots of partitions (16,000+). It took over 6 minutes 
before it even started the job. It looks like it was querying the Hive 
metastore and got a good chunk of data back.  Which I'm guessing is info on the 
partitions.  Running the same query using hive takes 45 seconds for the entire 
job. 
I know spark sql doesn't support all the hive optimization.  Is this a known 
limitation currently?  
Thanks,Tom

Re: Spark Sql reading hive partitioned tables?

2015-04-13 Thread Michael Armbrust
Yeah, we don't currently push down predicates into the metastore.  Though,
we do prune partitions based on predicates (so we don't read the data).

On Mon, Apr 13, 2015 at 2:53 PM, Tom Graves 
wrote:

> Hey,
> I was trying out spark sql using the HiveContext and doing a select on a
> partitioned table with lots of partitions (16,000+). It took over 6 minutes
> before it even started the job. It looks like it was querying the Hive
> metastore and got a good chunk of data back.  Which I'm guessing is info on
> the partitions.  Running the same query using hive takes 45 seconds for the
> entire job.
> I know spark sql doesn't support all the hive optimization.  Is this a
> known limitation currently?
> Thanks,Tom


Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-13 Thread Sean Owen
Pardon, I wanted to call attention to a JIRA I just created...

https://issues.apache.org/jira/browse/SPARK-6889

... in which I propose what I hope are some changes to the
contribution process wiki that could help a bit with the flood of
reviews and PRs. I'd be grateful for your thoughts and comments there,
as it's my current pet issue.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-13 Thread Patrick Wendell
Would just like to encourage everyone who is active in day-to-day
development to give feedback on this (and I will do same). Sean has
spent a lot of time looking through different ways we can streamline
our dev process.

- Patrick

On Mon, Apr 13, 2015 at 3:59 PM, Sean Owen  wrote:
> Pardon, I wanted to call attention to a JIRA I just created...
>
> https://issues.apache.org/jira/browse/SPARK-6889
>
> ... in which I propose what I hope are some changes to the
> contribution process wiki that could help a bit with the flood of
> reviews and PRs. I'd be grateful for your thoughts and comments there,
> as it's my current pet issue.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-13 Thread Nicholas Chammas
Wow, I had an open email draft to whine (yet again) about our open PR count
and provide some suggestions.

Will redirect that to the JIRA Sean created. Sweet!

Nick

On Mon, Apr 13, 2015 at 7:05 PM Patrick Wendell  wrote:

> Would just like to encourage everyone who is active in day-to-day
> development to give feedback on this (and I will do same). Sean has
> spent a lot of time looking through different ways we can streamline
> our dev process.
>
> - Patrick
>
> On Mon, Apr 13, 2015 at 3:59 PM, Sean Owen  wrote:
> > Pardon, I wanted to call attention to a JIRA I just created...
> >
> > https://issues.apache.org/jira/browse/SPARK-6889
> >
> > ... in which I propose what I hope are some changes to the
> > contribution process wiki that could help a bit with the flood of
> > reviews and PRs. I'd be grateful for your thoughts and comments there,
> > as it's my current pet issue.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Query regarding infering data types in pyspark

2015-04-13 Thread Davies Liu
Hey Suraj,

You should use "date" for DataType:

df.withColumn(df.DateCol.cast("date"))

Davies

On Sat, Apr 11, 2015 at 10:57 PM, Suraj Shetiya  wrote:
> Humble reminder
>
> On Sat, Apr 11, 2015 at 12:16 PM, Suraj Shetiya 
> wrote:
>>
>> Hi,
>>
>> Below is one line from the json file.
>> I have highlighted the field that represents the date.
>>
>>
>> "YEAR":2015,"QUARTER":1,"MONTH":1,"DAY_OF_MONTH":31,"DAY_OF_WEEK":6,"FL_DATE":"2015-01-31","UNIQUE_CARRIER":"NK","AI
>> RLINE_ID":20416,"CARRIER":"NK","TAIL_NUM":"N614NK","FL_NUM":126,"ORIGIN_AIRPORT_ID":11697,"ORIGIN_AIRPORT_SEQ_ID":1169
>> 703,"ORIGIN_CITY_MARKET_ID":32467,"ORIGIN":"FLL","ORIGIN_CITY_NAME":"Fort
>> Lauderdale, FL","ORIGIN_STATE_ABR":"FL","ORI
>> GIN_STATE_FIPS":12,"ORIGIN_STATE_NM":"Florida","ORIGIN_WAC":33,"DEST_AIRPORT_ID":13577,"DEST_AIRPORT_SEQ_ID":1357702,"
>> DEST_CITY_MARKET_ID":31135,"DEST":"MYR","DEST_CITY_NAME":"Myrtle Beach,
>> SC","DEST_STATE_ABR":"SC","DEST_STATE_FIPS":45,"DEST_STATE_NM":"South
>> Carolina","DEST_WAC":37,"CRS_DEP_TIME":2010,"DEP_TIME":2009.0,"DEP_DELAY":-1.0,"DEP_DELAY_NEW"
>> :0.0,"DEP_DEL15":0.0,"DEP_DELAY_GROUP":-1.0,"DEP_TIME_BLK":"2000-2059","TAXI_OUT":17.0,"WHEELS_OFF":2026.0,"WHEELS_ON"
>> :2147.0,"TAXI_IN":5.0,"CRS_ARR_TIME":2149,"ARR_TIME":2152.0,"ARR_DELAY":3.0,"ARR_DELAY_NEW":3.0,"ARR_DEL15":0.0,"ARR_DELAY_GROUP":0.0,"ARR_TIME_BLK":"2100-2159","Unnamed:
>> 47":null}
>>
>> Please let me know if you need access to the dataset.
>>
>> On Sat, Apr 11, 2015 at 11:56 AM, Davies Liu 
>> wrote:
>>>
>>> What's the format you have in json file?
>>>
>>> On Fri, Apr 10, 2015 at 6:57 PM, Suraj Shetiya 
>>> wrote:
>>> > Hi,
>>> >
>>> > In pyspark when if I read a json file using sqlcontext I find that the
>>> > date
>>> > field is not infered as date instead it is converted to string. And
>>> > when I
>>> > try to convert it to date using
>>> > df.withColumn(df.DateCol.cast("timestamp"))
>>> > it does not parse it successfuly and adds a null instead there. Should
>>> > I
>>> > use UDF to convert the date ? Is this expected behaviour (not throwing
>>> > an
>>> > error after failure to cast all fields)?
>>> >
>>> > --
>>> > Regards,
>>> > Suraj
>>
>>
>>
>>
>> --
>> Regards,
>> Suraj
>
>
>
>
> --
> Regards,
> Suraj

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.3.1 (RC3)

2015-04-13 Thread GuoQiang Li
+1 (non-binding)





-- Original --
From:  "Patrick Wendell";;
Date:  Sat, Apr 11, 2015 02:05 PM
To:  "dev@spark.apache.org"; 

Subject:  [VOTE] Release Apache Spark 1.3.1 (RC3)



Please vote on releasing the following candidate as Apache Spark version 1.3.1!

The tag to be voted on is v1.3.1-rc2 (commit 3e83913):
https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=3e8391327ba586eaf54447043bd526d919043a44

The list of fixes present in this release can be found at:
http://bit.ly/1C2nVPY

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc3/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1088/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.1-rc3-docs/

The patches on top of RC2 are:
[SPARK-6851] [SQL] Create new instance for each converted parquet relation
[SPARK-5969] [PySpark] Fix descending pyspark.rdd.sortByKey.
[SPARK-6343] Doc driver-worker network reqs
[SPARK-6767] [SQL] Fixed Query DSL error in spark sql Readme
[SPARK-6781] [SQL] use sqlContext in python shell
[SPARK-6753] Clone SparkConf in ShuffleSuite tests
[SPARK-6506] [PySpark] Do not try to retrieve SPARK_HOME when not needed...

Please vote on releasing this package as Apache Spark 1.3.1!

The vote is open until Tuesday, April 14, at 07:00 UTC and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.3.1
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see
http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Eliminate partition filters in execution.Filter after filter pruning

2015-04-13 Thread Yijie Shen
Hi,

Suppose I have a table t(id: String, event: String) saved as parquet file, and 
have directory hierarchy:  
hdfs://path/to/data/root/dt=2015-01-01/hr=00
After partition discovery, the result schema should be (id: String, event: 
String, dt: String, hr: Int)

If I have a query like:

df.select($“id”).filter(event match).filter($“dt” > “2015-01-01”).filter($”hr” 
> 13)

In current implementation, after (dt > 2015-01-01 && hr >13) is used to filter 
partitions, 
these two filters remains in execution plan and result in each row returned 
from parquet add two fields dt & hr each time,  
which I think is useless, if we could rewrite execution.Filter’s predicate and 
eliminate them.

What’s your opinion? Is it a general assumption or it’s just my job’s specific 
requirement?  

If it’s a general one, I would love to discuss further about the 
implementations. 
If specific, I would just make my own workaround :)

— 
Best Regards!
Yijie Shen

Using memory mapped file for shuffle

2015-04-13 Thread Kannan Rajah
DiskStore.getBytes uses memory mapped files if the length is more than a
configured limit. This code path is used during map side shuffle in
ExternalSorter. I want to know if its possible for the length to exceed the
limit in the case of shuffle. The reason I ask is in the case of Hadoop,
each map task is supposed to produce only data that can fit within the
task's configured max memory. Otherwise it will result in OOM. Is the
behavior same in Spark or the size of data generated by a map task can
exceed what can be fitted in memory.

  if (length < minMemoryMapBytes) {
val buf = ByteBuffer.allocate(length.toInt)

  } else {
Some(channel.map(MapMode.READ_ONLY, offset, length))
  }

--
Kannan


How to connect JDBC DB based on Spark Sql

2015-04-13 Thread doovsaid
Hi all,
According to the official document, SparkContext can load datatable to 
dataframe using the DataSources API. However, it just supports the following 
properties:Property NameMeaningurlThe JDBC URL to connect to.dbtableThe JDBC 
table that should be read. Note that anything that is valid in a `FROM` clause 
of a SQL query can be used. For example, instead of a full table you could also 
use a subquery in parentheses.driverThe class name of the JDBC driver needed to 
connect to this URL. This class with be loaded on the master and workers before 
running an JDBC commands to allow the driver to register itself with the JDBC 
subsystem.partitionColumn, lowerBound, upperBound, numPartitionsThese options 
must all be specified if any of them is specified. They describe how to 
partition the table when reading in parallel from multiple workers. 
partitionColumn must be a numeric column from the table in question.It lets me 
confused how to pass the username, password or other info? BTW, I am connecting 
to Postgresql like this:val dataFrame = sqlContext.load("jdbc", Map(  
"url" -> "jdbc:postgresql://192.168.1.110:5432/demo",  //how to pass username 
and password?  "driver" -> "org.postgresql.Driver",  "dbtable" -> 
"schema.tab_users"))
Thanks.
RegardsYi





Re: How to connect JDBC DB based on Spark Sql

2015-04-13 Thread Augustin Borsu
Hello Yi,

You can actually pass the username and password in the url. E.g.

val url = "
jdbc:postgresql://ip.ip.ip.ip/ow-feeder?user=MY_LOGIN&password=MY_PASSWORD"
val query = "(SELECT * FROM \"YadaYada\" WHERE type='item' LIMIT 100) as
MY_DB"

val jdbcDF = sqlContext.load("jdbc", Map( "url" -> url, dbtable" -> query))

On Tue, Apr 14, 2015 at 7:48 AM,  wrote:

> Hi all,
> According to the official document, SparkContext can load datatable to
> dataframe using the DataSources API. However, it just supports the
> following properties:Property NameMeaningurlThe JDBC URL to connect
> to.dbtableThe JDBC table that should be read. Note that anything that is
> valid in a `FROM` clause of a SQL query can be used. For example, instead
> of a full table you could also use a subquery in parentheses.driverThe
> class name of the JDBC driver needed to connect to this URL. This class
> with be loaded on the master and workers before running an JDBC commands to
> allow the driver to register itself with the JDBC
> subsystem.partitionColumn, lowerBound, upperBound, numPartitionsThese
> options must all be specified if any of them is specified. They describe
> how to partition the table when reading in parallel from multiple workers.
> partitionColumn must be a numeric column from the table in question.It lets
> me confused how to pass the username, password or other info? BTW, I am
> connecting to Postgresql like this:val dataFrame =
> sqlContext.load("jdbc", Map(  "url" -> "jdbc:postgresql://
> 192.168.1.110:5432/demo",  //how to pass username and password?
> "driver" -> "org.postgresql.Driver",  "dbtable" -> "schema.tab_users"
>   ))
> Thanks.
> RegardsYi
>
>
>
>


Re: Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-13 Thread Sree V
Hi Sean,
This is not the first time, I am hearing it.
I agree with JIRA suggestion.
In most of the companies that I worked, we have 'no status', 'no type', when a 
jira is created. And we set both, in sprint planning meetings.

I am not sure, how easy it would be for apache jira.  As any change might 
effect every apache project. 
Thanking you.

With Regards
Sree 


 On Monday, April 13, 2015 4:20 PM, Nicholas Chammas 
 wrote:
   

 Wow, I had an open email draft to whine (yet again) about our open PR count
and provide some suggestions.

Will redirect that to the JIRA Sean created. Sweet!

Nick

On Mon, Apr 13, 2015 at 7:05 PM Patrick Wendell  wrote:

> Would just like to encourage everyone who is active in day-to-day
> development to give feedback on this (and I will do same). Sean has
> spent a lot of time looking through different ways we can streamline
> our dev process.
>
> - Patrick
>
> On Mon, Apr 13, 2015 at 3:59 PM, Sean Owen  wrote:
> > Pardon, I wanted to call attention to a JIRA I just created...
> >
> > https://issues.apache.org/jira/browse/SPARK-6889
> >
> > ... in which I propose what I hope are some changes to the
> > contribution process wiki that could help a bit with the flood of
> > reviews and PRs. I'd be grateful for your thoughts and comments there,
> > as it's my current pet issue.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


  

回复:Re: How to connect JDBC DB based on Spark Sql

2015-04-13 Thread doovsaid
Great! It works. Thanks.
Best,Yi




- 原始邮件 -
发件人:Augustin Borsu 
收件人:doovs...@sina.com
抄送人:dev 
主题:Re: How to connect JDBC DB based on Spark Sql
日期:2015年04月14日 14点14分

Hello Yi,
You can actually pass the username and password in the url. E.g.
val url = "
jdbc:postgresql://ip.ip.ip.ip/ow-feeder?user=MY_LOGIN&password=MY_PASSWORD"
val query = "(SELECT * FROM \"YadaYada\" WHERE type='item' LIMIT 100) as
MY_DB"
val jdbcDF = sqlContext.load("jdbc", Map( "url" -> url, dbtable" -> query))
On Tue, Apr 14, 2015 at 7:48 AM,  wrote:
> Hi all,
> According to the official document, SparkContext can load datatable to
> dataframe using the DataSources API. However, it just supports the
> following properties:Property NameMeaningurlThe JDBC URL to connect
> to.dbtableThe JDBC table that should be read. Note that anything that is
> valid in a `FROM` clause of a SQL query can be used. For example, instead
> of a full table you could also use a subquery in parentheses.driverThe
> class name of the JDBC driver needed to connect to this URL. This class
> with be loaded on the master and workers before running an JDBC commands to
> allow the driver to register itself with the JDBC
> subsystem.partitionColumn, lowerBound, upperBound, numPartitionsThese
> options must all be specified if any of them is specified. They describe
> how to partition the table when reading in parallel from multiple workers.
> partitionColumn must be a numeric column from the table in question.It lets
> me confused how to pass the username, password or other info? BTW, I am
> connecting to Postgresql like this:val dataFrame =
> sqlContext.load("jdbc", Map(  "url" -> "jdbc:postgresql://
> 192.168.1.110:5432/demo",  //how to pass username and password?
> "driver" -> "org.postgresql.Driver",  "dbtable" -> "schema.tab_users"
>   ))
> Thanks.
> RegardsYi
>
>
>
>