Hi Suhel, My team is currently working with a lot of SQL Server databases as one of our many data sources and ultimately we pull the data into HDFS from SQL Server. As we had a lot of SQL databases to hit, we used the jTDS driver and SQOOP to extract the data out of SQL Server and into HDFS (small hit against the SQL databases to extract the data out). The reasons we had done this were to 1) minimize the impact on our SQL Servers since these were transactional databases and we didn't want to our analytics queries to interfere with the transactions and 2) having the data within HDFS allowed us to centralize our relational source data within one location so we could join / mash it with other sources of data more easily. Now that the data is there, we just run our Spark queries against that and humming nicely.
Saying this - I have not yet had a chance to try the Spark 1.3 JDBC data sources. Cheng, to confirm, the reference for JDBC is http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/api/java/org/apache/spark/sql/jdbc/package-tree.html ? In the past I have not been able to get SQL queries to against SQL Server without the use of the jTDS or Microsoft SQL Server JDBC driver for various reason (e.g. authentication, T-SQL vs. ANSI-SQL differences, etc.) If I needed to utilize an additional driver like jTDS, can I "plug it in" with the JDBC source and/or potentially build something that will work with the Data Sources API? Thanks! Denny On Tue Feb 24 2015 at 3:20:57 AM Cheng Lian <lian.cs....@gmail.com> wrote: > There is a newly introduced JDBC data source in Spark 1.3.0 (not the > JdbcRDD in Spark core), which may be useful. However, currently there's no > SQL server specific logics implemented. I'd assume standard SQL queries > should work. > > > Cheng > > > On 2/24/15 7:02 PM, Suhel M wrote: > > Hey, > > I am trying to work out what is the best way we can leverage Spark for > crunching data that is sitting in SQL Server databases. > Ideal scenario is being able to efficiently work with big data (10billion+ > rows of activity data). We need to shape this data for machine learning > problems and want to do ad-hoc & complex queries and get results in timely > manner. > > All our data crunching is done via SQL/MDX queries, but these obviously > take a very long time to run over large data size. Also we currently don't > have hadoop or any other distributed storage. > > Keen to hear feedback/thoughts/war stories from the Spark community on > best way to approach this situation. > > Thanks > Suhel > > >