Re: Use case for data in SQL Server

Denny Lee Tue, 24 Feb 2015 07:05:56 -0800

Hi Suhel,

My team is currently working with a lot of SQL Server databases as one of
our many data sources and ultimately we pull the data into HDFS from SQL
Server.  As we had a lot of SQL databases to hit, we used the jTDS driver
and SQOOP to extract the data out of SQL Server and into HDFS (small hit
against the SQL databases to extract the data out).  The reasons we had
done this were to 1) minimize the impact on our SQL Servers since these
were transactional databases and we didn't want to our analytics queries to
interfere with the transactions and 2) having the data within HDFS allowed
us to centralize our relational source data within one location so we could
join / mash it with other sources of data more easily.  Now that the data
is there, we just run our Spark queries against that and humming nicely.

Saying this - I have not yet had a chance to try the Spark 1.3 JDBC data
sources.

Cheng, to confirm, the reference for JDBC is
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/api/java/org/apache/spark/sql/jdbc/package-tree.html
? In the past I have not been able to get SQL queries to against SQL Server
without the use of the jTDS or Microsoft SQL Server JDBC driver for various
reason (e.g. authentication, T-SQL vs. ANSI-SQL differences, etc.) If I
needed to utilize an additional driver like jTDS, can I "plug it in" with
the JDBC source and/or potentially build something that will work with the
Data Sources API?

Thanks!
Denny

On Tue Feb 24 2015 at 3:20:57 AM Cheng Lian <lian.cs....@gmail.com> wrote:

>  There is a newly introduced JDBC data source in Spark 1.3.0 (not the
> JdbcRDD in Spark core), which may be useful. However, currently there's no
> SQL server specific logics implemented. I'd assume standard SQL queries
> should work.
>
>
> Cheng
>
>
> On 2/24/15 7:02 PM, Suhel M wrote:
>
>  Hey,
>
>  I am trying to work out what is the best way we can leverage Spark for
> crunching data that is sitting in SQL Server databases.
> Ideal scenario is being able to efficiently work with big data (10billion+
> rows of activity data).  We need to shape this data for machine learning
> problems and want to do ad-hoc & complex queries and get results in timely
> manner.
>
>  All our data crunching is done via SQL/MDX queries, but these obviously
> take a very long time to run over large data size. Also we currently don't
> have hadoop or any other distributed storage.
>
>  Keen to hear feedback/thoughts/war stories from the Spark community on
> best way to approach this situation.
>
>  Thanks
> Suhel
>
>
>

Re: Use case for data in SQL Server

Reply via email to