First, thanks everyone for the warm welcome into this community!

I also want to thank Ryan/Uwe for the initial pointers and information.

Jacques, I’ll share a little bit about my project planning. At this specific 
point-in-time, I am focusing on a simple POC of single stream output. My first 
goal is ensure a strong/flexible foundation in the SQL Engine to provide the 
ability to easily adapt to any Flight changes with minimal overhead. But from 
SQL Engine/Extensibility point-of-view, parallel output streams should be 
relatively cheap for us once single stream is complete. Today, in parallelized 
execution of external scripts, via sp_execute_external_script, we map every SQL 
query worker thread to an external process effectively giving us the 
degree-of-parallelism (DOP) of the query execution (QE) itself. Leveraging QE, 
we can partially control the DOP using query options (MAXDOP) but this is just 
a query hint for max number of workers, thus unfortunately at this time, the 
client does not have complete control over the exact DOP. After output, we will 
start to look integrating Flight into input but at this time not much has been 
designed yet.

Ryan, I would be very much happy to discuss the topic of the standardization of 
exchange of metadata, connection parameters, etc. as this will be super helpful 
for me. There’s are a couple non-SQL specific options that I’ve thought about 
as well which could provide some potential benefits. Such as, specifying a 
subset of output endpoints for the specific connection/query, this might allow 
for some client-side load balancing or perf benefits if endpoints/clients are 
co-located. Feel free to add me to any thread or we can setup a time to call.

Here are some reference links to provide some more in-depth on the SQL 
specifics mentioned:
SQL Server Extensibility Framework Architecture: 
https://docs.microsoft.com/en-us/sql/machine-learning/concepts/extensibility-framework?view=sql-server-ver15
Stored procedure sp_execute_external_script: 
https://docs.microsoft.com/en-us/sql/relational-databases/system-stored-procedures/sp-execute-external-script-transact-sql?view=sql-server-ver15
SQL worker threads: 
https://docs.microsoft.com/en-us/sql/relational-databases/thread-and-task-architecture-guide?view=sql-server-ver15
SQL query hints (ex. MAXDOP): 
https://docs.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-query?view=sql-server-ver15

Thanks,
Brendan Niebruegge

From: Ryan Murray <rym...@dremio.com>
Sent: Friday, May 22, 2020 7:22 AM
To: dev@arrow.apache.org
Cc: Brendan Niebruegge <brn...@microsoft.com>; Jasraj Dange 
<jasr...@microsoft.com>; James Rowland-Jones <j...@microsoft.com>; Jarupat 
Jisarojito <jarupat.jisaroj...@microsoft.com>
Subject: [EXTERNAL] Re: Arrow Flight connector for SQL Server

Hey Brendan,

As Jacques promised here are a few things to act as pointers for your work on 
Flight:
Our early release Flight connector[1]  this fully supports single flight 
streams and partially supports parallel streams
I also have a Spark DataSourceV2 client which may be of interest to you[2]

Both links make use of the 'doAction' part of the Flight API spec[3] to 
negotiate parallel vs single stream among other things. However, this is done 
in an ad-hoc manner and finding a way to standardise this for exchange of 
metadata, catalog info, connection parameters etc is for me an important next 
step to making a flight based protocol that is equivalent to odbc/jdbc. I would 
be happy to discuss further if you have any thoughts on the topic.

Best,
Ryan

[1] 
https://github.com/dremio-hub/dremio-flight-connector<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdremio-hub%2Fdremio-flight-connector&data=02%7C01%7Cjrj%40microsoft.com%7C250eb543f0e94cd2075708d7fe5b831b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637257541322365377&sdata=gdSH8RNjxSPYjlUWIKgEqctmsFYUqAkAEo6pDhkHRik%3D&reserved=0>
[2] 
https://github.com/rymurr/flight-spark-source<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Frymurr%2Fflight-spark-source&data=02%7C01%7Cjrj%40microsoft.com%7C250eb543f0e94cd2075708d7fe5b831b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637257541322365377&sdata=lkpgFNn%2FH00GtblYI10XK%2FCJ6KuRAevsYw%2B7BdN1BCE%3D&reserved=0>
[3] 
https://github.com/apache/arrow/blob/master/format/Flight.proto<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fblob%2Fmaster%2Fformat%2FFlight.proto&data=02%7C01%7Cjrj%40microsoft.com%7C250eb543f0e94cd2075708d7fe5b831b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637257541322375371&sdata=3zwTvraq3LCstRmlGW63Filgt4W%2BpgzGbUHVAAmnRuU%3D&reserved=0>


-----Original Message-----
From: Jacques Nadeau <jacq...@apache.org>
Sent: Tuesday, May 19, 2020 7:05 PM
To: dev <dev@arrow.apache.org>
Subject: [EXTERNAL] Re: Arrow Flight connector for SQL Server



Hey Brendan,



Welcome to the community. At Dremio we've exposed flight as an input and

output for sql result datasets. I'll have one of our guys share some

details. I think a couple questions we've been struggling with include how

to standardize additional metadata operations, what should the prepare

behavior be and and is there is a way to stadarize exposure of a flight

path as an extension of both jdbc and odbc.



Can you share more about whether you're initially more focused on input or

output and parallel or single stream?



Thanks and welcome

Jacques

On Thu, May 21, 2020 at 3:08 PM Uwe L. Korn 
<uw...@xhochy.com<mailto:uw...@xhochy.com>> wrote:
Hello Brendan,

welcome to the community. In addition to the folks at Dremio, I wanted to make 
you aware of the Python ODBC client library 
https://github.com/blue-yonder/turbodbc<https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fblue-yonder%2Fturbodbc&data=02%7C01%7Cjrj%40microsoft.com%7C250eb543f0e94cd2075708d7fe5b831b%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637257541322375371&sdata=n1wd4F%2FsSOm6Y0Sf%2FP1yoHAD4pGUiwntT8Gv%2FKbpI9U%3D&reserved=0>
 which provides a high-performance ODBC<->Arrow adapter. It is especially 
popular with MS SQL Server users as the fastest known way to retrieve query 
results as DataFrames in Python from SQL Server, considerably faster than 
pandas.read_sql or using pyodbc directly.

While being the fastest known, I can tell that still there is a lot time CPU 
spent in the ODBC driver "transforming" results so that it matches the ODBC 
interface. At least here, one could get possibly a lot better performance when 
retrieving large columnar results from SQL Server when going through Arrow 
Flight as an interface instead being constraint to the less efficient ODBC for 
this use case. Currently there is a performance difference of 50x between 
reading the data from a Parquet file and reading the same data from a table in 
SQL Server (simple SELECT, no filtering or so). As nearly for the full 
retrieval time the client CPU is at 100%, using a more efficient protocol for 
data transferral could roughly translate into a 10x speedup.

Best,
Uwe

On Wed, May 20, 2020, at 12:16 AM, Brendan Niebruegge wrote:
> Hi everyone,
>
> I wanted to informally introduce myself. My name is Brendan Niebruegge,
> I'm a Software Engineer in our SQL Server extensibility team here at
> Microsoft. I am leading an effort to explore how we could integrate
> Arrow Flight with SQL Server. We think this could be a very interesting
> integration that would both benefit SQL Server and the Arrow community.
> We are very early in our thoughts so I thought it best to reach out
> here and see if you had any thoughts or suggestions for me. What would
> be the best way to socialize my thoughts to date? I am keen to learn
> and deepen my knowledge of Arrow as well so please let me know how I
> can be of help to the community.
>
> Please feel free to reach out anytime 
> (email:brn...@microsoft.com<mailto:email%3abrn...@microsoft.com>)
>
> Thanks,
> Brendan Niebruegge
>
>

Reply via email to