Re: [I] Bigquery does not work cleanly with data/job project separation [superset]

via GitHub Fri, 21 Mar 2025 13:06:33 -0700


withnale commented on issue #32789:
URL: https://github.com/apache/superset/issues/32789#issuecomment-2742999053


   The native code in superset hasn't made provision for separation of 
project_id usage and primarily used the standard `create_engine` SQLA calls. 
The `DB_CONNECTION_MUTATOR` is available however which seems to run whenever a 
DB connection is created.
   
   Rather than just passing in parameters, it seemed to be possible to use the 
[supplying-your-own-bigquery-client](https://github.com/googleapis/python-bigquery-sqlalchemy?tab=readme-ov-file#supplying-your-own-bigquery-client)
 
   logic present in python-bigquery-sqlalchemy and make a default project 
decision based on context...
    
   ```python
   # shortened
   def DB_CONNECTION_MUTATOR(uri, params, _username, _security_manager, source):
       credentials_info = params.get('credentials_info', None)
       credentials = 
service_account.Credentials.from_service_account_info(credentials_info)
       project = 'some magic occurs here'
       client = bigquery.Client(credentials=credentials, project=project)
       params['connect_args'] = {'client': client}
       return uri.update_query_dict({"user_supplied_client":"True"}), params
   ```
   
   Obviously, the key part here is the 'magic' since you need to be able to 
make a decision about the correct project based on any context that the 
DB_CONNECTION_MUTATOR has available to it. The only real context is the 
`source` field, and I've done some experimenting in setting the correct project 
based on that...
   
   ```python
   if source is None:
       project = DATASET_PROJECT
   elif source.name in ['CHART', 'SQL_LAB']:
       project = JOB_PROJECT
   else:
       logger.error(f"DB_CONNECTION_MUTATOR: Unknown source: {source}")
   ```
   
   This seems too brittle and for many use cases `source=None` when a decision 
needs to be made. Also, fundamentally I don't think this approach is robust 
regarding connection reuse and pooling.
   
   I think it's probably better to try to fix the "destination decision making" 
part of this issue upstream in the python-sqlalchemy-bigquery repository, since 
at present the logic there doesn't specify a project explicitly on their 
bigquery calls (such as below). Modifying the various bigquery client calls to 
be explicit seems by far the cleanest solution making the logic available to 
any product built on sqlalchemy not just superset.
   
   ```diff
   # python-sqlalchemy-bigquery/sqlalchemy_bigquery/core.py:1335
   -   datasets = connection.connection._client.list_datasets()
   +   datasets = connection.connection._client.list_datasets(self.project_id)
   ```
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Bigquery does not work cleanly with data/job project separation [superset]

Reply via email to