Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Eugene Kirpichov
...Final note: performance when executing queries "limit A, B" and "limit C, D" in sequence may be completely different than when executing them in parallel. In particular, if they are being run in parallel, most likely a lot fewer caching will happen. Make sure your benchmarks account for this too

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Eugene Kirpichov
Most likely the identical performance you observed for "limit" clause is because you are not sorting the rows. Without sorting, a "limit" query is meaningless: the database is technically allowed return exactly the same result for "limit 0, 10" and "limit 10, 20", because both of these queries are

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Eugene Kirpichov
Thanks Madhusudan. Please note that in your case, likely, the time was dominated by shipping the rows over the network, rather than executing the query. Please make sure to include benchmarks where the query itself is expensive to evaluate (e.g. "select count(*) from query" takes time comparable to

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Madhusudan Borkar
Hi, Appreciate your questions. One thing I believe, AWS Aurora even though it is based on MySQL, it is no MySQL. The reason being, AWS has developed this database service RDS ground up and has improved or completely changed its implementation. That being said some of things that one may have experi

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-13 Thread Sourabh Bajaj
+1 for S3 being more of a FS @Madhusudan can you point to some documentation on how to do row-range queries in Aurora as from a quick scan it follows the MySql 5.6 syntax so you will still need an order by for the IO to do exactly once reads. So wanted to learn more about how the questions raised

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-12 Thread Jean-Baptiste Onofré
Hi, I think it's a mix of filesystem and IO. For S3, I see more a Beam filesystem than a pure IO. WDYT ? Regards JB On 06/13/2017 02:43 AM, tarush grover wrote: Hi All, I think this can be added under java --> io --> aws-cloud-platform with more io connectors can be added into it eg. S3 al

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-12 Thread tarush grover
Hi All, I think this can be added under java --> io --> aws-cloud-platform with more io connectors can be added into it eg. S3 also. Regards, Tarush On Mon, Jun 12, 2017 at 4:03 AM, Madhusudan Borkar wrote: > Yes, I believe so. Thanks for the Jira. > > Madhu Borkar > > On Sat, Jun 10, 2017 at

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-11 Thread Madhusudan Borkar
Yes, I believe so. Thanks for the Jira. Madhu Borkar On Sat, Jun 10, 2017 at 10:36 PM, Jean-Baptiste Onofré wrote: > Hi, > > I created a Jira to add custom splitting to JdbcIO (but it's not so > trivial depending of the backends. > > Regarding your proposal it sounds interesting, but do you thi

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-10 Thread Eugene Kirpichov
To elaborate a bit on what JB said: Suppose the table has 1,000,000 rows, and suppose you split it into 1000 bundles, 1000 rows per bundle. Does Aurora provide an API that allows to efficiently read the bundle containing rows 999,000-1,000,000, that does not involve reading and throwing away the

Re: [PROPOSAL] for AWS Aurora relational database connector

2017-06-10 Thread Jean-Baptiste Onofré
Hi, I created a Jira to add custom splitting to JdbcIO (but it's not so trivial depending of the backends. Regarding your proposal it sounds interesting, but do you think we will have really "parallel" read of the split ? I think splitting makes sense if we can do parallel read: if we split

[PROPOSAL] for AWS Aurora relational database connector

2017-06-10 Thread Madhusudan Borkar
Hi, We are proposing to develop connector for AWS Aurora. Aurora being cluster for relational database (MySQL) has no Java api for reading/writing other than jdbc client. Although there is a JdbcIO available, it looks like it doesn't work in parallel. The proposal is to provide split functionality