why not sync binlog of mysql(hopefully the data is immutable and the table is append-only), send the log through kafka and then consume it by spark streaming?
On Fri, Dec 30, 2016 at 9:01 AM, Michael Armbrust <mich...@databricks.com> wrote: > We don't support this yet, but I've opened this JIRA as it sounds > generally useful: https://issues.apache.org/jira/browse/SPARK-19031 > > In the mean time you could try implementing your own Source, but that is > pretty low level and is not yet a stable API. > > On Thu, Dec 29, 2016 at 4:05 AM, "Yuanzhe Yang (杨远哲)" <yyz1...@gmail.com> > wrote: > >> Hi all, >> >> Thanks a lot for your contributions to bring us new technologies. >> >> I don't want to waste your time, so before I write to you, I googled, >> checked stackoverflow and mailing list archive with keywords "streaming" >> and "jdbc". But I was not able to get any solution to my use case. I hope I >> can get some clarification from you. >> >> The use case is quite straightforward, I need to harvest a relational >> database via jdbc, do something with data, and store result into Kafka. I >> am stuck at the first step, and the difficulty is as follows: >> >> 1. The database is too large to ingest with one thread. >> 2. The database is dynamic and time series data comes in constantly. >> >> Then an ideal workflow is that multiple workers process partitions of >> data incrementally according to a time window. For example, the processing >> starts from the earliest data with each batch containing data for one hour. >> If data ingestion speed is faster than data production speed, then >> eventually the entire database will be harvested and those workers will >> start to "tail" the database for new data streams and the processing >> becomes real time. >> >> With Spark SQL I can ingest data from a JDBC source with partitions >> divided by time windows, but how can I dynamically increment the time >> windows during execution? Assume that there are two workers ingesting data >> of 2017-01-01 and 2017-01-02, the one which finishes quicker gets next task >> for 2017-01-03. But I am not able to find out how to increment those values >> during execution. >> >> Then I looked into Structured Streaming. It looks much more promising >> because window operations based on event time are considered during >> streaming, which could be the solution to my use case. However, from >> documentation and code example I did not find anything related to streaming >> data from a growing database. Is there anything I can read to achieve my >> goal? >> >> Any suggestion is highly appreciated. Thank you very much and have a nice >> day. >> >> Best regards, >> Yang >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >