I’m actually planning on using the in memory database MemSQL. Creating a file then ingesting it seems like we’re back to batch processing. I know the definition of real time varies and any improvement over 24 hours is a good thing but I’d like to get as close to the actual event happing as possible.
I’ve been studying Storm, Samza, and Spark Streaming. The literature says that Storm is good for ETL but I’ve also read that the trident abstraction has a large negative impact on throughput. So MemSQL boast rapid ingestion. Back to my original question. The method for loading data really is just a run of the mill INSERT statement? No other magic used than that? Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData From: Palmer, Cliff A. (NE) Sent: Saturday, March 07, 2015 10:44 AM To: [email protected] Subject: RE: real time warehouse loads Bob, if "real time" means "up to a few minutes is acceptable" then I'd recommend you use storm to do any pre-load processing and write the result to a text/csv/etc file in a directory. Then use a seperate utility (most databases have something that does this) to load data from the files you create into the database. This sounds slower, but remember that establishing a connection to a database to run a SQL INSERT has noticable latency. It's also true that each connection (usually) takes a port/socket, memory and is often a seperate OS task so you are consuming resources that you would probably want storm using. There are other solutions for something closer to real time, but they require an in-memory database or "fun with caching" which will require specialized expertise. HTH -------------------------------------------------------------------------------- From: Adaryl "Bob" Wakefield, MBA [[email protected]] Sent: Friday, March 06, 2015 7:54 PM To: [email protected] Subject: real time warehouse loads I’m looking at storm as a method to load data warehouses in real time. I am not that familiar with Java. I’m curious about the actual mechanism to load records into tables. Is it just a matter of feeding the final result of processing into a INSERT INTO SQL statement or is it more complicated than that? It seems to me that hammering the database with SQL statements of real time data is a bit inefficient. Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData
