It depends on many situations: 1) what’s your data format? csv(text) or ORC/parquet? 2) Did you have Data warehouse to summary/cluster your data?
if your data is text or you query for the raw data, It should be slow, Spark cannot do much to optimize your job. > On Dec 2, 2015, at 9:21 AM, Andrés Ivaldi <[email protected]> wrote: > > Mark, We have an application that use data from different kind of source, and > we build a engine able to handle that, but cant scale with big data(we could > but is to time expensive), and doesn't have Machine learning module, etc, we > came across with Spark and it's looks like it have all we need, actually it > does, but our latency is very low right now, and when we do some testing it > took too long time for the same kind of results, always against RDBM which is > our primary source. > > So, we want to expand our sources, to cvs, web service, big data, etc, we can > extend our engine or use something like Spark, which give as power of > clustering, different kind of source access, streaming, machine learning, > easy extensibility and so on. > > On Tue, Dec 1, 2015 at 9:36 PM, Mark Hamstra <[email protected] > <mailto:[email protected]>> wrote: > I'd ask another question first: If your SQL query can be executed in a > performant fashion against a conventional (RDBMS?) database, why are you > trying to use Spark? How you answer that question will be the key to > deciding among the engineering design tradeoffs to effectively use Spark or > some other solution. > > On Tue, Dec 1, 2015 at 4:23 PM, Andrés Ivaldi <[email protected] > <mailto:[email protected]>> wrote: > Ok, so latency problem is being generated because I'm using SQL as source? > how about csv, hive, or another source? > > On Tue, Dec 1, 2015 at 9:18 PM, Mark Hamstra <[email protected] > <mailto:[email protected]>> wrote: > It is not designed for interactive queries. > > You might want to ask the designers of Spark, Spark SQL, and particularly > some things built on top of Spark (such as BlinkDB) about their intent with > regard to interactive queries. Interactive queries are not the only designed > use of Spark, but it is going too far to claim that Spark is not designed at > all to handle interactive queries. > > That being said, I think that you are correct to question the wisdom of > expecting lowest-latency query response from Spark using SQL (sic, presumably > a RDBMS is intended) as the datastore. > > On Tue, Dec 1, 2015 at 4:05 PM, Jörn Franke <[email protected] > <mailto:[email protected]>> wrote: > Hmm it will never be faster than SQL if you use SQL as an underlying storage. > Spark is (currently) an in-memory batch engine for iterative machine learning > workloads. It is not designed for interactive queries. > Currently hive is going into the direction of interactive queries. > Alternatives are Hbase on Phoenix or Impala. > > On 01 Dec 2015, at 21:58, Andrés Ivaldi <[email protected] > <mailto:[email protected]>> wrote: > >> Yes, >> The use case would be, >> Have spark in a service (I didnt invertigate this yet), through api calls of >> this service we perform some aggregations over data in SQL, We are already >> doing this with an internal development >> >> Nothing complicated, for instance, a table with Product, Product Family, >> cost, price, etc. Columns like Dimension and Measures, >> >> I want to Spark for query that table to perform a kind of rollup, with cost >> as Measure and Prodcut, Product Family as Dimension >> >> Only 3 columns, it takes like 20s to perform that query and the aggregation, >> the query directly to the database with a grouping at the columns takes >> like 1s >> >> regards >> >> >> >> On Tue, Dec 1, 2015 at 5:38 PM, Jörn Franke <[email protected] >> <mailto:[email protected]>> wrote: >> can you elaborate more on the use case? >> >> > On 01 Dec 2015, at 20:51, Andrés Ivaldi <[email protected] >> > <mailto:[email protected]>> wrote: >> > >> > Hi, >> > >> > I'd like to use spark to perform some transformations over data stored >> > inSQL, but I need low Latency, I'm doing some test and I run into spark >> > context creation and data query over SQL takes too long time. >> > >> > Any idea for speed up the process? >> > >> > regards. >> > >> > -- >> > Ing. Ivaldi Andres >> >> >> >> -- >> Ing. Ivaldi Andres > > > > > -- > Ing. Ivaldi Andres > > > > > -- > Ing. Ivaldi Andres
