Qingyang, Aha. Got it.
800MB data is pretty small. Loading from Tachyon does have a bit of extra overhead. But it will have more benefit when the data size is larger. Also, if you store the table in Tachyon, you can have different shark servers to query the data at the same time. For more trade-off, please refer to this page: http://tachyon-project.org/Running-Shark-on-Tachyon.html Best, Haoyuan On Wed, Jul 16, 2014 at 12:06 AM, qingyang li <liqingyang1...@gmail.com> wrote: > let's me describe my scene: > ---------------------- > i have 8 machines (24 core , 16G memory, per machine) of spark cluster and > tachyon cluster. On tachyon, I create one table which contains 800M data, > when i run query sql on shark, it will cost 2.43s, but when i create the > same table on spark memory , i run the same sql , it will cost 1.56s. > data on tachyon cost more time than data on spark memory. they all have > 150 map process, and per node 16-20 map process. > I think the reason is that when data is on tachyon, shark will let spark > slave load data from tachyon salve which is on the same node with tachyon > slave, > i have tried to set some configuration to tune shark and tachyon, but still > can not make the former more fast than 2.43s. > do anyone have some ideas ? > > By the way , my tachyon block size is 1GB now, i want to reset block size > , will it work by setting tachyon.user.default.block.size.byte=8M ? if > not, what does tachyon.user.default.block.size.byte mean? > > > 2014-07-14 13:13 GMT+08:00 qingyang li <liqingyang1...@gmail.com>: > > > Shark, thanks for replying. > > Let's me clear my question again. > > ---------------------------------------------- > > i create a table using " create table xxx1 > > tblproperties("shark.cache"="tachyon") as select * from xxx2" > > when excuting some sql (for example , select * from xxx1) using shark, > > shark will read data into shark's memory from tachyon's memory. > > I think if each time we execute sql, shark always load data from tachyon, > > it is less effient. > > could we use some cache policy (such as, CacheAllPolicy FIFOCachePolicy > > LRUCachePolicy ) to cache data to invoid reading data from tachyon for > > each sql query? > > ---------------------------------------------- > > > > > > > > 2014-07-14 2:47 GMT+08:00 Haoyuan Li <haoyuan...@gmail.com>: > > > > Qingyang, > >> > >> Are you asking Spark or Shark (The first email was "Shark", the last > email > >> was "Spark".)? > >> > >> Best, > >> > >> Haoyuan > >> > >> > >> On Wed, Jul 9, 2014 at 7:40 PM, qingyang li <liqingyang1...@gmail.com> > >> wrote: > >> > >> > could i set some cache policy to let spark load data from tachyon only > >> one > >> > time for all sql query? for example by using CacheAllPolicy > >> > FIFOCachePolicy LRUCachePolicy. But I have tried that three policy, > >> they > >> > are not useful. > >> > I think , if spark always load data for each sql query, it will > impact > >> the > >> > query speed , it will take more time than the case that data are > >> managed by > >> > spark itself. > >> > > >> > > >> > > >> > > >> > 2014-07-09 1:19 GMT+08:00 Haoyuan Li <haoyuan...@gmail.com>: > >> > > >> > > Yes. For Shark, two modes, "shark.cache=tachyon" and > >> > "shark.cache=memory", > >> > > have the same ser/de overhead. Shark loads data from outsize of the > >> > process > >> > > in Tachyon mode with the following benefits: > >> > > > >> > > > >> > > - In-memory data sharing across multiple Shark instances (i.e. > >> > stronger > >> > > isolation) > >> > > - Instant recovery of in-memory tables > >> > > - Reduce heap size => faster GC in shark > >> > > - If the table is larger than the memory size, only the hot > columns > >> > will > >> > > be cached in memory > >> > > > >> > > from > http://tachyon-project.org/master/Running-Shark-on-Tachyon.html > >> and > >> > > https://github.com/amplab/shark/wiki/Running-Shark-with-Tachyon > >> > > > >> > > Haoyuan > >> > > > >> > > > >> > > On Tue, Jul 8, 2014 at 9:58 AM, Aaron Davidson <ilike...@gmail.com> > >> > wrote: > >> > > > >> > > > Shark's in-memory format is already serialized (it's compressed > and > >> > > > column-based). > >> > > > > >> > > > > >> > > > On Tue, Jul 8, 2014 at 9:50 AM, Mridul Muralidharan < > >> mri...@gmail.com> > >> > > > wrote: > >> > > > > >> > > > > You are ignoring serde costs :-) > >> > > > > > >> > > > > - Mridul > >> > > > > > >> > > > > On Tue, Jul 8, 2014 at 8:48 PM, Aaron Davidson < > >> ilike...@gmail.com> > >> > > > wrote: > >> > > > > > Tachyon should only be marginally less performant than > >> memory_only, > >> > > > > because > >> > > > > > we mmap the data from Tachyon's ramdisk. We do not have to, > say, > >> > > > transfer > >> > > > > > the data over a pipe from Tachyon; we can directly read from > the > >> > > > buffers > >> > > > > in > >> > > > > > the same way that Shark reads from its in-memory columnar > >> format. > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > On Tue, Jul 8, 2014 at 1:18 AM, qingyang li < > >> > > liqingyang1...@gmail.com> > >> > > > > > wrote: > >> > > > > > > >> > > > > >> hi, when i create a table, i can point the cache strategy > using > >> > > > > >> shark.cache, > >> > > > > >> i think "shark.cache=memory_only" means data are managed by > >> > spark, > >> > > > and > >> > > > > >> data are in the same jvm with excutor; while > >> > > "shark.cache=tachyon" > >> > > > > >> means data are managed by tachyon which is off heap, and > data > >> > are > >> > > > not > >> > > > > in > >> > > > > >> the same jvm with excutor, so spark will load data from > >> tachyon > >> > for > >> > > > > each > >> > > > > >> query sql , so, is tachyon less efficient than memory_only > >> cache > >> > > > > strategy > >> > > > > >> ? > >> > > > > >> if yes, can we let spark load all data once from tachyon for > >> all > >> > > sql > >> > > > > query > >> > > > > >> if i want to use tachyon cache strategy since tachyon is > more > >> HA > >> > > than > >> > > > > >> memory_only ? > >> > > > > >> > >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > -- > >> > > Haoyuan Li > >> > > AMPLab, EECS, UC Berkeley > >> > > http://www.cs.berkeley.edu/~haoyuan/ > >> > > > >> > > >> > >> > >> > >> -- > >> Haoyuan Li > >> AMPLab, EECS, UC Berkeley > >> http://www.cs.berkeley.edu/~haoyuan/ > >> > > > > > -- Haoyuan Li AMPLab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/