Re: reading ORC format on Spark-SQL

2016-02-11 Thread Philip Lee
shows flat scaling. because it is not over the capacity yet? but you know loading csv file is not much big as I guess. Could you correct me? Thanks in advance. Best, Phil On Wed, Feb 10, 2016 at 11:17 PM, Philip Lee wrote: > Thansk for your reply! > > according to you because of its natur

Re: reading ORC format on Spark-SQL

2016-02-10 Thread Philip Lee
Thansk for your reply! according to you because of its natural property of ORC, it cannot be splited by the default chunk. Because it is not composed of lines like csv. Until you run out of capacity, a distributed system *has* to show sub-linear scaling - and will show flat scaling upto a particu

reading ORC format on Spark-SQL

2016-02-10 Thread Philip Lee
What kind of steps exists when reading ORC format on Spark-SQL? I meant usually reading csv file is just directly reading the dataset on memory. But I feel like Spark-SQL has some steps when reading ORC format. For example, they have to create table to insert the dataset? and then they insert the

Re: Hive on Spark Engine versus Spark using Hive metastore

2016-02-02 Thread Philip Lee
>From my experience, spark sql has its own optimizer to support Hive query and metastore. After 1.5.2 spark, its optimizer is named catalyst. 2016. 2. 3. 오전 12:12에 "Xuefu Zhang" 님이 작성: > I think the diff is not only about which does optimization but more on > feature parity. Hive on Spark offers a

Re: ORC format

2016-02-02 Thread Philip Lee
; > > > ORC does not currently expose a primary key to the user, though we have > talked of having it do that. As Mich says the indexing on ORC is oriented > towards statistics that help the optimizer plan the query. This can be > very important in split generation (determining

Re: ORC format

2016-02-01 Thread Philip Lee
r endorsed by Peridale Technology > Ltd, its subsidiaries or their employees, unless expressly so stated. It is > the responsibility of the recipient to ensure that this email is virus > free, therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any res

Re: ORC format

2016-02-01 Thread Philip Lee
Also, when making ORC from CSV, for indexing every key on each coulmn is made, or a primary on a table is made ? If keys are made on each column in a table, accessing any column in some functions like filtering should be faster. On Mon, Feb 1, 2016 at 4:21 PM, Philip Lee wrote: > Hello, &g

ORC format

2016-02-01 Thread Philip Lee
Hello, I experiment the performance of some systems between ORC and CSV file. I read about ORC documentation on Hive website, but still curious of some things. I know ORC format is faster on filtering or reading because it has indexing. Has it advantage of joining two tables of ORC dataset as wel

Hive bug? about no such table

2015-12-18 Thread Philip Lee
I think It is from Hive Bug about something related to metastore. Here is the thing. After I generated scale factor 300 named bigbench300 and bigbench100, which already existed before, I run "hive job with bigbench300". At first it was really fine. Then I run hive job with bigbench100 again. It w

Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-27 Thread Philip Lee
, you defined the partition function for DBY. On Sun, Oct 25, 2015 at 12:59 AM, Philip Lee wrote: > Hello, the same question about DISTRIBUTE BY on Hive. > > Accorring to you, you do not use hashCode of Object class on DBY, > Distribute By. > > I tried to understand how ObjectIn

Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-24 Thread Philip Lee
, you defined the partition function for DBY. Regards, Philip Lee On Thu, Oct 22, 2015 at 7:13 PM, Gopal Vijayaraghavan wrote: > > > so do you think if we want the same result from Hive and Spark or the > >other freamwork, how could we try this one ? > > There's a spe

Re: Hi, Hive People urgent question about [Distribute By] function

2015-10-22 Thread Philip Lee
Thanks for your help. so do you think if we want the same result from Hive and Spark or the other freamwork, how could we try this one ? could you tell me in detail. Regards, Philip On Thu, Oct 22, 2015 at 6:25 PM, Gopal Vijayaraghavan wrote: > > > When applying [Distribute By] on Hive to the

Hi, Hive People urgent question about [Distribute By] function

2015-10-22 Thread Philip Lee
Hello, I am working on Flink and Spark majoring in Computer Science in Berlin. I have the important question. Well, this question is from what I do these days, which is translations Hive Query to Flink. When applying [Distribute By] on Hive to the framework, the function should be partitionByHash