shows flat
scaling.
because it is not over the capacity yet?
but you know loading csv file is not much big as I guess.
Could you correct me?
Thanks in advance.
Best,
Phil
On Wed, Feb 10, 2016 at 11:17 PM, Philip Lee wrote:
> Thansk for your reply!
>
> according to you because of its natur
Thansk for your reply!
according to you because of its natural property of ORC, it cannot be
splited by the default chunk.
Because it is not composed of lines like csv.
Until you run out of capacity, a distributed system *has* to show sub-linear
scaling -
and will show flat scaling upto a particu
What kind of steps exists when reading ORC format on Spark-SQL?
I meant usually reading csv file is just directly reading the dataset on
memory.
But I feel like Spark-SQL has some steps when reading ORC format.
For example, they have to create table to insert the dataset? and then they
insert the
>From my experience, spark sql has its own optimizer to support Hive query
and metastore. After 1.5.2 spark, its optimizer is named catalyst.
2016. 2. 3. 오전 12:12에 "Xuefu Zhang" 님이 작성:
> I think the diff is not only about which does optimization but more on
> feature parity. Hive on Spark offers a
;
>
>
> ORC does not currently expose a primary key to the user, though we have
> talked of having it do that. As Mich says the indexing on ORC is oriented
> towards statistics that help the optimizer plan the query. This can be
> very important in split generation (determining
r endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any res
Also,
when making ORC from CSV,
for indexing every key on each coulmn is made, or a primary on a table is
made ?
If keys are made on each column in a table, accessing any column in some
functions like filtering should be faster.
On Mon, Feb 1, 2016 at 4:21 PM, Philip Lee wrote:
> Hello,
&g
Hello,
I experiment the performance of some systems between ORC and CSV file.
I read about ORC documentation on Hive website, but still curious of some
things.
I know ORC format is faster on filtering or reading because it has indexing.
Has it advantage of joining two tables of ORC dataset as wel
I think It is from Hive Bug about something related to metastore.
Here is the thing.
After I generated scale factor 300 named bigbench300 and bigbench100, which
already existed before,
I run "hive job with bigbench300". At first it was really fine.
Then I run hive job with bigbench100 again. It w
, you
defined the partition function for DBY.
On Sun, Oct 25, 2015 at 12:59 AM, Philip Lee wrote:
> Hello, the same question about DISTRIBUTE BY on Hive.
>
> Accorring to you, you do not use hashCode of Object class on DBY,
> Distribute By.
>
> I tried to understand how ObjectIn
, you
defined the partition function for DBY.
Regards,
Philip Lee
On Thu, Oct 22, 2015 at 7:13 PM, Gopal Vijayaraghavan
wrote:
>
> > so do you think if we want the same result from Hive and Spark or the
> >other freamwork, how could we try this one ?
>
> There's a spe
Thanks for your help.
so do you think if we want the same result from Hive and Spark or the other
freamwork, how could we try this one ?
could you tell me in detail.
Regards,
Philip
On Thu, Oct 22, 2015 at 6:25 PM, Gopal Vijayaraghavan wrote:
>
> > When applying [Distribute By] on Hive to the
Hello, I am working on Flink and Spark majoring in Computer Science in
Berlin.
I have the important question.
Well, this question is from what I do these days, which is translations
Hive Query to Flink.
When applying [Distribute By] on Hive to the framework, the function should
be partitionByHash
13 matches
Mail list logo