date:20141208

Understanding reported times on the Spark UI [+ Streaming]

2014-12-08 Thread Gerard Maas

Hi,

I'm confused about the Stage times reported on the Spark-UI (Spark 1.1.0)
for an Spark-Streaming job.  I'm hoping somebody can shine some light on it:

Let's do this with an example:

On the /stages page, stage # 232 is reported to have lasted 18 seconds:
232runJob at RDDFunctions.scala:23
+details

2014/12/08 15:06:2518 s
12/12
When I click on it for details, I see: [1]

Total time across all tasks = 42s

Aggregated metrics by executor:
Executor1 19s
Executor2 24s

Summing all tasks is actually: 40,009s

What is the time reported on the overview page? (18s?)

What is relation between the reported time on the overview and the detail
page?

My Spark Streaming job is reported to be taking 3m24s, and (I think)
there's only 1 stage in my job. How does the timing per stage relate to the
Spark Streaming reported in the 'streaming' page ? (e.g. 'last batch') ?

Is there a way to relate a streaming batch to the stages executed  to
complete that batch?
The numbers as they are at the moment don't seem to add up.

Thanks,

Gerard.


[1] https://drive.google.com/file/d/0BznIWnuWhoLlMkZubzY2dTdOWDQ

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Michael Armbrust

This is by hive's design.  From the Hive documentation:

The column change command will only modify Hive's metadata, and will not
> modify data. Users should make sure the actual data layout of the
> table/partition conforms with the metadata definition.



On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang 
wrote:

> Ok, found another possible bug in Hive.
>
> My current solution is to use ALTER TABLE CHANGE to rename the column
> names.
>
> The problem is after renaming the column names, the value of the columns
> became all NULL.
>
> Before renaming:
> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>
> Execute renaming:
> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
> res13: org.apache.spark.sql.SchemaRDD =
> SchemaRDD[972] at RDD at SchemaRDD.scala:108
> == Query Plan ==
> 
>
> After renaming:
> scala> sql("select cre_ts from pmt limit 1").collect
> res16: Array[org.apache.spark.sql.Row] = Array([null])
>
> I created a JIRA for it:
>
>   https://issues.apache.org/jira/browse/SPARK-4781
>
>
> Jianshi
>
> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang 
> wrote:
>
>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>> later broadcast join and such will fail...
>>
>> Any idea how to fix this issue?
>>
>> Jianshi
>>
>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang 
>> wrote:
>>
>>> Very interesting, the line doing drop table will throws an exception.
>>> After removing it all works.
>>>
>>> Jianshi
>>>
>>> On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang 
>>> wrote:
>>>
 Here's the solution I got after talking with Liancheng:

 1) using backquote `..` to wrap up all illegal characters

 val rdd = parquetFile(file)
 val schema = rdd.schema.fields.map(f => s"`${f.name}`
 ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")

 val ddl_13 = s"""
   |CREATE EXTERNAL TABLE $name (
   |  $schema
   |)
   |STORED AS PARQUET
   |LOCATION '$file'
   """.stripMargin

 sql(ddl_13)

 2) create a new Schema and do applySchema to generate a new SchemaRDD,
 had to drop and register table

 val t = table(name)
 val newSchema = StructType(t.schema.fields.map(s => s.copy(name =
 s.name.replaceAll(".*?::", ""
 sql(s"drop table $name")
 applySchema(t, newSchema).registerTempTable(name)

 I'm testing it for now.

 Thanks for the help!


 Jianshi

 On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang 
 wrote:

> Hi,
>
> I had to use Pig for some preprocessing and to generate Parquet files
> for Spark to consume.
>
> However, due to Pig's limitation, the generated schema contains Pig's
> identifier
>
> e.g.
> sorted::id, sorted::cre_ts, ...
>
> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>
>   create external table pmt (
> sorted::id bigint
>   )
>   stored as parquet
>   location '...'
>
> Obviously it didn't work, I also tried removing the identifier
> sorted::, but the resulting rows contain only nulls.
>
> Any idea how to create a table in HiveContext from these Parquet files?
>
> Thanks,
> Jianshi
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github & Blog: http://huangjs.github.com/

>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>

Re: Handling stale PRs

2014-12-08 Thread Nicholas Chammas

I recently came across this blog post, which reminded me of this thread.

How to Discourage Open Source Contributions

We are currently at 320+ open PRs, many of which haven't been updated in
over a month. We have quite a few PRs that haven't been touched in 3-5
months.

*If you have the time and interest, please hop on over to the Spark PR
Dashboard , sort the PRs by
least-recently-updated, and update them where you can.*

I share the blog author's opinion that letting PRs go stale discourages
contributions, especially from first-time contributors, and especially more
so when the PR author is waiting on feedback from a committer or
contributor.

I've been thinking about simple ways to make it easier for all of us to
chip in on controlling stale PRs in an incremental way. For starters, would
it help if an automated email went out to the dev list once a week that a)
reported the number of stale PRs, and b) directly linked to the 5 least
recently updated PRs?

Nick

On Sat Aug 30 2014 at 3:41:39 AM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell 
> wrote:
>
>> it's actually precedurally difficult for us to close pull requests
>
>
> Just an FYI: Seems like the GitHub-sanctioned work-around to having
> issues-only permissions is to have a second, issues-only repository
> . Not a
> very attractive work-around...
>
> Nick
>

Re: Handling stale PRs

2014-12-08 Thread Ganelin, Ilya

Thank you for pointing this out, Nick. I know that for myself and my
colleague who are starting to contribute to Spark, it¹s definitely
discouraging to have fixes sitting in the pipeline. Could you recommend
any other ways that we can facilitate getting these PRs accepted? Clean,
well-tested code is an obvious one but I¹d like to know if there are some
non-obvious things we (as contributors) could do to make the committers¹
lives easier? Thanks!

-Ilya 

On 12/8/14, 11:58 AM, "Nicholas Chammas" 
wrote:

>I recently came across this blog post, which reminded me of this thread.
>
>How to Discourage Open Source Contributions
>
>
>We are currently at 320+ open PRs, many of which haven't been updated in
>over a month. We have quite a few PRs that haven't been touched in 3-5
>months.
>
>*If you have the time and interest, please hop on over to the Spark PR
>Dashboard , sort the PRs by
>least-recently-updated, and update them where you can.*
>
>I share the blog author's opinion that letting PRs go stale discourages
>contributions, especially from first-time contributors, and especially
>more
>so when the PR author is waiting on feedback from a committer or
>contributor.
>
>I've been thinking about simple ways to make it easier for all of us to
>chip in on controlling stale PRs in an incremental way. For starters,
>would
>it help if an automated email went out to the dev list once a week that a)
>reported the number of stale PRs, and b) directly linked to the 5 least
>recently updated PRs?
>
>Nick
>
>On Sat Aug 30 2014 at 3:41:39 AM Nicholas Chammas <
>nicholas.cham...@gmail.com> wrote:
>
>> On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell 
>> wrote:
>>
>>> it's actually precedurally difficult for us to close pull requests
>>
>>
>> Just an FYI: Seems like the GitHub-sanctioned work-around to having
>> issues-only permissions is to have a second, issues-only repository
>> . Not a
>> very attractive work-around...
>>
>> Nick
>>

The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Handling stale PRs

2014-12-08 Thread Nicholas Chammas

Things that help:

   - Be persistent. People are busy, so just ping them if there’s been no
   response for a couple of weeks. Hopefully, as the project continues to
   develop, this will become less necessary.
   - Only ping reviewers after test results are back from Jenkins. Make
   sure all the tests are clear before reaching out, unless you need help
   understanding why a test is failing.
   - Whenever possible, keep PRs small, small, small.
   - Get buy-in on the dev list before working on something, especially
   larger features, to make sure you are making something that people
   understand and that is in accordance with Spark’s design.

I’m just speaking as a random contributor here, so don’t take this advice
as gospel.

Nick


On Mon Dec 08 2014 at 3:08:02 PM Ganelin, Ilya 
wrote:

> Thank you for pointing this out, Nick. I know that for myself and my
> colleague who are starting to contribute to Spark, it¹s definitely
> discouraging to have fixes sitting in the pipeline. Could you recommend
> any other ways that we can facilitate getting these PRs accepted? Clean,
> well-tested code is an obvious one but I¹d like to know if there are some
> non-obvious things we (as contributors) could do to make the committers¹
> lives easier? Thanks!
>
> -Ilya
>
> On 12/8/14, 11:58 AM, "Nicholas Chammas" 
> wrote:
>
> >I recently came across this blog post, which reminded me of this thread.
> >
> >How to Discourage Open Source Contributions
> >
> >
> >We are currently at 320+ open PRs, many of which haven't been updated in
> >over a month. We have quite a few PRs that haven't been touched in 3-5
> >months.
> >
> >*If you have the time and interest, please hop on over to the Spark PR
> >Dashboard , sort the PRs by
> >least-recently-updated, and update them where you can.*
> >
> >I share the blog author's opinion that letting PRs go stale discourages
> >contributions, especially from first-time contributors, and especially
> >more
> >so when the PR author is waiting on feedback from a committer or
> >contributor.
> >
> >I've been thinking about simple ways to make it easier for all of us to
> >chip in on controlling stale PRs in an incremental way. For starters,
> >would
> >it help if an automated email went out to the dev list once a week that a)
> >reported the number of stale PRs, and b) directly linked to the 5 least
> >recently updated PRs?
> >
> >Nick
> >
> >On Sat Aug 30 2014 at 3:41:39 AM Nicholas Chammas <
> >nicholas.cham...@gmail.com> wrote:
> >
> >> On Tue, Aug 26, 2014 at 2:02 AM, Patrick Wendell 
> >> wrote:
> >>
> >>> it's actually precedurally difficult for us to close pull requests
> >>
> >>
> >> Just an FYI: Seems like the GitHub-sanctioned work-around to having
> >> issues-only permissions is to have a second, issues-only repository
> >> . Not
> a
> >> very attractive work-around...
> >>
> >> Nick
> >>
>
> 
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>
>

Re: scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-08 Thread Yin Huai

Seems you hit https://issues.apache.org/jira/browse/SPARK-4245. It was
fixed in 1.2.

Thanks,

Yin

On Wed, Dec 3, 2014 at 11:50 AM, invkrh  wrote:

> Hi,
>
> I am using SparkSQL on 1.1.0 branch.
>
> The following code leads to a scala.MatchError
> at
>
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247)
>
> val scm = StructType(*inputRDD*.schema.fields.init :+
>   StructField("list",
> ArrayType(
>   StructType(
> Seq(StructField("*date*", StringType, nullable = *false*),
>   StructField("*nbPurchase*", IntegerType, nullable =
> *false*,
> nullable = false))
>
> // *purchaseRDD* is RDD[sql.ROW] whose schema is corresponding to scm. It
> is
> transformed from *inputRDD*
> val schemaRDD = hiveContext.applySchema(purchaseRDD, scm)
> schemaRDD.registerTempTable("t_purchase")
>
> Here's the stackTrace:
> scala.MatchError: ArrayType(StructType(List(StructField(date,StringType,
> *true* ), StructField(n_reachat,IntegerType, *true* ))),true) (of class
> org.apache.spark.sql.catalyst.types.ArrayType)
> at
>
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247)
> at
> org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
> at
> org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
> at
>
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
> at
>
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:66)
> at
>
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:50)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
> $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)
> at
>
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
> at
>
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
>
> The strange thing is that *nullable* of *date* and *nbPurchase* field are
> set to true while it were false in the code. If I set both to *true*, it
> works. But, in fact, they should not be nullable.
>
> Here's what I find at Cast.scala:247 on 1.1.0 branch
>
>   private[this] lazy val cast: Any => Any = dataType match {
> case StringType => castToString
> case BinaryType => castToBinary
> case DecimalType => castToDecimal
> case TimestampType => castToTimestamp
> case BooleanType => castToBoolean
> case ByteType => castToByte
> case ShortType => castToShort
> case IntegerType => castToInt
> case FloatType => castToFloat
> case LongType => castToLong
> case DoubleType => castToDouble
>   }
>
> Any idea? Thank you.
>
> Hao
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/scala-MatchError-on-SparkSQL-when-creating-ArrayType-of-StructType-tp9623.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

2014-12-08 Thread Michael Armbrust

This is merged now and should be fixed in the next 1.2 RC.

On Sat, Dec 6, 2014 at 8:28 PM, Cheng, Hao  wrote:

> I've created(reused) the PR https://github.com/apache/spark/pull/3336,
> hopefully we can fix this regression.
>
> Thanks for the reporting.
>
> Cheng Hao
>
> -Original Message-
> From: Michael Armbrust [mailto:mich...@databricks.com]
> Sent: Saturday, December 6, 2014 4:51 AM
> To: kb
> Cc: d...@spark.incubator.apache.org; Cheng Hao
> Subject: Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0
>
> Thanks for reporting.  This looks like a regression related to:
> https://github.com/apache/spark/pull/2570
>
> I've filed it here: https://issues.apache.org/jira/browse/SPARK-4769
>
> On Fri, Dec 5, 2014 at 12:03 PM, kb  wrote:
>
> > I am having trouble getting "create table as select" or saveAsTable
> > from a hiveContext to work with temp tables in spark 1.2.  No issues
> > in 1.1.0 or
> > 1.1.1
> >
> > Simple modification to test case in the hive SQLQuerySuite.scala:
> >
> > test("double nested data") {
> > sparkContext.parallelize(Nested1(Nested2(Nested3(1))) ::
> > Nil).registerTempTable("nested")
> > checkAnswer(
> >   sql("SELECT f1.f2.f3 FROM nested"),
> >   1)
> > checkAnswer(sql("CREATE TABLE test_ctas_1234 AS SELECT * from
> > nested"),
> > Seq.empty[Row])
> > checkAnswer(
> >   sql("SELECT * FROM test_ctas_1234"),
> >   sql("SELECT * FROM nested").collect().toSeq)
> >   }
> >
> >
> > output:
> >
> > 11:57:15.974 ERROR org.apache.hadoop.hive.ql.parse.SemanticAnalyzer:
> > org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:45 Table not
> > found 'nested'
> > at
> >
> >
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1243)
> > at
> >
> >
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1192)
> > at
> >
> >
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9209)
> > at
> >
> >
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
> > at
> >
> >
> org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation$lzycompute(CreateTableAsSelect.scala:59)
> > at
> >
> >
> org.apache.spark.sql.hive.execution.CreateTableAsSelect.metastoreRelation(CreateTableAsSelect.scala:55)
> > at
> >
> >
> org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult$lzycompute(CreateTableAsSelect.scala:82)
> > at
> >
> >
> org.apache.spark.sql.hive.execution.CreateTableAsSelect.sideEffectResult(CreateTableAsSelect.scala:70)
> > at
> >
> >
> org.apache.spark.sql.hive.execution.CreateTableAsSelect.execute(CreateTableAsSelect.scala:89)
> > at
> >
> >
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:425)
> > at
> >
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:425)
> > at
> > org.apache.spark.sql.SchemaRDDLike$class.$init$(SchemaRDDLike.scala:58)
> > at org.apache.spark.sql.SchemaRDD.(SchemaRDD.scala:105)
> > at
> org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:103)
> > at
> >
> >
> org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply$mcV$sp(SQLQuerySuite.scala:122)
> > at
> >
> >
> org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117)
> > at
> >
> >
> org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$4.apply(SQLQuerySuite.scala:117)
> > at
> >
> >
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> > at org.scalatest.Transformer.apply(Transformer.scala:22)
> > at org.scalatest.Transformer.apply(Transformer.scala:20)
> > at
> org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> > at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
> > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
> > at
> >
> >
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> > at
> >
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> > at
> >
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> > at
> org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> > at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> > at
> >
> >
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> > at
> >
> >
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> > at
> >
> >
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.ap

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

2014-12-08 Thread Jianshi Huang

Ah... I see. Thanks for pointing it out.

Then it means we cannot mount external table using customized column names.
hmm...

Then the only option left is to use a subquery to add a bunch of column
alias. I'll try it later.

Thanks,
Jianshi

On Tue, Dec 9, 2014 at 3:34 AM, Michael Armbrust 
wrote:

> This is by hive's design.  From the Hive documentation:
>
> The column change command will only modify Hive's metadata, and will not
>> modify data. Users should make sure the actual data layout of the
>> table/partition conforms with the metadata definition.
>
>
>
> On Sat, Dec 6, 2014 at 8:28 PM, Jianshi Huang 
> wrote:
>
>> Ok, found another possible bug in Hive.
>>
>> My current solution is to use ALTER TABLE CHANGE to rename the column
>> names.
>>
>> The problem is after renaming the column names, the value of the columns
>> became all NULL.
>>
>> Before renaming:
>> scala> sql("select `sorted::cre_ts` from pmt limit 1").collect
>> res12: Array[org.apache.spark.sql.Row] = Array([12/02/2014 07:38:54])
>>
>> Execute renaming:
>> scala> sql("alter table pmt change `sorted::cre_ts` cre_ts string")
>> res13: org.apache.spark.sql.SchemaRDD =
>> SchemaRDD[972] at RDD at SchemaRDD.scala:108
>> == Query Plan ==
>> 
>>
>> After renaming:
>> scala> sql("select cre_ts from pmt limit 1").collect
>> res16: Array[org.apache.spark.sql.Row] = Array([null])
>>
>> I created a JIRA for it:
>>
>>   https://issues.apache.org/jira/browse/SPARK-4781
>>
>>
>> Jianshi
>>
>> On Sun, Dec 7, 2014 at 1:06 AM, Jianshi Huang 
>> wrote:
>>
>>> Hmm... another issue I found doing this approach is that ANALYZE TABLE
>>> ... COMPUTE STATISTICS will fail to attach the metadata to the table, and
>>> later broadcast join and such will fail...
>>>
>>> Any idea how to fix this issue?
>>>
>>> Jianshi
>>>
>>> On Sat, Dec 6, 2014 at 9:10 PM, Jianshi Huang 
>>> wrote:
>>>
 Very interesting, the line doing drop table will throws an exception.
 After removing it all works.

 Jianshi

 On Sat, Dec 6, 2014 at 9:11 AM, Jianshi Huang 
 wrote:

> Here's the solution I got after talking with Liancheng:
>
> 1) using backquote `..` to wrap up all illegal characters
>
> val rdd = parquetFile(file)
> val schema = rdd.schema.fields.map(f => s"`${f.name}`
> ${HiveMetastoreTypes.toMetastoreType(f.dataType)}").mkString(",\n")
>
> val ddl_13 = s"""
>   |CREATE EXTERNAL TABLE $name (
>   |  $schema
>   |)
>   |STORED AS PARQUET
>   |LOCATION '$file'
>   """.stripMargin
>
> sql(ddl_13)
>
> 2) create a new Schema and do applySchema to generate a new SchemaRDD,
> had to drop and register table
>
> val t = table(name)
> val newSchema = StructType(t.schema.fields.map(s => s.copy(name =
> s.name.replaceAll(".*?::", ""
> sql(s"drop table $name")
> applySchema(t, newSchema).registerTempTable(name)
>
> I'm testing it for now.
>
> Thanks for the help!
>
>
> Jianshi
>
> On Sat, Dec 6, 2014 at 8:41 AM, Jianshi Huang  > wrote:
>
>> Hi,
>>
>> I had to use Pig for some preprocessing and to generate Parquet files
>> for Spark to consume.
>>
>> However, due to Pig's limitation, the generated schema contains Pig's
>> identifier
>>
>> e.g.
>> sorted::id, sorted::cre_ts, ...
>>
>> I tried to put the schema inside CREATE EXTERNAL TABLE, e.g.
>>
>>   create external table pmt (
>> sorted::id bigint
>>   )
>>   stored as parquet
>>   location '...'
>>
>> Obviously it didn't work, I also tried removing the identifier
>> sorted::, but the resulting rows contain only nulls.
>>
>> Any idea how to create a table in HiveContext from these Parquet
>> files?
>>
>> Thanks,
>> Jianshi
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>
>
> --
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>



 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github & Blog: http://huangjs.github.com/

>>>
>>>
>>>
>>> --
>>> Jianshi Huang
>>>
>>> LinkedIn: jianshi
>>> Twitter: @jshuang
>>> Github & Blog: http://huangjs.github.com/
>>>
>>
>>
>>
>> --
>> Jianshi Huang
>>
>> LinkedIn: jianshi
>> Twitter: @jshuang
>> Github & Blog: http://huangjs.github.com/
>>
>
>


-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github & Blog: http://huangjs.github.com/

Understanding reported times on the Spark UI [+ Streaming]

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

Re: Handling stale PRs

Re: Handling stale PRs

Re: Handling stale PRs

Re: scala.MatchError on SparkSQL when creating ArrayType of StructType

Re: CREATE TABLE AS SELECT does not work with temp tables in 1.2.0

Re: Hive Problem in Pig generated Parquet file schema in CREATE EXTERNAL TABLE (e.g. bag::col1)

8 matches

Site Navigation

Mail list logo

Footer information