Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

swetha kasireddy Mon, 13 Jun 2016 11:42:08 -0700

Hi Mich,

Following is  a sample code snippet:



*val *userDF =
userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId",
"userRecord").persist()
System.*out*.println(" userRecsDF.partitions.size"+
userRecsDF.partitions.size)

userDF.registerTempTable("userRecordsTemp")

sqlContext.sql("SET hive.default.fileformat=Orc  ")
sqlContext.sql("set hive.enforce.bucketing = true; ")
sqlContext.sql("set hive.enforce.sorting = true; ")
sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId STRING,
userRecord STRING) PARTITIONED BY (idPartitioner STRING, dtPartitioner
STRING)   stored as ORC LOCATION '/user/userId/userRecords' ")
sqlContext.sql(
  """ from userRecordsTemp ps   insert overwrite table users
partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
""".stripMargin)


On Mon, Jun 13, 2016 at 10:57 AM, swetha kasireddy <
swethakasire...@gmail.com> wrote:

> Hi Bijay,
>
> If I am hitting this issue,
> https://issues.apache.org/jira/browse/HIVE-11940. What needs to be done?
> Incrementing to higher version of hive is the only solution?
>
> Thanks!
>
> On Mon, Jun 13, 2016 at 10:47 AM, swetha kasireddy <
> swethakasire...@gmail.com> wrote:
>
>> Hi,
>>
>> Following is  a sample code snippet:
>>
>>
>> *val *userDF = userRecsDF.toDF("idPartitioner", "dtPartitioner", "userId",
>> "userRecord").persist()
>> System.*out*.println(" userRecsDF.partitions.size"+
>> userRecsDF.partitions.size)
>>
>> userDF.registerTempTable("userRecordsTemp")
>>
>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>> sqlContext.sql("set hive.enforce.sorting = true; ")
>> sqlContext.sql("  CREATE EXTERNAL TABLE IF NOT EXISTS users (userId
>> STRING, userRecord STRING) PARTITIONED BY (idPartitioner STRING,
>> dtPartitioner STRING)   stored as ORC LOCATION '/user/userId/userRecords' "
>> )
>> sqlContext.sql(
>>   """ from userRecordsTemp ps   insert overwrite table users
>> partition(idPartitioner, dtPartitioner)  select ps.userId, ps.userRecord,
>> ps.idPartitioner, ps.dtPartitioner CLUSTER BY idPartitioner, dtPartitioner
>> """.stripMargin)
>>
>>
>>
>>
>> On Fri, Jun 10, 2016 at 12:10 AM, Bijay Pathak <
>> bijay.pat...@cloudwick.com> wrote:
>>
>>> Hello,
>>>
>>> Looks like you are hitting this:
>>> https://issues.apache.org/jira/browse/HIVE-11940.
>>>
>>> Thanks,
>>> Bijay
>>>
>>>
>>>
>>> On Thu, Jun 9, 2016 at 9:25 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> cam you provide a code snippet of how you are populating the target
>>>> table from temp table.
>>>>
>>>>
>>>> HTH
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 9 June 2016 at 23:43, swetha kasireddy <swethakasire...@gmail.com>
>>>> wrote:
>>>>
>>>>> No, I am reading the data from hdfs, transforming it , registering the
>>>>> data in a temp table using registerTempTable and then doing insert
>>>>> overwrite using Spark SQl' hiveContext.
>>>>>
>>>>> On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> how are you doing the insert? from an existing table?
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * 
>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9 June 2016 at 21:16, Stephen Boesch <java...@gmail.com> wrote:
>>>>>>
>>>>>>> How many workers (/cpu cores) are assigned to this job?
>>>>>>>
>>>>>>> 2016-06-09 13:01 GMT-07:00 SRK <swethakasire...@gmail.com>:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> How to insert data into 2000 partitions(directories) of
>>>>>>>> ORC/parquet  at a
>>>>>>>> time using Spark SQL? It seems to be not performant when I try to
>>>>>>>> insert
>>>>>>>> 2000 directories of Parquet/ORC using Spark SQL. Did anyone face
>>>>>>>> this issue?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-data-into-2000-partitions-directories-of-ORC-parquet-at-a-time-using-Spark-SQL-tp27132.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

Reply via email to