So suppose I have a bunch of userIds and I need to save them as parquet in
database. I also need to load them back and need to be able to do a join
on userId. My idea is to partition by userId hashcode first and then on
userId. So that I don't have to deal with any performance issues because of
a number of small files and also to be able to scan faster.


Something like ...df.write.format("parquet").partitionBy( "userIdHash"
, "userId").mode(SaveMode.Append).save("userRecords");

On Wed, Feb 17, 2016 at 2:16 PM, swetha kasireddy <swethakasire...@gmail.com
> wrote:

> So suppose I have a bunch of userIds and I need to save them as parquet in
> database. I also need to load them back and need to be able to do a join
> on userId. My idea is to partition by userId hashcode first and then on
> userId.
>
>
>
> On Wed, Feb 17, 2016 at 11:51 AM, Michael Armbrust <mich...@databricks.com
> > wrote:
>
>> Can you describe what you are trying to accomplish?  What would the
>> custom partitioner be?
>>
>> On Tue, Feb 16, 2016 at 1:21 PM, SRK <swethakasire...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> How do I use a custom partitioner when I do a saveAsTable in a dataframe.
>>>
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-a-custom-partitioner-in-a-dataframe-in-Spark-tp26240.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to