Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei,

How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395

coalesce is mostly used for reducing the number of partitions before
writing to HDFS, but it might still be a narrow dependency (satisfying your
requirements) if you increase the # of partitions.

Best,
Anastasios

On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu  wrote:

> Dear all,
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the partition will create a new partition,
> and the second half of elements in the partition will generate another new
> partition. But the two new partitions are required to be at the same node
> with their parent partition, which can help get high data locality.
>
> Is there anyone who knows how to implement it or any hints for it?
>
> Thanks in advance,
> Fei
>
>


-- 
-- Anastasios Zouzias



Re: What about removing TaskContext#getPartitionId?

2017-01-15 Thread Sean Owen
As you mentioned, it's called in ForeachSink. I don't know that the
scaladoc is wrong. You're saying something else, that there's no such thing
as local execution. I confess I don't know if that's true? but the doc
isn't wrong in that case, really.

More broadly, I just don't think this type of thing is worth this small
amount of attention.

On Sat, Jan 14, 2017 at 8:05 PM Jacek Laskowski  wrote:

> Hi Sean,
>
> Can you elaborate on " it's actually used by Spark"? Where exactly?
> I'd like to be corrected.
>
> What about the scaladoc? Since the method's a public API, I think it
> should be fixed, shouldn't it?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Jan 14, 2017 at 6:03 PM, Sean Owen  wrote:
> > It doesn't strike me as something that's problematic to use. There are a
> > thousand things in the API that, maybe in hindsight, could have been done
> > differently, but unless something is bad practice or superseded by
> another
> > superior mechanism, it's probably not worth the bother for maintainers or
> > users. I don't see much problem with this, and it's actually used by
> Spark.
> >
> > On Sat, Jan 14, 2017 at 3:55 PM Jacek Laskowski  wrote:
> >>
> >> Hi,
> >>
> >> Yes, correct. I was too forceful in discouraging people using it. I
> think
> >> @deprecated would be a right direction.
> >>
> >> What should be the next step? I think I should file an JIRA so it's in a
> >> release notes. Correct?
> >>
> >> I was very surprised to have noticed its resurrection in the very latest
> >> module of Spark - Structured Streaming - that will be an inspiration for
> >> others to learn Spark.
> >>
> >> Jacek
> >>
> >> On 14 Jan 2017 12:48 p.m., "Mridul Muralidharan" 
> wrote:
> >>>
> >>> Since TaskContext.getPartitionId is part of the public api, it cant be
> >>> removed as user code can be depending on it (unless we go through a
> >>> deprecation process for it).
> >>>
> >>> Regards,
> >>> Mridul
> >>>
> >>>
> >>> On Sat, Jan 14, 2017 at 2:02 AM, Jacek Laskowski 
> wrote:
> >>> > Hi,
> >>> >
> >>> > Just noticed that TaskContext#getPartitionId [1] is not used and
> >>> > moreover the scaladoc is incorrect:
> >>> >
> >>> > "It will return 0 if there is no active TaskContext for cases like
> >>> > local execution."
> >>> >
> >>> > since there are no local execution. (I've seen the comment in the
> code
> >>> > before but can't find it now).
> >>> >
> >>> > The reason to remove it is that Structured Streaming is giving new
> >>> > birth to the method in ForeachSink [2] which may look like a
> >>> > "resurrection".
> >>> >
> >>> > There's simply TaskContext.get.partitionId.
> >>> >
> >>> > What do you think?
> >>> >
> >>> > [1]
> >>> >
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/TaskContext.scala#L41
> >>> > [2]
> >>> >
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ForeachSink.scala#L50
> >>> >
> >>> > Pozdrawiam,
> >>> > Jacek Laskowski
> >>> > 
> >>> > https://medium.com/@jaceklaskowski/
> >>> > Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark
> >>> > Follow me at https://twitter.com/jaceklaskowski
> >>> >
> >>> > -
> >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>> >
>


Re: Both Spark AM and Client are trying to delete Staging Directory

2017-01-15 Thread Liang-Chi Hsieh

Hi,

Will it be a problem if the staging directory is already deleted? Because
even the directory doesn't exist, fs.delete(stagingDirPath, true) won't
cause failure but just return false.


Rostyslav Sotnychenko wrote
> Hi all!
> 
> I am a bit confused why Spark AM and Client are both trying to delete
> Staging Directory.
> 
> https://github.com/apache/spark/blob/branch-2.1/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L1110
> https://github.com/apache/spark/blob/branch-2.1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L233
> 
> As you can see, in case if a job was running on YARN in Cluster deployment
> mode, both AM and Client will try to delete Staging directory if job
> succeeded and eventually one of them will fail to do this, because the
> other one already deleted the directory.
> 
> Shouldn't we add some check to Client?
> 
> 
> Thanks,
> Rostyslav





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Both-Spark-AM-and-Client-are-trying-to-delete-Staging-Directory-tp20588p20600.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



unsubscribe

2017-01-15 Thread Boris Lenzinger



Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Rishi,

Thanks for your reply! The RDD has 24 partitions, and the cluster has a
master node + 24 computing nodes (12 cores per node). Each node will have a
partition, and I want to split each partition to two sub-partitions on the
same node to improve the parallelism and achieve high data locality.

Thanks,
Fei


On Sun, Jan 15, 2017 at 2:33 AM, Rishi Yadav  wrote:

> Can you provide some more details:
> 1. How many partitions does RDD have
> 2. How big is the cluster
> On Sat, Jan 14, 2017 at 3:59 PM Fei Hu  wrote:
>
>> Dear all,
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>> Thanks in advance,
>> Fei
>>
>>


Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios,

Thanks for your reply. If I just increase the numPartitions to be twice
larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
the data locality? Do I need to define my own Partitioner?

Thanks,
Fei

On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias 
wrote:

> Hi Fei,
>
> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>
> https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
>
> coalesce is mostly used for reducing the number of partitions before
> writing to HDFS, but it might still be a narrow dependency (satisfying your
> requirements) if you increase the # of partitions.
>
> Best,
> Anastasios
>
> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu  wrote:
>
>> Dear all,
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>> Thanks in advance,
>> Fei
>>
>>
>
>
> --
> -- Anastasios Zouzias
> 
>


Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Anastasios Zouzias
Hi Fei,

I looked at the code of CoalescedRDD and probably what I suggested will not
work.

Speaking of which, CoalescedRDD is private[spark]. If this was not the
case, you could set balanceSlack to 1, and get what you requested, see

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L75

Maybe you could try to use the CoalescedRDD code to implement your
requirement.

Good luck!
Cheers,
Anastasios


On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu  wrote:

> Hi Anastasios,
>
> Thanks for your reply. If I just increase the numPartitions to be twice
> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
> the data locality? Do I need to define my own Partitioner?
>
> Thanks,
> Fei
>
> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias 
> wrote:
>
>> Hi Fei,
>>
>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>
>> https://github.com/apache/spark/blob/branch-1.6/core/src/
>> main/scala/org/apache/spark/rdd/RDD.scala#L395
>>
>> coalesce is mostly used for reducing the number of partitions before
>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>> requirements) if you increase the # of partitions.
>>
>> Best,
>> Anastasios
>>
>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu  wrote:
>>
>>> Dear all,
>>>
>>> I want to equally divide a RDD partition into two partitions. That
>>> means, the first half of elements in the partition will create a new
>>> partition, and the second half of elements in the partition will generate
>>> another new partition. But the two new partitions are required to be at the
>>> same node with their parent partition, which can help get high data
>>> locality.
>>>
>>> Is there anyone who knows how to implement it or any hints for it?
>>>
>>> Thanks in advance,
>>> Fei
>>>
>>>
>>
>>
>> --
>> -- Anastasios Zouzias
>> 
>>
>
>


-- 
-- Anastasios Zouzias



Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Anastasios,

Thanks for your information. I will look into the CoalescedRDD code.

Thanks,
Fei

On Sun, Jan 15, 2017 at 12:21 PM, Anastasios Zouzias 
wrote:

> Hi Fei,
>
> I looked at the code of CoalescedRDD and probably what I suggested will
> not work.
>
> Speaking of which, CoalescedRDD is private[spark]. If this was not the
> case, you could set balanceSlack to 1, and get what you requested, see
>
> https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/CoalescedRDD.scala#L75
>
> Maybe you could try to use the CoalescedRDD code to implement your
> requirement.
>
> Good luck!
> Cheers,
> Anastasios
>
>
> On Sun, Jan 15, 2017 at 5:39 PM, Fei Hu  wrote:
>
>> Hi Anastasios,
>>
>> Thanks for your reply. If I just increase the numPartitions to be twice
>> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
>> the data locality? Do I need to define my own Partitioner?
>>
>> Thanks,
>> Fei
>>
>> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias 
>> wrote:
>>
>>> Hi Fei,
>>>
>>> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>>>
>>> https://github.com/apache/spark/blob/branch-1.6/core/src/mai
>>> n/scala/org/apache/spark/rdd/RDD.scala#L395
>>>
>>> coalesce is mostly used for reducing the number of partitions before
>>> writing to HDFS, but it might still be a narrow dependency (satisfying your
>>> requirements) if you increase the # of partitions.
>>>
>>> Best,
>>> Anastasios
>>>
>>> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu  wrote:
>>>
 Dear all,

 I want to equally divide a RDD partition into two partitions. That
 means, the first half of elements in the partition will create a new
 partition, and the second half of elements in the partition will generate
 another new partition. But the two new partitions are required to be at the
 same node with their parent partition, which can help get high data
 locality.

 Is there anyone who knows how to implement it or any hints for it?

 Thanks in advance,
 Fei


>>>
>>>
>>> --
>>> -- Anastasios Zouzias
>>> 
>>>
>>
>>
>
>
> --
> -- Anastasios Zouzias
> 
>


unsubscribe

2017-01-15 Thread Hosun Lee
*unsubscribe*


Re: Limit Query Performance Suggestion

2017-01-15 Thread Liang-Chi Hsieh

Hi Sujith,

Thanks for suggestion.

The codes you quoted are from `CollectLimitExec` which will be in the plan
if a logical `Limit` is the final operator in an logical plan. But in the
physical plan you showed, there are `GlobalLimit` and `LocalLimit` for the
logical `Limit` operation, so the `doExecute` method of `CollectLimitExec`
will not be executed.

In the case `CollectLimitExec` is the final physical operation, its
`executeCollect` will be executed and delegate to `SparkPlan.executeTake`
which is optimized to only retrieved required number of rows back to the
driver. So when using `limit n` with a huge partition number it should not
be a problem.

In the case `GlobalLimit` and `LocalLimit` are the final physical
operations, your concern is that when returning `n` rows from `N` partitions
and `N` is huge, the total `n * N` rows will cause heavy memory pressure on
the driver. I am not sure if you really observe this problem or you just
think it might be a problem. In this case, there will be a shuffle exchange
between `GlobalLimit` and `LocalLimit` to retrieve data from all partitions
to one partition. In `GlobalLimit` we will only take the required number of
rows from the input iterator which really pulls data from local blocks and
remote blocks. Due to the use of iterator approach, I think when we get the
enough rows in `GlobalLimit`, we won't continue to consume the input
iterator and pull more data back. So I don't think your concern will be a
problem.



sujith71955 wrote
> When limit is being added in the terminal of the physical plan there will
> be possibility of memory bottleneck
> if the limit value is too large and system will try to aggregate all the
> partition limit values as part of single partition.
> Description:
> Eg:
> create table src_temp as select * from src limit n;
> == Physical Plan ==
> ExecutedCommand
>+- CreateHiveTableAsSelectCommand [Database:spark}, TableName: t2,
> InsertIntoHiveTable]
>  +- GlobalLimit 2
> +- LocalLimit 2
>+- Project [imei#101, age#102, task#103L, num#104,
> level#105, productdate#106, name#107, point#108]
>   +- SubqueryAlias hive
>  +-
> Relation[imei#101,age#102,task#103L,num#104,level#105,productdate#106,name#107,point#108]
> csv  |
> 
> As shown in above plan when the limit comes in terminal ,there can be two
> types of performance bottlenecks.
> scenario 1: when the partition count is very high and limit value is small
> scenario 2: when the limit value is very large
> 
>  protected override def doExecute(): RDD[InternalRow] = {
> val locallyLimited =
> child.execute().mapPartitionsInternal(_.take(limit))
> val shuffled = new ShuffledRowRDD(
>   ShuffleExchange.prepareShuffleDependency(
> locallyLimited, child.output, SinglePartition, serializer))
> shuffled.mapPartitionsInternal(_.take(limit))
>   }
> }
> 
> As per my understanding the current algorithm first creates the
> MapPartitionsRDD by applying limit from each partition, then
> ShuffledRowRDD
> will be created by grouping data from all partitions into single
> partition,
> this can create overhead since all partitions will return limit n data ,
> so
> while grouping there will be N partition * limit N which can be very huge,
> in both scenarios mentioned above this logic can be a bottle neck.
> 
> My suggestion for handling scenario 1 where large number of partition and
> limit value is small, in this case driver can create an accumulator value
> and try to send to all partitions, all executer will be updating the
> accumulator value based on the data fetched ,
> eg: number of partition = 100, number of cores =10
> tasks will be launched in a group of 10(10*10 = 100), once the first group
> finishes the tasks driver will check whether the accumulator value is been
> reached the limit value
> if its reached then no further task will be launched to executers and the
> result will be returned.
> 
> Let me know for any furthur suggestions or solution.
> 
> Thanks in advance,
> Sujith





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Limit-Query-Performance-Suggestion-tp20570p20607.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Liang-Chi Hsieh

Hi,

When calling `coalesce` with `shuffle = false`, it is going to produce at
most min(numPartitions, previous RDD's number of partitions). So I think it
can't be used to double the number of partitions.


Anastasios Zouzias wrote
> Hi Fei,
> 
> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
> 
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395
> 
> coalesce is mostly used for reducing the number of partitions before
> writing to HDFS, but it might still be a narrow dependency (satisfying
> your
> requirements) if you increase the # of partitions.
> 
> Best,
> Anastasios
> 
> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <

> hufei68@

> > wrote:
> 
>> Dear all,
>>
>> I want to equally divide a RDD partition into two partitions. That means,
>> the first half of elements in the partition will create a new partition,
>> and the second half of elements in the partition will generate another
>> new
>> partition. But the two new partitions are required to be at the same node
>> with their parent partition, which can help get high data locality.
>>
>> Is there anyone who knows how to implement it or any hints for it?
>>
>> Thanks in advance,
>> Fei
>>
>>
> 
> 
> -- 
> -- Anastasios Zouzias
> <

> azo@.ibm

> >





-
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Equally-split-a-RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Liang-Chi,

Yes, you are right. I implement the following solution for this problem,
and it works. But I am not sure if it is efficient:

I double the partitions of the parent RDD, and then use the new partitions
and parent RDD to construct the target RDD. In the compute() function of
the target RDD, I use the input partition to get the corresponding parent
partition, and get the half elements in the parent partitions as the output
of the computing function.

Thanks,
Fei

On Sun, Jan 15, 2017 at 11:01 PM, Liang-Chi Hsieh  wrote:

>
> Hi,
>
> When calling `coalesce` with `shuffle = false`, it is going to produce at
> most min(numPartitions, previous RDD's number of partitions). So I think it
> can't be used to double the number of partitions.
>
>
> Anastasios Zouzias wrote
> > Hi Fei,
> >
> > How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
> >
> > https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
> >
> > coalesce is mostly used for reducing the number of partitions before
> > writing to HDFS, but it might still be a narrow dependency (satisfying
> > your
> > requirements) if you increase the # of partitions.
> >
> > Best,
> > Anastasios
> >
> > On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu <
>
> > hufei68@
>
> > > wrote:
> >
> >> Dear all,
> >>
> >> I want to equally divide a RDD partition into two partitions. That
> means,
> >> the first half of elements in the partition will create a new partition,
> >> and the second half of elements in the partition will generate another
> >> new
> >> partition. But the two new partitions are required to be at the same
> node
> >> with their parent partition, which can help get high data locality.
> >>
> >> Is there anyone who knows how to implement it or any hints for it?
> >>
> >> Thanks in advance,
> >> Fei
> >>
> >>
> >
> >
> > --
> > -- Anastasios Zouzias
> > <
>
> > azo@.ibm
>
> > >
>
>
>
>
>
> -
> Liang-Chi Hsieh | @viirya
> Spark Technology Center
> http://www.spark.tc/
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/Equally-split-a-
> RDD-partition-into-two-partition-at-the-same-node-tp20597p20608.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Error at starting Phoenix shell with HBase

2017-01-15 Thread Chetan Khatri
Any updates for the above error guys ?


On Fri, Jan 13, 2017 at 9:35 PM, Josh Elser  wrote:

> (-cc dev@phoenix)
>
> phoenix-4.8.2-HBase-1.2-server.jar in the top-level binary tarball of
> Apache Phoenix 4.8.0 is the jar which is meant to be deployed to all
> HBase's classpath.
>
> I would check the RegionServer logs -- I'm guessing that it never started
> correctly or failed. The error message is saying that certain regions in
> the system were never assigned to a RegionServer which only happens in
> exceptional cases.
>
> Chetan Khatri wrote:
>
>> Hello Community,
>>
>> I have installed and configured Apache Phoenix on Single Node Ubuntu 16.04
>> machine:
>> - Hadoop 2.7
>> - HBase 1.2.4
>> - Phoenix -4.8.2-HBase-1.2
>>
>> Copied phoenix-core-4.8.2-HBase-1.2.jar to hbase/lib and confirmed
>> with bin/hbase classpath | grep 'phoenix' and I am using embedded
>> zookeeper, so my hbase-site.xml looks like below:
>>
>> 
>>   
>>  hbase.rootdir
>>  file:///home/hduser/hbase
>>
>> 
>>
>> I am able to read / write to HBase from shell and Apache Spark.
>>
>> *Errors while accessing with **sqlline**:*
>>
>>
>> 1) bin/sqlline.py localhost:2181
>>
>> Error:
>>
>> 1. Command made process hang.
>> 2.
>> Error: ERROR 1102 (XCL02): Cannot get all table regions.
>> (state=XCL02,code=1102)
>> java.sql.SQLException: ERROR 1102 (XCL02): Cannot get all table regions.
>> at
>> org.apache.phoenix.exception.SQLExceptionCode$Factory$1.newE
>> xception(SQLExceptionCode.java:455)
>> at
>> org.apache.phoenix.exception.SQLExceptionInfo.buildException
>> (SQLExceptionInfo.java:145)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl.getAllT
>> ableRegions(ConnectionQueryServicesImpl.java:546)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl.checkCl
>> ientServerCompatibility(ConnectionQueryServicesImpl.java:1162)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureT
>> ableCreated(ConnectionQueryServicesImpl.java:1068)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl.createT
>> able(ConnectionQueryServicesImpl.java:1388)
>> at
>> org.apache.phoenix.schema.MetaDataClient.createTableInternal
>> (MetaDataClient.java:2298)
>> at
>> org.apache.phoenix.schema.MetaDataClient.createTable(MetaDat
>> aClient.java:940)
>> at
>> org.apache.phoenix.compile.CreateTableCompiler$2.execute(Cre
>> ateTableCompiler.java:193)
>> at
>> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixState
>> ment.java:344)
>> at
>> org.apache.phoenix.jdbc.PhoenixStatement$2.call(PhoenixState
>> ment.java:332)
>> at org.apache.phoenix.call.CallRunner.run(CallRunner.java:53)
>> at
>> org.apache.phoenix.jdbc.PhoenixStatement.executeMutation(Pho
>> enixStatement.java:331)
>> at
>> org.apache.phoenix.jdbc.PhoenixStatement.executeUpdate(Phoen
>> ixStatement.java:1423)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl$13.
>> call(ConnectionQueryServicesImpl.java:2352)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl$13.
>> call(ConnectionQueryServicesImpl.java:2291)
>> at
>> org.apache.phoenix.util.PhoenixContextExecutor.call(PhoenixC
>> ontextExecutor.java:76)
>> at
>> org.apache.phoenix.query.ConnectionQueryServicesImpl.init(Co
>> nnectionQueryServicesImpl.java:2291)
>> at
>> org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServ
>> ices(PhoenixDriver.java:232)
>> at
>> org.apache.phoenix.jdbc.PhoenixEmbeddedDriver.createConnecti
>> on(PhoenixEmbeddedDriver.java:147)
>> at org.apache.phoenix.jdbc.PhoenixDriver.connect(PhoenixDriver.java:202)
>> at sqlline.DatabaseConnection.connect(DatabaseConnection.java:157)
>> at sqlline.DatabaseConnection.getConnection(DatabaseConnection.java:203)
>> at sqlline.Commands.connect(Commands.java:1064)
>> at sqlline.Commands.connect(Commands.java:996)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
>> ssorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
>> thodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:498)
>> at
>> sqlline.ReflectiveCommandHandler.execute(ReflectiveCommandHa
>> ndler.java:36)
>> at sqlline.SqlLine.dispatch(SqlLine.java:803)
>> at sqlline.SqlLine.initArgs(SqlLine.java:588)
>> at sqlline.SqlLine.begin(SqlLine.java:656)
>> at sqlline.SqlLine.start(SqlLine.java:398)
>> at sqlline.SqlLine.main(SqlLine.java:292)
>> Caused by: org.apache.hadoop.hbase.client.NoServerForRegionException: No
>> server address listed in hbase:meta for region
>> SYSTEM.CATALOG,,1484293041241.0b74311f417f83abe84ae1be4e823de8.
>> containing
>> row
>> at
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>> Implementation.locateRegionInMeta(ConnectionManager.java:1318)
>> at
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>> Implementation.locateRegion(ConnectionManager.java:1181)
>> at
>> org.apache.hadoop.hbase.client.ConnectionManager$HConnection
>> Implementation.relocateRegi

RE: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread jasbir.sing
Hi,

Coalesce is used to decrease the number of partitions. If you give the value of 
numPartitions greater than the current partition, I don’t think RDD number of 
partitions will be increased.

Thanks,
Jasbir

From: Fei Hu [mailto:hufe...@gmail.com]
Sent: Sunday, January 15, 2017 10:10 PM
To: zouz...@cs.toronto.edu
Cc: user @spark ; dev@spark.apache.org
Subject: Re: Equally split a RDD partition into two partition at the same node

Hi Anastasios,

Thanks for your reply. If I just increase the numPartitions to be twice larger, 
how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps the data 
locality? Do I need to define my own Partitioner?

Thanks,
Fei

On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias 
mailto:zouz...@gmail.com>> wrote:
Hi Fei,

How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?

https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L395

coalesce is mostly used for reducing the number of partitions before writing to 
HDFS, but it might still be a narrow dependency (satisfying your requirements) 
if you increase the # of partitions.

Best,
Anastasios

On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu 
mailto:hufe...@gmail.com>> wrote:
Dear all,

I want to equally divide a RDD partition into two partitions. That means, the 
first half of elements in the partition will create a new partition, and the 
second half of elements in the partition will generate another new partition. 
But the two new partitions are required to be at the same node with their 
parent partition, which can help get high data locality.

Is there anyone who knows how to implement it or any hints for it?

Thanks in advance,
Fei




--
-- Anastasios Zouzias




This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com


Re: Equally split a RDD partition into two partition at the same node

2017-01-15 Thread Fei Hu
Hi Jasbir,

Yes, you are right. Do you have any idea about my question?

Thanks,
Fei

On Mon, Jan 16, 2017 at 12:37 AM,  wrote:

> Hi,
>
>
>
> Coalesce is used to decrease the number of partitions. If you give the
> value of numPartitions greater than the current partition, I don’t think
> RDD number of partitions will be increased.
>
>
>
> Thanks,
>
> Jasbir
>
>
>
> *From:* Fei Hu [mailto:hufe...@gmail.com]
> *Sent:* Sunday, January 15, 2017 10:10 PM
> *To:* zouz...@cs.toronto.edu
> *Cc:* user @spark ; dev@spark.apache.org
> *Subject:* Re: Equally split a RDD partition into two partition at the
> same node
>
>
>
> Hi Anastasios,
>
>
>
> Thanks for your reply. If I just increase the numPartitions to be twice
> larger, how coalesce(numPartitions: Int, shuffle: Boolean = false) keeps
> the data locality? Do I need to define my own Partitioner?
>
>
>
> Thanks,
>
> Fei
>
>
>
> On Sun, Jan 15, 2017 at 3:58 AM, Anastasios Zouzias 
> wrote:
>
> Hi Fei,
>
>
>
> How you tried coalesce(numPartitions: Int, shuffle: Boolean = false) ?
>
>
>
> https://github.com/apache/spark/blob/branch-1.6/core/
> src/main/scala/org/apache/spark/rdd/RDD.scala#L395
> 
>
>
>
> coalesce is mostly used for reducing the number of partitions before
> writing to HDFS, but it might still be a narrow dependency (satisfying your
> requirements) if you increase the # of partitions.
>
>
>
> Best,
>
> Anastasios
>
>
>
> On Sun, Jan 15, 2017 at 12:58 AM, Fei Hu  wrote:
>
> Dear all,
>
>
>
> I want to equally divide a RDD partition into two partitions. That means,
> the first half of elements in the partition will create a new partition,
> and the second half of elements in the partition will generate another new
> partition. But the two new partitions are required to be at the same node
> with their parent partition, which can help get high data locality.
>
>
>
> Is there anyone who knows how to implement it or any hints for it?
>
>
>
> Thanks in advance,
>
> Fei
>
>
>
>
>
>
>
> --
>
> -- Anastasios Zouzias
>
>
>
> --
>
> This message is for the designated recipient only and may contain
> privileged, proprietary, or otherwise confidential information. If you have
> received it in error, please notify the sender immediately and delete the
> original. Any other use of the e-mail by you is prohibited. Where allowed
> by local law, electronic communications with Accenture and its affiliates,
> including e-mail and instant messaging (including content), may be scanned
> by our systems for the purposes of information security and assessment of
> internal compliance with Accenture policy.
> 
> __
>
> www.accenture.com
>