Re: Random Shuffling

Maximilian Alber Wed, 24 Jun 2015 02:49:30 -0700

Thanks Sebastian!
What do you intend with driver? Before submitting to the cluster?
Knowing the dataset size is ok.


On Wed, Jun 24, 2015 at 11:08 AM, Sebastian <s...@apache.org> wrote:

> A very simple way to achieve is to generate a random variate on the driver
> that describes a mapping of datapoints to samples. Then you simply join the
> dataset with this mapping to generate the samples.
>
> This approach requires you to know the size of the dataset in advance, but
> has the advantage that you can guarantee the sizes of the samples and can
> easily support more involved techniques such as sampling with replacement.
>
> --sebastian
>
>
> On 24.06.2015 10:38, Maximilian Alber wrote:
>
>> That's not the point. In Machine Learning one often divides a data set X
>> into f.e. three sets, one for the training, one for the validation, one
>> for the final testing. The sets are usually created randomly according
>> to some ratio. Thus it would be important to keep the ratio and to do
>> the whole process randomly.
>>
>> Cheers,
>> Max
>>
>> On Wed, Jun 24, 2015 at 9:51 AM, Stephan Ewen <se...@apache.org
>> <mailto:se...@apache.org>> wrote:
>>
>>     If you do "rebalance()", it will redistribute elements round-robin
>>     fashion, which should give you very even partition sizes.
>>
>>
>>     On Tue, Jun 23, 2015 at 11:51 AM, Maximilian Alber
>>     <alber.maximil...@gmail.com <mailto:alber.maximil...@gmail.com>>
>> wrote:
>>
>>         Thank you!
>>
>>         Still I cannot guarantee the size of each partition, or can I?
>>         Something like randomSplit in Spark.
>>
>>         Cheers,
>>         Max
>>
>>         On Mon, Jun 15, 2015 at 5:46 PM, Matthias J. Sax
>>         <mj...@informatik.hu-berlin.de
>>         <mailto:mj...@informatik.hu-berlin.de>> wrote:
>>
>>             Hi,
>>
>>             using partitionCustom, the data distribution depends only on
>>             your
>>             probability distribution. If it is uniform, you should be
>>             fine (ie,
>>             choosing the channel like
>>
>>              > private final Random random = new
>>             Random(System.currentTimeMillis());
>>              > int partition(K key, int numPartitions) {
>>              >   return random.nextInt(numPartitions);
>>              > }
>>
>>             should do the trick.
>>
>>             -Matthias
>>
>>             On 06/15/2015 05:41 PM, Maximilian Alber wrote:
>>             > Thanks!
>>             >
>>             > Ok, so for a random shuffle I need partitionCustom. But in
>> that case the
>>             > data might be out of balance then?
>>             >
>>             > For the splitting. Is there no way to have exact sizes?
>>             >
>>             > Cheers,
>>             > Max
>>             >
>>             > On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann <
>> trohrm...@apache.org <mailto:trohrm...@apache.org>
>>             > <mailto:trohrm...@apache.org <mailto:trohrm...@apache.org>>>
>> wrote:
>>             >
>>             >     Hi Max,
>>             >
>>             >     you can always shuffle your elements using the
>> |rebalance| method.
>>             >     What Flink here does is to distribute the elements of
>> each partition
>>             >     among all available TaskManagers. This happens in a
>> round-robin
>>             >     fashion and is thus not completely random.
>>             >
>>             >     A different mean is the |partitionCustom| method which
>> allows you to
>>             >     specify for each element to which partition it shall be
>> sent. You
>>             >     would have to specify a |Partitioner| to do this.
>>             >
>>             >     For the splitting there is at moment no syntactic
>> sugar. What you
>>             >     can do, though, is to assign each item a split ID and
>> then use a
>>             >     |filter| operation to filter the individual splits.
>> Depending on you
>>             >     split ID distribution you will have differently sized
>> splits.
>>             >
>>             >     Cheers,
>>             >     Till
>>             >
>>             >     On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber
>>             >alber.maximil...@gmail.com <mailto:
>> alber.maximil...@gmail.com>
>>              >     <http://mailto:alber.maximil...@gmail.com> wrote:
>>              >
>>              >         Hi Flinksters,
>>              >
>>              >         I would like to shuffle my elements in the data
>>             set and then
>>              >         split it in two according to some ratio. Each
>>             element in the
>>              >         data set has an unique id. Is there a nice way to
>>             do it with the
>>              >         flink api?
>>              >         (It would be nice to have guaranteed random
>>             shuffling.)
>>              >         Thanks!
>>              >
>>              >         Cheers,
>>              >         Max
>>              >
>>              >     
>>              >
>>              >
>>
>>
>>
>>
>>

Re: Random Shuffling

Reply via email to