Thanks Sebastian! What do you intend with driver? Before submitting to the cluster? Knowing the dataset size is ok.
On Wed, Jun 24, 2015 at 11:08 AM, Sebastian <s...@apache.org> wrote: > A very simple way to achieve is to generate a random variate on the driver > that describes a mapping of datapoints to samples. Then you simply join the > dataset with this mapping to generate the samples. > > This approach requires you to know the size of the dataset in advance, but > has the advantage that you can guarantee the sizes of the samples and can > easily support more involved techniques such as sampling with replacement. > > --sebastian > > > On 24.06.2015 10:38, Maximilian Alber wrote: > >> That's not the point. In Machine Learning one often divides a data set X >> into f.e. three sets, one for the training, one for the validation, one >> for the final testing. The sets are usually created randomly according >> to some ratio. Thus it would be important to keep the ratio and to do >> the whole process randomly. >> >> Cheers, >> Max >> >> On Wed, Jun 24, 2015 at 9:51 AM, Stephan Ewen <se...@apache.org >> <mailto:se...@apache.org>> wrote: >> >> If you do "rebalance()", it will redistribute elements round-robin >> fashion, which should give you very even partition sizes. >> >> >> On Tue, Jun 23, 2015 at 11:51 AM, Maximilian Alber >> <alber.maximil...@gmail.com <mailto:alber.maximil...@gmail.com>> >> wrote: >> >> Thank you! >> >> Still I cannot guarantee the size of each partition, or can I? >> Something like randomSplit in Spark. >> >> Cheers, >> Max >> >> On Mon, Jun 15, 2015 at 5:46 PM, Matthias J. Sax >> <mj...@informatik.hu-berlin.de >> <mailto:mj...@informatik.hu-berlin.de>> wrote: >> >> Hi, >> >> using partitionCustom, the data distribution depends only on >> your >> probability distribution. If it is uniform, you should be >> fine (ie, >> choosing the channel like >> >> > private final Random random = new >> Random(System.currentTimeMillis()); >> > int partition(K key, int numPartitions) { >> > return random.nextInt(numPartitions); >> > } >> >> should do the trick. >> >> -Matthias >> >> On 06/15/2015 05:41 PM, Maximilian Alber wrote: >> > Thanks! >> > >> > Ok, so for a random shuffle I need partitionCustom. But in >> that case the >> > data might be out of balance then? >> > >> > For the splitting. Is there no way to have exact sizes? >> > >> > Cheers, >> > Max >> > >> > On Mon, Jun 15, 2015 at 2:26 PM, Till Rohrmann < >> trohrm...@apache.org <mailto:trohrm...@apache.org> >> > <mailto:trohrm...@apache.org <mailto:trohrm...@apache.org>>> >> wrote: >> > >> > Hi Max, >> > >> > you can always shuffle your elements using the >> |rebalance| method. >> > What Flink here does is to distribute the elements of >> each partition >> > among all available TaskManagers. This happens in a >> round-robin >> > fashion and is thus not completely random. >> > >> > A different mean is the |partitionCustom| method which >> allows you to >> > specify for each element to which partition it shall be >> sent. You >> > would have to specify a |Partitioner| to do this. >> > >> > For the splitting there is at moment no syntactic >> sugar. What you >> > can do, though, is to assign each item a split ID and >> then use a >> > |filter| operation to filter the individual splits. >> Depending on you >> > split ID distribution you will have differently sized >> splits. >> > >> > Cheers, >> > Till >> > >> > On Mon, Jun 15, 2015 at 1:50 PM Maximilian Alber >> >alber.maximil...@gmail.com <mailto: >> alber.maximil...@gmail.com> >> > <http://mailto:alber.maximil...@gmail.com> wrote: >> > >> > Hi Flinksters, >> > >> > I would like to shuffle my elements in the data >> set and then >> > split it in two according to some ratio. Each >> element in the >> > data set has an unique id. Is there a nice way to >> do it with the >> > flink api? >> > (It would be nice to have guaranteed random >> shuffling.) >> > Thanks! >> > >> > Cheers, >> > Max >> > >> > >> > >> > >> >> >> >> >>