Hi Sebastian, If T_1 must be processed before T_i, i>1, then you cannot parallelize the algorithm.
If this is not a restriction, then you could; 1) split the text in words and also attach the id of the text they appear in, 2) do a groupBy that will send all the same words to the same node, 3) keep a “per-word” state with the word and the index of the text, 4) when a new word arrives you should check if the word already exists in the state. Regards, Kostas > On Jan 5, 2017, at 11:51 AM, Sebastian Neef <gehax...@mailbox.tu-berlin.de> > wrote: > > Hello, > > I'd like to implement an algorithm which doesn't really look > parallelizable to me, but maybe there's a way around it: > > In general the algorithm looks like this: > > 1. Take a list of texts T_1 ... T_n > 2. For every text T_i (i > 1) do > 2.1: Split text into a list of words W_1 ... W_m > 2.2: For every word W_j do: > 2.2.1.: Check if word already existed in a prior text T_k ( i > k ) > 2.2.2.: If so, mark word W_j with k > 2.2.3.: Else mark word W_j with i > 3. Do something with texts based on marked words... > > I have a DataSet<Text> with all texts T_1...T_n. > > As far as I understand, I cannot simply .map(...) the DataSet, because > T_i's calculation is based on previous results (i.e. T_(i-1)). > > My current solution would be to set the parallelism to 1. > > - Is there an elegant way to parallelize this algorithm? > - Does setting parallelism=1 guarantee a specific order of the DataSet? > - Is there a way to check if an element exists in a DataSet? E.g. > DataSet<>.contains(elem)? > > Best regards, > Sebastian