Hello, I'd like to implement an algorithm which doesn't really look parallelizable to me, but maybe there's a way around it:
In general the algorithm looks like this: 1. Take a list of texts T_1 ... T_n 2. For every text T_i (i > 1) do 2.1: Split text into a list of words W_1 ... W_m 2.2: For every word W_j do: 2.2.1.: Check if word already existed in a prior text T_k ( i > k ) 2.2.2.: If so, mark word W_j with k 2.2.3.: Else mark word W_j with i 3. Do something with texts based on marked words... I have a DataSet<Text> with all texts T_1...T_n. As far as I understand, I cannot simply .map(...) the DataSet, because T_i's calculation is based on previous results (i.e. T_(i-1)). My current solution would be to set the parallelism to 1. - Is there an elegant way to parallelize this algorithm? - Does setting parallelism=1 guarantee a specific order of the DataSet? - Is there a way to check if an element exists in a DataSet? E.g. DataSet<>.contains(elem)? Best regards, Sebastian