The short answer is that because DataSet is not serializable.

I think the main underlying problem is that Flink needs to see all
DataSet operations before launching the job. However, if you have a
DataSet<DataSet<A>>, then operations on the inner DataSets will end up
being specified inside the UDFs of operations on the outer DataSet.
This is a problem, because Flink cannot see inside the UDFs before the
job starts, since they get executed only after the job starts
executing.

There are some workarounds though:

1. If you know that your inner DataSets would be small, then you can
instead replace them with some regular Java/Scala collection class,
like an Array or List.

2. You can often flatten your data, that is, somehow represent your
nested collection with a flat collection. Exactly how to do this
depends on your use case. For example, suppose that originally we
wanted to represent the lengths of the shortest paths between all
pairs of vertices in a graph by a DataSet that for every vertex
contains a DataSet that tells us the distances to all the other
Vertices:
DataSet<Tuple2<Vertex, DataSet<Tuple2<Vertex, Int>>>>
This doesn't work because of the nested DataSets, but you could
flatten this into the following:
DataSet<Tuple3<Vertex, Vertex, Int>>
which is a DataSet that contains pairs of vertices and their distances.

Btw. [1] is a paper where some graph data structures having complex
nesting are represented in Flink.

Best,
Gábor

[1] http://dbs.uni-leipzig.de/file/EPGM.pdf





2016-11-15 17:37 GMT+01:00 otherwise777 <wou...@onzichtbaar.net>:
> It seems what i tried did indeed not work.
> Can you explain me why that doesn't work though?
>
>
>
> --
> View this message in context: 
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Retrieving-values-from-a-dataset-of-datasets-tp10108p10128.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive at 
> Nabble.com.

Reply via email to