Hi,
@Ted:
> Is it possible to prune (unneeded) field(s) so that heap requirement is
> lower ?
The XmlInputFormat [0] splits the raw data into smaller chunks, which
are then further processed. I don't think I can reduce the field's
(Tuple2) sizes. The major difference to Mahout's
XmlInputFormat i
Hi Ted,
sure.
Here's the stack strace with .distinct() with the Exception in the
'SortMerger Reading Thread': [1]
Here's the stack strace without .distinct() and the 'Requested array
size exceeds VM limit' error: [2]
If you need anything else, I can more or less reliably reproduce the issue.
T
Hi,
I removed the .distinct() and ran another test.
Without filtering duplicate entries, the Job processes more data and
runs much longer, but eventually fails with the following error:
> java.lang.OutOfMemoryError: Requested array size exceeds VM limit
Even then playing around with the aforeme
Hi,
the code is part of a bigger project, so I'll try to outline the used
methods and their order:
# Step 1
- Reading a Wikipedia XML Dump into a DataSet of -tag delimited
strings using XmlInputFormat.
- A .distinct() operations removes all duplicates based on the content.
- .map() is used to par
Hi Flavio,
thanks for pointing me to your old thread.
I don't have administrative rights on the cluster, but from what dmesg
reports, I could not find anything that looks like an OOM message.
So no luck for me, I guess...
Best,
Sebastian
Hi Ted,
thanks for bringing this to my attention.
I just rechecked my Java version and it is indeed version 8. Both the
code and the Flink environment run that version.
Cheers,
Sebastian
Hi Kurt,
thanks for the input.
What do you mean with "try to disable your combiner"? Any tips on how I
can do that?
I don't actively use any combine* DataSet API functions, so the calls to
the SynchronousChainedCombineDriver come from Flink.
Kind regards,
Sebastian
Hi Stefan,
thanks for the answer and the advise, which I've already seen in another
email.
Anyway, I played around with the taskmanager.numberOfTaskSlots and
taskmanager.memory.fraction options. I noticed that decreasing the
former and increasing the latter lead to longer execution and more
proce
Hi,
when I'm running my Flink job on a small dataset, it successfully
finishes. However, when a bigger dataset is used, I get multiple exceptions:
- Caused by: java.io.IOException: Cannot write record to fresh sort
buffer. Record too large.
- Thread 'SortMerger Reading Thread' terminated due to
Hi,
thanks for the help!
Making the class fooo static did the trick.
I was just a bit confused, because I'm using a similar contruction
somewhere else in the code and it works flawlessy.
Best regards,
Sebastian
Hi,
I've heared of some methods that triggere an execution when using the
Batch API:
- print
- collect
- count
- execute
Some of them are discussed in older docs [0], but I can't find a good
list or hints in the newer ones. Are there any other methods?
Best regards,
Sebastian
[0]
https://ci.ap
Hello Apache Flink users,
I implemented a FilterFunction some months ago that worked quite well
back then. However, I wanted to check it out right now and it somehow
broke in the sense that Flink can't serialize it anymore. I might be
mistaken, but afaik I didn't touch the code at all.
I think th
Hi Till,
just to clarify and for my understanding.
Let's assume we have the following Bzip2 file:
|A.BA.B|A...B|A|..BA.|...BA|B|A...B|
|1 |2|3|4|5|6|7| ("block number")
Hi Robert,
sorry for the long delay.
> I wonder why the decompression with the XmlInputFormat doesn't work. Did
> you get any exception?
I didn't get any exception, it just seems to read nothing (or at least
don't match any opening/closing tags).
I digged a bit into the code and found out, that
Hi,
thanks! That's exactly what I needed.
I'm not using: DataSetA.leftOuterJoin(DataSetB).where(new
KeySelector()).equalTo(new KeySelector()).with(new JoinFunction(...)).
Now I get the following error:
> Caused by: org.apache.flink.optimizer.CompilerException: Error translating
> node 'Map "Ke
Hi,
is it possible to assign a "default" value to elements that didn't match?
For example I have the following two datasets:
|DataSetA | DataSetB|
-
|id=1 | id=1
|id=2 | id=3
|id=5 | id=4
|id=6 | id=6
When doing a join with:
A.join(B).where( KeySelector(A.id
Hi,
what's the best way to read a compressed (bz2 / gz) XML file splitting
it at a specific XML-tag?
So far I've been using hadoop's TextInputFormat in combination with
mahouts XmlInputFormat ([0]) with env.readHadoopFile(). Whereas the
plain TextInputFormat can handle compressed data, the XmlInp
Hi Chesnay,
thanks for the input. Finding a word's first occurrence is part of the
algorithm.
To be exact I'm trying to implement Adler's Text authorship tracking in
flink (http://www2007.org/papers/paper692.pdf, page 266).
Thanks,
Sebastian
Hi Kostas,
thanks for the quick reply.
> If T_1 must be processed before T_i, i>1, then you cannot parallelize the
> algorithm.
What would be the best way to process it anyway?
DataSet.collect() -> loop over List -> env.fromCollection(...) ?
Or with a parallelism of 1 and a .map(...) ?
Howeve
Hello,
I'd like to implement an algorithm which doesn't really look
parallelizable to me, but maybe there's a way around it:
In general the algorithm looks like this:
1. Take a list of texts T_1 ... T_n
2. For every text T_i (i > 1) do
2.1: Split text into a list of words W_1 ... W_m
2.2: For ev
Hi,
I'm also interested in that question/solution.
For now, my workaround looks like this:
> DataSet<...> .filter(... object.Id == NeededElement.Id ...
).collect().get(0)
I filter the DataSet for the element I want to find, collect it into a
List which then returns the first element.
That's a
Hi,
wow, oh that's indeed a nice solution.
Your version still threw some errors:
> Caused by: org.apache.flink.api.common.InvalidProgramException: Object
> factorytest.Config$1@5143c662 not serializable
> Caused by: java.io.NotSerializableException: factorytest.factory.PearFactory
I fixed this
Hi Chesnay,
thank you for looking into this!
Is there any way to tell Flink to (re)sync the changed classes and/or
tell it to distribute the serialized classes at a given point (e.g.
first on a env.execute() ) or so?
The thing is, that I'm working on a small framework which bases on
flink, so pa
link-factorytest/blob/master/src/main/java/factorytest/Job.java
Config.java:
https://gitlab.tubit.tu-berlin.de/gehaxelt/flink-factorytest/blob/master/src/main/java/factorytest/Config.java
Example run:
https://gitlab.tubit.tu-berlin.de/gehaxelt/flink-factorytest/blob/master/EXAMPLE_RUN_OUTPUT.txt
Kind regards,
Sebastian Neef
24 matches
Mail list logo