Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-15 Thread Sebastian Neef
Hi, @Ted: > Is it possible to prune (unneeded) field(s) so that heap requirement is > lower ? The XmlInputFormat [0] splits the raw data into smaller chunks, which are then further processed. I don't think I can reduce the field's (Tuple2) sizes. The major difference to Mahout's XmlInputFormat i

Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-14 Thread Sebastian Neef
Hi Ted, sure. Here's the stack strace with .distinct() with the Exception in the 'SortMerger Reading Thread': [1] Here's the stack strace without .distinct() and the 'Requested array size exceeds VM limit' error: [2] If you need anything else, I can more or less reliably reproduce the issue. T

Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-14 Thread Sebastian Neef
Hi, I removed the .distinct() and ran another test. Without filtering duplicate entries, the Job processes more data and runs much longer, but eventually fails with the following error: > java.lang.OutOfMemoryError: Requested array size exceeds VM limit Even then playing around with the aforeme

Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-13 Thread Sebastian Neef
Hi, the code is part of a bigger project, so I'll try to outline the used methods and their order: # Step 1 - Reading a Wikipedia XML Dump into a DataSet of -tag delimited strings using XmlInputFormat. - A .distinct() operations removes all duplicates based on the content. - .map() is used to par

Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-13 Thread Sebastian Neef
Hi Flavio, thanks for pointing me to your old thread. I don't have administrative rights on the cluster, but from what dmesg reports, I could not find anything that looks like an OOM message. So no luck for me, I guess... Best, Sebastian

Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-13 Thread Sebastian Neef
Hi Ted, thanks for bringing this to my attention. I just rechecked my Java version and it is indeed version 8. Both the code and the Flink environment run that version. Cheers, Sebastian

Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-13 Thread Sebastian Neef
Hi Kurt, thanks for the input. What do you mean with "try to disable your combiner"? Any tips on how I can do that? I don't actively use any combine* DataSet API functions, so the calls to the SynchronousChainedCombineDriver come from Flink. Kind regards, Sebastian

Re: Cannot write record to fresh sort buffer. Record too large.

2017-06-12 Thread Sebastian Neef
Hi Stefan, thanks for the answer and the advise, which I've already seen in another email. Anyway, I played around with the taskmanager.numberOfTaskSlots and taskmanager.memory.fraction options. I noticed that decreasing the former and increasing the latter lead to longer execution and more proce

Cannot write record to fresh sort buffer. Record too large.

2017-06-12 Thread Sebastian Neef
Hi, when I'm running my Flink job on a small dataset, it successfully finishes. However, when a bigger dataset is used, I get multiple exceptions: - Caused by: java.io.IOException: Cannot write record to fresh sort buffer. Record too large. - Thread 'SortMerger Reading Thread' terminated due to

Re: Weird serialization bug?

2017-05-10 Thread Sebastian Neef
Hi, thanks for the help! Making the class fooo static did the trick. I was just a bit confused, because I'm using a similar contruction somewhere else in the code and it works flawlessy. Best regards, Sebastian

Methods that trigger execution

2017-05-03 Thread Sebastian Neef
Hi, I've heared of some methods that triggere an execution when using the Batch API: - print - collect - count - execute Some of them are discussed in older docs [0], but I can't find a good list or hints in the newer ones. Are there any other methods? Best regards, Sebastian [0] https://ci.ap

Weird serialization bug?

2017-04-29 Thread Sebastian Neef
Hello Apache Flink users, I implemented a FilterFunction some months ago that worked quite well back then. However, I wanted to check it out right now and it somehow broke in the sense that Flink can't serialize it anymore. I might be mistaken, but afaik I didn't touch the code at all. I think th

Re: Reading compressed XML data

2017-02-24 Thread Sebastian Neef
Hi Till, just to clarify and for my understanding. Let's assume we have the following Bzip2 file: |A.BA.B|A...B|A|..BA.|...BA|B|A...B| |1 |2|3|4|5|6|7| ("block number")

Re: Reading compressed XML data

2017-02-16 Thread Sebastian Neef
Hi Robert, sorry for the long delay. > I wonder why the decompression with the XmlInputFormat doesn't work. Did > you get any exception? I didn't get any exception, it just seems to read nothing (or at least don't match any opening/closing tags). I digged a bit into the code and found out, that

Re: Join with Default-Value

2017-02-10 Thread Sebastian Neef
Hi, thanks! That's exactly what I needed. I'm not using: DataSetA.leftOuterJoin(DataSetB).where(new KeySelector()).equalTo(new KeySelector()).with(new JoinFunction(...)). Now I get the following error: > Caused by: org.apache.flink.optimizer.CompilerException: Error translating > node 'Map "Ke

Join with Default-Value

2017-02-10 Thread Sebastian Neef
Hi, is it possible to assign a "default" value to elements that didn't match? For example I have the following two datasets: |DataSetA | DataSetB| - |id=1 | id=1 |id=2 | id=3 |id=5 | id=4 |id=6 | id=6 When doing a join with: A.join(B).where( KeySelector(A.id

Reading compressed XML data

2017-01-11 Thread Sebastian Neef
Hi, what's the best way to read a compressed (bz2 / gz) XML file splitting it at a specific XML-tag? So far I've been using hadoop's TextInputFormat in combination with mahouts XmlInputFormat ([0]) with env.readHadoopFile(). Whereas the plain TextInputFormat can handle compressed data, the XmlInp

Re: Sequential/ordered map

2017-01-05 Thread Sebastian Neef
Hi Chesnay, thanks for the input. Finding a word's first occurrence is part of the algorithm. To be exact I'm trying to implement Adler's Text authorship tracking in flink (http://www2007.org/papers/paper692.pdf, page 266). Thanks, Sebastian

Re: Sequential/ordered map

2017-01-05 Thread Sebastian Neef
Hi Kostas, thanks for the quick reply. > If T_1 must be processed before T_i, i>1, then you cannot parallelize the > algorithm. What would be the best way to process it anyway? DataSet.collect() -> loop over List -> env.fromCollection(...) ? Or with a parallelism of 1 and a .map(...) ? Howeve

Sequential/ordered map

2017-01-05 Thread Sebastian Neef
Hello, I'd like to implement an algorithm which doesn't really look parallelizable to me, but maybe there's a way around it: In general the algorithm looks like this: 1. Take a list of texts T_1 ... T_n 2. For every text T_i (i > 1) do 2.1: Split text into a list of words W_1 ... W_m 2.2: For ev

Re: Retrieving a single element from a DataSet

2016-10-26 Thread Sebastian Neef
Hi, I'm also interested in that question/solution. For now, my workaround looks like this: > DataSet<...> .filter(... object.Id == NeededElement.Id ... ).collect().get(0) I filter the DataSet for the element I want to find, collect it into a List which then returns the first element. That's a

Re: Flink and factories?

2016-10-19 Thread Sebastian Neef
Hi, wow, oh that's indeed a nice solution. Your version still threw some errors: > Caused by: org.apache.flink.api.common.InvalidProgramException: Object > factorytest.Config$1@5143c662 not serializable > Caused by: java.io.NotSerializableException: factorytest.factory.PearFactory I fixed this

Re: Flink and factories?

2016-10-19 Thread Sebastian Neef
Hi Chesnay, thank you for looking into this! Is there any way to tell Flink to (re)sync the changed classes and/or tell it to distribute the serialized classes at a given point (e.g. first on a env.execute() ) or so? The thing is, that I'm working on a small framework which bases on flink, so pa

Flink and factories?

2016-10-19 Thread Sebastian Neef
link-factorytest/blob/master/src/main/java/factorytest/Job.java Config.java: https://gitlab.tubit.tu-berlin.de/gehaxelt/flink-factorytest/blob/master/src/main/java/factorytest/Config.java Example run: https://gitlab.tubit.tu-berlin.de/gehaxelt/flink-factorytest/blob/master/EXAMPLE_RUN_OUTPUT.txt Kind regards, Sebastian Neef