I managed to get remote debugging up and running and can in fact
reproduce the error and get a breakpoint triggered as it happens.
But it seems like the code does not go through TextInputFormat, or at
least the breakpoint is not triggered from this class? Don't know what
other class to look for th
I'm pretty confident the lines are encoded correctly since I can read
them both locally and on Spark (by ignoring the faulty line and
proceed to next). I also get the correct number of lines through
Spark, again by ignoring the faulty line.
I get the same error by reading the original file using S
Thanks for you help. Really appreciate it!
Give me some time i'll come back after I've tried your suggestions.
On Tue, Jun 14, 2016 at 3:28 PM, Kristoffer Sjögren wrote:
> I cannot reproduce it by running the file through Spark in local mode
> on my machine. So it does indeed seems to be somethi
It takes a little setup, but you can do remote debugging:
http://danosipov.com/?p=779 ... and then use similar config to
connect your IDE to a running executor.
Before that you might strip your program down to only a call to
textFile that then checks the lines according to whatever logic would
de
I cannot reproduce it by running the file through Spark in local mode
on my machine. So it does indeed seems to be something related to
split across partitions.
On Tue, Jun 14, 2016 at 3:04 PM, Kristoffer Sjögren wrote:
> Can you do remote debugging in Spark? Didn't know that. Do you have a link?
Can you do remote debugging in Spark? Didn't know that. Do you have a link?
Also noticed isSplittable in
org.apache.hadoop.mapreduce.lib.input.TextInputFormat which checks for
org.apache.hadoop.io.compress.SplittableCompressionCodec. Maybe there
are some way to tell it not to split?
On Tue, Jun 1
It really sounds like the line is being split across partitions. This
is what TextInputFormat does but should be perfectly capable of
putting together lines that break across files (partitions). If you're
into debugging, that's where I would start if you can. Breakpoints
around how TextInputFormat
That's funny. The line after is the rest of the whole line that got
split in half. Every following lines after that are fine.
I managed to reproduce without gzip also so maybe it's no gzip's fault
after all..
I'm clueless...
On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren wrote:
> Seems li
Can you read this file using MR job ?
On Tue, Jun 14, 2016 at 5:26 PM, Sean Owen wrote:
> It's really the MR InputSplit code that splits files into records.
> Nothing particularly interesting happens in that process, except for
> breaking on newlines.
>
> Do you have one huge line in the file? a
It's really the MR InputSplit code that splits files into records.
Nothing particularly interesting happens in that process, except for
breaking on newlines.
Do you have one huge line in the file? are you reading as a text file?
can you give any more detail about exactly how you parse it? it could
Hi
We have log files that are written in base64 encoded text files
(gzipped) where each line is ended with a new line character.
For some reason a particular line [1] is split by Spark [2] making it
unparsable by the base64 decoder. It does this consequently no matter
if I gives it the particular
11 matches
Mail list logo