Re: Spark corrupts text lines

Sean Owen Tue, 14 Jun 2016 05:44:16 -0700

It really sounds like the line is being split across partitions. This
is what TextInputFormat does but should be perfectly capable of
putting together lines that break across files (partitions). If you're
into debugging, that's where I would start if you can. Breakpoints
around how TextInputFormat is parsing lines. See if you can catch it
when it returns a line that doesn't contain what you expect.


On Tue, Jun 14, 2016 at 1:38 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
> That's funny. The line after is the rest of the whole line that got
> split in half. Every following lines after that are fine.
>
> I managed to reproduce without gzip also so maybe it's no gzip's fault
> after all..
>
> I'm clueless...
>
> On Tue, Jun 14, 2016 at 12:53 PM, Kristoffer Sjögren <sto...@gmail.com> wrote:
>> Seems like it's the gzip. It works if download the file, gunzip and
>> put it back to another directory and read it the same way.
>>
>> Hm.. I wonder what happens with the lines after it..
>>
>>
>>
>> On Tue, Jun 14, 2016 at 11:52 AM, Sean Owen <so...@cloudera.com> wrote:
>>> What if you read it uncompressed from HDFS?
>>> gzip compression is unfriendly to MR in that it can't split the file.
>>> It still should just work, certainly if the line is in one file. But,
>>> a data point worth having.
>>>
>>> On Tue, Jun 14, 2016 at 10:49 AM, Kristoffer Sjögren <sto...@gmail.com> 
>>> wrote:
>>>> The line is in one file. I did download the file manually from HDFS,
>>>> read and decoded it line-by-line successfully without Spark.
>>>>
>>>>
>>>>
>>>> On Tue, Jun 14, 2016 at 11:44 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>> The only thing I can think of is that a line is being broken across two 
>>>>> files?
>>>>> Hadoop easily puts things back together in this case, or should. There
>>>>> could be some weird factor preventing that. One first place to look:
>>>>> are you using a weird line separator? or at least different from the
>>>>> host OS?
>>>>>
>>>>> On Tue, Jun 14, 2016 at 10:41 AM, Kristoffer Sjögren <sto...@gmail.com> 
>>>>> wrote:
>>>>>> I should mention that we're in the end want to store the input from
>>>>>> Protobuf binary to Parquet using the following code. But this comes
>>>>>> after the lines has been decoded from base64 into binary.
>>>>>>
>>>>>>
>>>>>> public static <T extends Message> void save(JavaRDD<T> rdd, Class<T>
>>>>>> clazz, String path) {
>>>>>>   try {
>>>>>>     Job job = Job.getInstance();
>>>>>>     ParquetOutputFormat.setWriteSupportClass(job, 
>>>>>> ProtoWriteSupport.class);
>>>>>>     ProtoParquetOutputFormat.setProtobufClass(job, clazz);
>>>>>>     rdd.mapToPair(order -> new Tuple2<>(null, order))
>>>>>>       .saveAsNewAPIHadoopFile(path, Void.class, clazz,
>>>>>> ParquetOutputFormat.class, job.getConfiguration());
>>>>>>   } catch (IOException e) {
>>>>>>     throw new RuntimeException(e);
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>> <dependency>
>>>>>>   <groupId>org.apache.parquet</groupId>
>>>>>>   <artifactId>parquet-protobuf</artifactId>
>>>>>>   <version>1.8.1</version>
>>>>>> </dependency>
>>>>>>
>>>>>> On Tue, Jun 14, 2016 at 11:37 AM, Kristoffer Sjögren <sto...@gmail.com> 
>>>>>> wrote:
>>>>>>> I'm trying to figure out exactly what information could be useful but
>>>>>>> it's all as straight forward.
>>>>>>>
>>>>>>> - It's text files
>>>>>>> - Lines ends with a new line character.
>>>>>>> - Files are gzipped before added to HDFS
>>>>>>> - Files are read as gzipped files from HDFS by Spark
>>>>>>> - There are some extra configuration
>>>>>>>
>>>>>>> conf.set("spark.files.overwrite", "true");
>>>>>>> conf.set("spark.hadoop.validateOutputSpecs", "false");
>>>>>>>
>>>>>>> Here's the code using Java 8 Base64 class.
>>>>>>>
>>>>>>> context.textFile("/log.gz")
>>>>>>> .map(line -> line.split("&timestamp="))
>>>>>>> .map(split -> Base64.getDecoder().decode(split[0]));
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jun 14, 2016 at 11:26 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>>>>> It's really the MR InputSplit code that splits files into records.
>>>>>>>> Nothing particularly interesting happens in that process, except for
>>>>>>>> breaking on newlines.
>>>>>>>>
>>>>>>>> Do you have one huge line in the file? are you reading as a text file?
>>>>>>>> can you give any more detail about exactly how you parse it? it could
>>>>>>>> be something else in your code.
>>>>>>>>
>>>>>>>> On Tue, Jun 14, 2016 at 10:24 AM, Kristoffer Sjögren 
>>>>>>>> <sto...@gmail.com> wrote:
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> We have log files that are written in base64 encoded text files
>>>>>>>>> (gzipped) where each line is ended with a new line character.
>>>>>>>>>
>>>>>>>>> For some reason a particular line [1] is split by Spark [2] making it
>>>>>>>>> unparsable by the base64 decoder. It does this consequently no matter
>>>>>>>>> if I gives it the particular file that contain the line or a bunch of
>>>>>>>>> files.
>>>>>>>>>
>>>>>>>>> I know the line is not corrupt because I can manually download the
>>>>>>>>> file from HDFS, gunzip it and read/decode all the lines without
>>>>>>>>> problems.
>>>>>>>>>
>>>>>>>>> Was thinking that maybe there is a limit to number of characters per
>>>>>>>>> line but that doesn't sound right? Maybe the combination of characters
>>>>>>>>> makes Spark think it's new line?
>>>>>>>>>
>>>>>>>>> I'm clueless.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> -Kristoffer
>>>>>>>>>
>>>>>>>>> [1] Original line:
>>>>>>>>>
>>>>>>>>> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0cHpyM3ZzLHBkM2xkM2diaSxwaXVrYzY2ZWUscHl0ejI5OHM0KgkzOTUxLDM5NjAS3gIIxNjxhJTVsJcVEqUBTW96aWxsYS81LjAgKExpbnV4OyBBbmRyb2lkIDUuMS4xOyBTQU1TVU5HIFNNLUczODhGIEJ1aWxkL0xNWTQ4QikgQXBwbGVXZWJLaXQvNTM3LjM2IChLSFRNTCwgbGlrZSBHZWNrbykgU2Ftc3VuZ0Jyb3dzZXIvMy4zIENocm9tZS8zOC4wLjIxMjUuMTAyIE1vYmlsZSBTYWZhcmkvNTM3LjM2IjUKDDYyLjIwLjE5Ni44MBWgd3NBHRgibUIiAlNFKgfDlnJlYnJvMg5UZWxpYVNvbmVyYSBBQigAMdejcD0K1+s/OABCCAiAAhWamRlAQgcIURUAAOBAQggIlAEVzczMP0IHCFQVmpkJQUIICJYBFTMzE0BCBwhYFZqZ+UBCCAj6ARWamdk/QggImwEVzcysQEoHCAYVO6ysPkoHCAQVRYO4PkoHCAEVIg0APw==&timestamp=1465887564
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [2] Line as spark hands it over:
>>>>>>>>>
>>>>>>>>> CsAJCtwGCghwYWdlVmlldxC4PhjM1v66BSJFaHR0cDovL25hLnNlL3Nwb3J0ZW4vc3BvcnR0dC8xLjM5MTU5MjEtdXBwZ2lmdGVyLXNtZWRlcm5hLW1vdC1rb25rdXJzKjhVcHBnaWZ0ZXI6IFNtZWRlcm5hIG1vdCBrb25rdXJzIC0gU3BvcnQgKFRUKSAtIHd3dy5uYS5zZTJXaHR0cDovL25hLnNlL255aGV0ZXIvb3JlYnJvLzEuMzk2OTU0My1rcnlwaGFsLW9wcG5hci1mb3Itb2JvLWF0dC1iZWhhbGxhLXRqYW5zdGViaWxhcm5hOqECcWdrZWplNGpmLHBkMzBmdDRuNCxwbHB0b3JqNncscGxwczBvamZvLHBkYjVrZGM4eCxwbHBzN293Y3UscGE0czN1bXp5LHBhNHJla25sNyxwYTRyd3dxam4scGE0c21ra2Z4LHBkM2tpa3BmMixwZDNqcjE5dGMscGQ0ZGQ0M2F3LHAwZ3MwbmlqMSxwYTRvZTNrbXoscGE0cWJ3eDZxLHBkM2s2NW00dyxwYTRyazc3Z3IscGQzMHAzdW8wLHBkNGM1ajV5dixwbHB0c211NmcscGM3bXNibmM5LHBhNHFpaTdsZCxwbHB0dnpqdnUscGE0bmlsdmFnLHBhNHB6cjN2cyxwZDNsZDNnYmkscGl1a2M2NmVlLHB5dHoyOThzNErIAgoTNzI0NTY2NzU0MzQxNTUyOTQ4ORAAGAAioQJxZ2tlamU0amYscGQzMGZ0NG40LHBscHRvcmo2dyxwbHBzMG9qZm8scGRiNWtkYzh4LHBscHM3b3djdSxwYTRzM3VtenkscGE0cmVrbmw3LHBhNHJ3d3FqbixwYTRzbWtrZngscGQza2lrcGYyLHBkM2pyMTl0YyxwZDRkZDQzYXcscDBnczBuaWoxLHBhNG9lM2tteixwYTRxYnd4NnEscGQzazY1bTR3LHBhNHJrNzdncixwZDMwcDN1bzAscGQ0YzVqNXl2LHBscHRzbXU2ZyxwYzdtc2JuYzkscGE0cWlpN2xkLHBscHR2emp2dSxwYTRuaWx2YWcscGE0
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark corrupts text lines

Reply via email to