Hi folks,
Thanks for the help thus far.
I'm trying to track down the source of this error:
java.lang.UnsupportedOperationException:
org.apache.parquet.column.values.dictionary.PlainValuesDictionary
w hen doing a message.show()
Basically I'm reading in a single Parquet file (to try to narrow things
down).
I'm defining the schema in the beginning and loading the parquet with:
message = spark\
.read\
.schema(myMessageSchema)\
.format("parquet")\
.option("mergeSchema", "true")\
.option("badRecordsPath", "/tmp/badRecords/")\
.load("hdfs:///user/hadoop/feb20/part-00000-c6da95c9-9c40-4623-a5c5-851188e236ff-c000.snappy.parquet")
[I've tried with and without the mergeSchema option]
[ sidenote: I was hoping the badRecordPath would help with the truly bad
records, but this seems to do nothing]
I've also tried to cast the potential problematic columns (so Int, Long,
Double, etc) with
message_1 = message\
.withColumn('price', col('price').cast('double'))\
.withColumn('price_eur', col('price_eur').cast('double'))\
.withColumn('cost_usd', col('cost_usd').cast('double'))\
.withColumn('adapter_status', col('adapter_status').cast('long'))
Yet I get this error and I can't figure out:
(a) whether it's some record WITHIN the parquet file that's causing it and
(b) if it is a single record (or a few records) then how do I find those
particular records?
In the previous time I encountered this, there were records that should
have had doubles in them (like "price" above) that actually seemed to have
null.
I did this to fix that particular problem:
if not 'price' in message.columns:
message = message.withColumn('price', message.lit('0'))
Any suggestions or help would be MOST welcome. I have also tried using
pyarrow to take a look at the Parquet schema and it looks fine. I mean, it
doesn't look like the schema in the parquet is the problem - but of course
I'm not ruling that out just yet.
Thanks for any suggestions,
Hamish
--
Cloud-Fundis.co.za
Cape Town, South Africa
+27 79 614 4913