I have 2 files.
1. names.snappy is a file created by spark with snappy compression.
CodecFactory(conf).getCodec() correctly returns
org.apache.hadoop.io.compress.SnappyCodec
2. pairs-avro-snappy-compressed.avro is file created by spark - by reading
an avro file, and write using snappy compression
Create a hive table x
Load your csv data in table x (LOAD DATA INPATH 'file/path' INTO TABLE x;)
create hive table y with same structure as x except add STORED AS PARQUET;
INSERT OVERWRITE TABLE y SELECT * FROM x;
This would get you parquet files under /user/hive/warehouse/y (as an
example) you
I changed the code to below...
JavaPairRDD rdd = sc.newAPIHadoopFile(inputFile,
ParquetInputFormat.class, NullWritable.class, String.class, mrConf);
JavaRDD words = rdd.values().flatMap(
new FlatMapFunction() {
public Iterable call(String x) {
return Arrays.asLi
Hi All,
I am a newbie to Spark/Hadoop.
I want to read a parquet file and a perform a simple word-count. Below is my
code, however I get an error:
Exception in thread "main" java.io.IOException: No input paths specified in
job
at
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listSta