codecFactory is not able to return codec for a file, created by spark

2016-10-05 Thread shamu
I have 2 files. 1. names.snappy is a file created by spark with snappy compression. CodecFactory(conf).getCodec() correctly returns org.apache.hadoop.io.compress.SnappyCodec 2. pairs-avro-snappy-compressed.avro is file created by spark - by reading an avro file, and write using snappy compression

Re: Spark with Parquet

2016-08-22 Thread shamu
Create a hive table x Load your csv data in table x (LOAD DATA INPATH 'file/path' INTO TABLE x;) create hive table y with same structure as x except add STORED AS PARQUET; INSERT OVERWRITE TABLE y SELECT * FROM x; This would get you parquet files under /user/hive/warehouse/y (as an example) you

Re: word count on parquet file

2016-08-22 Thread shamu
I changed the code to below... JavaPairRDD rdd = sc.newAPIHadoopFile(inputFile, ParquetInputFormat.class, NullWritable.class, String.class, mrConf); JavaRDD words = rdd.values().flatMap( new FlatMapFunction() { public Iterable call(String x) { return Arrays.asLi

word count on parquet file

2016-08-22 Thread shamu
Hi All, I am a newbie to Spark/Hadoop. I want to read a parquet file and a perform a simple word-count. Below is my code, however I get an error: Exception in thread "main" java.io.IOException: No input paths specified in job at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listSta