Hi,
Lets take Pig as an example...
collection = LOAD 'test_data.csv' USING PigStorage(';')
AS (
col1:chararray,
col2:chararray,
);
# use partitions
STORE collection INTO 'import_table_hcat' USING
org.apache.hcatalog.pig.HCatStorer('datestamp=20150420');
How would I implement this with Flink?
Let us brainstorm about the code snippet...
final ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
CsvReader csvr = env.readCsvFile(filePath);
// TODO: Get data into a data set - How to read the whole file?
// DataSet<Tuple2<Text, Text>> hadoopResult = csvr.
// TODO: Store data into Hadoop - Write to HDFS / HCatalog
// HadoopOutputFormat<Text, IntWritable> hadoopOF =
// create the Flink wrapper.
new HadoopOutputFormat<Text,
IntWritable>(
// set the Hadoop
OutputFormat and specify the job.
new TextOutputFormat<Text,
IntWritable>(), job
);
hadoopOF.getConfiguration().set("mapreduce.output.textoutputformat.separator",
" ");
TextOutputFormat.setOutputPath(job, new Path(outputPath));
// Emit data using the Hadoop
TextOutputFormat.
hadoopResult.output(hadoopOF);
My idea is: If I create the tables in HCatalog in advance, I might add them by
writing to HDFS Hive directory. Any thoughts on this?
Stefan
-----Original Message-----
From: Robert Metzger [mailto:[email protected]]
Sent: Monday, April 20, 2015 3:22 PM
To: [email protected]
Subject: Re: Hadoop ETLing with Flink
Hi Stefan,
you can use Flink to load data into HDFS.
The CSV reader is suited for reading delimiter separated text files into the
system. But you can also read data from a lot of other sources (avro, jdbc,
mongodb, hcatalog).
We don't have any utilities to make writing to HCatalog very easy, but you can
certainly write to HCatalog with Flink's Hadoop OutputFormat wrappers:
http://ci.apache.org/projects/flink/flink-docs-master/hadoop_compatibility.html#using-hadoop-outputformats
Here is some documentation on how to use the Hcatalog output format:
https://cwiki.apache.org/confluence/display/Hive/HCatalog+InputOutput
You probably have to do something like:
HCatOutputFormat.setOutput(job, OutputJobInfo.create(dbName, outputTableName,
null)); HCatSchema s = HCatOutputFormat.getTableSchema(job);
HCatOutputFormat.setSchema(job, s);
Let me know if you need more help writing to Hcatalog.
On Mon, Apr 20, 2015 at 1:29 PM, Papp, Stefan <[email protected]>
wrote:
> Hi,
>
>
> I want load CSV files into a Hadoop cluster. How could I do that with
> Flink?
>
> I know, I can load data into a CsvReader and then iterate over rows
> and transform them. Is there an easy way to store the results into
> HDFS+HCatalog within Flink?
>
> Thank you!
>
> Stefan Papp
> Lead Hadoop Consultant
>
> Teradata GmbH
> Mobile: +43 664 22 08 616
> [email protected]<mailto:[email protected]>
> teradata.com<http://www.teradata.com/>
>
> This e-mail is from Teradata Corporation and may contain information
> that is confidential or proprietary. If you are not the intended
> recipient, do not read, copy or distribute the e-mail or any
> attachments. Instead, please notify the sender and delete the e-mail and any
> attachments. Thank you.
> Please consider the environment before printing.
>
>