RE : Re: HDFS small file generation problem

nibiau Sat, 03 Oct 2015 04:13:57 -0700

Hello,
Thanks if I understand correctly Hive can be a usable to my context ?


Nicolas




Envoyé depuis mon appareil mobile SamsungJörn Franke <jornfra...@gmail.com> a 
écrit :If you use transactional tables in hive together with insert, update, 
delete then it does the "concatenate " for you automatically in regularly 
intervals. Currently this works only with tables in orc.format (stored as orc)

Le sam. 3 oct. 2015 à 11:45,  <nib...@free.fr> a écrit :
Hello,
So, does Hive is a solution for my need :
- I receive small messages (10KB) identified by ID (product ID for example)
- Each message I receive is the last picture of my product ID, so I just want 
basically to store last picture products inside HDFS
in order to process batch on it later.

If I use Hive I suppose I have to use INSERT and UPDATE records and 
periodically CONCATENATE.
After a CONCATENATE I suppose the records are still updatable.

Tks to confirm if it can be solution for my use case. Or any other idea..

Thanks a lot !
Nicolas


----- Mail original -----
De: "Jörn Franke" <jornfra...@gmail.com>
À: nib...@free.fr, "Brett Antonides" <banto...@gmail.com>
Cc: user@spark.apache.org
Envoyé: Samedi 3 Octobre 2015 11:17:51
Objet: Re: HDFS small file generation problem



You can update data in hive if you use the orc format



Le sam. 3 oct. 2015 à 10:42, < nib...@free.fr > a écrit :


Hello,
Finally Hive is not a solution as I cannot update the data.
And for archive file I think it would be the same issue.
Any other solutions ?

Nicolas

----- Mail original -----
De: nib...@free.fr
À: "Brett Antonides" < banto...@gmail.com >
Cc: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:37:22
Objet: Re: HDFS small file generation problem

Ok thanks, but can I also update data instead of insert data ?

----- Mail original -----
De: "Brett Antonides" < banto...@gmail.com >
À: user@spark.apache.org
Envoyé: Vendredi 2 Octobre 2015 18:18:18
Objet: Re: HDFS small file generation problem








I had a very similar problem and solved it with Hive and ORC files using the 
Spark SQLContext.
* Create a table in Hive stored as an ORC file (I recommend using partitioning 
too)
* Use SQLContext.sql to Insert data into the table
* Use SQLContext.sql to periodically run ALTER TABLE...CONCATENATE to merge 
your many small files into larger files optimized for your HDFS block size
* Since the CONCATENATE command operates on files in place it is transparent to 
any downstream processing

Cheers,
Brett









On Fri, Oct 2, 2015 at 3:48 PM, < nib...@free.fr > wrote:


Hello,
Yes but :
- In the Java API I don't find a API to create a HDFS archive
- As soon as I receive a message (with messageID) I need to replace the old 
existing file by the new one (name of file being the messageID), is it possible 
with archive ?

Tks
Nicolas

----- Mail original -----
De: "Jörn Franke" < jornfra...@gmail.com >
À: nib...@free.fr , "user" < user@spark.apache.org >
Envoyé: Lundi 28 Septembre 2015 23:53:56
Objet: Re: HDFS small file generation problem





Use hadoop archive



Le dim. 27 sept. 2015 à 15:36, < nib...@free.fr > a écrit :


Hello,
I'm still investigating my small file generation problem generated by my Spark 
Streaming jobs.
Indeed, my Spark Streaming jobs are receiving a lot of small events (avg 10kb), 
and I have to store them inside HDFS in order to treat them by PIG jobs 
on-demand.
The problem is the fact that I generate a lot of small files in HDFS (several 
millions) and it can be problematic.
I investigated to use Hbase or Archive file but I don't want to do it finally.
So, what about this solution :
- Spark streaming generate on the fly several millions of small files in HDFS
- Each night I merge them inside a big daily file
- I launch my PIG jobs on this big file ?

Other question I have :
- Is it possible to append a big file (daily) by adding on the fly my event ?

Tks a lot
Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE : Re: HDFS small file generation problem

Reply via email to