In Hadoop you should not have many small files. Put them into a HAR.
> On 13 Dec 2016, at 05:42, Jakob Odersky wrote:
>
> Assuming the bottleneck is IO, you could try saving your files to
> HDFS. This will distribute your data and allow for better concurrent
> reads.
>
>> On Mon, Dec 12, 2016 a
Assuming the bottleneck is IO, you could try saving your files to
HDFS. This will distribute your data and allow for better concurrent
reads.
On Mon, Dec 12, 2016 at 3:06 PM, Reth RM wrote:
> Hi,
>
> I have millions of html files in a directory, using "wholeTextFiles" api to
> load them and proce
Hi,
I have millions of html files in a directory, using "wholeTextFiles" api to
load them and process further. Right now, testing it with 40k records and
at the time of loading files(wholeTextFiles), it waits for minimum of 8-9
minutes. What are some recommended optimizations? Should consider any