Hi Xinh,
I tried to wrap it, but it still didn’t work. I got a
"java.util.ConcurrentModificationException”.
All,
I have been trying and trying with some help of a coworker, but it’s slow
going. I have been able to gather a list of the s3 files I need to download.
### S3 Lists ###
import scala
Could you wrap the ZipInputStream in a List, since a subtype of
TraversableOnce[?] is required?
case (name, content) => List(new ZipInputStream(content.open))
Xinh
On Wed, Mar 9, 2016 at 7:07 AM, Benjamin Kim wrote:
> Hi Sabarish,
>
> I found a similar posting online where I should use the S3
Hi Sabarish,
I found a similar posting online where I should use the S3 listKeys.
http://stackoverflow.com/questions/24029873/how-to-read-multiple-text-files-into-a-single-rdd.
Is this what you were thinking?
And, your assumption is correct. The zipped CSV file contains only a single
file. I f
Oozie may be able to do this for you and integrate with Spark.
> On 09 Mar 2016, at 06:03, Benjamin Kim wrote:
>
> I am wondering if anyone can help.
>
> Our company stores zipped CSV files in S3, which has been a big headache from
> the start. I was wondering if anyone has created a way to i
You can use S3's listKeys API and do a diff between consecutive listKeys to
identify what's new.
Are there multiple files in each zip? Single file archives are processed
just like text as long as it is one of the supported compression formats.
Regards
Sab
On Wed, Mar 9, 2016 at 10:33 AM, Benjami
https://issues.apache.org/jira/browse/SPARK-3586 talks about creating a
file dstream which can monitor for new files recursively but this
functionality is not yet added.
I don't see an easy way out. You will have to create your folders based on
timeline (looks like you are already doing that) and