I could be wrong, but I think you can do a wild card. df = spark.read.format('csv').load('/path/to/file*.csv.gz')
Thank You, Irving Duran On Fri, May 4, 2018 at 4:38 AM Shuporno Choudhury < shuporno.choudh...@gmail.com> wrote: > Hi, > > I want to read multiple files parallely into 1 dataframe. But the files > have random names and cannot confirm to any pattern (so I can't use > wildcard). Also, the files can be in different directories. > If I provide the file names in a list to the dataframe reader, it reads > then sequentially. > Eg: > df=spark.read.format('csv').load(['/path/to/file1.csv.gz','/path/to/file2.csv.gz','/path/to/file3.csv.gz']) > This reads the files sequentially. What can I do to read the files > parallely? > I noticed that spark reads files parallely if provided directly the > directory location. How can that be extended to multiple random files? > Suppose if my system has 4 cores, how can I make spark read 4 files at a > time? > > Please suggest. >