I understand how it works with hdfs. My question is when hdfs is not the file sustem, how number of partitions are calculated. Hope that makes it clearer.
On Mon, 12 Jun 2017 at 2:42 am, vaquar khan <vaquar.k...@gmail.com> wrote: > > > As per spark doc : > The textFile method also takes an optional second argument for > controlling the number of partitions of the file.* By default, Spark > creates one partition for each block of the file (blocks being 128MB by > default in HDFS)*, but you can also ask for a higher number of partitions > by passing a larger value. Note that you cannot have fewer partitions than > blocks. > > > sc.textFile doesn't commence any reading. It simply defines a > driver-resident data structure which can be used for further processing. > > It is not until an action is called on an RDD that Spark will build up a > strategy to perform all the required transforms (including the read) and > then return the result. > > If there is an action called to run the sequence, and your next > transformation after the read is to map, then Spark will need to read a > small section of lines of the file (according to the partitioning strategy > based on the number of cores) and then immediately start to map it until it > needs to return a result to the driver, or shuffle before the next sequence > of transformations. > > If your partitioning strategy (defaultMinPartitions) seems to be swamping > the workers because the java representation of your partition (an > InputSplit in HDFS terms) is bigger than available executor memory, then > you need to specify the number of partitions to read as the second > parameter to textFile. You can calculate the ideal number of partitions > by dividing your file size by your target partition size (allowing for > memory growth). A simple check that the file can be read would be: > > sc.textFile(file, numPartitions).count() > > You can get good explanation here : > > https://stackoverflow.com/questions/29011574/how-does-partitioning-work-for-data-from-files-on-hdfs > > > > Regards, > Vaquar khan > > > On Jun 11, 2017 5:28 AM, "ayan guha" <guha.a...@gmail.com> wrote: > >> Hi >> >> My question is what happens if I have 1 file of say 100gb. Then how many >> partitions will be there? >> >> Best >> Ayan >> On Sun, 11 Jun 2017 at 9:36 am, vaquar khan <vaquar.k...@gmail.com> >> wrote: >> >>> Hi Ayan, >>> >>> If you have multiple files (example 12 files )and you are using >>> following code then you will get 12 partition. >>> >>> r = sc.textFile("file://my/file/*") >>> >>> Not sure what you want to know about file system ,please check API doc. >>> >>> >>> Regards, >>> Vaquar khan >>> >>> >>> On Jun 8, 2017 10:44 AM, "ayan guha" <guha.a...@gmail.com> wrote: >>> >>> Any one? >>> >>> On Thu, 8 Jun 2017 at 3:26 pm, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> Hi Guys >>>> >>>> Quick one: How spark deals (ie create partitions) with large files >>>> sitting on NFS, assuming the all executors can see the file exactly same >>>> way. >>>> >>>> ie, when I run >>>> >>>> r = sc.textFile("file://my/file") >>>> >>>> what happens if the file is on NFS? >>>> >>>> is there any difference from >>>> >>>> r = sc.textFile("hdfs://my/file") >>>> >>>> Are the input formats used same in both cases? >>>> >>>> >>>> -- >>>> Best Regards, >>>> Ayan Guha >>>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >>> >>> -- >> Best Regards, >> Ayan Guha >> > -- Best Regards, Ayan Guha