Re: FileStreamSource source checks path eagerly?

Jacek Laskowski Thu, 08 Sep 2016 02:25:14 -0700

Hi Steve,

Thank you for more source-oriented answer. Helped but didn't explain
the reason for such eagerness. The file(s) might not be on the driver
but on executors only where the Spark job(s) run. I don't see why
Spark should check the file(s) regardless of glob pattern being used.


You see my way of thinking?

Pozdrawiam,
Jacek Laskowski
----
https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Sep 8, 2016 at 11:20 AM, Steve Loughran <[email protected]> wrote:
> failfast generally means that you find problems sooner rather than later, and 
> here, potentially, that your code runs but simply returns empty data without 
> any obvious cue as to what is wrong.
>
> As is always good in OSS, follow those stack trace links to see what they say:
>
>         // Check whether the path exists if it is not a glob pattern.
>         // For glob pattern, we do not check it because the glob pattern 
> might only make sense
>         // once the streaming job starts and some upstream source starts 
> dropping data.
>
> If you specify a glob pattern, you'll get the late check at the expense of 
> the risk of that empty data source if the pattern is wrong. Something like 
> "/var/log\s" would suffice, as the presence of the backslash is enough for 
> SparkHadoopUtil.isGlobPath() to conclude that its something for the globber.
>
>
>> On 8 Sep 2016, at 07:33, Jacek Laskowski <[email protected]> wrote:
>>
>> Hi,
>>
>> I'm wondering what's the rationale for checking the path option
>> eagerly in FileStreamSource? My thinking is that until start is called
>> there's no processing going on that is supposed to happen on executors
>> (not the driver) with the path available.
>>
>> I could (and perhaps should) use dfs but IMHO that just hides the real
>> question of the text source eagerness.
>>
>> Please help me understand the rationale of the choice. Thanks!
>>
>> scala> spark.version
>> res0: String = 2.1.0-SNAPSHOT
>>
>> scala> spark.readStream.format("text").load("/var/logs")
>> org.apache.spark.sql.AnalysisException: Path does not exist: /var/logs;
>>  at 
>> org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:229)
>>  at 
>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:81)
>>  at 
>> org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:81)
>>  at 
>> org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
>>  at 
>> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:142)
>>  at 
>> org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:153)
>>  ... 48 elided
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: FileStreamSource source checks path eagerly?

Reply via email to