Problem building spark-catalyst_2.12 with Maven

2022-02-10 Thread Martin Grigorov
Hi,

I am not able to build Spark due to the following error :

ERROR] ## Exception when compiling 543 sources to
/home/martin/git/apache/spark/sql/catalyst/target/scala-2.12/classes
java.lang.BootstrapMethodError: call site initialization exception
java.lang.invoke.CallSite.makeSite(CallSite.java:341)
java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
scala.tools.nsc.typechecker.Typers$Typer.typedBlock(Typers.scala:2504)
scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103(Typers.scala:5711)
scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1(Typers.scala:500)
scala.tools.nsc.typechecker.Typers$Typer.typed1(Typers.scala:5746)
scala.tools.nsc.typechecker.Typers$Typer.typed(Typers.scala:5781)
...
Caused by: java.lang.StackOverflowError
at java.lang.ref.Reference. (Reference.java:303)
at java.lang.ref.WeakReference. (WeakReference.java:57)
at java.lang.invoke.MethodType$ConcurrentWeakInternSet$WeakEntry.
(MethodType.java:1269)
at java.lang.invoke.MethodType$ConcurrentWeakInternSet.get
(MethodType.java:1216)
at java.lang.invoke.MethodType.makeImpl (MethodType.java:302)
at java.lang.invoke.MethodType.dropParameterTypes (MethodType.java:573)
at java.lang.invoke.MethodType.replaceParameterTypes
(MethodType.java:467)
at java.lang.invoke.MethodHandle.asSpreader (MethodHandle.java:875)
at java.lang.invoke.Invokers.spreadInvoker (Invokers.java:158)
at java.lang.invoke.CallSite.makeSite (CallSite.java:324)
at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl
(MethodHandleNatives.java:307)
at java.lang.invoke.MethodHandleNatives.linkCallSite
(MethodHandleNatives.java:297)
at scala.tools.nsc.typechecker.Typers$Typer.typedBlock
(Typers.scala:2504)
at scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103
(Typers.scala:5711)
at scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1
(Typers.scala:500)
at scala.tools.nsc.typechecker.Typers$Typer.typed1 (Typers.scala:5746)
at scala.tools.nsc.typechecker.Typers$Typer.typed (Typers.scala:5781)

I have played a lot with the scala-maven-plugin jvmArg settings at [1] but
so far nothing helps.
Same error for Scala 2.12 and 2.13.

The command I use is: ./build/mvn install -Pkubernetes -DskipTests

I need to create a distribution from master branch.

Java: 1.8.0_312
Maven: 3.8.4
OS: Ubuntu 21.10

Any hints ?
Thank you!

1.
https://github.com/apache/spark/blob/50256bde9bdf217413545a6d2945d6c61bf4cfff/pom.xml#L2845-L2849


Re: Problem building spark-catalyst_2.12 with Maven

2022-02-10 Thread Sean Owen
Yes I've seen this; the JVM stack size needs to be increased. I'm not sure
if it's env specific (though you and I at least have hit it, I think
others), or whether we need to change our build script.
In the pom.xml file, find "-Xss..." settings and make them something like
"-Xss4m", see if that works.

On Thu, Feb 10, 2022 at 8:54 AM Martin Grigorov 
wrote:

> Hi,
>
> I am not able to build Spark due to the following error :
>
> ERROR] ## Exception when compiling 543 sources to
> /home/martin/git/apache/spark/sql/catalyst/target/scala-2.12/classes
> java.lang.BootstrapMethodError: call site initialization exception
> java.lang.invoke.CallSite.makeSite(CallSite.java:341)
>
> java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
>
> java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
> scala.tools.nsc.typechecker.Typers$Typer.typedBlock(Typers.scala:2504)
>
> scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103(Typers.scala:5711)
>
> scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1(Typers.scala:500)
> scala.tools.nsc.typechecker.Typers$Typer.typed1(Typers.scala:5746)
> scala.tools.nsc.typechecker.Typers$Typer.typed(Typers.scala:5781)
> ...
> Caused by: java.lang.StackOverflowError
> at java.lang.ref.Reference. (Reference.java:303)
> at java.lang.ref.WeakReference. (WeakReference.java:57)
> at
> java.lang.invoke.MethodType$ConcurrentWeakInternSet$WeakEntry.
> (MethodType.java:1269)
> at java.lang.invoke.MethodType$ConcurrentWeakInternSet.get
> (MethodType.java:1216)
> at java.lang.invoke.MethodType.makeImpl (MethodType.java:302)
> at java.lang.invoke.MethodType.dropParameterTypes (MethodType.java:573)
> at java.lang.invoke.MethodType.replaceParameterTypes
> (MethodType.java:467)
> at java.lang.invoke.MethodHandle.asSpreader (MethodHandle.java:875)
> at java.lang.invoke.Invokers.spreadInvoker (Invokers.java:158)
> at java.lang.invoke.CallSite.makeSite (CallSite.java:324)
> at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl
> (MethodHandleNatives.java:307)
> at java.lang.invoke.MethodHandleNatives.linkCallSite
> (MethodHandleNatives.java:297)
> at scala.tools.nsc.typechecker.Typers$Typer.typedBlock
> (Typers.scala:2504)
> at scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103
> (Typers.scala:5711)
> at scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1
> (Typers.scala:500)
> at scala.tools.nsc.typechecker.Typers$Typer.typed1 (Typers.scala:5746)
> at scala.tools.nsc.typechecker.Typers$Typer.typed (Typers.scala:5781)
>
> I have played a lot with the scala-maven-plugin jvmArg settings at [1] but
> so far nothing helps.
> Same error for Scala 2.12 and 2.13.
>
> The command I use is: ./build/mvn install -Pkubernetes -DskipTests
>
> I need to create a distribution from master branch.
>
> Java: 1.8.0_312
> Maven: 3.8.4
> OS: Ubuntu 21.10
>
> Any hints ?
> Thank you!
>
> 1.
> https://github.com/apache/spark/blob/50256bde9bdf217413545a6d2945d6c61bf4cfff/pom.xml#L2845-L2849
>


Help needed to locate the csv parser (for Spark bug reporting/fixing)

2022-02-10 Thread Marnix van den Broek
hi all,

Yesterday I filed a CSV parsing bug [1] for Spark, that leads to data
incorrectness when data contains sequences similar to the one in the
report.

I wanted to take a look at the parsing logic to see if I could spot the
error to update the issue with more information and to possibly contribute
a PR with a bug fix, but I got completely lost navigating my way down the
dependencies in the Spark repository. Can someone point me in the right
direction?

I am looking for the csv parser itself, which is likely a dependency?

The next question might need too much knowledge about Spark internals to
know where to look or understand what I'd be looking at, but I am also
looking to see if and why the implementation of the CSV parsing is
different when columns are projected as opposed to the processing of the
full dataframe/ The issue only occurs when projecting columns and this
inconsistency is a worry in itself.

Many thanks,

Marnix

1. https://issues.apache.org/jira/browse/SPARK-38167


Re: Help needed to locate the csv parser (for Spark bug reporting/fixing)

2022-02-10 Thread Sean Owen
It starts in org.apache.spark.sql.execution.datasources.csv.CSVDataSource.
Yes univocity is used for much of the parsing.
I am not sure of the cause of the bug but it does look like one indeed. In
one case the parser is asked to read all fields, in the other, to skip one.
The pushdown helps efficiency but something is going wrong.

On Thu, Feb 10, 2022 at 10:34 AM Marnix van den Broek <
marnix.van.den.br...@bundlesandbatches.io> wrote:

> hi all,
>
> Yesterday I filed a CSV parsing bug [1] for Spark, that leads to data
> incorrectness when data contains sequences similar to the one in the
> report.
>
> I wanted to take a look at the parsing logic to see if I could spot the
> error to update the issue with more information and to possibly contribute
> a PR with a bug fix, but I got completely lost navigating my way down the
> dependencies in the Spark repository. Can someone point me in the right
> direction?
>
> I am looking for the csv parser itself, which is likely a dependency?
>
> The next question might need too much knowledge about Spark internals to
> know where to look or understand what I'd be looking at, but I am also
> looking to see if and why the implementation of the CSV parsing is
> different when columns are projected as opposed to the processing of the
> full dataframe/ The issue only occurs when projecting columns and this
> inconsistency is a worry in itself.
>
> Many thanks,
>
> Marnix
>
> 1. https://issues.apache.org/jira/browse/SPARK-38167
>
>


Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-10 Thread John Zhuge
The vote is now closed and the vote passes. Thank you to everyone who took
the time to review and vote on this SPIP. I’m looking forward to adding
this feature to the next Spark release. The tracking JIRA is
https://issues.apache.org/jira/browse/SPARK-31357.

The tally is:

+1s:

Walaa Eldin Moustafa
Erik Krogen
Holden Karau (binding)
Ryan Blue
Chao Sun
L C Hsieh (binding)
Huaxin Gao
Yufei Gu
Terry Kim
Jacky Lee
Wenchen Fan (binding)

0s:

-1s:

On Mon, Feb 7, 2022 at 10:04 PM Wenchen Fan  wrote:

> +1 (binding)
>
> On Sun, Feb 6, 2022 at 10:27 AM Jacky Lee  wrote:
>
>> +1 (non-binding). Thanks John!
>> It's great to see ViewCatalog moving on, it's a nice feature.
>>
>> Terry Kim  于2022年2月5日周六 11:57写道:
>>
>>> +1 (non-binding). Thanks John!
>>>
>>> Terry
>>>
>>> On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu  wrote:
>>>
 +1 (non-binding)
 Best,

 Yufei

 `This is not a contribution`


 On Fri, Feb 4, 2022 at 11:54 AM huaxin gao 
 wrote:

> +1 (non-binding)
>
> On Fri, Feb 4, 2022 at 11:40 AM L. C. Hsieh  wrote:
>
>> +1
>>
>> On Thu, Feb 3, 2022 at 7:25 PM Chao Sun  wrote:
>> >
>> > +1 (non-binding). Looking forward to this feature!
>> >
>> > On Thu, Feb 3, 2022 at 2:32 PM Ryan Blue  wrote:
>> >>
>> >> +1 for the SPIP. I think it's well designed and it has worked
>> quite well at Netflix for a long time.
>> >>
>> >> On Thu, Feb 3, 2022 at 2:04 PM John Zhuge 
>> wrote:
>> >>>
>> >>> Hi Spark community,
>> >>>
>> >>> I’d like to restart the vote for the ViewCatalog design proposal
>> (SPIP).
>> >>>
>> >>> The proposal is to add a ViewCatalog interface that can be used
>> to load, create, alter, and drop views in DataSourceV2.
>> >>>
>> >>> Please vote on the SPIP until Feb. 9th (Wednesday).
>> >>>
>> >>> [ ] +1: Accept the proposal as an official SPIP
>> >>> [ ] +0
>> >>> [ ] -1: I don’t think this is a good idea because …
>> >>>
>> >>> Thanks!
>> >>
>> >>
>> >>
>> >> --
>> >> Ryan Blue
>> >> Tabular
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
John Zhuge


Re: Problem building spark-catalyst_2.12 with Maven

2022-02-10 Thread Martin Grigorov
Hi Sean,

On Thu, Feb 10, 2022 at 5:37 PM Sean Owen  wrote:

> Yes I've seen this; the JVM stack size needs to be increased. I'm not sure
> if it's env specific (though you and I at least have hit it, I think
> others), or whether we need to change our build script.
> In the pom.xml file, find "-Xss..." settings and make them something like
> "-Xss4m", see if that works.
>

It is already a much bigger value - 128m (
https://github.com/apache/spark/blob/50256bde9bdf217413545a6d2945d6c61bf4cfff/pom.xml#L2845
)
I've tried smaller and bigger values for all jvmArgs next to this one. None
helped!
I also have the feeling it is something in my environment that overrides
these values but so far I cannot identify anything.



>
> On Thu, Feb 10, 2022 at 8:54 AM Martin Grigorov 
> wrote:
>
>> Hi,
>>
>> I am not able to build Spark due to the following error :
>>
>> ERROR] ## Exception when compiling 543 sources to
>> /home/martin/git/apache/spark/sql/catalyst/target/scala-2.12/classes
>> java.lang.BootstrapMethodError: call site initialization exception
>> java.lang.invoke.CallSite.makeSite(CallSite.java:341)
>>
>> java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
>>
>> java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
>> scala.tools.nsc.typechecker.Typers$Typer.typedBlock(Typers.scala:2504)
>>
>> scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103(Typers.scala:5711)
>>
>> scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1(Typers.scala:500)
>> scala.tools.nsc.typechecker.Typers$Typer.typed1(Typers.scala:5746)
>> scala.tools.nsc.typechecker.Typers$Typer.typed(Typers.scala:5781)
>> ...
>> Caused by: java.lang.StackOverflowError
>> at java.lang.ref.Reference. (Reference.java:303)
>> at java.lang.ref.WeakReference. (WeakReference.java:57)
>> at
>> java.lang.invoke.MethodType$ConcurrentWeakInternSet$WeakEntry.
>> (MethodType.java:1269)
>> at java.lang.invoke.MethodType$ConcurrentWeakInternSet.get
>> (MethodType.java:1216)
>> at java.lang.invoke.MethodType.makeImpl (MethodType.java:302)
>> at java.lang.invoke.MethodType.dropParameterTypes
>> (MethodType.java:573)
>> at java.lang.invoke.MethodType.replaceParameterTypes
>> (MethodType.java:467)
>> at java.lang.invoke.MethodHandle.asSpreader (MethodHandle.java:875)
>> at java.lang.invoke.Invokers.spreadInvoker (Invokers.java:158)
>> at java.lang.invoke.CallSite.makeSite (CallSite.java:324)
>> at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl
>> (MethodHandleNatives.java:307)
>> at java.lang.invoke.MethodHandleNatives.linkCallSite
>> (MethodHandleNatives.java:297)
>> at scala.tools.nsc.typechecker.Typers$Typer.typedBlock
>> (Typers.scala:2504)
>> at scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103
>> (Typers.scala:5711)
>> at scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1
>> (Typers.scala:500)
>> at scala.tools.nsc.typechecker.Typers$Typer.typed1 (Typers.scala:5746)
>> at scala.tools.nsc.typechecker.Typers$Typer.typed (Typers.scala:5781)
>>
>> I have played a lot with the scala-maven-plugin jvmArg settings at [1]
>> but so far nothing helps.
>> Same error for Scala 2.12 and 2.13.
>>
>> The command I use is: ./build/mvn install -Pkubernetes -DskipTests
>>
>> I need to create a distribution from master branch.
>>
>> Java: 1.8.0_312
>> Maven: 3.8.4
>> OS: Ubuntu 21.10
>>
>> Any hints ?
>> Thank you!
>>
>> 1.
>> https://github.com/apache/spark/blob/50256bde9bdf217413545a6d2945d6c61bf4cfff/pom.xml#L2845-L2849
>>
>


Re: Problem building spark-catalyst_2.12 with Maven

2022-02-10 Thread Sean Owen
I think it's another occurrence that I had to change or had to set
MAVEN_OPTS. I think this occurs in a way that this setting doesn't affect,
though I don't quite understand it. Try the stack size in test runner
configs

On Thu, Feb 10, 2022, 2:02 PM Martin Grigorov  wrote:

> Hi Sean,
>
> On Thu, Feb 10, 2022 at 5:37 PM Sean Owen  wrote:
>
>> Yes I've seen this; the JVM stack size needs to be increased. I'm not
>> sure if it's env specific (though you and I at least have hit it, I think
>> others), or whether we need to change our build script.
>> In the pom.xml file, find "-Xss..." settings and make them something like
>> "-Xss4m", see if that works.
>>
>
> It is already a much bigger value - 128m (
> https://github.com/apache/spark/blob/50256bde9bdf217413545a6d2945d6c61bf4cfff/pom.xml#L2845
> )
> I've tried smaller and bigger values for all jvmArgs next to this one.
> None helped!
> I also have the feeling it is something in my environment that overrides
> these values but so far I cannot identify anything.
>
>
>
>>
>> On Thu, Feb 10, 2022 at 8:54 AM Martin Grigorov 
>> wrote:
>>
>>> Hi,
>>>
>>> I am not able to build Spark due to the following error :
>>>
>>> ERROR] ## Exception when compiling 543 sources to
>>> /home/martin/git/apache/spark/sql/catalyst/target/scala-2.12/classes
>>> java.lang.BootstrapMethodError: call site initialization exception
>>> java.lang.invoke.CallSite.makeSite(CallSite.java:341)
>>>
>>> java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)
>>>
>>> java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
>>> scala.tools.nsc.typechecker.Typers$Typer.typedBlock(Typers.scala:2504)
>>>
>>> scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103(Typers.scala:5711)
>>>
>>> scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1(Typers.scala:500)
>>> scala.tools.nsc.typechecker.Typers$Typer.typed1(Typers.scala:5746)
>>> scala.tools.nsc.typechecker.Typers$Typer.typed(Typers.scala:5781)
>>> ...
>>> Caused by: java.lang.StackOverflowError
>>> at java.lang.ref.Reference. (Reference.java:303)
>>> at java.lang.ref.WeakReference. (WeakReference.java:57)
>>> at
>>> java.lang.invoke.MethodType$ConcurrentWeakInternSet$WeakEntry.
>>> (MethodType.java:1269)
>>> at java.lang.invoke.MethodType$ConcurrentWeakInternSet.get
>>> (MethodType.java:1216)
>>> at java.lang.invoke.MethodType.makeImpl (MethodType.java:302)
>>> at java.lang.invoke.MethodType.dropParameterTypes
>>> (MethodType.java:573)
>>> at java.lang.invoke.MethodType.replaceParameterTypes
>>> (MethodType.java:467)
>>> at java.lang.invoke.MethodHandle.asSpreader (MethodHandle.java:875)
>>> at java.lang.invoke.Invokers.spreadInvoker (Invokers.java:158)
>>> at java.lang.invoke.CallSite.makeSite (CallSite.java:324)
>>> at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl
>>> (MethodHandleNatives.java:307)
>>> at java.lang.invoke.MethodHandleNatives.linkCallSite
>>> (MethodHandleNatives.java:297)
>>> at scala.tools.nsc.typechecker.Typers$Typer.typedBlock
>>> (Typers.scala:2504)
>>> at scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103
>>> (Typers.scala:5711)
>>> at
>>> scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1
>>> (Typers.scala:500)
>>> at scala.tools.nsc.typechecker.Typers$Typer.typed1
>>> (Typers.scala:5746)
>>> at scala.tools.nsc.typechecker.Typers$Typer.typed (Typers.scala:5781)
>>>
>>> I have played a lot with the scala-maven-plugin jvmArg settings at [1]
>>> but so far nothing helps.
>>> Same error for Scala 2.12 and 2.13.
>>>
>>> The command I use is: ./build/mvn install -Pkubernetes -DskipTests
>>>
>>> I need to create a distribution from master branch.
>>>
>>> Java: 1.8.0_312
>>> Maven: 3.8.4
>>> OS: Ubuntu 21.10
>>>
>>> Any hints ?
>>> Thank you!
>>>
>>> 1.
>>> https://github.com/apache/spark/blob/50256bde9bdf217413545a6d2945d6c61bf4cfff/pom.xml#L2845-L2849
>>>
>>


Re: Problem building spark-catalyst_2.12 with Maven

2022-02-10 Thread Martin Grigorov
I've found the problem!
It was indeed a local thingy!

$ cat ~/.mavenrc
MAVEN_OPTS='-XX:+TieredCompilation -XX:TieredStopAtLevel=1'

I've added this some time ago. It optimizes the build time. But it seems it
also overrides the env var MAVEN_OPTS...

Now it fails with:

[INFO] --- scala-maven-plugin:4.3.0:compile (scala-compile-first) @
spark-catalyst_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiler bridge file:
/home/martin/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.15__52.0-1.3.1_20191012T045515.jar
[INFO] compiler plugin:
BasicArtifact(com.github.ghik,silencer-plugin_2.12.15,1.7.6,null)
[INFO] Compiling 372 Scala sources and 171 Java sources to
/home/martin/git/apache/spark/sql/catalyst/target/scala-2.12/classes ...

[ERROR] [Error] : error writing
/home/martin/git/apache/spark/sql/catalyst/target/scala-2.12/classes/org/apache/spark/sql/catalyst/analysis/Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.class:
java.nio.file.FileSystemException
/home/martin/git/apache/spark/sql/catalyst/target/scala-2.12/classes/org/apache/spark/sql/catalyst/analysis/Analyzer$ResolveGroupingAnalytics$$anonfun$org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGroupingAnalytics$$replaceGroupingFunc$1.class:
File name too long
but this is well documented:
https://spark.apache.org/docs/latest/building-spark.html#encrypted-filesystems

All works now!
Thank you, Sean!


On Thu, Feb 10, 2022 at 10:13 PM Sean Owen  wrote:

> I think it's another occurrence that I had to change or had to set
> MAVEN_OPTS. I think this occurs in a way that this setting doesn't affect,
> though I don't quite understand it. Try the stack size in test runner
> configs
>
> On Thu, Feb 10, 2022, 2:02 PM Martin Grigorov 
> wrote:
>
>> Hi Sean,
>>
>> On Thu, Feb 10, 2022 at 5:37 PM Sean Owen  wrote:
>>
>>> Yes I've seen this; the JVM stack size needs to be increased. I'm not
>>> sure if it's env specific (though you and I at least have hit it, I think
>>> others), or whether we need to change our build script.
>>> In the pom.xml file, find "-Xss..." settings and make them something
>>> like "-Xss4m", see if that works.
>>>
>>
>> It is already a much bigger value - 128m (
>> https://github.com/apache/spark/blob/50256bde9bdf217413545a6d2945d6c61bf4cfff/pom.xml#L2845
>> )
>> I've tried smaller and bigger values for all jvmArgs next to this one.
>> None helped!
>> I also have the feeling it is something in my environment that overrides
>> these values but so far I cannot identify anything.
>>
>>
>>
>>>
>>> On Thu, Feb 10, 2022 at 8:54 AM Martin Grigorov 
>>> wrote:
>>>
 Hi,

 I am not able to build Spark due to the following error :

 ERROR] ## Exception when compiling 543 sources to
 /home/martin/git/apache/spark/sql/catalyst/target/scala-2.12/classes
 java.lang.BootstrapMethodError: call site initialization exception
 java.lang.invoke.CallSite.makeSite(CallSite.java:341)

 java.lang.invoke.MethodHandleNatives.linkCallSiteImpl(MethodHandleNatives.java:307)

 java.lang.invoke.MethodHandleNatives.linkCallSite(MethodHandleNatives.java:297)
 scala.tools.nsc.typechecker.Typers$Typer.typedBlock(Typers.scala:2504)

 scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103(Typers.scala:5711)

 scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1(Typers.scala:500)
 scala.tools.nsc.typechecker.Typers$Typer.typed1(Typers.scala:5746)
 scala.tools.nsc.typechecker.Typers$Typer.typed(Typers.scala:5781)
 ...
 Caused by: java.lang.StackOverflowError
 at java.lang.ref.Reference. (Reference.java:303)
 at java.lang.ref.WeakReference. (WeakReference.java:57)
 at
 java.lang.invoke.MethodType$ConcurrentWeakInternSet$WeakEntry.
 (MethodType.java:1269)
 at java.lang.invoke.MethodType$ConcurrentWeakInternSet.get
 (MethodType.java:1216)
 at java.lang.invoke.MethodType.makeImpl (MethodType.java:302)
 at java.lang.invoke.MethodType.dropParameterTypes
 (MethodType.java:573)
 at java.lang.invoke.MethodType.replaceParameterTypes
 (MethodType.java:467)
 at java.lang.invoke.MethodHandle.asSpreader (MethodHandle.java:875)
 at java.lang.invoke.Invokers.spreadInvoker (Invokers.java:158)
 at java.lang.invoke.CallSite.makeSite (CallSite.java:324)
 at java.lang.invoke.MethodHandleNatives.linkCallSiteImpl
 (MethodHandleNatives.java:307)
 at java.lang.invoke.MethodHandleNatives.linkCallSite
 (MethodHandleNatives.java:297)
 at scala.tools.nsc.typechecker.Typers$Typer.typedBlock
 (Typers.scala:2504)
 at scala.tools.nsc.typechecker.Typers$Typer.$anonfun$typed1$103
 (Typers.scala:5711)
 at
 scala.tools.nsc.typechecker.Typers$Typer.typedOutsidePatternMode$1
 (Typers.scala:500)

Re: Help needed to locate the csv parser (for Spark bug reporting/fixing)

2022-02-10 Thread Marnix van den Broek
Thanks, Sean!

It was actually on the Catalyst side of things, but I found where column
pruning pushdown is delegated to univocity, see [1].

I've tried setting the spark configuration
*spark.sql.csv.parser.columnPruning.enabled* to *False* and this prevents
the bug from happening. I am unfamiliar with Java / Scala so I might be
misreading things, but to me everything points to a bug in univocity,
specifically in how the *selectIndexes* parser setting impacts the parsing
of the example in the bug report.

This means that to fix this bug, univocity must be fixed and Spark then
needs to refer to a fixed version, correct? Unless someone thinks this
analysis is off, I'll add this info to the Spark issue and file a bug
report with univocity.

1.
https://github.com/apache/spark/blob/6a59fba248359fb2614837fe8781dc63ac8fdc4c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala#L79

On Thu, Feb 10, 2022 at 5:39 PM Sean Owen  wrote:

> It starts in org.apache.spark.sql.execution.datasources.csv.CSVDataSource.
> Yes univocity is used for much of the parsing.
> I am not sure of the cause of the bug but it does look like one indeed. In
> one case the parser is asked to read all fields, in the other, to skip one.
> The pushdown helps efficiency but something is going wrong.
>
> On Thu, Feb 10, 2022 at 10:34 AM Marnix van den Broek <
> marnix.van.den.br...@bundlesandbatches.io> wrote:
>
>> hi all,
>>
>> Yesterday I filed a CSV parsing bug [1] for Spark, that leads to data
>> incorrectness when data contains sequences similar to the one in the
>> report.
>>
>> I wanted to take a look at the parsing logic to see if I could spot the
>> error to update the issue with more information and to possibly contribute
>> a PR with a bug fix, but I got completely lost navigating my way down the
>> dependencies in the Spark repository. Can someone point me in the right
>> direction?
>>
>> I am looking for the csv parser itself, which is likely a dependency?
>>
>> The next question might need too much knowledge about Spark internals to
>> know where to look or understand what I'd be looking at, but I am also
>> looking to see if and why the implementation of the CSV parsing is
>> different when columns are projected as opposed to the processing of the
>> full dataframe/ The issue only occurs when projecting columns and this
>> inconsistency is a worry in itself.
>>
>> Many thanks,
>>
>> Marnix
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-38167
>>
>>