You wrote: " 2. On each machine, I can install pyspark by running `python setup.py install` inside the python directory.
Step 2 would fail because of missing the licenses directory. " That shouldn't depend on the license file, and the script you showed does not fail when not present, so I am wondering what this means. I'm not sure there's a JIRA here yet. On Fri, May 1, 2020 at 1:46 PM Xiangyu Li <yisky...@gmail.com> wrote: > Hmm, sorry I don't get what part of my email were you referring to when > you said "the build fails?". > > So I am trying to build a custom spark binary distribution with, say, > different Hadoop versions and R support. > > Then I stored this custom build on S3, so as I am building more machines I > can just directly download this custom build from S3. But besides > spark-submit and what not, I also wanted to install the pyspark python > package to the machine I am building. > > The lack of the LICENSE file in the custom build would prevent pyspark > from being successfully built. > > Hopefully this answers your question. > > The second part of my last email was about building pyspark inside spark > source directory, I will raise an issue on Jira for that, as it is more of > a clean cut problem with the documentation on the website and the comments > in make-distribution.sh. > > > > On Fri, May 1, 2020 at 1:31 PM Sean Owen <sro...@gmail.com> wrote: > >> Hm, the build fails? you can see this is just skipped if not present, for >> this reason. >> I'm not clear why you need the file for its own sake, for your own >> internal modification that you don't redistribute. >> >> >> >> On Fri, May 1, 2020 at 11:43 AM Xiangyu Li <yisky...@gmail.com> wrote: >> >>> Hi Sean, >>> >>> Thanks for the quick response! Yes, what you described about how LICENSE >>> file should be distributed makes sense. >>> >>> The reason I learned about this is that I was trying to build >>> spark-2.4.5-bin-custom.tgz, then distributes this build to multiple >>> machines, so that: >>> >>> 1. These machines can run spark with the built. >>> 2. On each machine, I can install pyspark by running `python setup.py >>> install` inside the python directory. >>> >>> Step 2 would fail because of missing the licenses directory. >>> >>> Building pyspark out of a binary distribution is a bit unconventional, >>> but I did this after failing to do what the official doc recommended ( >>> https://spark.apache.org/docs/latest/building-spark.html#pyspark-pip-installable), >>> so taking a step back to describe what I did originally: >>> >>> In the spark-2.4.5 src directory, I just did a simple: >>> >>> `./build/mvn -DskipTests clean package` >>> >>> >>> And then went to the python directory and did: >>> >>> >>> `python setup.py sdist` followed by `pip install >>> dist/pyspark-2.4.5.tar.gz` (as mentioned in the make-distribution.sh.) >>> >>> >>> This ran into "error: package directory `deps/jars` does not exist". >>> >>> >>> However, directly running >>> >>> >>> `sudo python setup.py install` >>> >>> >>> worked. >>> >>> >>> >>> On Fri, May 1, 2020 at 11:30 AM Sean Owen <sro...@gmail.com> wrote: >>> >>>> The source distribution has the source LICENSE file. The binary >>>> distribution has the LICENSE-binary license file. The source release isn't >>>> supposed to have LICENSE-binary as it would not be accurate for that >>>> release; LICENSE is. If you're redistributing a build, you'll have your own >>>> process for modifying and building it, including modifying the LICENSE file >>>> as appropriate; these LICENSE files represent what the project delivers to >>>> you rather than what you deliver to others. You could get the >>>> LICENSE-binary file from the right hash commit from git, if desired, as >>>> part of your build. >>>> >>>> On Fri, May 1, 2020 at 10:19 AM Xiangyu Li <yisky...@gmail.com> wrote: >>>> >>>>> Hello, >>>>> >>>>> I downloaded spark-2.4.5 source from >>>>> https://mirrors.ocf.berkeley.edu/apache/spark/spark-2.4.5/spark-2.4.5.tgz >>>>> After extracting it and running: >>>>> >>>>> ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr >>>>> -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes >>>>> >>>>> >>>>> It creates a Spark binary distribution named: >>>>> spark-2.4.5-bin-custom-spark.tgz >>>>> >>>>> So this file is supposedly a ready-to-distribute Spark binary file >>>>> like the one you can download from >>>>> http://mirror.metrocast.net/apache/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz >>>>> >>>>> However, one big difference between this custom build and the official >>>>> build is that you do not have a LICENSE file in the custom build. I don't >>>>> know much about Apache license, but I would suppose a custom build >>>>> distribution should have one. >>>>> >>>>> The reason we are missing the file is caused by the following code in >>>>> make-distribution.sh: >>>>> [image: image.png] >>>>> >>>>> There is no LICENSE-binary file in the official spark-2.4.5.tgz file, >>>>> therefore there will be no LICENSE file in your custom build. >>>>> >>>>> I am aware of two pull requests related to this: >>>>> >>>>> https://github.com/apache/spark/pull/22436 >>>>> started to use LICENSE-binary instead of just the LICENSE. >>>>> >>>>> And >>>>> https://github.com/apache/spark/pull/22840 >>>>> To avoid failure when there is no LICENSE-binary in spark-2.4.5 source >>>>> directory. >>>>> >>>>> I think we need to change make-distribution.sh to make sure that the >>>>> LICENSE file is copied over to its corresponding custom build >>>>> distribution. >>>>> However, I am not ready to do a pull request, so hopefully we can discuss >>>>> it here first. >>>>> -- >>>>> Sincerely >>>>> Xiangyu Li >>>>> >>>>> <yisky...@gmail.com> >>>>> >>>> >>> >>> -- >>> Sincerely >>> Xiangyu Li >>> >>> <yisky...@gmail.com> >>> >> > > -- > Sincerely > Xiangyu Li > > <yisky...@gmail.com> >