+1 (non-binding)
Compiled in Mac OS with :
build/mvn -Pyarn,sparkr,hive,hive-thriftserver
-Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
Checked around R
Looked into legal files
All looks good.
On Thu, Sep 24, 2015 at 12:27 AM, Reynold Xin wrote:
> Please vote on releasing the
Hi Fengdong,
So I created two files in HDFS under a test folder.
test/dt=20100101.json
{ "key1" : "value1" }
test/dt=20100102.json
{ "key2" : "value2" }
Then inside PySpark shell
rdd = sc.wholeTextFiles('./test/*')
rdd.collect()
[(u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json', u'{
Ran tests + built/ran an internal spark streaming app /w 1.5.1 artifacts.
+1
Cheers,
Sean
On Sep 24, 2015, at 1:28 AM, Reynold Xin
mailto:r...@databricks.com>> wrote:
Please vote on releasing the following candidate as Apache Spark version 1.5.1.
The vote is open until Sun, Sep 27, 2015 at
yes. such as I have two data sets:
date set A: /data/test1/dt=20100101
data set B: /data/test2/dt=20100202
all data has the same JSON format , such as:
{“key1” : “value1”, “key2” : “value2” }
my output expected:
{“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” : “20100101"}
Sure. May I ask for a sample input(could be just few lines) and the output
you are expecting to bring clarity to my thoughts?
On Thu, Sep 24, 2015, 23:44 Fengdong Yu wrote:
> Hi Anchit,
>
> Thanks for the quick answer.
>
> my exact question is : I want to add HDFS location into each line in my
>
Hi Anchit,
Thanks for the quick answer.
my exact question is : I want to add HDFS location into each line in my JSON
data.
> On Sep 25, 2015, at 11:25, Anchit Choudhry wrote:
>
> Hi Fengdong,
>
> Thanks for your question.
>
> Spark already has a function called wholeTextFiles within spar
+1 Tested MLlib on Mac OS X
On Thu, Sep 24, 2015 at 6:14 PM, Reynold Xin wrote:
> Krishna,
>
> Thanks for testing every release!
>
>
> On Thu, Sep 24, 2015 at 6:08 PM, Krishna Sankar
> wrote:
>
>> +1 (non-binding, of course)
>>
>> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:48 min
>>
Hi Fengdong,
Thanks for your question.
Spark already has a function called wholeTextFiles within sparkContext
which can help you with that:
Python
hdfs://a-hdfs-path/part-0hdfs://a-hdfs-path/part-1
...hdfs://a-hdfs-path/part-n
rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”)
Hi,
I have multiple files with JSON format, such as:
/data/test1_data/sub100/test.data
/data/test2_data/sub200/test.data
I can sc.textFile(“/data/*/*”)
but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save it
the one target HDFS location.
how to do it, Thanks.
--
Yes, the current implementation requires the backend to be on the same host as
SparkR package. But this does not prevent SparkR from connecting to a remote
Spark Cluster specified by a Spark master URL. The only thing needed is that
there need be to a Spark JAR co-located with SparkR package on
If a user downloads Spark source, of course he needs to build it before
running it. But a user can download pre-built Spark binary distributions, then
he can directly use sparkR after deployment of the Spark cluster.
From: Hossein [mailto:fal...@gmail.com]
Sent: Friday, September 25, 2015 2:37
Krishna,
Thanks for testing every release!
On Thu, Sep 24, 2015 at 6:08 PM, Krishna Sankar wrote:
> +1 (non-binding, of course)
>
> 1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:48 min
> mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
> 2. Tested pyspark, mllib (iPython 4.0, FYI, n
+1 (non-binding, of course)
1. Compiled OSX 10.10 (Yosemite) OK Total time: 26:48 min
mvn clean package -Pyarn -Phadoop-2.6 -DskipTests
2. Tested pyspark, mllib (iPython 4.0, FYI, notebook install is separate
“conda install python” and then “conda install jupyter”)
2.1. statistics (min,max,me
For host information, are you looking for something like this (which is
available today in Spark 1.5 already) ?
# Spark related configuration
Sys.setenv("SPARK_MASTER_IP"="127.0.0.1")
Sys.setenv("SPARK_LOCAL_IP"="127.0.0.1")
#Load libraries
library("rJava")
library(SparkR, lib.loc="/./spark-b
+1 tested SparkR on Mac and Linux.
--Hossein
On Thu, Sep 24, 2015 at 3:10 PM, Xiangrui Meng wrote:
> +1. Checked user guide and API doc, and ran some MLlib and SparkR
> examples. -Xiangrui
>
> On Thu, Sep 24, 2015 at 2:54 PM, Reynold Xin wrote:
> > I'm going to +1 this myself. Tested on my lap
+1. Checked user guide and API doc, and ran some MLlib and SparkR
examples. -Xiangrui
On Thu, Sep 24, 2015 at 2:54 PM, Reynold Xin wrote:
> I'm going to +1 this myself. Tested on my laptop.
>
>
>
> On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin wrote:
>>
>> I forked a new thread for this. Please
I'm going to +1 this myself. Tested on my laptop.
On Thu, Sep 24, 2015 at 10:56 AM, Reynold Xin wrote:
> I forked a new thread for this. Please discuss NOTICE file related things
> there so it doesn't hijack this thread.
>
>
> On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen wrote:
>
>> On Thu, Se
Right now in sparkR.R the backend hostname is hard coded to "localhost" (
https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L156).
If we make that address configurable / parameterized, then a user can
connect a remote Spark cluster with no need to have spark jars on their
local machine.
Yes, but the ASF's reading seems to be clear:
http://www.apache.org/dev/licensing-howto.html#permissive-deps
"In LICENSE, add a pointer to the dependency's license within the
source tree and a short note summarizing its licensing:"
I'd be concerned if you get a different interpretation from the AS
Hi Sean,
My reading would be that a separate copy of the BSD license, with copyright
years filled in, is required for each BSD-licensed dependency. Same for
MIT-licensed dependencies. Hopefully, we will receive some guidance on
https://issues.apache.org/jira/browse/LEGAL-226
Thanks,
-Rick
Sea
Yes, the issue of where 3rd-party license information goes is
different, and varies by license. I think the BSD/MIT licenses are all
already listed in LICENSE accordingly. Let me know if you spy an
omission.
On Thu, Sep 24, 2015 at 8:36 PM, Richard Hillegas wrote:
> Thanks for that pointer, Sean.
Thanks for that pointer, Sean. It may be that Derby is putting the license
information in the wrong place, viz. in the NOTICE file. But the 3rd party
license text may need to go somewhere else. See for instance the advice a
little further up the page at
http://www.apache.org/dev/licensing-howto.ht
I don't think the crux of the problem is about users who download the
source -- Spark's source distribution is clearly marked as something
that needs to be built and they can run `mvn -DskipTests -Psparkr
package` based on instructions in the Spark docs.
The crux of the problem is that with a sour
Have a look at http://www.apache.org/dev/licensing-howto.html#mod-notice
though, which makes a good point about limiting what goes into NOTICE
to what is required. That's what makes me think we shouldn't do this.
On Thu, Sep 24, 2015 at 7:24 PM, Richard Hillegas wrote:
> To answer Sean's question
Requiring users to download entire Spark distribution to connect to a
remote cluster (which is already running Spark) seems an over kill. Even
for most spark users who download Spark source, it is very unintuitive that
they need to run a script named "install-dev.sh" before they can run SparkR.
--
Thanks for forking the new email thread, Reynold. It is entirely possible
that I am being overly skittish. I have posed a question for our legal
experts: https://issues.apache.org/jira/browse/LEGAL-226
To answer Sean's question on the previous email thread, I would propose
making changes like the
I forked a new thread for this. Please discuss NOTICE file related things
there so it doesn't hijack this thread.
On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen wrote:
> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas
> wrote:
> > Under your guidance, I would be happy to help compile a NOTICE f
Richard,
Thanks for bringing this up and this is a great point. Let's start another
thread for it so we don't hijack the release thread.
On Thu, Sep 24, 2015 at 10:51 AM, Sean Owen wrote:
> On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas
> wrote:
> > Under your guidance, I would be happy t
On Thu, Sep 24, 2015 at 6:45 PM, Richard Hillegas wrote:
> Under your guidance, I would be happy to help compile a NOTICE file which
> follows the pattern used by Derby and the JDK. This effort might proceed in
> parallel with vetting 1.5.1 and could be targeted at a later release
> vehicle. I don
Hi Sean and Wendell,
I share your concerns about how difficult and important it is to get this
right. I think that the Spark community has compiled a very readable and
well organized NOTICE file. A lot of careful thought went into gathering
together 3rd party projects which share the same license
Hey Richard,
My assessment (just looked before I saw Sean's email) is the same as
his. The NOTICE file embeds other projects' licenses. If those
licenses themselves have pointers to other files or dependencies, we
don't embed them. I think this is standard practice.
- Patrick
On Thu, Sep 24, 201
Hi Richard, those are messages reproduced from other projects' NOTICE
files, not created by Spark. They need to be reproduced in Spark's
NOTICE file to comply with the license, but their text may or may not
apply to Spark's distribution. The intent is that users would track
this back to the source
-1 (non-binding)
I was able to build Spark cleanly from the source distribution using the
command in README.md:
build/mvn -DskipTests clean package
However, while I was waiting for the build to complete, I started going
through the NOTICE file. I was confused about where to find licenses fo
...and we're finished and now building!
On Thu, Sep 24, 2015 at 7:19 AM, shane knapp wrote:
> this is happening now.
>
> On Tue, Sep 22, 2015 at 10:07 AM, shane knapp wrote:
>> ok, here's the updated downtime schedule for this week:
>>
>> wednesday, sept 23rd:
>>
>> firewall maintenance cancelle
this is happening now.
On Tue, Sep 22, 2015 at 10:07 AM, shane knapp wrote:
> ok, here's the updated downtime schedule for this week:
>
> wednesday, sept 23rd:
>
> firewall maintenance cancelled, as jon took care of the update
> saturday morning while we were bringing jenkins back up after the co
Thanks, it seems good, though a little hack.
And here is another question. updateByKey compute on all the data from the
beginning, but in many situation, we just need to update the coming data.
This could be a big improve on speed and resource. Would this to be support
in the future?
Shixiong Zhu
+1 non-binding. This is the first time I've seen all tests pass the
first time with Java 8 + Ubuntu + "-Pyarn -Phadoop-2.6 -Phive
-Phive-thriftserver". Clearly the test improvement efforts are paying
off.
As usual the license, sigs, etc are OK.
On Thu, Sep 24, 2015 at 8:27 AM, Reynold Xin wrote:
You can create connection like this:
val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])])
=> {
val dbConnection = create a db connection
iterator.flatMap { case (key, values, stateOption) =>
if (values.isEmpty) {
// don't access database
It seems like a work around. But I don't know how to get the database
connection from the working nodes.
Shixiong Zhu 于2015年9月24日周四 下午5:37写道:
> Could you write your update func like this?
>
> val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])])
> => {
> iterator.flatMa
Could you write your update func like this?
val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])])
=> {
iterator.flatMap { case (key, values, stateOption) =>
if (values.isEmpty) {
// don't access database
} else {
// update to new state
Data that are not updated should be saved earlier: while the data added to
the DStream at the first time, it should be considered as updated. So save
the same data again is a waste.
What are the community is doing? Is there any doc or discussion that I can
look for? Thanks.
Shixiong Zhu 于2015年9
Thanks for the log file. Unfortunately, this is insufficient as it does not
why the file does not exist. It could be that before failure somehow the
file was deleted. For that I need to see both the before failure and after
recovery logs. If this can be reproduced, could you generate the before and
For data that are not updated, where do you save? Or do you only want to
avoid accessing database for those that are not updated?
Besides, the community is working on optimizing "updateStateBykey"'s
performance. Hope it will be delivered soon.
Best Regards,
Shixiong Zhu
2015-09-24 13:45 GMT+08:
Please vote on releasing the following candidate as Apache Spark version
1.5.1. The vote is open until Sun, Sep 27, 2015 at 10:00 UTC and passes if
a majority of at least 3 +1 PMC votes are cast.
[ ] +1 Release this package as Apache Spark 1.5.1
[ ] -1 Do not release this package because ...
The
44 matches
Mail list logo