possible issues with listing objects in the HadoopFSrelation

2015-08-10 Thread Gil Vernik
Just some thoughts, hope i didn't missed something obvious.

HadoopFSRelation calls directly FileSystem class to list files in the 
path.
It looks like it implements basically the same logic as in the 
FileInputFormat.listStatus method ( located in 
hadoop-map-reduce-client-core)

The point is that HadoopRDD (or similar ) calls getSplits method that 
calls FileInputFormat.listStatus, while HadoopFSRelation calls FileSystem 
directly and both of them try to achieve "listing" of objects.

There might be various issues with this, for example this one 
https://issues.apache.org/jira/browse/SPARK-7868 makes sure that 
"_temporary" is not returned in a result, but the the listing of 
FileInputFormat contains more logic,  it uses hidden PathFilter like this

  private static final PathFilter hiddenFileFilter = new PathFilter(){
  public boolean accept(Path p){
String name = p.getName(); 
return !name.startsWith("_") && !name.startsWith("."); 
  }
}; 

In addition, custom FileOutputCommitter, may use other name than 
"_temporary" . 

All this may lead that HadoopFSrelation and HadoopRDD will provide 
different lists from the same data source.

My question is: what the roadmap for this listing in HadoopFSrelation. 
Will it implement exactly the same logic like in 
FileInputFormat.listStatus, or may be one day HadoopFSrelation will call 
FileInputFormat.listStatus and provide custom PathFilter or 
MultiPathFilter? This way there will be single  code that list objects.

Thanks,
Gil.




Re: PySpark on PyPi

2015-08-10 Thread Davies Liu
I think so, any contributions on this are welcome.

On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger  wrote:
> Sorry, trying to follow the context here. Does it look like there is
> support for the idea of creating a setup.py file and pypi package for
> pyspark?
>
> Cheers,
>
> Brian
>
> On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu  wrote:
>> We could do that after 1.5 released, it will have same release cycle
>> as Spark in the future.
>>
>> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>>  wrote:
>>> +1 (once again :) )
>>>
>>> 2015-07-28 14:51 GMT+02:00 Justin Uang :

 // ping

 do we have any signoff from the pyspark devs to submit a PR to publish to
 PyPI?

 On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman 
 wrote:
>
> Hey all, great discussion, just wanted to +1 that I see a lot of value in
> steps that make it easier to use PySpark as an ordinary python library.
>
> You might want to check out this (https://github.com/minrk/findspark),
> started by Jupyter project devs, that offers one way to facilitate this
> stuff. I’ve also cced them here to join the conversation.
>
> Also, @Jey, I can also confirm that at least in some scenarios (I’ve done
> it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs
> just using `from pyspark import SparkContext; sc = 
> SparkContext(master=“X”)`
> so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are
> set correctly on *both* workers and driver. That said, there’s definitely
> additional configuration / functionality that would require going through
> the proper submit scripts.
>
> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal 
> wrote:
>
> I agree with everything Justin just said. An additional advantage of
> publishing PySpark's Python code in a standards-compliant way is the fact
> that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
> way that pip can use. Contrast this with the current situation, where
> df.toPandas() exists in the Spark API but doesn't actually work until you
> install Pandas.
>
> Punya
> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang 
> wrote:
>>
>> // + Davies for his comments
>> // + Punya for SA
>>
>> For development and CI, like Olivier mentioned, I think it would be
>> hugely beneficial to publish pyspark (only code in the python/ dir) on 
>> PyPI.
>> If anyone wants to develop against PySpark APIs, they need to download 
>> the
>> distribution and do a lot of PYTHONPATH munging for all the tools 
>> (pylint,
>> pytest, IDE code completion). Right now that involves adding python/ and
>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
>> dependencies, we would have to manually mirror all the PYTHONPATH 
>> munging in
>> the ./pyspark script. With a proper pyspark setup.py which declares its
>> dependencies, and a published distribution, depending on pyspark will 
>> just
>> be adding pyspark to my setup.py dependencies.
>>
>> Of course, if we actually want to run parts of pyspark that is backed by
>> Py4J calls, then we need the full spark distribution with either 
>> ./pyspark
>> or ./spark-submit, but for things like linting and development, the
>> PYTHONPATH munging is very annoying.
>>
>> I don't think the version-mismatch issues are a compelling reason to not
>> go ahead with PyPI publishing. At runtime, we should definitely enforce 
>> that
>> the version has to be exact, which means there is no backcompat 
>> nightmare as
>> suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267.
>> This would mean that even if the user got his pip installed pyspark to
>> somehow get loaded before the spark distribution provided pyspark, then 
>> the
>> user would be alerted immediately.
>>
>> Davies, if you buy this, should me or someone on my team pick up
>> https://issues.apache.org/jira/browse/SPARK-1267 and
>> https://github.com/apache/spark/pull/464?
>>
>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>>  wrote:
>>>
>>> Ok, I get it. Now what can we do to improve the current situation,
>>> because right now if I want to set-up a CI env for PySpark, I have to :
>>> 1- download a pre-built version of pyspark and unzip it somewhere on
>>> every agent
>>> 2- define the SPARK_HOME env
>>> 3- symlink this distribution pyspark dir inside the python install dir
>>> site-packages/ directory
>>> and if I rely on additional packages (like databricks' Spark-CSV
>>> project), I have to (except if I'm mistaken)
>>> 4- compile/assembly spark-csv, deploy the jar in a specific directory
>>> on every agent
>>> 5- add this jar-filled directory to the Spark distribution's additional
>

Pushing Spark to 10Gb/s

2015-08-10 Thread Starch, Michael D (398M)
All,

I am trying to get data moving in and out of spark at 10Gb/s. I currently have 
a very powerful cluster to work on, offering 40Gb/s inifiniband links so I 
believe the network pipe should be fast enough.

Has anyone gotten spark operating at high data rates before? Any advice would 
be appreciated.

-Michael Starch
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: PySpark on PyPi

2015-08-10 Thread Matt Goodman
I would tentatively suggest also conda packaging.

http://conda.pydata.org/docs/

--Matthew Goodman

=
Check Out My Website: http://craneium.net
Find me on LinkedIn: http://tinyurl.com/d6wlch

On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu  wrote:

> I think so, any contributions on this are welcome.
>
> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger 
> wrote:
> > Sorry, trying to follow the context here. Does it look like there is
> > support for the idea of creating a setup.py file and pypi package for
> > pyspark?
> >
> > Cheers,
> >
> > Brian
> >
> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu 
> wrote:
> >> We could do that after 1.5 released, it will have same release cycle
> >> as Spark in the future.
> >>
> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
> >>  wrote:
> >>> +1 (once again :) )
> >>>
> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang :
> 
>  // ping
> 
>  do we have any signoff from the pyspark devs to submit a PR to
> publish to
>  PyPI?
> 
>  On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <
> freeman.jer...@gmail.com>
>  wrote:
> >
> > Hey all, great discussion, just wanted to +1 that I see a lot of
> value in
> > steps that make it easier to use PySpark as an ordinary python
> library.
> >
> > You might want to check out this (https://github.com/minrk/findspark
> ),
> > started by Jupyter project devs, that offers one way to facilitate
> this
> > stuff. I’ve also cced them here to join the conversation.
> >
> > Also, @Jey, I can also confirm that at least in some scenarios (I’ve
> done
> > it in an EC2 cluster in standalone mode) it’s possible to run
> PySpark jobs
> > just using `from pyspark import SparkContext; sc =
> SparkContext(master=“X”)`
> > so long as the environmental variables (PYTHONPATH and
> PYSPARK_PYTHON) are
> > set correctly on *both* workers and driver. That said, there’s
> definitely
> > additional configuration / functionality that would require going
> through
> > the proper submit scripts.
> >
> > On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <
> punya.bis...@gmail.com>
> > wrote:
> >
> > I agree with everything Justin just said. An additional advantage of
> > publishing PySpark's Python code in a standards-compliant way is the
> fact
> > that we'll be able to declare transitive dependencies (Pandas, Py4J)
> in a
> > way that pip can use. Contrast this with the current situation, where
> > df.toPandas() exists in the Spark API but doesn't actually work
> until you
> > install Pandas.
> >
> > Punya
> > On Wed, Jul 22, 2015 at 12:49 PM Justin Uang 
> > wrote:
> >>
> >> // + Davies for his comments
> >> // + Punya for SA
> >>
> >> For development and CI, like Olivier mentioned, I think it would be
> >> hugely beneficial to publish pyspark (only code in the python/ dir)
> on PyPI.
> >> If anyone wants to develop against PySpark APIs, they need to
> download the
> >> distribution and do a lot of PYTHONPATH munging for all the tools
> (pylint,
> >> pytest, IDE code completion). Right now that involves adding
> python/ and
> >> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add
> more
> >> dependencies, we would have to manually mirror all the PYTHONPATH
> munging in
> >> the ./pyspark script. With a proper pyspark setup.py which declares
> its
> >> dependencies, and a published distribution, depending on pyspark
> will just
> >> be adding pyspark to my setup.py dependencies.
> >>
> >> Of course, if we actually want to run parts of pyspark that is
> backed by
> >> Py4J calls, then we need the full spark distribution with either
> ./pyspark
> >> or ./spark-submit, but for things like linting and development, the
> >> PYTHONPATH munging is very annoying.
> >>
> >> I don't think the version-mismatch issues are a compelling reason
> to not
> >> go ahead with PyPI publishing. At runtime, we should definitely
> enforce that
> >> the version has to be exact, which means there is no backcompat
> nightmare as
> >> suggested by Davies in
> https://issues.apache.org/jira/browse/SPARK-1267.
> >> This would mean that even if the user got his pip installed pyspark
> to
> >> somehow get loaded before the spark distribution provided pyspark,
> then the
> >> user would be alerted immediately.
> >>
> >> Davies, if you buy this, should me or someone on my team pick up
> >> https://issues.apache.org/jira/browse/SPARK-1267 and
> >> https://github.com/apache/spark/pull/464?
> >>
> >> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
> >>  wrote:
> >>>
> >>> Ok, I get it. Now what can we do to improve the current situation,
> >>> because right now if I want to set-up a CI env for PySpark, I have
> to :
> >>> 1- download a pre-built version of pyspark and unz

Re: Should spark-ec2 get its own repo?

2015-08-10 Thread Jeremy Freeman
Hi all, definitely a +1 to this plan.

Wanted to also share this library for Spark + GCE by a collaborator of mine, 
Michael Broxton, which seems to expand and improve on the earlier one Nick 
pointed us to. It’s pip installable, not yet on spark-packages, but I’m sure 
he’d be game to add it.

https://github.com/broxtronix/spark-gce 


> On Aug 3, 2015, at 1:25 PM, Shivaram Venkataraman 
>  wrote:
> 
> I sent a note to the Mesos developers and created
> https://github.com/apache/spark/pull/7899 to change the repository
> pointer. There are 3-4 open PRs right now in the mesos/spark-ec2
> repository and I'll work on migrating them to amplab/spark-ec2 later
> today.
> 
> My thoughts on moving the python script is that we should have a
> wrapper shell script that just fetches the latest version of
> spark_ec2.py for the corresponding Spark branch. We already have
> separate branches in our spark-ec2 repository for different Spark
> versions so it can just be a call to `wget
> https://github.com/amplab/spark-ec2/tree//driver/spark_ec2.py`.
> 
> Thanks
> Shivaram
> 
> On Sun, Aug 2, 2015 at 11:34 AM, Nicholas Chammas
>  wrote:
>> On Sat, Aug 1, 2015 at 1:09 PM Matt Goodman  wrote:
>>> 
>>> I am considering porting some of this to a more general spark-cloud
>>> launcher, including google/aliyun/rackspace.  It shouldn't be hard at all
>>> given the current approach for setup/install.
>> 
>> 
>> FWIW, there are already some tools for launching Spark clusters on GCE and
>> Azure:
>> 
>> http://spark-packages.org/?q=tags%3A%22Deployment%22
>> 
>> Nick
>> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 



Re: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-10 Thread Ted Yu
Yan / Bing:
Mind taking a look at HBASE-14181
 'Add Spark DataFrame
DataSource to HBase-Spark Module' ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
wrote:

> We are happy to announce the availability of the Spark SQL on HBase 1.0.0
> release.
> http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
>
> The main features in this package, dubbed “Astro”, include:
>
> · Systematic and powerful handling of data pruning and
> intelligent scan, based on partial evaluation technique
>
> · HBase pushdown capabilities like custom filters and coprocessor
> to support ultra low latency processing
>
> · SQL, Data Frame support
>
> · More SQL capabilities made possible (Secondary index, bloom
> filter, Primary Key, Bulk load, Update)
>
> · Joins with data from other sources
>
> · Python/Java/Scala support
>
> · Support latest Spark 1.4.0 release
>
>
>
> The tests by Huawei team and community contributors covered the areas:
> bulk load; projection pruning; partition pruning; partial evaluation; code
> generation; coprocessor; customer filtering; DML; complex filtering on keys
> and non-keys; Join/union with non-Hbase data; Data Frame; multi-column
> family test.  We will post the test results including performance tests the
> middle of August.
>
> You are very welcomed to try out or deploy the package, and help improve
> the integration tests with various combinations of the settings, extensive
> Data Frame tests, complex join/union test and extensive performance tests.
> Please use the “Issues” “Pull Requests” links at this package homepage, if
> you want to report bugs, improvement or feature requests.
>
> Special thanks to project owner and technical leader Yan Zhou, Huawei
> global team, community contributors and Databricks.   Databricks has been
> providing great assistance from the design to the release.
>
> “Astro”, the Spark SQL on HBase package will be useful for ultra low
> latency* query and analytics of large scale data sets in vertical
> enterprises**.* We will continue to work with the community to develop
> new features and improve code base.  Your comments and suggestions are
> greatly appreciated.
>
>
>
> Yan Zhou / Bing Xiao
>
> Huawei Big Data team
>
>
>


答复: Package Release Annoucement: Spark SQL on HBase "Astro"

2015-08-10 Thread Yan Zhou.sc
Ted,

I’m in China now, and seem to experience difficulty to access Apache Jira. 
Anyways, it appears to me  that 
HBASE-14181 attempts to 
support Spark DataFrame inside HBase.
If true, one question to me is whether HBase is intended to have a built-in 
query engine or not. Or it will stick with the current way as
a k-v store with some built-in processing capabilities in the forms of 
coprocessor, custom filter, …, etc., which allows for loosely-coupled query 
engines
built on top of it.

Thanks,

发件人: Ted Yu [mailto:yuzhih...@gmail.com]
发送时间: 2015年8月11日 8:54
收件人: Bing Xiao (Bing)
抄送: dev@spark.apache.org; u...@spark.apache.org; Yan Zhou.sc
主题: Re: Package Release Annoucement: Spark SQL on HBase "Astro"

Yan / Bing:
Mind taking a look at 
HBASE-14181 'Add Spark 
DataFrame DataSource to HBase-Spark Module' ?

Thanks

On Wed, Jul 22, 2015 at 4:53 PM, Bing Xiao (Bing) 
mailto:bing.x...@huawei.com>> wrote:
We are happy to announce the availability of the Spark SQL on HBase 1.0.0 
release.  http://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
The main features in this package, dubbed “Astro”, include:

• Systematic and powerful handling of data pruning and intelligent 
scan, based on partial evaluation technique

• HBase pushdown capabilities like custom filters and coprocessor to 
support ultra low latency processing

• SQL, Data Frame support

• More SQL capabilities made possible (Secondary index, bloom filter, 
Primary Key, Bulk load, Update)

• Joins with data from other sources

• Python/Java/Scala support

• Support latest Spark 1.4.0 release


The tests by Huawei team and community contributors covered the areas: bulk 
load; projection pruning; partition pruning; partial evaluation; code 
generation; coprocessor; customer filtering; DML; complex filtering on keys and 
non-keys; Join/union with non-Hbase data; Data Frame; multi-column family test. 
 We will post the test results including performance tests the middle of August.
You are very welcomed to try out or deploy the package, and help improve the 
integration tests with various combinations of the settings, extensive Data 
Frame tests, complex join/union test and extensive performance tests.  Please 
use the “Issues” “Pull Requests” links at this package homepage, if you want to 
report bugs, improvement or feature requests.
Special thanks to project owner and technical leader Yan Zhou, Huawei global 
team, community contributors and Databricks.   Databricks has been providing 
great assistance from the design to the release.
“Astro”, the Spark SQL on HBase package will be useful for ultra low latency 
query and analytics of large scale data sets in vertical enterprises. We will 
continue to work with the community to develop new features and improve code 
base.  Your comments and suggestions are greatly appreciated.

Yan Zhou / Bing Xiao
Huawei Big Data team