Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-14 Thread Josh Rosen
The updated prototype listed in https://github.com/databricks/spark-pr-dashboard/pull/59 is now running live on spark-prs as part of its PR comment update task. On Fri, Aug 14, 2015 at 10:51 AM, Josh Rosen wrote: > I think that I'm still going to want some custom code to remove the build > start

Jenkins having issues?

2015-08-14 Thread Cheolsoo Park
Hi devs, Jenkins failed twice in my PR for unknown error- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/40930/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/40931/console Can you help? Thank you! Cheols

Re: Reliance on java.math.BigInteger implementation

2015-08-14 Thread Reynold Xin
I pinged the IBM team to submit a patch that would work on IBM JVM. On Fri, Aug 14, 2015 at 11:27 AM, Pete Robbins wrote: > ref: https://issues.apache.org/jira/browse/SPARK-9370 > > The code to handle BigInteger types in > > org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters.java > > an

SPARK-10000 + now

2015-08-14 Thread Reynold Xin
Five month ago we reached 1 commits on GitHub. Today we reached 1 JIRA tickets. https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20created%3E%3D-1w%20ORDER%20BY%20created%20DESC Hopefully the extra character we have to type doesn't bring our productivity much.

Re: Fwd: [ANNOUNCE] Spark 1.5.0-preview package

2015-08-14 Thread Reynold Xin
Is it possible that you have only upgraded some set of nodes but not the others? We have ran some performance benchmarks on this so it definitely runs in some configuration. Could still be buggy in some other configurations though. On Fri, Aug 14, 2015 at 6:37 AM, mkhaitman wrote: > Has anyone

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Jörn Franke
Well what do you do in case of failure? I think one should use a professional ingestion tool that ideally does not need to reload everything in case of failure and verifies that the file has been transferred correctly via checksums. I am not sure if Flume supports ftp, but Ssh,scp should be support

Re: Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Marcelo Vanzin
Why do you need to use Spark or Flume for this? You can just use curl and hdfs: curl ftp://blah | hdfs dfs -put - /blah On Fri, Aug 14, 2015 at 1:15 PM, Varadhan, Jawahar < varad...@yahoo.com.invalid> wrote: > What is the best way to bring such a huge file from a FTP server into > Hadoop to

Setting up Spark/flume/? to Ingest 10TB from FTP

2015-08-14 Thread Varadhan, Jawahar
What is the best way to bring such a huge file from a FTP server into Hadoop to persist in HDFS? Since a single jvm process might run out of memory, I was wondering if I can use Spark or Flume to do this. Any help on this matter is appreciated.  I prefer a application/process running inside Hado

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Reynold Xin
This is already supported with the new partitioned data sources in DataFrame/SQL right? On Fri, Aug 14, 2015 at 8:04 AM, Alex Angelini wrote: > Speaking about Shopify's deployment, this would be a really nice to have > feature. > > We would like to write data to folders with the structure > `//

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Nicholas Chammas
See: https://issues.apache.org/jira/browse/SPARK-3533 Feel free to comment there and make a case if you think the issue should be reopened. Nick On Fri, Aug 14, 2015 at 11:11 AM Abhishek R. Singh < abhis...@tetrationanalytics.com> wrote: > A workaround would be to have multiple passes on the RD

Reliance on java.math.BigInteger implementation

2015-08-14 Thread Pete Robbins
ref: https://issues.apache.org/jira/browse/SPARK-9370 The code to handle BigInteger types in org.apache.spark.sql.catalyst.expressions.UnsafeRowWriters.java and org.apache.spark.unsafe.Platform.java is dependant on the implementation of java.math.BigInteger eg: try { signumOffs

Re: SparkR DataFrame fail to return data of Decimal type

2015-08-14 Thread Shkurenko, Alex
Created https://issues.apache.org/jira/browse/SPARK-9982, working on the PR On Fri, Aug 14, 2015 at 12:43 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Thanks for the catch. Could you send a PR with this diff ? > > On Fri, Aug 14, 2015 at 10:30 AM, Shkurenko, Alex > wrote: > >

Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-14 Thread Josh Rosen
I think that I'm still going to want some custom code to remove the build start messages from SparkQA and it's hardly any code, so I'm going to stick with the custom approach for now. The problem is that I don't want _any_ posts from AmplabJenkins, even if they're improved to be more informative, s

Re: SparkR DataFrame fail to return data of Decimal type

2015-08-14 Thread Shivaram Venkataraman
Thanks for the catch. Could you send a PR with this diff ? On Fri, Aug 14, 2015 at 10:30 AM, Shkurenko, Alex wrote: > Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, > but with the Decimal datatype coming from a Postgres DB: > > //Set up SparkR > >>Sys.setenv(SPARK_HOME=

SparkR DataFrame fail to return data of Decimal type

2015-08-14 Thread Shkurenko, Alex
Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but with the Decimal datatype coming from a Postgres DB: //Set up SparkR >Sys.setenv(SPARK_HOME="/Users/ashkurenko/work/git_repos/spark") >Sys.setenv(SPARKR_SUBMIT_ARGS="--driver-class-path ~/Downloads/postgresql-9.4-1201.j

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Abhishek R. Singh
A workaround would be to have multiple passes on the RDD and each pass write its own output? Or in a foreachPartition do it in a single pass (open up multiple files per partition to write out)? -Abhishek- On Aug 14, 2015, at 7:56 AM, Silas Davis wrote: > Would it be right to assume that the

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Alex Angelini
Speaking about Shopify's deployment, this would be a really nice to have feature. We would like to write data to folders with the structure `//` but have had to hold off on that because of the lack of support for MultipleOutputs. On Fri, Aug 14, 2015 at 10:56 AM, Silas Davis wrote: > Would it b

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Silas Davis
Would it be right to assume that the silence on this topic implies others don't really have this issue/desire? On Sat, 18 Jul 2015 at 17:24 Silas Davis wrote: > *tl;dr hadoop and cascading* *provide ways of writing tuples to multiple > output files based on key, but the plain RDD interface doesn

Re: Fwd: [ANNOUNCE] Spark 1.5.0-preview package

2015-08-14 Thread mkhaitman
Has anyone had success using this preview? We were able to build the preview, and able to start the spark-master, however, unable to connect any spark workers to it. Kept receiving "AkkaRpcEnv address in use" while attempting to connect the spark-worker to the master. Also confirmed that the work

Re: Automatically deleting pull request comments left by AmplabJenkins

2015-08-14 Thread Iulian Dragoș
On Fri, Aug 14, 2015 at 4:21 AM, Josh Rosen wrote: > Prototype is at https://github.com/databricks/spark-pr-dashboard/pull/59 > > On Wed, Aug 12, 2015 at 7:51 PM, Josh Rosen wrote: > >> *TL;DR*: would anyone object if I wrote a script to auto-delete pull >> request comments from AmplabJenkins? >

Re: avoid creating small objects

2015-08-14 Thread Reynold Xin
You can use mapPartitions to do that. On Friday, August 14, 2015, 周千昊 wrote: > I am thinking that creating a shared object outside the closure, use this > object to hold the byte array. > will this work? > > 周千昊 >于2015年8月14日周五 > 下午4:02写道: > >> Hi, >> All I want to do is that, >> 1. read

Re: Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-14 Thread pishen tsai
Sorry for previous line-breaking format, try to resend the mail again. I have written a sbt plugin called spark-deployer, which is able to deploy a standalone spark cluster on aws ec2 and submit jobs to it. https://github.com/pishen/spark-deployer Compared to current spark-ec2 script, this design

Re: avoid creating small objects

2015-08-14 Thread 周千昊
I am thinking that creating a shared object outside the closure, use this object to hold the byte array. will this work? 周千昊 于2015年8月14日周五 下午4:02写道: > Hi, > All I want to do is that, > 1. read from some source > 2. do some calculation to get some byte array > 3. write the byte arr

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Akhil Das
Thanks for the clarifications Mrithul. Thanks Best Regards On Fri, Aug 14, 2015 at 1:04 PM, Mridul Muralidharan wrote: > What I understood from Imran's mail (and what was referenced in his > mail) the RDD mentioned seems to be violating some basic contracts on > how partitions are used in spark

avoid creating small objects

2015-08-14 Thread 周千昊
Hi, All I want to do is that, 1. read from some source 2. do some calculation to get some byte array 3. write the byte array to hdfs In hadoop, I can share an ImmutableByteWritable, and do some System.arrayCopy, it will prevent the application from creating a lot of small object

Introduce a sbt plugin to deploy and submit jobs to a spark cluster on ec2

2015-08-14 Thread pishen tsai
Hello, I have written a sbt plugin called spark-deployer, which is able to deploy a standalone spark cluster on aws ec2 and submit jobs to it. https://github.com/pishen/spark-deployer Compared to current spark-ec2 script, this design may have several benefits (features): 1. All the code are writt

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-14 Thread Mridul Muralidharan
What I understood from Imran's mail (and what was referenced in his mail) the RDD mentioned seems to be violating some basic contracts on how partitions are used in spark [1]. They cannot be arbitrarily numbered,have duplicates, etc. Extending RDD to add functionality is typically for niche cases