Re: Flume integration

2016-11-21 Thread Ian Brooks
pplication. That is something we should be able to manage though. *-Ian * Hi Ian, Flume is great for ingesting data into HDFS and Hbase. However, that is part of batch layer. For real time processing, I would go through Kafka into spark streaming. Except your case, I have not established

Re: Flume integration

2016-11-21 Thread Ian Brooks
*-Ian* Hi While I am following this discussion with interest, I am trying to comprehend any architectural benefit of a spark sink. Is there any feature in flume makes it more suitable to ingest stream data than sppark streaming, so that we should chain them? For example does it help

Re: Flume integration

2016-11-20 Thread Ian Brooks
tream = FlumeUtils.createPollingStream(ssc, addresses, StorageLevel.MEMORY_AND_DISK_SER_2(), 100, 10); After setting this, the data is correclty maked as processed by the SPARK reveiver and the Flume sink is notified. -Ian > Hi Ian, > > Has this been resolved? > > How about data to Flume and

Re: PySpark: preference for Python 2.7 or Python 3.5?

2016-09-02 Thread Ian Stokes Rees
arcel pretty well. It is more a question of whether CDH and Spark "work" better with PySpark on Python 2.7 or Python 3.5. My sense was "you choose: both are fine", but I wanted to ask here before committing to going down one path or another. Thanks, Ian

PySpark: preference for Python 2.7 or Python 3.5?

2016-09-01 Thread Ian Stokes Rees
ld proceed with PySpark on top of Python 2.7 or 3.5. Opinions? Does Cloudera have an official (or unofficial) position on this? Thanks, Ian _______ Ian Stokes-Rees Computational Scientist Continuum Analytics <http://continuum.io> @ijstokes Twitter <http://t

Re: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-07-20 Thread Ian O'Connell
Ravi did your issue ever get solved for this? I think i've been hitting the same thing, it looks like the spark.sql.autoBroadcastJoinThreshold stuff isn't kicking in as expected, if I set that to -1 then the computation proceeds successfully. On Tue, Jun 14, 2016 at 12:28 AM, Ravi Aggarwal wrote

Flume integration

2016-07-13 Thread Ian Brooks
g to let the Flume server know the batch has been received and processed? *Ian Brooks*

Re: JDBC Cluster

2016-05-30 Thread Ian
Normally, when you start the master, the slaves should also be started automatically. This, however, presupposes that you've configured the slaves. In the $SPARK_HOME/conf directory there should be a slaves or slaves.template file. If it only contains localhost, then you have not set up any worker

Re: List of questios about spark

2016-05-30 Thread Ian
No, the limit is given by your setup. If you use Spark on a YARN cluster, then the number of concurrent jobs is really limited to the resources allocated to each job and how the YARN queues are set up. For instance, if you use the FIFO scheduler (default), then it can be the case that the first job

Re: Problem instantiation of HiveContext

2016-05-26 Thread Ian
The exception indicates that Spark cannot invoke the method it's trying to call, which is probably caused by a library missing. Do you have a Hive configuration (hive-site.xml) or similar in your $SPARK_HOME/conf folder? -- View this message in context: http://apache-spark-user-list.1001560.n3.

Re: save RDD of Avro GenericRecord as parquet throws UnsupportedOperationException

2016-05-26 Thread Ian
Have you tried saveAsNewAPIHadoopFile? See: http://stackoverflow.com/questions/29238126/how-to-save-a-spark-rdd-to-an-avro-file -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/save-RDD-of-Avro-GenericRecord-as-parquet-throws-UnsupportedOperationException-tp

Re: List of questios about spark

2016-05-26 Thread Ian
I'll attempt to answer a few of your questions: There are no limitations with regard to the number of dimension or lookup tables for Spark. As long as you have disk space, you should have no problem. Obviously, if you do joins among dozens or hundreds of tables it may take a while since it's unlik

Re: How to set the degree of parallelism in Spark SQL?

2016-05-26 Thread Ian
The number of executors is set when you launch the shell or an application with /spark-submit/. It's controlled by the /num-executors/ parameter: https://databaseline.wordpress.com/2016/03/12/an-overview-of-apache-streaming-technologies/. Important is also that cranking up the number may not cause

Checkpointing in Spark without Streaming

2015-08-31 Thread Ian Wood
particular, with GraphX code)? Any insight into these problems would be very appreciated. Thanks, Ian

Re: Providing query dsl to Elasticsearch for Spark (2.1.0.Beta3)

2014-12-18 Thread Ian Wilkinson
Quick follow-up: this works sweetly with spark-1.1.1-bin-hadoop2.4. > On Dec 3, 2014, at 3:31 PM, Ian Wilkinson wrote: > > Hi, > > I'm trying the Elasticsearch support for Spark (2.1.0.Beta3). > > In the following I provide the query (as query dsl): > >

Providing query dsl to Elasticsearch for Spark (2.1.0.Beta3)

2014-12-03 Thread Ian Wilkinson
ArgumentException: Cannot open stream for resource "{ "query": { ... } } Is the query dsl supported with esRDD, or am I missing something more fundamental? Huge thanks, ian - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark and Stanford CoreNLP

2014-11-24 Thread Ian O'Connell
object MyCoreNLP { @transient lazy val coreNLP = new coreNLP() } and then refer to it from your map/reduce/map partitions or that it should be fine (presuming its thread safe), it will only be initialized once per classloader per jvm On Mon, Nov 24, 2014 at 7:58 AM, Evan Sparks wrote: > We ha

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell
Algebird 0.8.0 has 2.11 support if you want to run in a 2.11 env. On Thu, Oct 30, 2014 at 10:08 AM, Buntu Dev wrote: > Thanks.. I was using Scala 2.11.1 and was able to > use algebird-core_2.10-0.1.11.jar with spark-shell. > > On Thu, Oct 30, 2014 at 8:22 AM, Ian O'Connell &g

Re: Algebird using spark-shell

2014-10-30 Thread Ian O'Connell
Whats the error with the 2.10 version of algebird? On Thu, Oct 30, 2014 at 12:49 AM, thadude wrote: > I've tried: > > . /bin/spark-shell --jars algebird-core_2.10-0.8.1.jar > > scala> import com.twitter.algebird._ > import com.twitter.algebird._ > > scala> import HyperLogLog._ > import HyperLog

Unsubscribe

2014-10-27 Thread Ian Ferreira
unsubscribe

Re: Kryo UnsupportedOperationException

2014-09-25 Thread Ian O'Connell
I would guess the field serializer is having issues being able to reconstruct the class again, its pretty much best effort. Is this an intermediate type? On Thu, Sep 25, 2014 at 2:12 PM, Sandy Ryza wrote: > We're running into an error (below) when trying to read spilled shuffle > data back in.

Re: Querying a parquet file in s3 with an ec2 install

2014-09-08 Thread Ian O'Connell
Mmm how many days worth of data/how deep is your data nesting? I suspect your running into a current issue with parquet (a fix is in master but I don't believe released yet..). It reads all the metadata to the submitter node as part of scheduling the job. This can cause long start times(timeouts t

Re: DynamoDB input source

2014-07-21 Thread Ian Wilkinson
...>") jobConf.set("mapred.output.format.class", "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") var users = sc.hadoopRDD(jobConf, classOf[DynamoDBInputFormat], classOf[Text], classOf[DynamoDBItemWritable]) users.count() This is raising an npe for FileSplit (as be

Problem running Spark shell (1.0.0) on EMR

2014-07-16 Thread Ian Wilkinson
gt; val logs = sc.textFile("s3n://.../“) this produces: 14/07/16 12:40:35 WARN storage.BlockManager: Putting block broadcast_0 failed java.lang.NoSuchMethodError: com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode; Any help mighty welcome, ian

Re: DynamoDB input source

2014-07-04 Thread Ian Wilkinson
situation… ian On 4 Jul 2014, at 16:58, Nick Pentreath wrote: > I should qualify by saying there is boto support for dynamodb - but not for > the inputFormat. You could roll your own python-based connection but this > involves figuring out how to split the data in dynamo - inputFor

Re: DynamoDB input source

2014-07-04 Thread Ian Wilkinson
Excellent. Let me get browsing on this. Huge thanks, ian On 4 Jul 2014, at 16:47, Nick Pentreath wrote: > No boto support for that. > > In master there is Python support for loading Hadoop inputFormat. Not sure if > it will be in 1.0.1 or 1.1 > > I master docs under the

Re: DynamoDB input source

2014-07-04 Thread Ian Wilkinson
Hi Nick, I’m going to be working with python primarily. Are you aware of comparable boto support? ian On 4 Jul 2014, at 16:32, Nick Pentreath wrote: > You should be able to use DynamoDBInputFormat (I think this should be part of > AWS libraries for Java) and create a HadoopRDD fro

DynamoDB input source

2014-07-04 Thread Ian Wilkinson
thanks, ian

Re: GroupByKey results in OOM - Any other alternative

2014-06-15 Thread Ian O'Connell
Depending on your requirements when doing hourly metrics calculating distinct cardinality, a much more scalable method would be to use a hyper log log data structure. a scala impl people have used with spark would be https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/

Re: RDD with a Map

2014-06-03 Thread Ian O'Connell
So if your data can be kept in memory on the driver node then you don't really need spark? If you want to use it for hadoop reading then i'd immediately call collect after you open it and then you can do normal scala collections operations. On Tue, Jun 3, 2014 at 2:56 PM, Amit Kumar wrote: > Hi

Is Hadoop MR now comparable with Spark?

2014-06-02 Thread Ian Ferreira
http://hortonworks.com/blog/ddm/#.U4yn3gJgfts.twitter

RE: Announcing Spark 1.0.0

2014-05-30 Thread Ian Ferreira
Congrats Sent from my Windows Phone From: Dean Wampler Sent: ‎5/‎30/‎2014 6:53 AM To: user@spark.apache.org Subject: Re: Announcing Spark 1.0.0 Congratulations!! On Fri, May 30, 2014 at 5:12 AM, Patrick

controlling the time in spark-streaming

2014-05-22 Thread Ian Holsman
s of achieving this? I would assume that controlling the windows RDD buckets would be a common use case. TIA Ian -- Ian Holsman i...@holsman.com.au PH: + 61-3-9028 8133 / +1-(425) 998-7083

Re: I want to filter a stream by a subclass.

2014-05-21 Thread Ian Holsman
be > > records.flatMap(_ match { > case i: Int=> > Some(i) > case _ => > None > }) > -- Ian Holsman i...@holsman.com.au PH: + 61-3-9028 8133 / +1-(425) 998-7083

I want to filter a stream by a subclass.

2014-05-21 Thread Ian Holsman
matters. TIA Ian -- Ian Holsman i...@holsman.com.au PH: + 61-3-9028 8133 / +1-(425) 998-7083

Re: Debugging Spark AWS S3

2014-05-16 Thread Ian Ferreira
Did you check the executor stderr logs? On 5/16/14, 2:37 PM, "Robert James" wrote: >I have Spark code which runs beautifully when MASTER=local. When I >run it with MASTER set to a spark ec2 cluster, the workers seem to >run, but the results, which are supposed to be put to AWS S3, don't >appear

Real world

2014-05-15 Thread Ian Ferreira
Folks, I keep getting questioned on real world experience of Spark as in mission critical production deployments. Does anyone have some war stories to share or know of resources to review? Cheers - Ian

Re: Easy one

2014-05-07 Thread Ian Ferreira
xport SPARK_WORKER_MEMORY=4g On Tue, May 6, 2014 at 5:29 PM, Ian Ferreira wrote: > Hi there, > > Why can¹t I seem to kick the executor memory higher? See below from EC2 > deployment using m1.large > > > And in the spark-env.sh > export SPARK_MEM=6154m > > > And in th

Easy one

2014-05-06 Thread Ian Ferreira
Hi there, Why can¹t I seem to kick the executor memory higher? See below from EC2 deployment using m1.large And in the spark-env.sh export SPARK_MEM=6154m And in the spark context sconf.setExecutorEnv("spark.executor.memory", "4g²) Cheers - Ian

Re: Spark and Java 8

2014-05-06 Thread Ian O'Connell
I think the distinction there might be they never said they ran that code under CDH5, just that spark supports it and spark runs under CDH5. Not that you can use these features while running under CDH5. They could use mesos or the standalone scheduler to run them On Tue, May 6, 2014 at 6:16 AM,

Getting the following error using EC2 deployment

2014-05-01 Thread Ian Ferreira
I have a custom app that was compiled with scala 2.10.3 which I believe is what the latest spark-ec2 script installs. However running it on the master yields this cryptic error which according to the web implies incompatible jar versions. Exception in thread "main" java.lang.NoClassDefFoundError:

Setting the Scala version in the EC2 script?

2014-05-01 Thread Ian Ferreira
Is this possible, it is very annoying to have such a great script, but still have to manually update stuff afterwards.

Re: Can't be built on MAC

2014-05-01 Thread Ian Ferreira
HI Zhige, I had the same issue and revert to using JDK 1.7.055 From: Zhige Xin Reply-To: Date: Thursday, May 1, 2014 at 12:32 PM To: Subject: Can't be built on MAC Hi dear all, When I tried to build Spark 0.9.1 on my Mac OS X 10.9.2 with Java 8, I found the following errors: [error] err

Re: is it okay to reuse objects across RDD's?

2014-04-28 Thread Ian O'Connell
A mutable map in an object should do what your looking for then I believe. You just reference the object as an object in your closure so it won't be swept up when your closure is serialized and you can reference variables of the object on the remote host then. e.g.: object MyObject { val mmap =

Running parallel jobs in the same driver with Futures?

2014-04-28 Thread Ian Ferreira
I recall asking about this, and I think Matei suggest it was, but is the scheduler thread safe? I am running mllib libraries as futures in the same driver using the same dataset as input and this error 14/04/28 08:29:48 ERROR TaskSchedulerImpl: Exception in statusUpdate java.util.concurrent.Reje

Failed to run count?

2014-04-23 Thread Ian Ferreira
I am getting this cryptic error running LinearRegressionwithSGD Data sample LabeledPoint(39.0, [144.0, 1521.0, 20736.0, 59319.0, 2985984.0]) 14/04/23 15:15:34 INFO SparkContext: Starting job: first at GeneralizedLinearAlgorithm.scala:121 14/04/23 15:15:34 INFO DAGScheduler: Got job 2 (first at G

Adding to an RDD

2014-04-21 Thread Ian Ferreira
, 1 4 1 8, 2 1 4 2, 3 1 6 2, 4 1 8 2) Cheers - Ian

Combining RDD's columns

2014-04-18 Thread Ian Ferreira
a collection of these RDD's to create a "multi-column" RDD rddA = {Names, Age} rddB = {Names, Star Sign} I saw that rdd.union() merges rows, but anything that can combine columns? Cheers - Ian

Re: reduceByKey issue in example wordcount (scala)

2014-04-18 Thread Ian Bonnycastle
he ClassNotFoundException error. So, in the future, always "sbt/sbt package" before doing an "sbt/sbt run". Thank you for all your help, Marcelo. Ian On Mon, Apr 14, 2014 at 2:59 PM, Ian Bonnycastle wrote: > Hi Marcelo, > > Changing it to null didn't make any di

RE: Multi-tenant?

2014-04-15 Thread Ian Ferreira
ook at http://spark.apache.org/docs/latest/job-scheduling.html, which includes scheduling concurrent jobs within the same driver. Matei On Apr 15, 2014, at 4:08 PM, Ian Ferreira wrote: > What is the support for multi-tenancy in Spark. > > I assume more than one driver can share the same clus

Multi-tenant?

2014-04-15 Thread Ian Ferreira
What is the support for multi-tenancy in Spark. I assume more than one driver can share the same cluster, but can a driver run two jobs in parallel?

Re: Scala vs Python performance differences

2014-04-15 Thread Ian Ferreira
This would be super useful. Thanks. On 4/15/14, 1:30 AM, "Jeremy Freeman" wrote: >Hi Andrew, > >I'm putting together some benchmarks for PySpark vs Scala. I'm focusing on >ML algorithms, as I'm particularly curious about the relative performance >of >MLlib in Scala vs the Python MLlib API vs pur

Pyspark with Cython

2014-04-14 Thread Ian Ferreira
Has anyone used Cython closures with Spark? We have a large investment in Python code that we don¹t want to port to Scala. Curious about any performance issues with the interop between the Scala engine and the Cython closures. I believe it is sockets on the driver and pipe on the executors?

Re: Spark resilience

2014-04-14 Thread Ian Ferreira
t does not affect currently-running jobs. Workers can fail and will simply cause jobs to lose their current Executors. New Workers can be added at any point. On Mon, Apr 14, 2014 at 11:00 AM, Ian Ferreira wrote: > Folks, > > I was wondering what the failure support modes where for Spark

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Ian Bonnycastle
ting to all the nodes properly. But why reduceByKey is the only method affected is beyond me. Ian On Mon, Apr 14, 2014 at 2:45 PM, Marcelo Vanzin wrote: > Hi Ian, > > On Mon, Apr 14, 2014 at 11:30 AM, Ian Bonnycastle > wrote: > > val sc = new SparkContext("

Re: reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Ian Bonnycastle
s trouble finding the code that is itself. And why only with the reduceByKey function is it occuring? I have no problems with any other code running except for that. (BTW, I don't use in my code above... I just removed it for security purposes.) Thanks, Ian On Mon, Apr 14, 2014 at 12:45 PM

reduceByKey issue in example wordcount (scala)

2014-04-14 Thread Ian Bonnycastle
g to figure out what I'm missing. The section of code I'm trying to get to work is: val JCountRes = logData.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) "logData" is just an RDD pointing to a large (2gb) file in HDFS. Thanks, Ian

Spark resilience

2014-04-14 Thread Ian Ferreira
Zookeeper quorum so that isolates the slaves from a master failure, but what about the masters behind quorum? Cheers - Ian

Re: Spark - ready for prime time?

2014-04-10 Thread Ian Ferreira
Do you have the link to the Cloudera comment? Sent from Windows Mail From: Dean Wampler Sent: ‎Thursday‎, ‎April‎ ‎10‎, ‎2014 ‎7‎:‎39‎ ‎AM To: Spark Users Cc: Daniel Darabos, Andras Barjak Spark has been endorsed by Cloudera as the successor to MapReduce. That says a lot... On

Re: Avro serialization

2014-04-03 Thread Ian O'Connell
Objects been transformed need to be one of these in flight. Source data can just use the mapreduce input formats, so anything you can do with mapred. doing an avro one for this you probably want one of : https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantb

Re: Error when run Spark on mesos

2014-04-02 Thread Ian Ferreira
I think this is related to a known issue (regression) in 0.9.0. Try using explicit IP other than loop back. Sent from a mobile device > On Apr 2, 2014, at 8:53 PM, "panfei" wrote: > > any advice ? > > > 2014-04-03 11:35 GMT+08:00 felix : >> I deployed mesos and test it using the exmaple/tes

Protobuf 2.5 Mesos

2014-04-01 Thread Ian Ferreira
>From what I can tell I need to use mesos 0-17 to support protobuf 2.5 which is required for hadoop 2.3.0. However I still run into the JVM error which appears to be related to protobuf compatibility. Any recommendations?

Mllib in pyspark for 0.8.1

2014-04-01 Thread Ian Ferreira
Hi there, For some reason the distribution and build for 0.8.1 does not include the MLLib libraries for pyspark i.e. import from mllib fails. Seems to be addressed in 0.9.0, but that has other issue running on mesos in standalone mode :) Any pointers? Cheers - Ian

Re: java.lang.NullPointerException met when computing new RDD or use .count

2014-03-17 Thread Ian O'Connell
I'm guessing the other result was wrong, or just never evaluated here. The RDD transforms being lazy may have let it be expressed, but it wouldn't work. Nested RDD's are not supported. On Mon, Mar 17, 2014 at 4:01 PM, anny9699 wrote: > Hi Andrew, > > Thanks for the reply. However I did almost t