Re: Serialization issue when using HBase with Spark

2014-12-15 Thread Shixiong Zhu
Just point out a bug in your codes. You should not use `mapPartitions` like that. For details, I recommend Section "setup() and cleanup()" in Sean Owen's post: http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/ Best Regards, Shixiong Zhu 2014-12-14 16:35 GMT+08

Spark inserting into parquet files with different schema

2014-12-15 Thread AdamPD
Hi all, I understand that parquet allows for schema versioning automatically in the format; however, I'm not sure whether Spark supports this. I'm saving a SchemaRDD to a parquet file, registering it as a table, then doing an insertInto with a SchemaRDD with an extra column. The second SchemaRDD

RDD vs Broadcast

2014-12-15 Thread elitejyo
We are developing Spark framework wherein we are moving historical data into RDD sets. Basically, RDD is immutable, read only dataset on which we do operations. Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs. Now there is a use

Re: Adding a column to a SchemaRDD

2014-12-15 Thread Yanbo Liang
Hi Nathan, #1 Spark SQL & DSL can satisfy your requirement. You can refer the following code snippet: jdata.select(Star(Node), 'seven.getField("mod"), 'eleven.getField("mod")) You need to import org.apache.spark.sql.catalyst.analysis.Star in advance. #2 After you make the transform above, you

Re: Does filter on an RDD scan every data item ?

2014-12-15 Thread nsareen
Thanks! shall try it out. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-filter-on-an-RDD-scan-every-data-item-tp20170p20683.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Tomoya Igarashi
Hi all, I am trying to run Spark job on Playframework + Spark Master/Worker in one Mac. When job ran, I encountered java.lang.ClassNotFoundException. Would you teach me how to solve it? Here is my code in Github. https://github.com/TomoyaIgarashi/spark_cluster_sample * Envrionments: Mac 10.9.5 J

Why my SQL UDF cannot be registered?

2014-12-15 Thread Xuelin Cao
Hi,      I tried to create a function that to convert an Unix time stamp to the hour number in a day.       It works if the code is like this:sqlContext.registerFunction("toHour", (x:Long)=>{new java.util.Date(x*1000).getHours})       But, if I do it like this, it doesn't work:       def toHour

Re: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Aniket Bhatnagar
Try the workaround (addClassPathJars(sparkContext, this.getClass.getClassLoader) discussed in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3CCAJOb8buD1B6tUtOfG8_Ok7F95C3=r-zqgffoqsqbjdxd427...@mail.gmail.com%3E Thanks, Aniket On Mon Dec 15 2014 at 07:43:24 Tomoya Igarashi < to

Re: Serialization issue when using HBase with Spark

2014-12-15 Thread Aniket Bhatnagar
"The reason not using sc.newAPIHadoopRDD is it only support one scan each time." I am not sure is that's true. You can use multiple scans as following: val scanStrings = scans.map(scan => convertScanToString(scan)) conf.setStrings(MultiTableInputFormat.SCANS, scanStrings : _*) where convertScanT

Re: Spark with HBase

2014-12-15 Thread Aniket Bhatnagar
In case you are still looking for help, there has been multiple discussions in this mailing list that you can try searching for. Or you can simply use https://github.com/unicredit/hbase-rdd :-) Thanks, Aniket On Wed Dec 03 2014 at 16:11:47 Ted Yu wrote: > Which hbase release are you running ? >

HiveQL support in Cassandra-Spark connector

2014-12-15 Thread shahab
Hi, I just wonder if Cassandra-Spark connector supports executing HiveQL on Cassandra tables? best, /Shahab

Re: JSON Input files

2014-12-15 Thread Madabhattula Rajesh Kumar
Hi Helena and All, I have found one example "multi-line json file" into an RDD using " https://github.com/alexholmes/json-mapreduce";. val data = sc.newAPIHadoopFile( filepath, classOf[MultiLineJsonInputFormat], classOf[LongWritable], classOf[Text], conf ).map(p => (p._1.get,

Re: ...FileNotFoundException: Path is not a file: - error on accessing HDFS with sc.wholeTextFiles

2014-12-15 Thread Karen Murphy
Thanks Akhil, In line with your suggestion I have used the following 2 commands to flatten the directory structure: find . -type f -iname '*' -exec mv '{}' . \; find . -type d -exec rm -rf '{}' \; Kind Regards Karen On 12/12/14 13:25, Akhil Das wrote: I'm not quiet sure whether spark wil

Re: SchemaRDD partition on specific column values?

2014-12-15 Thread Nitin Goyal
Hi Michael, I have opened following JIRA for the same :- https://issues.apache.org/jira/browse/SPARK-4849 I am having a look at the code to see what can be done and then we can have a discussion over the approach. Let me know if you have any comments/suggestions. Thanks -Nitin On Sun, Dec 14,

Migrating Parquet inputs

2014-12-15 Thread Marius Soutier
Hi, is there an easy way to “migrate” parquet files or indicate optional values in sql statements? I added a couple of new fields that I also use in a schemaRDD.sql() which obviously fails for input files that don’t have the new fields. Thanks - Marius ---

RE: Why my SQL UDF cannot be registered?

2014-12-15 Thread Cheng, Hao
As the error log shows, you may need to register it as: sqlContext.rgisterFunction(“toHour”, toHour _) The “_” means you are passing the function as parameter, not invoking it. Cheng Hao From: Xuelin Cao [mailto:xuelin...@yahoo.com.INVALID] Sent: Monday, December 15, 2014 5:28 PM To: User Subje

Re: JSON Input files

2014-12-15 Thread Peter Vandenabeele
On Sat, Dec 13, 2014 at 5:43 PM, Helena Edelson wrote: > One solution can be found here: > https://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets > > As far as I understand, the people.json file is not really a proper json file, but a file documented as: "... JSON files w

Re: java.lang.IllegalStateException: unread block data

2014-12-15 Thread Akhil
When you say restored, does it mean the internal IP/public IP remain unchanged to you changed them accordingly? (I'm assuming you are using a cloud service like AWS, GCE or Azure). What is the serializer that you are using? Try to set the following before creating the sparkContext, might help with

Re: stage failure: java.lang.IllegalStateException: unread block data

2014-12-15 Thread Akhil
Try to run the spark-shell in standalone mode (MASTER=spark://yourmasterurl:7077 $SPARK_HOME/bin/spark-shell), and do a small count ( val d = sc.parallelize(1 to 1000).count()), If that is failing, then something is wrong with your cluster setup as its saying Connection refused: node001/10.180.49.2

is there a way to interact with Spark clusters remotely?

2014-12-15 Thread Xiaoyong Zhu
Hi experts I am wondering if there is a way to interactive with Spark remotely? i.e. no access to clusters required but submit Python/Scala scripts to cluster and get result based on (REST) APIs. That will facilitate the development process a lot.. Xiaoyong

IBM open-sources Spark Kernel

2014-12-15 Thread lbustelo
This was posted on the Dev list, but it is very relevant to the user list as well… -- We are happy to announce a developer preview of the Spark Kernel which enables remote applications to dynamically interact

Re: JSON Input files

2014-12-15 Thread Madabhattula Rajesh Kumar
Hi Peter, Thank you for the clarification. Now we need to store each JSON object into one line. Is there any limitation of length of JSON object? So, JSON object will not go to the next line. What will happen if JSON object is a big/huge one? Will it store in a single line in HDFS? What will h

Re: Adding a column to a SchemaRDD

2014-12-15 Thread Nathan Kronenfeld
Perfect, that's exactly what I was looking for. Thank you! On Mon, Dec 15, 2014 at 3:32 AM, Yanbo Liang wrote: > > Hi Nathan, > > #1 > > Spark SQL & DSL can satisfy your requirement. You can refer the following > code snippet: > > jdata.select(Star(Node), 'seven.getField("mod"), 'eleven.getField

Re: Pagerank implementation

2014-12-15 Thread kmurph
Hiya, I too am looking for a PageRank solution in GraphX where the probabilities sum to 1. I tried a few modifications, including division by the total number of vertices in the first part of the equation, as well as trying to return full rank instead of delta (though not correctly as evident fro

integrating long-running Spark jobs with Thriftserver

2014-12-15 Thread Tim Schweichler
Hi everybody, I apologize if the answer to my question is obvious but I haven't been able to find a straightforward solution anywhere on the internet. I have a number of Spark jobs written using the python API that do things like read in data from Amazon S3 to a main table in the Hive metastore

Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread Albert Manyà
Hi all. I'm willing to serialize and later load a model trained using mllib's ALS. I've tried usign Java serialization with something like: val model = ALS.trainImplicit(training, rank, numIter, lambda, 1) val fos = new FileOutputStream("model.bin") val oos = new ObjectOutputStream(f

Re: NullPointerException When Reading Avro Sequence Files

2014-12-15 Thread Simone Franzini
To me this looks like an internal error to the REPL. I am not sure what is causing that. Personally I never use the REPL, can you try typing up your program and running it from an IDE or spark-submit and see if you still get the same error? Simone Franzini, PhD http://www.linkedin.com/in/simonefr

Re: spark kafka batch integration

2014-12-15 Thread Cody Koeninger
For an alternative take on a similar idea, see https://github.com/koeninger/spark-1/tree/kafkaRdd/external/kafka/src/main/scala/org/apache/spark/rdd/kafka An advantage of the approach I'm taking is that the lower and upper offsets of the RDD are known in advance, so it's deterministic. I haven't

Intermittent test failures

2014-12-15 Thread Marius Soutier
Hi, I’m seeing strange, random errors when running unit tests for my Spark jobs. In this particular case I’m using Spark SQL to read and write Parquet files, and one error that I keep running into is this one: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage

Re: spark kafka batch integration

2014-12-15 Thread Koert Kuipers
thanks! i will take a look at your code. didn't realize there was already something out there. good point about upper offsets, i will add that feature to our version as well if you dont mind. i was thinking about making it deterministic for task failure transparently (even if no upper offsets are

Re: Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread Sean Owen
This class is not going to be serializable, as it contains huge RDDs. Even if the right constructor existed the RDDs inside would not serialize. On Mon, Dec 15, 2014 at 4:33 PM, Albert Manyà wrote: > Hi all. > > I'm willing to serialize and later load a model trained using mllib's > ALS. > > I've

Re: is there a way to interact with Spark clusters remotely?

2014-12-15 Thread Akhil Das
Hi Xiaoyong, You could refer this post if you are looking on how to run spark jobs remotely http://apache-spark-user-list.1001560.n3.nabble.com/Submitting-Spark-job-on-Unix-cluster-from-dev-environment-Windows-td16989.html You will of course require network access to the cluster. Thanks Best Reg

Re: Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread Albert Manyà
In that case, what is the strategy to train a model in some background batch process and make recommendations for some other service in real time? Run both processes in the same spark cluster? Thanks. -- Albert Manyà alber...@eml.cc On Mon, Dec 15, 2014, at 05:58 PM, Sean Owen wrote: > This

Re: is there a way to interact with Spark clusters remotely?

2014-12-15 Thread François Le Lay
Have you seen the recent announcement around Spark Kernel using IPython/0MQ protocol ? https://github.com/ibm-et/spark-kernel On Mon, Dec 15, 2014 at 12:06 PM, Akhil Das wrote: > > Hi Xiaoyong, > > You could refer this post if you are looking on how to run spark jobs > remotely > http://apache-

Re: MLLIB model export: PMML vs MLLIB serialization

2014-12-15 Thread sourabh
Thanks Vincenzo. Are you trying out all the models implemented in mllib? Actually I don't see decision tree there. Sorry if I missed it. When are you planning to merge this to spark branch? Thanks Sourabh On Sun, Dec 14, 2014 at 5:54 PM, selvinsource [via Apache Spark User List] < ml-node+s100156

Re: Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread sourabh chaki
Hi Albert, There is some discussion going on here: http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-model-export-PMML-vs-MLLIB-serialization-tc20324.html#a20674 I am also looking for this solution.But looks like until mllib pmml export is ready, there is no full proof solution to export th

Re: Serialize mllib's MatrixFactorizationModel

2014-12-15 Thread Sean Owen
The thing about MatrixFactorizationModel, compared to other models, is that it is huge. It's not just a few coefficients, but whole RDDs of coefficients. I think you could save these RDDs of user/product factors to persistent storage, load them, then recreate the MatrixFactorizationModel that way.

Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI

2014-12-15 Thread shenghua
Hello, I met a problem when using Spark sql CLI. A custom UDTF with lateral view throws ClassNotFound exception. I did a couple of experiments in same environment (spark version 1.1.1): select + same custom UDTF (Passed) select + lateral view + custom UDTF (ClassNotFoundException) select + latera

Re: custom spark app name in yarn-cluster mode

2014-12-15 Thread Tomer Benyamini
Thanks Sandy, passing --name works fine :) Tomer On Fri, Dec 12, 2014 at 9:35 AM, Sandy Ryza wrote: > > Hi Tomer, > > In yarn-cluster mode, the application has already been submitted to YARN > by the time the SparkContext is created, so it's too late to set the app > name there. I believe givin

Accessing rows of a row in Spark

2014-12-15 Thread Jerry Lam
Hi spark users, Do you know how to access rows of row? I have a SchemaRDD called user and register it as a table with the following schema: root |-- user_id: string (nullable = true) |-- item: array (nullable = true) ||-- element: struct (containsNull = false) |||-- item_id: stri

Re: JSON Input files

2014-12-15 Thread Michael Armbrust
Underneath the covers, jsonFile uses TextInputFormat, which will split files correctly based on new lines. Thus, there is no fixed maximum size for a json object (other than the fact that it must fit into memory on the executors). On Mon, Dec 15, 2014 at 7:22 AM, Madabhattula Rajesh Kumar < mraja

Re: Custom UDTF with Lateral View throws ClassNotFound exception in Spark SQL CLI

2014-12-15 Thread Michael Armbrust
Can you add this information to the JIRA? On Mon, Dec 15, 2014 at 10:54 AM, shenghua wrote: > > Hello, > I met a problem when using Spark sql CLI. A custom UDTF with lateral view > throws ClassNotFound exception. I did a couple of experiments in same > environment (spark version 1.1.1): > select

Re: Accessing rows of a row in Spark

2014-12-15 Thread Mark Hamstra
scala> val items = Row(1 -> "orange", 2 -> "apple") items: org.apache.spark.sql.catalyst.expressions.Row = [(1,orange),(2,apple)] If you literally want an iterator, then this: scala> items.toIterator.count { case (user_id, name) => user_id == 1 } res0: Int = 1 ...else: scala> items.count

Re: Intermittent test failures

2014-12-15 Thread Michael Armbrust
Is it possible that you are starting more than one SparkContext in a single JVM with out stopping previous ones? I'd try testing with Spark 1.2, which will throw an exception in this case. On Mon, Dec 15, 2014 at 8:48 AM, Marius Soutier wrote: > > Hi, > > I’m seeing strange, random errors when r

Re: Spark metrics for ganglia

2014-12-15 Thread danilopds
Thanks tsingfu, I used this configuration based in your post: (with ganglia unicast mode) # Enable GangliaSink for all instances *.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink *.sink.ganglia.host=10.0.0.7 *.sink.ganglia.port=8649 *.sink.ganglia.period=15 *.sink.ganglia.unit=seco

Re: printing mllib.linalg.vector

2014-12-15 Thread Xiangrui Meng
you can use the default toString method to get the string representation. if you want to customized, check the indices/values fields. -Xiangrui On Fri, Dec 5, 2014 at 7:32 AM, debbie wrote: > Basic question: > > What is the best way to loop through one of these and print their > components? Conve

Re: Accessing rows of a row in Spark

2014-12-15 Thread Jerry Lam
Hi Mark, Thank you for helping out. The items I got back from Spark SQL has the type information as follows: scala> items res16: org.apache.spark.sql.Row = [WrappedArray([1,orange],[2,apple])] I tried to iterate the items as you suggested but no luck. Best Regards, Jerry On Mon, Dec 15, 201

Re: MLlib(Logistic Regression) + Spark Streaming.

2014-12-15 Thread Xiangrui Meng
If you want to train offline and predict online, you can use the current LR implementation to train a model and then apply model.predict on the dstream. -Xiangrui On Sun, Dec 7, 2014 at 6:30 PM, Nasir Khan wrote: > I am new to spark. > Lets say i want to develop a machine learning model. which tr

Re: MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-15 Thread Xiangrui Meng
Is it possible that after filtering the feature dimension changed? This may happen if you use LIBSVM format but didn't specify the number of features. -Xiangrui On Tue, Dec 9, 2014 at 4:54 AM, Sameer Tilak wrote: > Hi All, > > > I was able to run LinearRegressionwithSGD for a largeer dataset (> 2

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Xiangrui Meng
Please check the number of partitions after sc.textFile. Use sc.textFile('...', 8) to have at least 8 partitions. -Xiangrui On Tue, Dec 9, 2014 at 4:58 AM, DB Tsai wrote: > You just need to use the latest master code without any configuration > to get performance improvement from my PR. > > Since

Re: Stack overflow Error while executing spark SQL

2014-12-15 Thread Xiangrui Meng
Could you post the full stacktrace? It seems to be some recursive call in parsing. -Xiangrui On Tue, Dec 9, 2014 at 7:44 PM, wrote: > Hi > > > > I am getting Stack overflow Error > > Exception in main java.lang.stackoverflowerror > > scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.sc

Re: Accessing rows of a row in Spark

2014-12-15 Thread Mark Hamstra
Looks like you've got one more layer of containment than you intend -- i.e. you've got Row[WrappedArray[Row[(Int, String)]] where you want Row[Row[(Int, String)]]. That's easy to do if somewhere along the line you did something like `val row = Row(collection)` instead of `val row = Row.fromSeq(col

Re: what is the best way to implement mini batches?

2014-12-15 Thread Imran Rashid
I'm a little confused by some of the responses. It seems like there are two different issues being discussed here: 1. How to turn a sequential algorithm into something that works on spark. Eg deal with the fact that data is split into partitions which are processed in parallel (though within a p

Re: Building Desktop application for ALS-MlLib/ Training ALS

2014-12-15 Thread Xiangrui Meng
On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal wrote: > > > Hi, > > > > I am a new bee in spark and scala world > > > > I have been trying to implement Collaborative filtering using MlLib supplied > out of the box with Spark and Scala > > > > I have 2 problems > > > > 1. The best model was

Re: ALS failure with size > Integer.MAX_VALUE

2014-12-15 Thread Xiangrui Meng
Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar wrote: > Hi Xiangrui, > > The block size limit was encountered even with reduced number of item blocks > as you had expected. I'm wondering if I could try the new implementation

Re: Building Desktop application for ALS-MlLib/ Training ALS

2014-12-15 Thread Abhi Basu
In case you must write c# code, you can call python code from c# or use IronPython. :) On Mon, Dec 15, 2014 at 12:04 PM, Xiangrui Meng wrote: > > On Sun, Dec 14, 2014 at 3:06 AM, Saurabh Agrawal > wrote: > > > > > > Hi, > > > > > > > > I am a new bee in spark and scala world > > > > > > > > I ha

Re: java.lang.IllegalStateException: unread block data

2014-12-15 Thread Morbious
"Restored" ment reboot slave node with unchanged IP. "Funny" thing is that for small files spark works fine. I checked hadoop with hdfs also and I'm able to run wordcount on it without any problems (i.e. file about 50GB size). -- View this message in context: http://apache-spark-user-list.10015

Re: ERROR YarnClientClusterScheduler: Lost executor Akka client disassociated

2014-12-15 Thread DB Tsai
Hi Muhammad, Maybe next time you can use http://pastebin.com/ to format and paste the cleaner scala code snippet so other can help you easier. Also, please only paste the significant portion of stack-trace which causes the issue instead of giant logs. First of all, In your log, it seems that you

Re: Including data nucleus tools

2014-12-15 Thread DB Tsai
Just out of my curiosity. Do you manually apply this patch and see if this can actually resolve the issue? It seems that it was merged at some point, but reverted due to that it causes some stability issue. Sincerely, DB Tsai --- My Blog: https:

Re: Intermittent test failures

2014-12-15 Thread Marius Soutier
Possible, yes, although I’m trying everything I can to prevent it, i.e. fork in Test := true and isolated. Can you confirm that reusing a single SparkContext for multiple tests poses a problem as well? Other than that, just switching from SQLContext to HiveContext also provoked the error. On

Re: MLLIB model export: PMML vs MLLIB serialization

2014-12-15 Thread selvinsource
I am going to try to export decision tree next, so far I focused on linear models and k-means. Regards, Vincenzo sourabh wrote > Thanks Vincenzo. > Are you trying out all the models implemented in mllib? Actually I don't > see decision tree there. Sorry if I missed it. When are you planning t

Re: pyspark is crashing in this case. why?

2014-12-15 Thread Sameer Farooqui
Adding group back. FYI Geneis - this was on a m3.xlarge with all default settings in Spark. I used Spark version 1.3.0. The 2nd case did work for me: >>> a = [1,2,3,4,5,6,7,8,9] >>> b = [] >>> for x in range(100): ... b.append(a) ... >>> rdd1 = sc.parallelize(b) >>> rdd1.first() 14/12/15

Stop streaming context gracefully when SIGTERM is passed

2014-12-15 Thread Budde, Adam
Hi all, We are using Spark Streaming ETL a large volume of time series datasets. In our current design, each dataset we ETL will have a corresponding Spark Streaming context + process running on our cluster. Each of these processes will be passed configuration options specifying the data source

Re: Intermittent test failures

2014-12-15 Thread Michael Armbrust
Using a single SparkContext should not cause this problem. In the SQL tests we use TestSQLContext and TestHive which are global singletons for all of our unit testing. On Mon, Dec 15, 2014 at 1:27 PM, Marius Soutier wrote: > > Possible, yes, although I’m trying everything I can to prevent it, i.

Re: Intermittent test failures

2014-12-15 Thread Marius Soutier
Ok, maybe these test versions will help me then. I’ll check it out. On 15.12.2014, at 22:33, Michael Armbrust wrote: > Using a single SparkContext should not cause this problem. In the SQL tests > we use TestSQLContext and TestHive which are global singletons for all of our > unit testing. >

NumberFormatException

2014-12-15 Thread yu
Hello, everyone I know 'NumberFormatException' is due to the reason that String can not be parsed properly, but I really can not find any mistakes for my code. I hope someone may kindly help me. My hdfs file is as follows: 8,22 3,11 40,10 49,47 48,29 24,28 50,30 33,56 4,20 30,38 ... So each line

Re: Why KMeans with mllib is so slow ?

2014-12-15 Thread Jaonary Rabarisoa
I've tried some additional experiments with kmeans and I finally got it worked as I expected. In fact, the number of partition is critical. I had a data set of 24x784 with 12 partitions. In this case the kmeans algorithm took a very long time (about hours to converge). When I change the partiti

MLLib: Saving and loading a model

2014-12-15 Thread Sameer Tilak
Hi All,Resending this: I am using LinearRegressionWithSGD and then I save the model weights and intercept. File that contains weights have this format: 1.204550.13560.000456.. Intercept is 0 since I am using train not setting the intercept so it can be ignored for the moment. I would now like

Re: Stop streaming context gracefully when SIGTERM is passed

2014-12-15 Thread Soumitra Kumar
Hi Adam, I have following scala actor based code to do graceful shutdown: class TimerActor (val timeout : Long, val who : Actor) extends Actor { def act { reactWithin (timeout) { case TIMEOUT => who ! SHUTDOWN } } } class SSCReactor (val ssc : StreamingContext

Re: NumberFormatException

2014-12-15 Thread Sean Owen
That certainly looks surprising. Are you sure there are no unprintable characters in the file? On Mon, Dec 15, 2014 at 9:49 PM, yu wrote: > The exception info is: > 14/12/15 15:35:03 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 > (TID 0, h3): java.lang.NumberFormatException: For inpu

Re: Can spark job have sideeffects (write files to FileSystem)

2014-12-15 Thread Paweł Szulc
Yes, this is what I also found in Spark documentation, that foreach can have side effects. Nevertheless I have this weird error, that sometimes files are just empty. "using" is simply a wrapper that takes our code, makes try-catch-finally and flush & close all resources. I honestly have no clue w

Re: Can spark job have sideeffects (write files to FileSystem)

2014-12-15 Thread Davies Liu
Thinking about that any task could be launched concurrently in different nodes, so in order to make sure the generated files are valid, you need some atomic operation (such as rename) to do it. For example, you could generate a random name for output file, writing the data into it, rename it to the

Executor memory

2014-12-15 Thread Pala M Muthaia
Hi, Running Spark 1.0.1 on Yarn 2.5 When i specify --executor-memory 4g, the spark UI shows each executor as having only 2.3 GB, and similarly for 8g, only 4.6 GB. I am guessing that the executor memory corresponds to the container memory, and that the task JVM gets only a percentage of the cont

Re: what is the best way to implement mini batches?

2014-12-15 Thread Earthson Lu
Hi Imran, you are right. Sequentially process does not make sense to use spark. I think Sequentially process works if batch for each iteration is large enough(this batch could be processed in parallel). My point is that we shall not run mini-batches in parallel, but it still possible to use lar

Re: NotSerializableException in Spark Streaming

2014-12-15 Thread Nicholas Chammas
This still seems to be broken. In 1.1.1, it errors immediately on this line (from the above repro script): liveTweets.map(t => noop(t)).print() The stack trace is: org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureClean

Re: NumberFormatException

2014-12-15 Thread Harihar Nahak
Hi Yu, Try this : val data = csv.map( line => line.split(",").map(elem => elem.trim)) //lines in rows data.map( rec => (rec(0).toInt, rec(1).toInt)) to convert into integer. On 16 December 2014 at 10:49, yu [via Apache Spark User List] < ml-node+s1001560n20694...@n3.nabble.com> wrote: > > He

Re: JSON Input files

2014-12-15 Thread Madabhattula Rajesh Kumar
Thank you Peter for the clarification. Regards, Rajesh On Tue, Dec 16, 2014 at 12:42 AM, Michael Armbrust wrote: > > Underneath the covers, jsonFile uses TextInputFormat, which will split > files correctly based on new lines. Thus, there is no fixed maximum size > for a json object (other than

Re: Executor memory

2014-12-15 Thread sandy . ryza
Hi Pala, Spark executors only reserve spark.storage.memoryFraction (default 0.6) of their spark.executor.memory for caching RDDs. The spark UI displays this fraction. spark.executor.memory controls the executor heap size. spark.yarn.executor.memoryOverhead controls the extra that's tacked on

Fetch Failed caused job failed.

2014-12-15 Thread Mars Max
While I was running spark MR job, there was FetchFailed(BlockManagerId(47, xx.com, 40975, 0), shuffleId=2, mapId=5, reduceId=286), then there were many retries, and the job failed finally. And the log showed the following error, does anybody meet this error ? or is it a known issue in Spa

Re: Executor memory

2014-12-15 Thread Sean Owen
I believe this corresponds to the 0.6 of the whole heap that is allocated for caching partitions. See spark.storage.memoryFraction on http://spark.apache.org/docs/latest/configuration.html 0.6 of 4GB is about 2.3GB. The note there is important, that you probably don't want to exceed the JVM old ge

Re: ALS failure with size > Integer.MAX_VALUE

2014-12-15 Thread Bharath Ravi Kumar
Ok. We'll try using it in a test cluster running 1.2. On 16-Dec-2014 1:36 am, "Xiangrui Meng" wrote: Unfortunately, it will depends on the Sorter API in 1.2. -Xiangrui On Mon, Dec 15, 2014 at 11:48 AM, Bharath Ravi Kumar wrote: > Hi Xiangrui, > > The block size limit was encountered even with r

Fwd: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Tomoya Igarashi
Hi Aniket, Thanks for your reply. I followed your advice to modified my code. Here is latest one. https://github.com/TomoyaIgarashi/spark_cluster_sample/commit/ce7613c42d3adbe6ae44e264c11f3829460f3c35 As a result, It works correctly! Thank you very much. But, "AssociationError" Message appears l

Can I set max execution time for any task in a job?

2014-12-15 Thread Mohamed Lrhazi
Is that possible, if not, how would one do it from PySpark ? This probably does not make sense in most cases, but am writing a script where my job involves downloading and pushing data into cassandra.. sometimes a task hangs forever, and I dont really mind killing it.. The job is not actually comp

RE: is there a way to interact with Spark clusters remotely?

2014-12-15 Thread Xiaoyong Zhu
Thanks all for your information! What Pietro mentioned seems to be the appropriate solution.. I also find a slides talking about it. Several quick questions: 1. Is it already available in Spark main branch? (seem

Re: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Aniket Bhatnagar
Seems you are using standalone mode. Can you check spark worker logs or application logs in spark work directory to find any errors? On Tue, Dec 16, 2014, 9:09 AM Tomoya Igarashi < tomoya.igarashi.0...@gmail.com> wrote: > Hi Aniket, > Thanks for your reply. > > I followed your advice to modified

Re: Can I set max execution time for any task in a job?

2014-12-15 Thread Akhil Das
There is a spark listener interface which can be used to trigger events like jobStarted, TaskGotResults etc but i don't think you can set execution time anywhere. If a task is hung, its mostly becaus

Re: Fetch Failed caused job failed.

2014-12-15 Thread Akhil Das
You could try setting the following while creating the sparkContext .set("spark.rdd.compress","true") .set("spark.storage.memoryFraction","1") .set("spark.core.connection.ack.wait.timeout","600") .set("spark.akka.frameSize","50") Thanks Best Regards On Tue, Dec 16, 2014 at 8:30 AM, Mars

Re: NumberFormatException

2014-12-15 Thread Akhil Das
There could be some other character like a space or ^M etc. You could try the following and see the actual row. val newstream = datastream.map(row => { try{ val strArray = str.trim().split(",") (strArray(0).toInt, strArray(1).toInt) //Instead try this //*(strArray(0).trim(

Re: Spark Streaming Python APIs?

2014-12-15 Thread smallmonkey...@hotmail.com
Hi zhu: maybe there is not the python api for spark-stream baishuo smallmonkey...@hotmail.com From: Xiaoyong Zhu Date: 2014-12-15 10:52 To: user@spark.apache.org Subject: Spark Streaming Python APIs? Hi spark experts Are there any Python APIs for Spark Streaming? I didn’t find the Python APIs

Re: Run Spark job on Playframework + Spark Master/Worker in one Mac

2014-12-15 Thread Tomoya Igarashi
Thanks for response. Yes, I am using standalone mode. I couldn't find any errors. But, "WARN" messages appear in Spark master logs. Here is Spark master logs. https://gist.github.com/TomoyaIgarashi/72145c11d3769c7d1ddb FYI Here is Spark worker logs. https://gist.github.com/TomoyaIgarashi/0db77e9

Accessing Apache Spark from Java

2014-12-15 Thread Jai
Hi I have installed a standalone Spark set up in standalone mode in a Linux server and I am trying to access that spark setup from Java in windows. When I try connecting to Spark I see the following exception 14/12/16 12:52:52 WARN TaskSchedulerImpl: Initial job has not accepted any resources; ch