Re: Java 8 vs Scala

2015-07-16 Thread Marius Danciu
If you takes time to actually learn Scala starting from its fundamental concepts AND quite importantly get familiar with general functional programming concepts, you'd immediately realize the things that you'd really miss going back to Java (8). On Fri, Jul 17, 2015 at 8:14 AM Wojciech Pituła wr

Nullpointer when saving as table with a timestamp column type

2015-07-16 Thread Brandon White
So I have a very simple dataframe that looks like df: [name:String, Place:String, time: time:timestamp] I build this java.sql.Timestamp from a string and it works really well expect when I call saveAsTable("tableName") on this df. Without the timestamp, it saves fine but with the timestamp, it th

what is : ParquetFileReader: reading summary file ?

2015-07-16 Thread shshann
Hi all, our scenario is to generate lots of folders containinig parquet file and then uses "add partition" to add these folder locations to a hive table; when trying to read the hive table using Spark, following logs would show up and took a lot of time on reading them; but this won't happen afte

spark-streaming failed to bind ip address

2015-07-16 Thread ??
Hi all, I submit a spark-streaming job under yarn-cluster model but error happens 15/07/16 14:36:36 ERROR receiver.ReceiverSupervisorImpl: Stopped executor with error: org.jboss.netty.channel.ChannelException: Failed to bind to: sparktest/192.168.1.17:3 15/07/16 14:36:36 ERROR executor.Ex

Re: Java 8 vs Scala

2015-07-16 Thread Wojciech Pituła
IMHO only Scala is an option. Once you're familiar with it you just cant even look at java code. czw., 16.07.2015 o 07:20 użytkownik spark user napisał: > I struggle lots in Scala , almost 10 days n0 improvement , but when i > switch to Java 8 , things are so smooth , and I used Data Frame with

Re: [Spark Shell] Could the spark shell be reset to the original status?

2015-07-16 Thread Terry Hole
Hi Ted, Thanks for the information. The post seems little different with my requirement: suppose we defined different functions to do different streaming work (e.g. 50 functions), i want to test these 50 functions in the spark shell, and the shell will always throw OOM at the middle of test (yes,

Re: Spark streaming Processing time keeps increasing

2015-07-16 Thread N B
Hi TD, Yes, we do have the invertible function provided. However, I am not sure I understood how to use the filterFunction. Is there an example somewhere showing its usage? The header comment on the function says : * @param filterFunc function to filter expired key-value pairs; *

Re: Retrieving Spark Configuration properties

2015-07-16 Thread Yanbo Liang
This is because that you did not set the parameter "spark.sql. hive.metastore.version". You can check other parameters that you have set, it will work well. Or you can first set this parameter, and then get it. 2015-07-17 11:53 GMT+08:00 RajG : > I am using this version of Spark : *spark-1.4.0-bi

Re: [Spark Shell] Could the spark shell be reset to the original status?

2015-07-16 Thread Ted Yu
See this recent thread: http://search-hadoop.com/m/q3RTtFW7iMDkrj61/Spark+shell+oom+&subj=java+lang+OutOfMemoryError+PermGen+space > On Jul 16, 2015, at 8:51 PM, Terry Hole wrote: > > Hi, > > Background: The spark shell will get out of memory error after dealing lots > of spark work. > >

Retrieving Spark Configuration properties

2015-07-16 Thread RajG
I am using this version of Spark : *spark-1.4.0-bin-hadoop2.6* . I want to check few default properties. So I gave the following statement in spark-shell *scala> sqlContext.getConf("spark.sql.hive.metastore.version") *I was expecting the call to method getConf to return a value of 0.13.1 as desrib

[Spark Shell] Could the spark shell be reset to the original status?

2015-07-16 Thread Terry Hole
Hi, Background: The spark shell will get out of memory error after dealing lots of spark work. Is there any method which can reset the spark shell to the startup status? I tried "*:reset*", but it seems not working: i can not create spark context anymore (some compile error as below) after the "*

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai
No problem:) Glad to hear that! On Thu, Jul 16, 2015 at 8:22 PM, Koert Kuipers wrote: > that solved it, thanks! > > On Thu, Jul 16, 2015 at 6:22 PM, Koert Kuipers wrote: > >> thanks i will try 1.4.1 >> >> On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai wrote: >> >>> Hi Koert, >>> >>> For the classlo

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
that solved it, thanks! On Thu, Jul 16, 2015 at 6:22 PM, Koert Kuipers wrote: > thanks i will try 1.4.1 > > On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai wrote: > >> Hi Koert, >> >> For the classloader issue, you probably hit >> https://issues.apache.org/jira/browse/SPARK-8365, which has been fixed

Setting different amount of cache memory for driver

2015-07-16 Thread Zalzberg, Idan (Agoda)
Hi, I am using the spark thrift server. In my deployment, I need to have more memory for the driver, to be able to get results back from the executors. Currently a lot of the driver memory is spent on caching, but I would prefer the driver would not use memory for that (only the executors) Is tha

Re: SparkSQL 1.4 can't accept registration of UDF?

2015-07-16 Thread Okehee Goh
The same issue (A custome udf jar added through 'add jar' is not recognized) is observed on Spark 1.4.1. Instead of executing, beeline>add jar udf.jar My workaround is either 1) to pass the udf.jar by using "--jars" while starting ThriftServer (This didn't work in AWS EMR's Spark 1.4.0.b). or 2)

Re: Console log file of CoarseGrainedExecutorBackend

2015-07-16 Thread Jeff Zhang
By default it is in ${SPARK_HOME}/work/${APP_ID}/${EXECUTOR_ID} On Thu, Jul 16, 2015 at 3:43 PM, Tao Lu wrote: > Hi, Guys, > > Where can I find the console log file of CoarseGrainedExecutorBackend > process? > > Thanks! > > Tao > > -- Best Regards Jeff Zhang

Console log file of CoarseGrainedExecutorBackend

2015-07-16 Thread Tao Lu
Hi, Guys, Where can I find the console log file of CoarseGrainedExecutorBackend process? Thanks! Tao

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
thanks i will try 1.4.1 On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai wrote: > Hi Koert, > > For the classloader issue, you probably hit > https://issues.apache.org/jira/browse/SPARK-8365, which has been fixed in > Spark 1.4.1. Can you try 1.4.1 and see if the exception disappear? > > Thanks, > > Yi

Re: S3 Read / Write makes executors deadlocked

2015-07-16 Thread Hao Ren
I have tested on another pc which has 8 CPU cores. But it hangs when defaultParallelismLevel > 4, e.g. sparkConf.setMaster("local[*]") local[1] ~ local[3] work well. 4 is the mysterious boundary. It seems that I am not the only one encountered this problem: https://issues.apache.org/jira/browse/S

Please add two groups to "Community" page

2015-07-16 Thread Andrew Vykhodtsev
Moscow : http://www.meetup.com/Apache-Spark-in-Moscow/ Slovenija (Ljubljana) http://www.meetup.com/Apache-Spark-Ljubljana-Meetup/

Re: What else is need to setup native support of BLAS/LAPACK with Spark?

2015-07-16 Thread Sean Owen
Yes, that's most of the work, just getting the native libs into the assembly. netlib can find them from there even if you don't have BLAS libs on your OS, since it includes a reference implementation as a fallback. One common reason it won't load is not having libgfortran installed on your OSes th

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai
Hi Koert, For the classloader issue, you probably hit https://issues.apache.org/jira/browse/SPARK-8365, which has been fixed in Spark 1.4.1. Can you try 1.4.1 and see if the exception disappear? Thanks, Yin On Thu, Jul 16, 2015 at 2:12 PM, Koert Kuipers wrote: > i am using scala 2.11 > > spar

How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-16 Thread Stahlman, Jonathan
Hello all, I am running the Spark recommendation algorithm in MLlib and I have been studying its output with various model configurations. Ideally I would like to be able to run one job that trains the recommendation model with many different configurations to try to optimize for performance.

Re: Select all columns except some

2015-07-16 Thread Lars Albertsson
The snippet at the end worked for me. We run Spark 1.3.x, so DataFrame.drop is not available to us. As pointed out by Yana, DataFrame operations typically return a new DataFrame, so use as such: import com.foo.sparkstuff.DataFrameOps._ ... val df = ... val prunedDf = df.dropColumns("one_col",

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
i am using scala 2.11 spark jars are not in my assembly jar (they are "provided"), since i launch with spark-submit On Thu, Jul 16, 2015 at 4:34 PM, Koert Kuipers wrote: > spark 1.4.0 > > spark-csv is a normal dependency of my project and in the assembly jar > that i use > > but i also tried ad

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
spark 1.4.0 spark-csv is a normal dependency of my project and in the assembly jar that i use but i also tried adding spark-csv with --package for spark-submit, and got the same error On Thu, Jul 16, 2015 at 4:31 PM, Yin Huai wrote: > We do this in SparkILookp ( > https://github.com/apache/spa

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Yin Huai
We do this in SparkILookp ( https://github.com/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037). What is the version of Spark you are using? How did you add the spark-csv jar? On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers wrote: > has a

create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
has anyone tried to make HiveContext only if the class is available? i tried this: implicit lazy val sqlc: SQLContext = try { Class.forName("org.apache.spark.sql.hive.HiveContext", true, Thread.currentThread.getContextClassLoader) .getConstructor(classOf[SparkContext]).newInstance(sc).asInst

Re: Getting not implemented by the TFS FileSystem implementation

2015-07-16 Thread Sean Owen
See also https://issues.apache.org/jira/browse/SPARK-8385 (apologies if someone already mentioned that -- just saw this thread) On Thu, Jul 16, 2015 at 7:19 PM, Jerrick Hoang wrote: > So, this has to do with the fact that 1.4 has a new way to interact with > HiveMetastore, still investigating. W

Re: Getting not implemented by the TFS FileSystem implementation

2015-07-16 Thread Jerrick Hoang
So, this has to do with the fact that 1.4 has a new way to interact with HiveMetastore, still investigating. Would really appreciate if anybody has any insights :) On Tue, Jul 14, 2015 at 4:28 PM, Jerrick Hoang wrote: > Hi all, > > I'm upgrading from spark1.3 to spark1.4 and when trying to run s

Re: PairRDDFunctions and DataFrames

2015-07-16 Thread Michael Armbrust
Instead of using that RDD operation just use the native DataFrame function approxCountDistinct https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$ On Thu, Jul 16, 2015 at 6:58 AM, Yana Kadiyska wrote: > Hi, could someone point me to the recommended way of u

Re: pyspark 1.4 udf change date values

2015-07-16 Thread Davies Liu
Thanks for reporting this, could you file a JIRA for it? On Thu, Jul 16, 2015 at 8:22 AM, Luis Guerra wrote: > Hi all, > > I am having some troubles when using a custom udf in dataframes with pyspark > 1.4. > > I have rewritten the udf to simplify the problem and it gets even weirder. > The udfs

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-16 Thread Cody Koeninger
Not exactly the same issue, but possibly related: https://issues.apache.org/jira/browse/KAFKA-1196 On Thu, Jul 16, 2015 at 12:03 PM, Cody Koeninger wrote: > Well, working backwards down the stack trace... > > at java.nio.Buffer.limit(Buffer.java:275) > > That exception gets thrown if the limit

Re: Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-16 Thread Cody Koeninger
Well, working backwards down the stack trace... at java.nio.Buffer.limit(Buffer.java:275) That exception gets thrown if the limit is negative or greater than the buffer's capacity at kafka.message.Message.sliceDelimited(Message.scala:236) If size had been negative, it would have just returned

Resume checkpoint failed with Spark Streaming Kafka via createDirectStream under heavy reprocessing

2015-07-16 Thread Nicolas Phung
Hello, When I'm reprocessing the data from kafka (about 40 Gb) with the new Spark Streaming Kafka method createDirectStream, everything is fine till a driver error happened (driver is killed, connection lost...). When the driver pops up again, it resumes the processing with the checkpoint in HDFS.

BroadCast on Interval ( eg every 10 min )

2015-07-16 Thread Ashish Soni
Hi All , How can i broadcast a data change to all the executor ever other 10 min or 1 min Ashish

Re: Select all columns except some

2015-07-16 Thread Yana Kadiyska
Have you tried to examine what clean_cols contains -- I'm suspect of this part mkString(“, “). Try this: val clean_cols : Seq[String] = df.columns... if you get a type error you need to work on clean_cols (I suspect yours is of type String at the moment and presents itself to Spark as a single col

Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Raghavendra Pandey
Depending on what you do with them, they will get computed separately. Bcoz u may have long dag in each branch. So spark tries to run all the transformation function together rather than trying to optimize things across branches. On Jul 16, 2015 1:40 PM, "Bin Wang" wrote: > What if I would use bo

Re: HiveThriftServer2.startWithContext error with registerTempTable

2015-07-16 Thread Srikanth
Cheng, Yes, "select * from temp_table" was working. I was able to perform some transformation+action on the dataframe and print it on console. HiveThriftServer2.startWithContext was being run on the same session. When you say "try --jars option", are you asking me to pass spark-csv jar? I'm alrea

pyspark 1.4 udf change date values

2015-07-16 Thread Luis Guerra
Hi all, I am having some troubles when using a custom udf in dataframes with pyspark 1.4. I have rewritten the udf to simplify the problem and it gets even weirder. The udfs I am using do absolutely nothing, they just receive some value and output the same value with the same format. I show you

Please add our meetup home page in Japan.

2015-07-16 Thread Kousuke Saruta
Hi folks. We have lots of Spark enthusiasts and some organizations held talk events in Tokyo, Japan. Now we're going to unifiy those events and have created our home page in meetup.com. http://www.meetup.com/Tokyo-Spark-Meetup/ Could you add this to the list? Thanks. - Kousuke Saruta -

Re: spark streaming job to hbase write

2015-07-16 Thread Michael Segel
You ask an interesting question… Lets set aside spark, and look at the overall ingestion pattern. Its really an ingestion pattern where your input in to the system is from a queue. Are the events discrete or continuous? (This is kinda important.) If the events are continuous then more than

Select all columns except some

2015-07-16 Thread Saif.A.Ellafi
Hi, In a hundred columns dataframe, I wish to either select all of them except or drop the ones I dont want. I am failing in doing such simple task, tried two ways val clean_cols = df.columns.filterNot(col_name => col_name.startWith("STATE_").mkString(", ") df.select(clean_cols) But this thro

Re: Research ideas using spark

2015-07-16 Thread Michael Segel
Ok… After having some off-line exchanges with Shashidhar Rao came up with an idea… Apply machine learning to either implement or improve autoscaling up or down within a Storm/Akka cluster. While I don’t know what constitutes an acceptable PhD thesis, or senior project for undergrads… this is

Re: Indexed Store for lookup table

2015-07-16 Thread Jem Tucker
Thanks! On Thu, Jul 16, 2015 at 1:59 PM Vetle Leinonen-Roeim wrote: > By the way - if you're going this route, see > https://github.com/datastax/spark-cassandra-connector > > On Thu, Jul 16, 2015 at 2:40 PM Vetle Leinonen-Roeim > wrote: > >> You'll probably have to install it separately. >> >>

PairRDDFunctions and DataFrames

2015-07-16 Thread Yana Kadiyska
Hi, could someone point me to the recommended way of using countApproxDistinctByKey with DataFrames? I know I can map to pair RDD but I'm wondering if there is a simpler method? If someone knows if this operations is expressible in SQL that information would be most appreciated as well.

RE: Use rank with distribute by in HiveContext

2015-07-16 Thread java8964
Yes. The HIVE UDF and distribute by both supported by Spark SQL. If you are using Spark 1.4, you can try Hive analytics windows function (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics),most of which are already supported in Spark 1.4, so you don't need the

Re: Use rank with distribute by in HiveContext

2015-07-16 Thread Todd Nist
Did you take a look at the excellent write up by Yin Huai and Michael Armbrust? It appears that rank is supported in the 1.4.x release. https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html Snippet from above article for your convenience: To answer the first ques

[SPARK][GRAPHX] 'Executor Deserialize Time' is too big

2015-07-16 Thread Hlib Mykhailenko
Hello, I use Apache GraphX (version 1.1.0). And sometime stage which corresponds to this line of code: val graph = GraphLoader.edgeListFile(...) takes too much time. Looking to EVENT_LOG_1 file I found out that for some tasks of this stage 'Executor Deserialize Time' were too big (like ~102

Re: Indexed Store for lookup table

2015-07-16 Thread Vetle Leinonen-Roeim
By the way - if you're going this route, see https://github.com/datastax/spark-cassandra-connector On Thu, Jul 16, 2015 at 2:40 PM Vetle Leinonen-Roeim wrote: > You'll probably have to install it separately. > > On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker wrote: > >> Hi Vetle, >> >> IndexedRDD i

Re: Indexed Store for lookup table

2015-07-16 Thread Vetle Leinonen-Roeim
You'll probably have to install it separately. On Thu, Jul 16, 2015 at 2:29 PM Jem Tucker wrote: > Hi Vetle, > > IndexedRDD is persisted in the same way RDDs are as far as I am aware. Are > you aware if Cassandra can be built into my application or has to be a > stand alone database which is ins

Re: Indexed Store for lookup table

2015-07-16 Thread Jem Tucker
Hi Vetle, IndexedRDD is persisted in the same way RDDs are as far as I am aware. Are you aware if Cassandra can be built into my application or has to be a stand alone database which is installed separately? Thanks, Jem On Thu, Jul 16, 2015 at 12:59 PM Vetle Leinonen-Roeim wrote: > Hi, > > No

Use rank with distribute by in HiveContext

2015-07-16 Thread Lior Chaga
Does spark HiveContext support the rank() ... distribute by syntax (as in the following article- http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive )? If not, how can it be achieved? Thanks, Lior

Re: Indexed Store for lookup table

2015-07-16 Thread Vetle Leinonen-Roeim
Hi, Not sure how IndexedRDD is persisted, but perhaps you're better off using a NOSQL database for lookups (perhaps using Cassandra, with the Cassandra connector)? That should give you good performance on lookups, but persisting those billion records sounds like something that will take some time

Re: DataFrame.write().partitionBy("some_column").parquet(path) produces OutOfMemory with very few items

2015-07-16 Thread Nikos Viorres
Ηι Lian, Thank you for the tip. Indeed, there were a lot of distinct values in my result set (approximately 3000). As you suggested i decided to partition the data firstly on a column with much smaller cardinality. Thanks n On Thu, Jul 16, 2015 at 2:09 PM, Cheng Lian wrote: > Hi Nikos, > > Ho

DataFrame from RDD[Row]

2015-07-16 Thread Marius Danciu
Hi, This is an ugly solution because it requires pulling out a row: val rdd: RDD[Row] = ... ctx.createDataFrame(rdd, rdd.first().schema) Is there a better alternative to get a DataFrame from an RDD[Row] since toDF won't work as Row is not a Product ? Thanks, Marius

Re: DataFrame.write().partitionBy("some_column").parquet(path) produces OutOfMemory with very few items

2015-07-16 Thread Cheng Lian
Hi Nikos, How many columns and distinct values of "some_column" are there in the DataFrame? Parquet writer is known to be very memory consuming for wide tables. And lots of distinct partition column values result in many concurrent Parquet writers. One possible workaround is to first repartit

Invalid HDFS path exception

2015-07-16 Thread wazza
Hi. I am trying to use Apache Spark in a Restful web service in which I am trying to query the data from Hive tables using Apache Spark Sql. This is my java class SparkConf sparkConf = new SparkConf().setAppName("Hive").setMaster("local").setSparkHome("Path"); JavaSparkContext

Spark 1.4.0 org.apache.spark.sql.AnalysisException: cannot resolve 'probability' given input columns

2015-07-16 Thread lokeshkumar
Hi forumI am currently using Spark 1.4.0, and started using the ML pipeline framework.I ran the example program "ml.JavaSimpleTextClassificationPipeline" which uses the LogisticRegression. But I wanted to do multiclass classification, so I used DecisionTreeClassifier present in the org.apache.spark

S3 Read / Write makes executors deadlocked

2015-07-16 Thread Hao Ren
Given the following code which just reads from s3, then saves files to s3 val inputFileName: String = "s3n://input/file/path" val outputFileName: String = "s3n://output/file/path" val conf = new SparkConf().setAppName(this.getClass.getName).setMaster("local[4]") val

Re: Sorted Multiple Outputs

2015-07-16 Thread Yiannis Gkoufas
Hi Eugene, thanks for your response! Your recommendation makes sense, that's what I more or less tried. The problem that I am facing is that inside foreachPartition() I cannot create a new rdd and use saveAsTextFile. It would probably make sense to write directly to HDFS using the Java API. When I

spark-streaming whit flume run error under yarn model

2015-07-16 Thread ??
hi all, I'm use spark-streaming with spark ,I configure flume like this: a1.channels = c1 a1.sinks = k1 a1.sources = r1 a1.sinks.k1.type = avro a1.sinks.k1.channel = c1 a1.sinks.k1.hostname = localhost a1.sinks.k1.port = 3 a1.sources.r1.type = avro a1.sources.r1.bind = localhost a1

Re: Spark streaming Processing time keeps increasing

2015-07-16 Thread Tathagata Das
MAke sure you provide the filterFunction with the invertible reduceByKeyAndWindow. Otherwise none of the keys will get removed, and the key space will continue increase. This is what is leading to the lag. So use the filtering function to filter out the keys that are not needed any more. On Thu, J

Re: Spark streaming Processing time keeps increasing

2015-07-16 Thread N B
Thanks Akhil. For doing reduceByKeyAndWindow, one has to have checkpointing enabled. So, yes we do have it enabled. But not Write Ahead Log because we don't have a need for recovery and we do not recover the process state on restart. I don't know if IO Wait fully explains the increasing processing

Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
What if I would use both rdd1 and rdd2 later? Raghavendra Pandey 于2015年7月16日周四 下午4:08写道: > If you cache rdd it will save some operations. But anyway filter is a lazy > operation. And it runs based on what you will do later on with rdd1 and > rdd2... > > Raghavendra > On Jul 16, 2015 1:33 PM, "Bin

Re: Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Raghavendra Pandey
If you cache rdd it will save some operations. But anyway filter is a lazy operation. And it runs based on what you will do later on with rdd1 and rdd2... Raghavendra On Jul 16, 2015 1:33 PM, "Bin Wang" wrote: > If I write code like this: > > val rdd = input.map(_.value) > val f1 = rdd.filter(_

Will multiple filters on the same RDD optimized to one filter?

2015-07-16 Thread Bin Wang
If I write code like this: val rdd = input.map(_.value) val f1 = rdd.filter(_ == 1) val f2 = rdd.filter(_ == 2) ... Then the DAG of the execution may be this: -> Filter -> ... Map -> Filter -> ... But the two filters is operated on the same RDD, which means it could be done by

Indexed Store for lookup table

2015-07-16 Thread Jem Tucker
Hello, I have been using IndexedRDD as a large lookup (1 billion records) to join with small tables (1 million rows). The performance of indexedrdd is great until it has to be persisted on disk. Are there any alternatives to IndexedRDD or any changes to how I use it to improve performance with big

Re: Spark streaming Processing time keeps increasing

2015-07-16 Thread Akhil Das
What is your data volume? Are you having checkpointing/WAL enabled? In that case make sure you are having SSD disks as this behavior is mainly due to the IO wait. Thanks Best Regards On Thu, Jul 16, 2015 at 8:43 AM, N B wrote: > Hello, > > We have a Spark streaming application and the problem t

Re: Spark cluster read local files

2015-07-16 Thread Akhil Das
Yes you can do that, just make sure you rsync the same file to the same location on every machine. Thanks Best Regards On Thu, Jul 16, 2015 at 5:50 AM, Julien Beaudan wrote: > Hi all, > > Is it possible to use Spark to assign each machine in a cluster the same > task, but on files in each machi

Re: Job aborted due to stage failure: Task not serializable:

2015-07-16 Thread Akhil Das
Did you try this? *val out=lines.filter(xx=>{* val y=xx val x=broadcastVar.value var flag:Boolean=false for(a<-x) { if(y.contains(a)) flag=true } flag } *})* Thanks Best Regards On Wed, Jul 15, 2015 at 8:10 PM, Naveen Dabas wrote: > > > > > > I am using the

Re: DataFrame InsertIntoJdbc() Runtime Exception on cluster

2015-07-16 Thread Akhil Das
Which version of spark are you using? insertIntoJDBC is deprecated (from 1.4.0), you may use write.jdbc() instead. Thanks Best Regards On Wed, Jul 15, 2015 at 2:43 PM, Manohar753 wrote: > Hi All, > > Am trying to add few new rows for existing table in mysql using > DataFrame.But it is adding ne