Re: Linear search between particular log4j log lines

2015-07-11 Thread Akhil Das
Can you not use sc.wholeTextFile() and use a custom parser or a regex to extract out the TransactionIDs? Thanks Best Regards On Sat, Jul 11, 2015 at 8:18 AM, ssbiox wrote: > Hello, > > I have a very specific question on how to do a search between particular > lines of log file. I did some resea

Re: Issues when combining Spark and a third party java library

2015-07-11 Thread Akhil Das
Did you try setting the HADOOP_CONF_DIR? Thanks Best Regards On Sat, Jul 11, 2015 at 3:17 AM, maxdml wrote: > Also, it's worth noting that I'm using the prebuilt version for hadoop 2.4 > and higher from the official website. > > > > -- > View this message in context: > http://apache-spark-user-

Re: Starting Spark-Application without explicit submission to cluster?

2015-07-11 Thread Akhil Das
Yes, that is correct. You can use this boiler plate to avoid spark-submit. //The configurations val sconf = new SparkConf() .setMaster("spark://spark-ak-master:7077") .setAppName("SigmoidApp") .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .s

Moving average using Spark and Scala

2015-07-11 Thread Anupam Bagchi
I have to do the following tasks on a dataset using Apache Spark with Scala as the programming language: Read the dataset from HDFS. A few sample lines look like this: deviceid,bytes,eventdate 15590657,246620,20150630 14066921,1907,20150621 14066921,1906,20150626 6522013,2349,20150626 6522013,252

Calculating moving average of dataset in Apache Spark and Scala

2015-07-11 Thread Anupam Bagchi
I have to do the following tasks on a dataset using Apache Spark with Scala as the programming language: Read the dataset from HDFS. A few sample lines look like this: deviceid,bytes,eventdate 15590657,246620,20150630 14066921,1907,20150621 14066921,1906,20150626 6522013,2349,20150626 6522013,252

Re: Spark performance

2015-07-11 Thread Jörn Franke
Honestly you are addressing this wrongly - you do not seem.to have a business case for changing - so why do you want to switch Le sam. 11 juil. 2015 à 3:28, Mohammed Guller a écrit : > Hi Ravi, > > First, Neither Spark nor Spark SQL is a database. Both are compute > engines, which need to be pa

Re: Spark performance

2015-07-11 Thread Jörn Franke
Le sam. 11 juil. 2015 à 14:53, Roman Sokolov a écrit : > Hello. Had the same question. What if I need to store 4-6 Tb and do > queries? Can't find any clue in documentation. > Am 11.07.2015 03:28 schrieb "Mohammed Guller" : > >> Hi Ravi, >> >> First, Neither Spark nor Spark SQL is a database. Bo

Re: S3 vs HDFS

2015-07-11 Thread Aaron Davidson
Note that if you use multi-part upload, each part becomes 1 block, which allows for multiple concurrent readers. One would typically use fixed-size block sizes which align with Spark's default HDFS block size (64 MB, I think) to ensure the reads are aligned. On Sat, Jul 11, 2015 at 11:14 AM, Steve

Re: S3 vs HDFS

2015-07-11 Thread Steve Loughran
seek() is very, very expensive on s3, even short forward seeks. If your code does a lot of, it will kill performance. (forward seeks are better in s3a, which with Hadoop 2.3 is now something safe to use, and in the s3 client that Amazon include in EMR), but its still sluggish. The other killers

Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread Krishna Sankar
Looks like reduceByKey() should work here. Cheers On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna < leonida.gianfa...@gmail.com> wrote: > Thanks a lot oubrik, > > I got your point, my consideration is that sum() should be already a > built-in function for iterators in python. > Anyway I trie

Re: Sum elements of an iterator inside an RDD

2015-07-11 Thread leonida.gianfagna
Thanks a lot oubrik, I got your point, my consideration is that sum() should be already a built-in function for iterators in python. Anyway I tried your approach def mysum(iter): count = sum = 0 for item in iter: count += 1 sum += item return sum wordCountsGrouped = wor

Worker dies with java.io.IOException: Stream closed

2015-07-11 Thread gaurav sharma
Hi All, I am facing this issue in my production environment. My worker dies by throwing this exception. But i see the space is available on all the partitions on my disk I did NOT see any abrupt increase in DIsk IO, which might have choked the executor to write on to the stderr file. But still m

RE: Spark performance

2015-07-11 Thread Mohammed Guller
Hi Roman, Yes, Spark SQL will be a better solution than standard RDBMS databases for querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a powerful analytics solution. Mohammed From: David Mitchell [mailto:jdavidmitch...@gmail.com] Sent: Saturday, July 11, 2015 7:10 AM To: R

Re: SparkDriverExecutionException when using actorStream

2015-07-11 Thread Juan Rodríguez Hortalá
Hi, I've finally fixed this. The problem was that I wasn't providing a type for the DStream in ssc.actorStream /* with this inputDStream : ReceiverInputDStream[Nothing] and we get SparkDriverExecutionException: Execution error * Caused by: java.lang.ArrayStoreException: [Ljava.lang.Object;

Re: Spark performance

2015-07-11 Thread David Mitchell
You can certainly query over 4 TB of data with Spark. However, you will get an answer in minutes or hours, not in milliseconds or seconds. OLTP databases are used for web applications, and typically return responses in milliseconds. Analytic databases tend to operate on large data sets, and retu

Re: How do we control output part files created by Spark job?

2015-07-11 Thread Srikanth
Reducing no.of partitions may have impact on memory consumption. Especially if there is uneven distribution of key used in groupBy. Depends on your dataset. On Sat, Jul 11, 2015 at 5:06 AM, Umesh Kacha wrote: > Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10 > I thin

RE: Spark performance

2015-07-11 Thread Roman Sokolov
Hello. Had the same question. What if I need to store 4-6 Tb and do queries? Can't find any clue in documentation. Am 11.07.2015 03:28 schrieb "Mohammed Guller" : > Hi Ravi, > > First, Neither Spark nor Spark SQL is a database. Both are compute > engines, which need to be paired with a storage sy

Re: Spark performance

2015-07-11 Thread Jörn Franke
What is your business case for the move? Le ven. 10 juil. 2015 à 12:49, Ravisankar Mani a écrit : > Hi everyone, > > I have planned to move mssql server to spark?. I have using around 50,000 > to 1l records. > The spark performance is slow when compared to mssql server. > > What is the best da

Rdd partitioning

2015-07-11 Thread anshu shukla
Suppose i have RDD with 10 tuples and cluster with 100 cores (standalone mode) the by dafault how the partition will be done. I did not get how it will divide 20 tuples set (RDD) to 100 cores .(By default ) Mentioned in documentation - *spark.default.parallelism* For distributed shuffle operati

spark streaming doubt

2015-07-11 Thread Shushant Arora
1.spark streaming 1.3 creates as many RDD Partitions as there are kafka partitions in topic. Say I have 300 partitions in topic and 10 executors and each with 3 cores so , is it means at a time only 10*3=30 partitions are processed and then 30 like that since executors launch tasks per RDD partitio

Re: Spark Streaming and using Swift object store for checkpointing

2015-07-11 Thread algermissen1971
On 10 Jul 2015, at 23:10, algermissen1971 wrote: > Hi, > > initially today when moving my streaming application to the cluster the first > time I ran in to newbie error of using a local file system for checkpointing > and the RDD partition count differences (see exception below). > > Having

Re: How do we control output part files created by Spark job?

2015-07-11 Thread Umesh Kacha
Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10 I think reducing shuffle partitions will slower my group by query of hiveContext or it wont slow it down please guide. On Sat, Jul 11, 2015 at 7:41 AM, Srikanth wrote: > Is there a join involved in your sql? > Have a lo