Re: RE: Fast write datastore...

2017-03-16 Thread Rick Moritz
If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet might also be an option. Of course, management-wise it has much more overhead than using ES, since you need to manually define partitions and buckets, which is suboptimal. On the other hand, for querying, you can probably

RE: RE: Fast write datastore...

2017-03-16 Thread yohann jardin
Hello everyone, I'm also really interested in the answers as I will be facing the same issue soon. Muthu, if you evaluate again Apache Ignite, can you share your results? I also noticed Alluxio to store spark results in memory that you might want to investigate. In my case I want to use them t

RE: RE: Fast write datastore...

2017-03-16 Thread Mal Edwin
Hi All, I believe here what we are looking for is a serving layer where user queries can be executed on a subset of processed data. In this scenario, we are using Impala for this as it provides a layered caching, in our use case it caches some set in memory and then some in HDFS and the full set

[Spark Streaming+Kafka][How-to]

2017-03-16 Thread OUASSAIDI, Sami
Hi all, So I need to specify how an executor should consume data from a kafka topic. Let's say I have 2 topics : t0 and t1 with two partitions each, and two executors e0 and e1 (both can be on the same node so assign strategy does not work since in the case of a multi executor node it works based

Re: RE: Fast write datastore...

2017-03-16 Thread Sudhir Menon
I am extremely leery about pushing product on this forum and have refrained from it in the past. But since you are talking about loading parquet data into Spark, run some aggregate queries and then write the results to a fast data store, and specifically asking for product options, it makes absolu

Re: [Spark Streaming+Kafka][How-to]

2017-03-16 Thread Cody Koeninger
Spark just really isn't a good fit for trying to pin particular computation to a particular executor, especially if you're relying on that for correctness. On Thu, Mar 16, 2017 at 7:16 AM, OUASSAIDI, Sami wrote: > > Hi all, > > So I need to specify how an executor should consume data from a kafk

Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
Hi, While saving a dataset using * mydataset.write().csv("outputlocation") * I am running into an exception *"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)"* Does it mean that for saving a dataset whole of

CSV empty columns handling in Spark 2.0.2

2017-03-16 Thread George Obama
Hello, I am using spark 2.0.2 to read the CSV file with empty columns and is hitting the issue: scala>val df = sqlContext.read.option("header", true).option("inferSchema", true).csv("file location") 17/03/13 07:26:26 WARN DataSource: Error while looking for metadata directory. scala> df.show(

Re: [Spark Streaming+Kafka][How-to]

2017-03-16 Thread Michael Armbrust
I think it should be straightforward to express this using structured streaming. You could ensure that data from a given partition ID is processed serially by performing a group by on the partition column. spark.readStream .format("kafka") .option("kafka.bootstrap.servers", "...") .option("

Hive on Spark Job Monitoring

2017-03-16 Thread Ninad Shringarpure
Hi Team, I wanted to understand how Hive on Spark actually maps to Spark jobs underneath triggered by Hive. AFAIK each Hive query would trigger a new Spark job. But this was contradicted by someone and wanted to confirm what is the real design implementation. Please let me know if there is refere

[Spark Streaming] Checkpoint backup (.bk) file purpose

2017-03-16 Thread Bartosz Konieczny
Hello, Actually I'm studying metadata checkpoint implementation in Spark Streaming and I was wondering the purpose of so called "backup files": CheckpointWriter snippet: > // We will do checkpoint when generating a batch and completing a batch. > When the processing > // time of a batch is great

Streaming 2.1.0 - window vs. batch duration

2017-03-16 Thread Dominik Safaric
Hi all, As I’ve implemented a streaming application pulling data from Kafka every 1 second (batch interval), I am observing some quite strange behaviour (didn’t use Spark extensively in the past, but continuous operator based engines instead of). Namely the dstream.window(Seconds(60)) windowe

Re: Dataset : Issue with Save

2017-03-16 Thread Yong Zhang
You can take a look of https://issues.apache.org/jira/browse/SPARK-12837 Yong Spark driver requires large memory space for serialized ... issues.apache.org Executing a sql statement with a large number of partitions requires a high memory spac

Re: CSV empty columns handling in Spark 2.0.2

2017-03-16 Thread Hyukjin Kwon
I think this is fixed in https://github.com/apache/spark/pull/15767 This should be fixed in 2.1.0. 2017-03-17 3:28 GMT+09:00 George Obama : > Hello, > > > > I am using spark 2.0.2 to read the CSV file with empty columns and is > hitting the issue: > > scala>val df = sqlContext.read.option("head

Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Everett Anderson
Hi, We're using Dataset union() in Spark 2.0.2 to concatenate a bunch of tables together and save as Parquet to S3, but it seems to take a long time. We're using the S3A FileSystem implementation under the covers, too, if that helps. Watching the Spark UI, the executors all eventually stop (we're

Re: Streaming 2.1.0 - window vs. batch duration

2017-03-16 Thread Michael Armbrust
Have you considered trying event time aggregation in structured streaming instead? On Thu, Mar 16, 2017 at 12:34 PM, Dominik Safaric wrote: > Hi all, > > As I’ve implemented a streaming application pulling data from Kafka every > 1 second (batch interval), I am observing some quite strange behav

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Everett Anderson
Looks like the Dataset version of union may also fail with the following on larger data sets, which again seems like it might be drawing everything into the driver for some reason -- 7/03/16 22:28:21 WARN TaskSetManager: Lost task 1.0 in stage 91.0 (TID 5760, ip-10-8-52-198.us-west-2.compute.inter

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Burak Yavuz
Hi Everett, IIRC we added unionAll in Spark 2.0 which is the same implementation as rdd union. The union in DataFrames with Spark 2.0 does dedeuplication, and that's why you should be seeing the slowdown. Best, Burak On Thu, Mar 16, 2017 at 4:14 PM, Everett Anderson wrote: > Looks like the Dat

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Everett Anderson
Hi! On Thu, Mar 16, 2017 at 5:20 PM, Burak Yavuz wrote: > Hi Everett, > > IIRC we added unionAll in Spark 2.0 which is the same implementation as > rdd union. The union in DataFrames with Spark 2.0 does dedeuplication, and > that's why you should be seeing the slowdown. > I thought it was the o

spark streaming exectors memory increasing and executor killed by yarn

2017-03-16 Thread darin
Hi, I got this exception when streaming program run some hours. ``` *User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 1194.0 failed 4 times, most recent failure: Lost task 21.3 in stage 1194.0 (TID 2475, 2.dev3, executor 66): ExecutorL

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
Hi, Was this not yet resolved? Its a very common requirement to save a dataframe, is there a better way to save a dataframe by avoiding data being sent to driver?. *"Total size of serialized results of 3722 tasks (1024.0 MB) is bigger than spark.driver.maxResultSize (1024.0 MB) "* Thanks, Baahu

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-16 Thread Yong Zhang
In this kind of question, you always want to tell us the spark version. Yong From: darin Sent: Thursday, March 16, 2017 9:59 PM To: user@spark.apache.org Subject: spark streaming exectors memory increasing and executor killed by yarn Hi, I got this exception w

Re: Dataset : Issue with Save

2017-03-16 Thread Yong Zhang
Did you read the JIRA ticket? Are you confirming that it is fixed in Spark 2.0, or you complain that it still exists in Spark 2.0? First, you didn't tell us what version of your Spark you are using. The JIRA clearly said that it is a bug in Spark 1.x, and should be fixed in Spark 2.0. So help

Re: Dataset : Issue with Save

2017-03-16 Thread Bahubali Jain
I am using SPARK 2.0 . There are comments in the ticket since Oct-2016 which clearly mention that issue still persists even in 2.0. I agree 1G is very small today's world, and I have already resolved by increasing the *spark.driver.maxResultSize.* I was more intrigued as to why is the data being se

RDD can not convert to df, thanks

2017-03-16 Thread ??????????
hi all, when i try to convert rdd to DF, I meet errors like belowing: value toDF is not a member of org.apache.spark.rdd.RDD[X] possible causer:may be a semicolon is missing before value toDF My code looks like: spark.textfile().map(s=>_.split(".")).map( s=>AAA(1,2)).toDF the code can

Re: RDD can not convert to df, thanks

2017-03-16 Thread ??????????
More info,I have imported the implicitics of sparksession. ---Original--- From: "user-return-68576-1427357147=qq.com" Date: 2017/3/17 11:48:43 To: "user"; Subject: RDD can not convert to df, thanks hi all, when i try to convert rdd to DF, I meet errors like belowing: value toDF is not a membe