Re: Getting all files of a table

2015-12-01 Thread Krzysztof Zarzycki
but should work for hive tables). > > Michael > > On Tue, Dec 1, 2015 at 10:55 AM, Krzysztof Zarzycki > wrote: > >> Hi there, >> Do you know how easily I can get a list of all files of a Hive table? >> >> What I want to achieve is to get all files that

Getting all files of a table

2015-12-01 Thread Krzysztof Zarzycki
Hi there, Do you know how easily I can get a list of all files of a Hive table? What I want to achieve is to get all files that are underneath parquet table and using sparksql-protobuf[1] library(really handy library!) and its helper class ProtoParquetRDD: val protobufsRdd = new ProtoParquetRDD(s

Spark using Yarn timelineserver - High CPU usage

2015-11-05 Thread Krzysztof Zarzycki
Hi there, I have a serious problem in my Hadoop cluster, that YARN Timeline server generates very high load, 800% CPU when there are 8 Spark Streaming jobs running in parallel. I discuss this problem on Hadoop group in parallel: http://mail-archives.apache.org/mod_mbox/hadoop-user/201509.mbox/%3CC

Re: Notification on Spark Streaming job failure

2015-10-06 Thread Krzysztof Zarzycki
ice >> on this, from people who implemented anything on this. >> >> On Fri, Sep 18, 2015 at 2:35 AM, Krzysztof Zarzycki > > wrote: >> >>> Hi there Spark Community, >>> I would like to ask you for an advice: I'm running Spark Streaming

Re: Store DStreams into Hive using Hive Streaming

2015-10-05 Thread Krzysztof Zarzycki
I'm also interested in this feature. Did you guys found some information about how to use Hive Streaming with Spark Streaming? Thanks, Krzysiek 2015-07-17 20:16 GMT+02:00 unk1102 : > Hi I have similar use case did you found solution for this problem of > loading > DStreams in Hive using Spark St

Notification on Spark Streaming job failure

2015-09-17 Thread Krzysztof Zarzycki
Hi there Spark Community, I would like to ask you for an advice: I'm running Spark Streaming jobs in production. Sometimes these jobs fail and I would like to get email notification about it. Do you know how I can set up Spark to notify me by email if my job fails? Or do I have to use external moni

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Krzysztof Zarzycki
works that might do it more convenient (Samza, Flink or just being-designed Kafka-Streams <https://cwiki.apache.org/confluence/display/KAFKA/KIP-28+-+Add+a+processor+client> ) Thanks Dibyendu for your note, I will strongly consider it, when falling back to receiver-based approach. Cheers, Kr

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Krzysztof Zarzycki
Thanks guys for your answers. I put my answers in text, below. Cheers, Krzysztof Zarzycki 2015-09-10 15:39 GMT+02:00 Cody Koeninger : > The kafka direct stream meets those requirements. You don't need > checkpointing for exactly-once. Indeed, unless your output operations are &

Re: Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Krzysztof Zarzycki
xamples of manually managing ZK offsets? Thanks, Krzysztof 2015-09-10 12:22 GMT+02:00 Akhil Das : > This consumer pretty much covers all those scenarios you listed > github.com/dibbhatt/kafka-spark-consumer Give it a try. > > Thanks > Best Regards > > On Thu, Sep 10, 2015

Using KafkaDirectStream, stopGracefully and exceptions

2015-09-10 Thread Krzysztof Zarzycki
e able to upgrade code & not lose Kafka offsets? Thank you a lot for your answers, Krzysztof Zarzycki

Re: Merge metadata error when appending to parquet table

2015-08-09 Thread Krzysztof Zarzycki
one can help? Of course the original problem stays open. Thanks! Krzysiek 2015-08-09 14:19 GMT+02:00 Krzysztof Zarzycki : > Hi there, > I have a problem with a spark streaming job running on Spark 1.4.1, that > appends to parquet table. > > My job receives json strings and creates Jso

Merge metadata error when appending to parquet table

2015-08-09 Thread Krzysztof Zarzycki
Hi there, I have a problem with a spark streaming job running on Spark 1.4.1, that appends to parquet table. My job receives json strings and creates JsonRdd out of it. The jsons might come in different shape as most of the fields are optional. But they never have conflicting schemas. Next, for e

writing/reading multiple Parquet files: Failed to merge incompatible data types StringType and StructType

2015-07-21 Thread Krzysztof Zarzycki
Hi everyone, I have pretty challenging problem with reading/writing multiple parquet files with streaming, but let me introduce my data flow: I have a lot of json events streaming to my platform. All of them have the same structure, but fields are mostly optional. Some of the fields are arrays wit

Re: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-14 Thread Krzysztof Zarzycki
This is a common use of Spark Streaming + > Cassandra/HBase. > > Regarding the performance of updateStateByKey, we are aware of the > limitations, and we will improve it soon :) > > TD > > > On Tue, Apr 14, 2015 at 12:34 PM, Krzysztof Zarzycki > wrote: > >> H

Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-04-14 Thread Krzysztof Zarzycki
Hey guys, could you please help me with a question I asked on Stackoverflow: https://stackoverflow.com/questions/29635681/is-it-feasible-to-keep-millions-of-keys-in-state-of-spark-streaming-job-for-two ? I'll be really grateful for your help! I'm also pasting the question below: I'm trying to so