[Spark Streaming] Unable to write checkpoint when restart

2015-11-21 Thread Sea
When I restart my streaming program??this bug found And it will kill my program I am using spark 1.4.1 15/11/22 03:20:00 WARN CheckpointWriter: Error in attempt 1 of writing checkpoint to hdfs://streaming/user/dm/order_predict/streaming_ v2/10/checkpoint/checkpoint-144813360 org.apa

Re: newbie : why are thousands of empty files being created on HDFS?

2015-11-21 Thread Sabarish Sasidharan
Those are empty partitions. I don't see the number of partitions specified in code. That then implies the default parallelism config is being used and is set to a very high number, the sum of empty + non empty files. Regards Sab On 21-Nov-2015 11:59 pm, "Andy Davidson" wrote: > I start working o

JavaStreamingContext nullpointer exception while fetching data from Cassandra

2015-11-21 Thread ravi.gawai
want to read file data and check if file line data is present in Cassandra if it's present then needs to merge otherwise fresh insert to C*. File data just contains name,address in json format, in Cassandra student table have UUID as primary key and there is secondry index on name Once data is me

Datastore for GrpahX

2015-11-21 Thread Ilango Ravi
Hi I am trying to figure which Datastore I can use for storing data to be used with GraphX. Is there a good Graph database out there which I can use for storing Graph data for efficient data storage/retireval. thanks, ravi

Closures sent once per executor or copied with each tasks?

2015-11-21 Thread emao
Hi, I would like to know how/where are the serialized closures shipped: are they sent once per executors or copied to each task? From my understanding they are copied with each tasks but in the online documentation there is misleading information. For example, on the http://spark.apache.org/docs

newbie : why are thousands of empty files being created on HDFS?

2015-11-21 Thread Andy Davidson
I start working on a very simple ETL pipeline for a POC. It reads a in a data set of tweets stored as JSON strings on in HDFS and randomly selects 1% of the observations and writes them to HDFS. It seems to run very slowly. E.G. To write 4720 observations takes 1:06:46.577795. I Also noticed that R

spark shuffle

2015-11-21 Thread Shushant Arora
Hi I have few doubts 1.does rdd.saveasNewAPIHadoopFile(outputdir,keyclass,valueclass,ouputformat class)-> shuffles data or it will always create same no of files in output dir as number of partitions in rdd. 2. How to use multiple outputs in saveasNewAPIHadoopFile to have file name generated fro

RDD partition after calling mapToPair

2015-11-21 Thread trung kien
Hi all, I am having problem of understanding how RDD will be partitioned after calling mapToPair function. Could anyone give me more information about parititoning in this function? I have a simple application doing following job: JavaPairInputDStream messages = KafkaUtils.createDirectStream(...

Re: Spark Streaming - stream between 2 applications

2015-11-21 Thread Christian
Instead of sending the results of the one spark app directly to the other one, you could write the results to a Kafka topic which is consumed by your other spark application. On Fri, Nov 20, 2015 at 12:07 PM Saiph Kappa wrote: > I think my problem persists whether I use Kafka or sockets. Or am I

How to adjust Spark shell table width

2015-11-21 Thread Fengdong Yu
Hi, I found if the column value is too long, spark shell only show a partial result. such as: sqlContext.sql("select url from tableA”).show(10) it cannot show the whole URL here. so how to adjust it? Thanks - To unsubscr

Spark : merging object with approximation

2015-11-21 Thread OcterA
Hello, I have a set of X data (around 30M entry), I have to do a batch to merge data which are similar, at the end I will have around X/2 data. At this moment, i've done the basis : open files, map to an usable Ojbect, but I'm stuck at the merge part... The merge condition is composed from vario

Spark-SQL idiomatic way of adding a new partition or writing to Partitioned Persistent Table

2015-11-21 Thread Deenar Toraskar
Hi guys Is it possible to add a new partition to a persistent table using Spark SQL ? The following call works and data gets written in the correct directories, but no partition metadata is not added to the Hive metastore. In addition I see nothing preventing any arbitrary schema being appended to