1. Can you try count()? Take often does not force the entire computation.
2. Can you give the full log. From the log it seems that the blocks are
added to two nodes but the tasks seem to be launched to different nodes. I
dont see any message removing the blocks. So need the whole log to debug
this.
outdata.foreachRDD( rdd => rdd.foreachPartition(rec => {
val writer = new
KafkaOutputService(otherConf("kafkaProducerTopic").toString, propsMap)
writer.output(rec)
}) )
So this is creating a new kafka producer for every n
Michael,
I haven't been paying close attention to the JIRA tickets for
PrunedFilteredScan but I noticed some weird behavior around the filters
being applied when OR expressions were used in the WHERE clause. From what
I was seeing, it looks like it could be possible that the "start" and "end"
rang
Ok. I just verified that this is the case with a little test:
WHERE (a = 'v1' and b = 'v2')PrunedFilteredScan passes down 2 filters
WHERE(a = 'v1' and b = 'v2') or (a = 'v3') PrunedFilteredScan passes down 0
filters
On Fri, Feb 13, 2015 at 12:28 AM, Corey Nolet wrote:
> Michael,
>
> I haven
Thank you. That worked.
2015-02-12 20:03 GMT+04:00 Imran Rashid :
> You need to import the implicit conversions to PairRDDFunctions with
>
> import org.apache.spark.SparkContext._
>
> (note that this requirement will go away in 1.3:
> https://issues.apache.org/jira/browse/SPARK-4397)
>
> On Thu,
Your earlier call stack clearly states that it fails because the Derby
metastore has already been started by another instance, so I think that is
explained by your attempt to run this concurrently.
Are you running Spark standalone? Do you have a cluster? You should be able to
run spark in yarn-
Hi Tim,
I think this code will still introduce shuffle even when you call
repartition on each input stream. Actually this style of implementation
will generate more jobs (job per each input stream) than union into one
stream as called DStream.union(), and union normally has no special
overhead as
Thank's for reply. I solved porblem with importing
org.apache.spark.SparkContext._ by Imran Rashid suggestion.
In the sake of interest, does JavaPairRDD intended for use from java? What
is the purpose of this class? Does my rdd implicitly converted to it in
some circumstances?
2015-02-12 19:42 GM
Thanks Kevin for the link, I have had issues trying to install zeppelin as
I believe it is not yet supported for CDH 5.3, and Spark 1.2. Please
correct me if I am mistaken.
On Thu, Feb 12, 2015 at 7:33 PM, Kevin (Sangwoo) Kim
wrote:
> Apache Zeppelin also has a scheduler and then you can reload
>
> I haven't been paying close attention to the JIRA tickets for
> PrunedFilteredScan but I noticed some weird behavior around the filters
> being applied when OR expressions were used in the WHERE clause. From what
> I was seeing, it looks like it could be possible that the "start" and "end"
> ra
Just to add to what Arush said, you can go through these links:
http://stackoverflow.com/questions/1162375/apache-port-proxy
http://serverfault.com/questions/153229/password-protect-and-serve-apache-site-by-port-number
Thanks
Best Regards
On Thu, Feb 12, 2015 at 10:43 PM, Arush Kharbanda <
ar..
Increasing your driver memory might help.
Thanks
Best Regards
On Fri, Feb 13, 2015 at 12:09 AM, Manas Kar
wrote:
> Hi,
> I have a Hidden Markov Model running with 200MB data.
> Once the program finishes (i.e. all stages/jobs are done) the program
> hangs for 20 minutes or so before killing ma
Hello Everyone,
I am writing simple word counts to hdfs using
messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
String.class, (Class) TextOutputFormat.class);
1) However, each 2 seconds I getting a new *directory *that is titled as a
csv. So i'll have test.csv, which will be
Hi foks,
My Spark cluster has 8 machines, each of which has 377GB physical memory,
and thus the total maximum memory can be used for Spark is more than
2400+GB. In my program, I have to deal with 1 billion of (key, value) pairs,
where the key is an integer and the value is an integer array with 43
I am new to Scala. I have a dataset with many columns, each column has a
column name. Given several column names (these column names are not fixed,
they are generated dynamically), I need to sum up the values of these
columns. Is there an efficient way of doing this?
I worked out a way by using f
I replaced the writeToKafka statements with a rdd.count() and sure enough,
I have a stable app with total delay well within the batch window (20
seconds). Here's the total delay lines from the driver log:
15/02/13 06:14:26 INFO JobScheduler: Total delay: 6.521 s for time
142380806 ms (execution
Hi Tim,
I think maybe you can try this way:
create Receiver per executor and specify thread for each topic large than
1, and the total number of consumer thread will be: total consumer =
(receiver number) * (thread number), and make sure this total consumer is
less than or equal to Kafka partitio
Hi Saisai,
If I understand correctly, you are suggesting that control parallelism by
having number of consumers/executors at least 1:1 for number of kafka
partitions. For example, if I have 50 partitions for a kafka topic then
either have:
- 25 or more executors, 25 receivers, each receiver set to
Yes, you can try it. For example, if you have a cluster of 10 executors, 60
Kafka partitions, you can try to choose 10 receivers * 2 consumer threads,
so each thread will consume 3 partitions ideally, if you increase the
threads to 6, each threads will consume 1 partitions ideally. What I think
imp
For streaming application, for every batch it will create a new directory
and puts the data in it. If you don't want to have multiple files inside
the directory as part- then you can do a repartition before the saveAs*
call.
messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","cs
101 - 120 of 120 matches
Mail list logo