Parse RDD[Seq[String]] to DataFrame with types.

2019-07-17 Thread Guillermo Ortiz Fernández
I'm trying to parse a RDD[Seq[String]] to Dataframe. ALthough it's a Seq of Strings they could have a more specific type as Int, Boolean, Double, String an so on. For example, a line could be: "hello", "1", "bye", "1.1" "hello1", "11", "bye1", "2.1" ... First column is going to be always a String,

Re: Putting record in HBase with Spark - error get regions.

2019-05-28 Thread Guillermo Ortiz Fernández
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 19/05/28 11:11:18 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 369 El mar., 28 may. 2019 a las 12:12, Guillermo Ortiz Fernández (< guillermo.ortiz.f...@gmail.com>) escri

Putting record in HBase with Spark - error get regions.

2019-05-28 Thread Guillermo Ortiz Fernández
I'm executing a load process into HBase with spark. (around 150M record). At the end of the process there are a lot of fail tasks. I get this error: 19/05/28 11:02:31 ERROR client.AsyncProcess: Failed to get region location org.apache.hadoop.hbase.TableNotFoundException: my_table at org.

Re: Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-27 Thread Guillermo Ortiz
gt;>> from beginning for the same groupId. >>> >>> Akshay Bhardwaj >>> +91-97111-33849 >>> >>> >>> On Thu, Feb 21, 2019 at 9:08 PM Gabor Somogyi >>> wrote: >>> >>>> From the info you've provided not much t

Spark Streaming - Proeblem to manage offset Kafka and starts from the beginning.

2019-02-21 Thread Guillermo Ortiz
I' working with Spark Streaming 2.0.2 and Kafka 1.0.0 using Direct Stream as connector. I consume data from Kafka and autosave the offsets. I can see Spark doing commits in the logs of the last offsets processed, Sometimes I have restarted spark and it starts from the beginning, when I'm using the

DAGScheduler in SparkStreaming

2018-09-14 Thread Guillermo Ortiz
A question, if you use Spark Streaming, the DAG is calculated for each microbatch? it's possible to calculate only the first time?

Trying to improve performance of the driver.

2018-09-13 Thread Guillermo Ortiz Fernández
I have a process in Spark Streamin which lasts 2 seconds. When I check where the time is spent I see about 0.8s-1s in processing time although the global time is 2s. This one second is spent in the driver. I reviewed the code which is executed by the driver and I commented some of this code with th

Re: deploy-mode cluster. FileNotFoundException

2018-09-05 Thread Guillermo Ortiz Fernández
-2.0.2.jar:lib/kafka-clients-0.10.0.1.jar \ --files conf/${1}Conf.json iris-core-0.0.1-SNAPSHOT.jar conf/${1}Conf.json El mié., 5 sept. 2018 a las 11:11, Guillermo Ortiz Fernández (< guillermo.ortiz.f...@gmail.com>) escribió: > I want to execute my processes in cluster mode. As I don'

deploy-mode cluster. FileNotFoundException

2018-09-05 Thread Guillermo Ortiz Fernández
I want to execute my processes in cluster mode. As I don't know where the driver has been executed I have to do available all the file it needs. I undertand that they are two options. Copy all the files to all nodes of copy them to HDFS. My doubt is,, if I want to put all the files in HDFS, isn't

Local mode vs client mode with one executor

2018-08-30 Thread Guillermo Ortiz
I have many spark processes, some of them are pretty simple and they don't have to process almost messages but they were developed with the same archeotype and they use spark. Some of them are executed with many executors but a few ones don't make sense to process with more than 2-4 cores in only

Re: Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
I can't... do you think that it's a possible bug of this version?? from Spark or Kafka? El mié., 29 ago. 2018 a las 22:28, Cody Koeninger () escribió: > Are you able to try a recent version of spark? > > On Wed, Aug 29, 2018 at 2:10 AM, Guillermo Ortiz Fernández > wrot

java.lang.OutOfMemoryError: Java heap space - Spark driver.

2018-08-29 Thread Guillermo Ortiz Fernández
I got this error from spark driver, it seems that I should increase the memory in the driver although it's 5g (and 4 cores) right now. It seems weird to me because I'm not using Kryo or broadcast in this process but in the log there are references to Kryo and broadcast. How could I figure out the r

Spark Streaming - Kafka. java.lang.IllegalStateException: This consumer has already been closed.

2018-08-29 Thread Guillermo Ortiz Fernández
I'm using Spark Streaming 2.0.1 with Kafka 0.10, sometimes I get this exception and Spark dies. I couldn't see any error or problem among the machines, anybody has the reason about this error? java.lang.IllegalStateException: This consumer has already been closed. at org.apache.kafka.clients

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-24 Thread Guillermo Ortiz
Another test I just did it's to execute with local[X] and this problem doesn't happen. Communication problems? 2018-08-23 22:43 GMT+02:00 Guillermo Ortiz : > it's a complex DAG before the point I cache the RDD, they are some joins, > filter and maps before caching data, but

Re: Caching small Rdd's take really long time and Spark seems frozen

2018-08-23 Thread Guillermo Ortiz
mall RDDs created? Could the blockage be in their compute > creation instead of their caching? > > Thanks, > Sonal > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Thu, Aug 23, 2018 at 6:38 PM, Guillermo

Caching small Rdd's take really long time and Spark seems frozen

2018-08-23 Thread Guillermo Ortiz
I use spark with caching with persist method. I have several RDDs what I cache but some of them are pretty small (about 300kbytes). Most of time it works well and usually lasts 1s the whole job, but sometimes it takes about 40s to store 300kbytes to cache. If I go to the SparkUI->Cache, I can see

Refresh broadcast variable when it isn't the value.

2018-08-19 Thread Guillermo Ortiz Fernández
Hello, I want to set data in a broadcast (Map) variable in Spark. Sometimes there are new data so I have to update/refresh the values but I'm not sure how I could do this. My idea is to use accumulators like a flag when a cache error occurs, in this point I could read the data and reload the broa

Reset the offsets, Kafka 0.10 and Spark

2018-06-07 Thread Guillermo Ortiz Fernández
I'm consuming data from Kafka with createDirectStream and store the offsets in Kafka ( https://spark.apache.org/docs/2.1.0/streaming-kafka-0-10-integration.html#kafka-itself ) val stream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String, S

Measure performance time in some spark transformations.

2018-05-12 Thread Guillermo Ortiz Fernández
I want to measure how long it takes some different transformations in Spark as map, joinWithCassandraTable and so on. Which one is the best aproximation to do it? def time[R](block: => R): R = { val t0 = System.nanoTime() val result = block val t1 = System.nanoTime() println("Elap

Testing spark streaming action

2018-04-10 Thread Guillermo Ortiz
I have a unitTest in SparkStreaming which has an input parameters. -DStream[String] Inside of the code I want to update an LongAccumulator. When I execute the test I get an NullPointerException because the accumulator doesn't exist. Is there any way to test this? My accumulator is updated in diff

Testing spark-testing-base. Error multiple SparkContext

2018-04-03 Thread Guillermo Ortiz
I'm doing a spark test with spark streaming, cassandra and kafka. I have an action which has an DStream as input and save to Cassandra and sometimes put some elements in Kafka. I'm using https://github.com/holdenk/spark-testing-base and kafka y cassandra in local. My method looks like: *def

Testing with spark-base-test

2018-03-28 Thread Guillermo Ortiz
I'm using spark-unit-test and I don't get to compile the code. test("Testging") { val inputInsert = A("data2") val inputDelete = A("data1") val outputInsert = B(1) val outputDelete = C(1) val input = List(List(inputInsert), List(inputDelete)) val output = (List(List(outp

Connection SparkStreaming with SchemaRegistry

2018-03-09 Thread Guillermo Ortiz
I'm trying to integrate with schemaRegistry and SparkStreaming. By the moment I want to use GenericRecords. It seems that my producer works and new schemas are published in _schemas topic. When I try to read with my Consumer, I'm not able to deserialize the data. How could I say to Spark that I'm

Spark Streaming reading many topics with Avro

2018-03-02 Thread Guillermo Ortiz
Hello, I want to read with a single Spark Streaming process several topics. I'm using avro and the data to the different topics have a different schema.Ideally, If I would only have one topic I could implement a deserializer but, I don't know if it's possible with many different schemas. val kafk

Re: Testing Spark-Cassandra

2018-01-17 Thread Guillermo Ortiz
link > to install it, and then follow this > <https://github.com/koeninger/spark-cassandra-example> project, but you > will have to adapt the necessary libraries to use spark 2.0.x version. > > Good luck, i would like to see any blog post using this combination. > >

Testing Spark-Cassandra

2018-01-17 Thread Guillermo Ortiz
Hello, I'm using spark 2.0 and Cassandra. Is there any util to make unit test easily or which one would be the best way to do it? library? Cassandra with docker?

Re: Flume and Spark Streaming

2017-01-16 Thread Guillermo Ortiz
Avro sink --> Spark Streaming 2017-01-16 13:55 GMT+01:00 ayan guha : > With Flume, what would be your sink? > > > > On Mon, Jan 16, 2017 at 10:44 PM, Guillermo Ortiz > wrote: > >> I'm wondering to use Flume (channel file)-Spark Streaming. >> >> I

Flume and Spark Streaming

2017-01-16 Thread Guillermo Ortiz
I'm wondering to use Flume (channel file)-Spark Streaming. I have some doubts about it: 1.The RDD size is all data what it comes in a microbatch which you have defined. Risght? 2.If there are 2Gb of data, how many are RDDs generated? just one and I have to make a repartition? 3.When is the ACK

Number of consumers in Kafka with Spark Streaming

2016-06-21 Thread Guillermo Ortiz
I use Spark Streaming with Kafka and I'd like to know how many consumers are generated. I guess that as many as partitions in Kafka but I'm not sure. Is there a way to know the name of the groupId generated in Spark to Kafka?

Re: java.lang.NoClassDefFoundError: kafka/api/TopicMetadataRequest

2016-05-09 Thread Guillermo Ortiz
t > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > &

Re: java.lang.NoClassDefFoundError: kafka/api/TopicMetadataRequest

2016-05-09 Thread Guillermo Ortiz
in any of the > jars listed below. > > Cheers > > > On Mon, May 9, 2016 at 4:00 AM, Guillermo Ortiz > wrote: > >> *jar tvf kafka_2.10-0.8.2.1.jar | grep TopicMetadataRequest * >> 1757 Thu Feb 26 14:30:34 CET 2015 >> kafka/api/TopicMetadataRequest$$anon

Re: java.lang.NoClassDefFoundError: kafka/api/TopicMetadataRequest

2016-05-09 Thread Guillermo Ortiz
:36 CET 2015 kafka/javaapi/TopicMetadataRequest.class 2135 Thu Feb 26 14:30:38 CET 2015 kafka/server/KafkaApis$$anonfun$handleTopicMetadataRequest$1.class 2016-05-09 12:51 GMT+02:00 Guillermo Ortiz : > I'm trying to execute a job with Spark and Kafka and I'm getting this > error.

Re: Error Kafka/Spark. Ran out of messages before reaching ending offset

2016-05-09 Thread Guillermo Ortiz
gt; On Fri, May 6, 2016 at 4:22 AM, Guillermo Ortiz > wrote: > > This is the complete error. > > > > 2016-05-06 11:18:05,424 [task-result-getter-0] INFO > > org.apache.spark.scheduler.TaskSetManager - Finished task 5.0 in stage > > 13.0 (TID 60) in 11692 ms on x

java.lang.NoClassDefFoundError: kafka/api/TopicMetadataRequest

2016-05-09 Thread Guillermo Ortiz
I'm trying to execute a job with Spark and Kafka and I'm getting this error. I know that it's becuase the version are not right, but I have been checking the jar which I import on the SparkUI spark.yarn.secondary.jars and they are right and the class exists inside *kafka_2.10-0.8.2.1.jar. * 2016-

Re: Error Kafka/Spark. Ran out of messages before reaching ending offset

2016-05-06 Thread Guillermo Ortiz
[JobGenerator] INFO org.apache.spark.streaming.scheduler.JobScheduler - Added jobs for time 146252629 ms 2016-05-06 11:18:10,015 [JobGenerator] INFO org.apache.spark.streaming.scheduler.JobGenerator - Checkpointing graph for time 146252629 ms 2016-05-06 11:11 GMT+02:00 Guillermo Ortiz : >

Re: Error Kafka/Spark. Ran out of messages before reaching ending offset

2016-05-06 Thread Guillermo Ortiz
I think that it's a kafka error, but I'm starting thinking if it could be something about elasticsearch since I have seen more people with same error using elasticsearch. I have no idea. 2016-05-06 11:05 GMT+02:00 Guillermo Ortiz : > I'm trying to read data from Spark and in

Error Kafka/Spark. Ran out of messages before reaching ending offset

2016-05-06 Thread Guillermo Ortiz
I'm trying to read data from Spark and index to ES with its library (es-hadoop 2.2.1 version). IIt was working right for a while but now it has started to happen this error. I have delete the checkpoint and even the kafka topic and restart all the machines with kafka and zookeeper but it didn't fix

Re: Configuring log4j Spark

2016-03-30 Thread Guillermo Ortiz
m-executors 5 --executor-cores 1 --driver-memory 1024m *--files /opt/myProject/conf/log4j.properties* /opt/myProject/myJar.jar I think I didn't do any others changes. 2016-03-30 15:42 GMT+02:00 Guillermo Ortiz : > I'm trying to configure log4j in Spark. > > spark-submit

Configuring log4j Spark

2016-03-30 Thread Guillermo Ortiz
I'm trying to configure log4j in Spark. spark-submit --conf spark.metrics.conf=metrics.properties --name "myProject" --master yarn-cluster --class myCompany.spark.MyClass *--files /opt/myProject/conf/log4j.properties* --jars $SPARK_CLASSPATH --executor-memory 1024m --num-executors 5 --executor-c

Checkpoints in Spark

2016-03-30 Thread Guillermo Ortiz
I'm curious about what kind of things are saved in the checkpoints. I just changed the number of executors when I execute Spark and it didn't happen until I remove the checkpoint, I guess that if I'm using log4j.properties and I want to changed I have to remove the checkpoint as well. When you ne

Problem with union of DirectStream

2016-03-10 Thread Guillermo Ortiz
I have a DirectStream and process data from Kafka, val directKafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams1, topics1.toSet) directKafkaStream.foreachRDD { rdd => val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges When

Re: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-27 Thread Guillermo Ortiz
t; > val verticesWithNoOutEdges = graphWithNoOutEdges.vertices > > > > Mohammed > > Author: Big Data Analytics with Spark > <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> > > > > *From:* Guillermo Ortiz [mailto:konstt2...@gmail.com] >

Re: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-26 Thread Guillermo Ortiz
Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > > On 26 Feb 2016, at 11:59, Guillermo Ortiz wrote: > > I'm new with graphX. I need to get the vertex without out edges.. > I g

Get all vertexes with outDegree equals to 0 with GraphX

2016-02-26 Thread Guillermo Ortiz
I'm new with graphX. I need to get the vertex without out edges.. I guess that it's pretty easy but I did it pretty complicated.. and inefficienct val vertices: RDD[(VertexId, (List[String], List[String]))] = sc.parallelize(Array((1L, (List("a"), List[String]())), (2L, (List("b"), List[Strin

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
smartphone > > > ---- Original message > From: Guillermo Ortiz > Date: 02/24/2016 5:26 PM (GMT-05:00) > To: user > Subject: How could I do this algorithm in Spark? > > I want to do some algorithm in Spark.. I know how to do it in a single > machine where all d

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
-b-e) > (b-c) -> (a-b-c, b-c-d) > (c-d) -> (b-c-d) > (b-e) -> (b-e-f) > (e-f) -> (b-e-f, e-f-c) > (f-c) -> (e-f-c) > > filter out keys with less than 2 values > > (b-c) -> (a-b-c, b-c-d) > (e-f) -> (b-e-f, e-f-c) > > mapvalues > > a-b-c

Re: Number partitions after a join

2016-02-25 Thread Guillermo Ortiz
> partition. > > > > Cheers, > > Ximo > > > > *De:* Guillermo Ortiz [mailto:konstt2...@gmail.com] > *Enviado el:* jueves, 25 de febrero de 2016 15:19 > *Para:* Takeshi Yamamuro > *CC:* user > *Asunto:* Re: Number partitions after a join > > > > th

Re: Number partitions after a join

2016-02-25 Thread Guillermo Ortiz
thank you, I didn't see that option. 2016-02-25 14:51 GMT+01:00 Takeshi Yamamuro : > Hi, > > The number depends on `spark.sql.shuffle.partitions`. > See: > http://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options > > On Thu, Feb 25,

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
he algorithm implementation. > > --- > Robin East > *Spark GraphX in Action* Michael Malak and Robin East > Manning Publications Co. > http://www.manning.com/books/spark-graphx-in-action > > > > > > On 25 Feb 2016, at

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
Oh, the letters were just an example, it could be: a , t b, o t, k k, c So.. a -> t -> k -> c and the result is: a,c; t,c; k,c and b,o I don't know if you were thinking about sortBy because the another example where letter were consecutive. 2016-02-25 9:42 GMT+01:00 Guillermo Orti

Number partitions after a join

2016-02-25 Thread Guillermo Ortiz
When you do a join in Spark, how many partitions are as result? is it a default number if you don't specify the number of partitions?

Re: How could I do this algorithm in Spark?

2016-02-25 Thread Guillermo Ortiz
though. > > Thoughts? > > > On Wed, Feb 24, 2016 at 6:43 PM, Jakob Odersky wrote: > >> Hi Guillermo, >> assuming that the first "a,b" is a typo and you actually meant "a,d", >> this is a sorting problem. >> >> You could easily

How could I do this algorithm in Spark?

2016-02-24 Thread Guillermo Ortiz
I want to do some algorithm in Spark.. I know how to do it in a single machine where all data are together, but I don't know a good way to do it in Spark. If someone has an idea.. I have some data like this a , b x , y b , c y , y c , d I want something like: a , d b , d c , d x , y y , y I need

Re: Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz
I think that it's that bug, because the error is the same.. thanks a lot. 2016-01-21 16:46 GMT+01:00 Guillermo Ortiz : > I'm using 1.5.0 of Spark confirmed. Less this > jar file:/opt/centralLogs/lib/spark-catalyst_2.10-1.5.1.jar. > > I'm going to keep looking for,, Th

Re: Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz
executor throws ClassNotFoundException on > driver > > FYI > > On Thu, Jan 21, 2016 at 7:10 AM, Guillermo Ortiz > wrote: > >> I'm using CDH 5.5.1 with Spark 1.5.x (I think that it's 1.5.2). >> >> I know that the library is here: >> cloud-user@ose10kafkae

Re: Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz
; > Which Spark version are you using ? > > Cheers > > On Thu, Jan 21, 2016 at 6:50 AM, Guillermo Ortiz > wrote: > >> I'm runing a Spark Streaming process and it stops in a while. It makes >> some process an insert the result in ElasticSeach with its library. A

Spark job stops after a while.

2016-01-21 Thread Guillermo Ortiz
I'm runing a Spark Streaming process and it stops in a while. It makes some process an insert the result in ElasticSeach with its library. After a while the process fail. I have been checking the logs and I have seen this error 2016-01-21 14:57:54,388 [sparkDriver-akka.actor.default-dispatcher-17]

Number of executors in Spark - Kafka

2016-01-21 Thread Guillermo Ortiz
I'm using Spark Streaming and Kafka with Direct Approach. I have created a topic with 6 partitions so when I execute Spark there are six RDD. I understand than ideally it should have six executors to process each one one RDD. To do it, when I execute spark-submit (I use YARN) I specific the number

Re: Trying to index document in Solr with Spark and solr-spark library

2015-12-16 Thread Guillermo Ortiz
org.apache.solr.common.cloud.ZkStateReader.createClusterStateWatchersAndUpdate(ZkStateReader.java:334) at org.apache.solr.client.solrj.impl.CloudSolrServer.connect(CloudSolrServer.java:243) 2015-12-16 16:26 GMT+01:00 Guillermo Ortiz : > I'm trying to index document to Solr from Spark with the library solr-spark > > I have create a proj

Trying to index document in Solr with Spark and solr-spark library

2015-12-16 Thread Guillermo Ortiz
I'm trying to index document to Solr from Spark with the library solr-spark I have create a project with Maven and include all the dependencies when I execute spark but I get a ClassNotFoundException. I have check that the class is in one of the jar that I'm including ( solr-solrj-4.10.3.jar) I co

How to config the log in Spark

2015-12-07 Thread Guillermo Ortiz
I don't get to activate the logs for my classes. I'm using CDH 5.4 with Spark 1.3.0 I have a class in Scala with some log.debug, I create a class to log: package example.spark import org.apache.log4j.Logger object Holder extends Serializable { @transient lazy val log = Logger.getLogger(getClass

Re: Spark directStream with Kafka and process the lost messages.

2015-11-30 Thread Guillermo Ortiz
ming-guide.html#checkpointing > > On Mon, Nov 30, 2015 at 9:38 AM, Guillermo Ortiz > wrote: > >> Hello, >> >> I have Spark and Kafka with directStream. I'm trying that if Spark dies >> it could process all those messages when it starts. The offsets are stored >&

Spark directStream with Kafka and process the lost messages.

2015-11-30 Thread Guillermo Ortiz
Hello, I have Spark and Kafka with directStream. I'm trying that if Spark dies it could process all those messages when it starts. The offsets are stored in chekpoints but I don't know how I could say to Spark to start in that point. I saw that there's another createDirectStream method with a fro

Re: Problems with JobScheduler

2015-07-31 Thread Guillermo Ortiz
rk! 2015-07-31 9:15 GMT+02:00 Guillermo Ortiz : > It doesn't make sense to me. Because in the another cluster process all > data in less than a second. > Anyway, I'm going to set that parameter. > > 2015-07-31 0:36 GMT+02:00 Tathagata Das : > >> Yes, and that

Re: Problems with JobScheduler

2015-07-31 Thread Guillermo Ortiz
le offset >> >> On Thu, Jul 30, 2015 at 10:46 AM, Guillermo Ortiz >> wrote: >> >>> The difference is that one recives more data than the others two. I can >>> pass thought parameters the topics, so, I could execute the code trying >>> with one topic a

Re: Error SparkStreaming after a while executing.

2015-07-31 Thread Guillermo Ortiz
you running it on YARN? Might be a known YARN credential expiring > issue. > > Hari, any insights? > > On Thu, Jul 30, 2015 at 4:04 AM, Guillermo Ortiz > wrote: > >> I'm executing a job with Spark Streaming and got this error all times >> when the job has bee

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
nt clusters and comparing > the results. > > On Thu, Jul 30, 2015 at 9:29 AM, Guillermo Ortiz > wrote: > >> I have three topics with one partition each topic. So each jobs run about >> one topics. >> >> 2015-07-30 16:20 GMT+02:00 Cody Koeninger : >> >&

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
shed: foreachRDD at > MetricsSpark.scala:67, took 60.391761 s > > 15/07/30 14:37:35 INFO DAGScheduler: Job 93 finished: foreachRDD at > MetricsSpark.scala:67, took 0.531323 s > > > Are those jobs running on the same topicpartition? > > > On Thu, Jul 30, 2015 at 8:03 AM, Guillermo Orti

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
I read about maxRatePerPartition parameter, I haven't set this parameter. Could it be the problem?? Although this wouldn't explain why it doesn't work in one of the clusters. 2015-07-30 14:47 GMT+02:00 Guillermo Ortiz : > They just share the kafka, the rest of resources are ind

Re: Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
They just share the kafka, the rest of resources are independents. I tried to stop one cluster and execute just the cluster isn't working but it happens the same. 2015-07-30 14:41 GMT+02:00 Guillermo Ortiz : > I have some problem with the JobScheduler. I have executed same code in >

Problems with JobScheduler

2015-07-30 Thread Guillermo Ortiz
I have some problem with the JobScheduler. I have executed same code in two cluster. I read from three topics in Kafka with DirectStream so I have three tasks. I have check YARN and there aren't more jobs launched. The cluster where I have troubles I got this logs: 15/07/30 14:32:58 INFO TaskSet

Error SparkStreaming after a while executing.

2015-07-30 Thread Guillermo Ortiz
I'm executing a job with Spark Streaming and got this error all times when the job has been executing for a while (usually hours of days). I have no idea why it's happening. 15/07/30 13:02:14 ERROR LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTarge

Checkpoints in SparkStreaming

2015-07-28 Thread Guillermo Ortiz
I'm using SparkStreaming and I want to configure checkpoint to manage fault-tolerance. I've been reading the documentation. Is it necessary to create and configure the InputDSStream in the getOrCreate function? I checked the example in https://github.com/apache/spark/blob/master/examples/src/main/

Re: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Guillermo Ortiz
renamign that variable and trying it out > again? At least it will reduce one possible source of problem. > > TD > > On Sat, Jun 27, 2015 at 2:32 AM, Guillermo Ortiz > wrote: > >> I'm checking the logs in YARN and I found this error as well >> >> Applicati

Re: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Guillermo Ortiz
utput: Requested user hdfs is not whitelisted and has id 496,which is below the minimum allowed 1000 Container exited with a non-zero exit code 255 Failing this attempt. Failing the application. 2015-06-27 11:25 GMT+02:00 Guillermo Ortiz : > Well SPARK_CLASSPATH it's just a random name, the compl

Re: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Guillermo Ortiz
ures, and > in some cases also for some stateful operations. > > 2. Could you try not using the SPARK_CLASSPATH environment variable. > > TD > > On Sat, Jun 27, 2015 at 1:00 AM, Guillermo Ortiz > wrote: > >> I don't have any checkpoint on my code. Really, I don't

Re: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Guillermo Ortiz
are you trying to execute the code again? From checkpoints, or > otherwise? > Also cc'ed Hari who may have a better idea of YARN related issues. > > On Sat, Jun 27, 2015 at 12:35 AM, Guillermo Ortiz > wrote: > >> Hi, >> >> I'm executing a SparkStreamig code with

Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Guillermo Ortiz
Hi, I'm executing a SparkStreamig code with Kafka. IçThe code was working but today I tried to execute the code again and I got an exception, I dn't know what's it happening. right now , there are no jobs executions on YARN. How could it fix it? Exception in thread "main" org.apache.spark.SparkEx

Trying to connect to many topics with several DirectConnect

2015-05-22 Thread Guillermo Ortiz
Hi, I'm trying to connect to two topics of Kafka with Spark with DirectStream but I get an error. I don't know if there're any limitation to do it, because when I just access to one topics everything if right. *val ssc = new StreamingContext(sparkConf, Seconds(5))* *val kafkaParams =

Re: Working with slides. How do I know how many times a RDD has been processed?

2015-05-19 Thread Guillermo Ortiz
; } else { splitRegister(splitRegister.length) = "1" splitRegister.copyToArray(newArray) } (splitRegister(1), newArray) } If I check the length of splitRegister is always 2 in each slide, it is never three. 2015-05-18 15:36 GMT+02:00 Guillermo Ortiz

Working with slides. How do I know how many times a RDD has been processed?

2015-05-18 Thread Guillermo Ortiz
Hi, I have two streaming RDD1 and RDD2 and want to cogroup them. Data don't come in the same time and sometimes they could come with some delay. When I get all data I want to insert in MongoDB. For example, imagine that I get: RDD1 --> T 0 RDD2 -->T 0.5 I do cogroup between them but I couldn't st

How to separate messages of different topics.

2015-05-05 Thread Guillermo Ortiz
I want to read from many topics in Kafka and know from where each message is coming (topic1, topic2 and so on). val kafkaParams = Map[String, String]("metadata.broker.list" -> "myKafka:9092") val topics = Set("EntryLog", "presOpManager") val directKafkaStream = KafkaUtils.createDirectStream[Str

Re: Spark + Kakfa with directStream

2015-05-05 Thread Guillermo Ortiz
Sorry, I had a duplicated kafka dependency with another older version in another pom.xml 2015-05-05 14:46 GMT+02:00 Guillermo Ortiz : > I'm tryting to execute the "Hello World" example with Spark + Kafka ( > https://github.com/apache/spark/blob/master/examples/scala-2.

Spark + Kakfa with directStream

2015-05-05 Thread Guillermo Ortiz
I'm tryting to execute the "Hello World" example with Spark + Kafka ( https://github.com/apache/spark/blob/master/examples/scala-2.10/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala) with createDirectStream and I get this error. java.lang.NoSuchMethodError: kafka.mes

Re: SparkSQL, executing an "OR"

2015-03-03 Thread Guillermo Ortiz
thanks, it works. 2015-03-03 13:32 GMT+01:00 Cheng, Hao : > Using where('age >=10 && 'age <=4) instead. > > -----Original Message- > From: Guillermo Ortiz [mailto:konstt2...@gmail.com] > Sent: Tuesday, March 3, 2015 5:14 PM > To: user > Subject:

SparkSQL, executing an "OR"

2015-03-03 Thread Guillermo Ortiz
I'm trying to execute a query with Spark. (Example from the Spark Documentation) val teenagers = people.where('age >= 10).where('age <= 19).select('name) Is it possible to execute an OR with this syntax? val teenagers = people.where('age >= 10 'or 'age <= 4).where('age <= 19).select('name) I hav

Combiners in Spark

2015-03-02 Thread Guillermo Ortiz
Which is the equivalent function to "Combiners" of MapReduce in Spark? I guess that it's combineByKey, but is combineByKey executed locally? I understand than functions as reduceByKey or foldByKey aren't executed locally. Reading the documentation looks like combineByKey is equivalent to reduceByK

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
py in the driver. > > On Thu, Feb 26, 2015 at 10:47 AM, Guillermo Ortiz > wrote: >> Isn't it "contacts.map(v => (v._1, (usersMap(v._1), v._2))).collect()" >> executed in the executors? why is it executed in the d

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
usersMap is > a local object. This bit has nothing to do with Spark. > > Yes you would have to broadcast it to use it efficient in functions > (not on the driver). > > On Thu, Feb 26, 2015 at 10:24 AM, Guillermo Ortiz > wrote: >> So, on my example, when I execute: >>

Re: CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
' is > more efficient than a join, but you are right that this is mostly > because joins usually involve shuffles. If not, it's not as clear > which way is best. I suppose that if the Map is large-ish, it's safer > to not keep pulling it to the driver. > > On Thu, Feb 26

CollectAsMap, Broadcasting.

2015-02-26 Thread Guillermo Ortiz
I have a question, If I execute this code, val users = sc.textFile("/tmp/users.log").map(x => x.split(",")).map( v => (v(0), v(1))) val contacts = sc.textFile("/tmp/contacts.log").map(y => y.split(",")).map( v => (v(0), v(1))) val usersMap = contacts.collectAsMap() contacts.map(v => (v._1, (users

Re: How Broadcast variable scale?.

2015-02-23 Thread Guillermo Ortiz
low O(log N) scaling. > > Have you tried it on your 300-machine cluster? I'm curious to know what > happened. > > -Mosharaf > > On Mon, Feb 23, 2015 at 8:06 AM, Guillermo Ortiz > wrote: >> >> I'm looking for about how scale broadcast variables in Spar

How Broadcast variable scale?.

2015-02-23 Thread Guillermo Ortiz
I'm looking for about how scale broadcast variables in Spark and what algorithm uses. I have found http://www.cs.berkeley.edu/~agearh/cs267.sp10/files/mosharaf-spark-bc-report-spring10.pdf I don't know if they're talking about the current version (1.2.1) because the file was created in 2010. I to

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Guillermo Ortiz
If I launch more executors, GC gets worse. 2015-02-06 10:47 GMT+01:00 Guillermo Ortiz : > This is an execution with 80 executors > > MetricMin25th percentileMedian75th percentileMax > Duration 31s 44s 50s 1.1min 2.6 min > GC Time 70ms 0.1s 0.3s 4s 53 s > Input 128.0MB 128.0

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Guillermo Ortiz
Though that wouldn't explain the high > GC. What percent of task time does the web UI report that tasks are > spending in GC? > > On Fri, Feb 6, 2015 at 12:56 AM, Guillermo Ortiz > wrote: >> >> Yes, It's surpressing to me as well >> >> I tried

Re: Problems with GC and time to execute with different number of executors.

2015-02-06 Thread Guillermo Ortiz
you would be hitting a lot of GC for > this scenario. Are you setting --executor-cores and --executor-memory? > What are you setting them to? > > -Sandy > > On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz > wrote: >> >> Any idea why if I use more containers I get

Re: Problems with GC and time to execute with different number of executors.

2015-02-05 Thread Guillermo Ortiz
Any idea why if I use more containers I get a lot of stopped because GC? 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz : > I'm not caching the data. with "each iteration I mean,, each 128mb > that a executor has to process. > > The code is pretty simple. > > final Conv

Re: Problems with GC and time to execute with different number of executors.

2015-02-05 Thread Guillermo Ortiz
de, I'd like to understand what it's happeing. 2015-02-04 18:57 GMT+01:00 Sandy Ryza : > Hi Guillermo, > > What exactly do you mean by "each iteration"? Are you caching data in > memory? > > -Sandy > > On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz > wr

Problems with GC and time to execute with different number of executors.

2015-02-04 Thread Guillermo Ortiz
I execute a job in Spark where I'm processing a file of 80Gb in HDFS. I have 5 slaves: (32cores /256Gb / 7physical disks) x 5 I have been trying many different configurations with YARN. yarn.nodemanager.resource.memory-mb 196Gb yarn.nodemanager.resource.cpu-vcores 24 I have tried to execute the j

Define size partitions

2015-01-30 Thread Guillermo Ortiz
Hi, I want to process some files, there're a king of big, dozens of gigabytes each one. I get them like a array of bytes and there's an structure inside of them. I have a header which describes the structure. It could be like: Number(8bytes) Char(16bytes) Number(4 bytes) Char(1bytes), .. This

  1   2   >