Re: Read Local File

2017-06-14 Thread Dirceu Semighini Filho
ectly alright. Typing the path explicitly resolved it. But > this is a corner case. > > Alternately - if the file size is small, you could do spark-submit with a > --files option which will ship the file to every executor and is available > for all executors. > > > > >

Read Local File

2017-06-13 Thread Dirceu Semighini Filho
Hi all, I'm trying to read a File from local filesystem, I'd 4 workstations 1 Master and 3 slaves, running with Ambari and Yarn with Spark version* 2.1.1.2.6.1.0-129* The code that I'm trying to run is quite simple spark.sqlContext.read.text("file:///pathToFile").count I've copied the file in al

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-12 Thread Dirceu Semighini Filho
ample").as[A] > df.union(df1) > > It runs ok. And for nullabillity I thought that issue has been fixed: > https://issues.apache.org/jira/browse/SPARK-18058 > I think you can check your spark version and schema of dataset again? Hope > this help. > > Best, > > On 2017年5月9

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho
ctually be fixed, and the union's schema > should have the less restrictive of the DataFrames. > > On Mon, May 8, 2017 at 12:46 PM, Dirceu Semighini Filho < > dirceu.semigh...@gmail.com> wrote: > >> HI Burak, >> By nullability you mean that if I have the exac

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho
ble column problem. For >> RDD, You may not see any error if you don't use the incompatible column. >> >> Dataset.union requires compatible schema. You can print ds.schema and >> ds1.schema and check if they are same. >> >> On Mon, May 8, 2017 at 11:07 AM,

Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Dirceu Semighini Filho
Hello, I've a very complex case class structure, with a lot of fields. When I try to union two datasets of this class, it doesn't work with the following error : ds.union(ds1) Exception in thread "main" org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatibl

Cant convert Dataset to case class with Option fields

2017-04-07 Thread Dirceu Semighini Filho
Hi Devs, I've some case classes here, and it's fields are all optional case class A(b:Option[B] = None, c: Option[C] = None, ...) If I read some data in a DataSet and try to connvert it to this case class using the as method, it doesn't give me any answer, it simple freeze. If I change the case cl

Re: specifing schema on dataframe

2017-02-04 Thread Dirceu Semighini Filho
Hi Sam Remove the " from the number that it will work Em 4 de fev de 2017 11:46 AM, "Sam Elamin" escreveu: > Hi All > > I would like to specify a schema when reading from a json but when trying > to map a number to a Double it fails, I tried FloatType and IntType with no > joy! > > > When inferr

Re: Time-Series Analysis with Spark

2017-01-11 Thread Dirceu Semighini Filho
Hello Rishabh, We have done some forecasting, for time-series, using ARIMA in our project, it's on top of Spark and it's open source https://github.com/eleflow/uberdata Kind Regards, Dirceu 2017-01-11 8:20 GMT-02:00 Sean Owen : > https://github.com/sryza/spark-timeseries ? > > On Wed, Jan 11, 20

Re: How many Spark streaming applications can be run at a time on a Spark cluster?

2016-12-24 Thread Dirceu Semighini Filho
Hi, You can start multiple spark apps per cluster. You will have one stream context per app. Em 24 de dez de 2016 18:22, "shyla deshpande" escreveu: > Hi All, > > Thank you for the response. > > As per > > https://docs.cloud.databricks.com/docs/latest/databricks_ > guide/index.html#07%20Spark%2

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Dirceu Semighini Filho
e anything. Out of curiousity why did you suggest that? > Googling "spark coalesce prime" doesn't give me any clue :-) > Adrian > > > On 14/12/2016 13:58, Dirceu Semighini Filho wrote: > > Hi Adrian, > Which kind of partitioning are you using? > Have you alre

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Dirceu Semighini Filho
Hi Adrian, Which kind of partitioning are you using? Have you already tried to coalesce it to a prime number? 2016-12-14 11:56 GMT-02:00 Adrian Bridgett : > I realise that coalesce() isn't guaranteed to be balanced and adding a > repartition() does indeed fix this (at the cost of a large shuffle

Re: Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

2016-11-17 Thread Dirceu Semighini Filho
store the block addition event. I need to look into the code > again to see when these files are created new and when they are appended. > > > Thanks, Arijit > > > -- > *From:* Dirceu Semighini Filho > *Sent:* Thursday, November 17, 20

Re: Spark Streaming Data loss on failure to write BlockAdditionEvent failure to WAL

2016-11-17 Thread Dirceu Semighini Filho
Hi Arijit, Have you find a solution for this? I'm facing the same problem in Spark 1.6.1, but here the error happens only a few times, so our hdfs does support append. This is what I can see in the logs: 2016-11-17 13:43:20,012 ERROR [BatchedWriteAheadLog Writer] WriteAheadLogManager for Thread: F

Re: Writing parquet table using spark

2016-11-16 Thread Dirceu Semighini Filho
Hello, Have you configured this property? spark.sql.parquet.compression.codec 2016-11-16 6:40 GMT-02:00 Vaibhav Sinha : > Hi, > I am using hiveContext.sql() method to select data from source table and > insert into parquet tables. > The query executed from spark takes about 3x more disk space t

Can somebody remove this guy?

2016-09-23 Thread Dirceu Semighini Filho
Can somebody remove this guy from the list tod...@yahoo-inc.com Just sent a message to the list and received an mail from yahoo saying that this email doesn't exist anymore. This is an automatically generated message. tod...@yahoo-inc.com is no longer with Yahoo! Inc. Your message will not be fo

Re: 答复: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-23 Thread Dirceu Semighini Filho
* 2 - 1 (breakpoint-1) > val y = random * 2 - 1 > if (x*x + y*y < 1) 1 else 0 > }.reduce(_ + _) > println("Pi is roughly " + 4.0 * count / (n - 1)) > spark.stop() > } > } > > > > > -- > *发件人:*

Re: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho
a COUNT action in advace and then remove it after debugging. Is > that the right way? > > > ------ > *发件人:* Dirceu Semighini Filho > *发送时间:* 2016年9月16日 21:07 > *收件人:* chen yong > *抄送:* user@spark.apache.org > *主题:* Re: 答复: 答复: 答复: 答复: t it does

Re: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho
nks you very much > -- > *发件人:* Dirceu Semighini Filho > *发送时间:* 2016年9月16日 21:07 > *收件人:* chen yong > *抄送:* user@spark.apache.org > *主题:* Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in > an anonymous function > > Hello Felix

Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho
gt; Later, I guess > > the line > > val test = count > > is the key point. without it, it would not stop at the breakpont-1, right? > > > > -- > *发件人:* Dirceu Semighini Filho > *发送时间:* 2016年9月16日 0:39 > *收件人:* chen yong > *抄送:* user@

Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-15 Thread Dirceu Semighini Filho
; if (x*x + y*y < 1) 1 else 0 > }.reduce(_ + _) > val test = x (breakpoint-2 set in this line) > > > > -- > *发件人:* Dirceu Semighini Filho > *发送时间:* 2016年9月14日 23:32 > *收件人:* chen yong > *主题:* Re: 答复: 答复: t it does not stop a

Re: t it does not stop at breakpoints which is in an anonymous function

2016-09-14 Thread Dirceu Semighini Filho
Hello Felix, Spark functions run lazy, and that's why it doesn't stop in those breakpoints. They will be executed only when you call some methods of your dataframe/rdd, like the count, collect, ... Regards, Dirceu 2016-09-14 11:26 GMT-03:00 chen yong : > Hi all, > > > > I am newbie to spark. I a

Re: Forecasting algorithms in spark ML

2016-09-08 Thread Dirceu Semighini Filho
Hi Madabhattula Rajesh Kumar, There is an open source project called sparkts (Time Series for Spark) that implement ARIMA and Holtwinters algorithms on top of Spark, which can be used for forecast. In some cases, Linear Regression, which is avalilable i

Re: Debug spark jobs on Intellij

2016-05-31 Thread Dirceu Semighini Filho
Try this: Is this python right? I'm not used to it, I'm used to scala, so val toDebug = rdd.foreachPartition(partition -> { //breakpoint stop here *// by val toDebug I mean to assign the result of foreachPartition to a variable* partition.forEachRemaining(message -> { //breakpoint doen

Re: Debug spark jobs on Intellij

2016-05-31 Thread Dirceu Semighini Filho
Hi Marcelo, this is because the operations in rdd are lazy, you will only stop at this inside foreach breakpoint when you call a first, a collect or a reduce operation. This is when the spark will run the operations. Have you tried that? Cheers. 2016-05-31 17:18 GMT-03:00 Marcelo Oikawa : > Hell

Re: ClassNotFoundException in RDD.map

2016-03-23 Thread Dirceu Semighini Filho
typechecker > could catch, can slip through. > > On Thu, Mar 17, 2016 at 10:25 AM, Dirceu Semighini Filho > wrote: > > Hi Ted, thanks for answering. > > The map is just that, whenever I try inside the map it throws this > > ClassNotFoundException, even if I do map(f =&

Re: Serialization issue with Spark

2016-03-23 Thread Dirceu Semighini Filho
Hello Hafsa, TaskNotSerialized exception usually means that you are trying to use an object, defined in the driver, in code that runs on workers. Can you post the code that ir generating this error here, so we can better advise you? Cheers. 2016-03-23 14:14 GMT-03:00 Hafsa Asif : > Can anyone pl

ClassNotFoundException in RDD.map

2016-03-20 Thread Dirceu Semighini Filho
Hello, I found a strange behavior after executing a prediction with MLIB. My code return an RDD[(Any,Double)] where Any is the id of my dataset, which is BigDecimal, and Double is the prediction for that line. When I run myRdd.take(10) it returns ok res16: Array[_ >: (Double, Double) <: (Any, Doubl

Re: ClassNotFoundException in RDD.map

2016-03-19 Thread Dirceu Semighini Filho
ode isn't wrong. Kind Regards, Dirceu 2016-03-17 12:50 GMT-03:00 Ted Yu : > bq. $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1 > > Do you mind showing more of your code involving the map() ? > > On Thu, Mar 17, 2016 at 8:32 AM, Dirceu Semighini Filho < > dirc

Re: SparkR Count vs Take performance

2016-03-02 Thread Dirceu Semighini Filho
an RDD of nonserializable > objects anyway, but they can exist as an intermediate stage. > > We could fix that pretty easily with a little copy and paste of the > take() code; right now isEmpty is simple but has this drawback. > > On Tue, Mar 1, 2016 at 7:18 PM, Dirceu Semighini Fil

Re: SparkR Count vs Take performance

2016-03-01 Thread Dirceu Semighini Filho
#x27;s slower than > a count in all but pathological cases. > > > > On Tue, Mar 1, 2016 at 6:03 PM, Dirceu Semighini Filho > wrote: > > Hello all. > > I have a script that create a dataframe from this operation: > > > > mytable <- sql(sqlContext,("S

SparkR Count vs Take performance

2016-03-01 Thread Dirceu Semighini Filho
Hello all. I have a script that create a dataframe from this operation: mytable <- sql(sqlContext,("SELECT ID_PRODUCT, ... FROM mytable")) rSparkDf <- createPartitionedDataFrame(sqlContext,myRdataframe) dFrame <- join(mytable,rSparkDf,mytable$ID_PRODUCT==rSparkDf$ID_PRODUCT) After filtering this

Re: Client session timed out, have not heard from server in

2015-12-22 Thread Dirceu Semighini Filho
Hi Yash, I've experienced this behavior here when the process freeze in a worker. This mainly happen, in my case, when the worker memory was full and the java GC wasn't able to free memory for the process. Try to search for outofmemory error in your worker logs. Regards, Dirceu 2015-12-22 10:26 G

Re: How to set memory for SparkR with master="local[*]"

2015-10-23 Thread Dirceu Semighini Filho
Hi Matej, I'm also using this and I'm having the same behavior here, my driver has only 530mb which is the default value. Maybe this is a bug. 2015-10-23 9:43 GMT-02:00 Matej Holec : > Hello! > > How to adjust the memory settings properly for SparkR with > master="local[*]" > in R? > > > *When r

Spark 1.5.1 ThriftServer

2015-10-15 Thread Dirceu Semighini Filho
Hello, I'm trying to migrate to scala 2.11 and I didn't found a spark-thriftserver jar for scala 2.11 in maven repository. I could a manual build (without tests) the spark with thriftserver in scala 2.11. Sometime ago the thrift server build wasn't enabled by default, but I can find a 2.10 jar for

Re:

2015-10-15 Thread Dirceu Semighini Filho
Hi Anfemee, Subject in the email sometimes help ;) Have you seen if the link is sending you to a hostname that is not accessible by your workstation? Sometimes changing the hostname to the ip solve this kind of issue. 2015-10-15 13:34 GMT-03:00 Anfernee Xu : > Sorry, I have to re-send it again

Re: Null Value in DecimalType column of DataFrame

2015-09-18 Thread Dirceu Semighini Filho
> need a decimal type that has precision - scale >= 2. > > On Tue, Sep 15, 2015 at 6:39 AM, Dirceu Semighini Filho < > dirceu.semigh...@gmail.com> wrote: > >> >> Hi Yin, posted here because I think it's a bug. >> So, it will return null and I can get