from:"Cesar Flores"

Cleaning spark memory

2016-06-10 Thread Cesar Flores

? Thanks -- Cesar Flores

Integrating spark source in an eclipse project?

2016-06-07 Thread Cesar Flores

I created a spark application in Eclipse by including the spark-assembly-1.6.0-hadoop2.6.0.jar file in the path. However, this method does not allow me see spark code. Is there an easy way to include spark source code for reference in an application developed in Eclipse? Thanks ! -- Cesar

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Cesar Flores

Please sent me to me too ! Thanks ! ! ! Cesar Flores On Tue, May 17, 2016 at 4:55 PM, Femi Anthony wrote: > Please send it to me as well. > > Thanks > > Sent from my iPhone > > On May 17, 2016, at 12:09 PM, Raghavendra Pandey < > raghavendra.pan...@gmail.com>

DAG Pipelines?

2016-05-04 Thread Cesar Flores

functionality may be useful?* Thanks -- Cesar Flores

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Cesar Flores

gt; >> >>> >> From: kpe...@gmail.com >>> >> Date: Mon, 2 May 2016 12:11:18 -0700 >>> >> Subject: Re: Weird results with Spark SQL Outer joins >>> >> To: gourav.sengu...@gmail.com >>> >> CC: user@spark.apache.org >>> >> >>> >> >>> >> Gourav, >>> >> >>> >> I wish that was case, but I have done a select count on each of the >>> two >>> >> tables individually and they return back different number of rows: >>> >> >>> >> >>> >> dps.registerTempTable("dps_pin_promo_lt") >>> >> swig.registerTempTable("swig_pin_promo_lt") >>> >> >>> >> >>> >> dps.count() >>> >> RESULT: 42632 >>> >> >>> >> >>> >> swig.count() >>> >> RESULT: 42034 >>> >> >>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta >>> >> wrote: >>> >> >>> >> This shows that both the tables have matching records and no >>> mismatches. >>> >> Therefore obviously you have the same results irrespective of whether >>> you >>> >> use right or left join. >>> >> >>> >> I think that there is no problem here, unless I am missing something. >>> >> >>> >> Regards, >>> >> Gourav >>> >> >>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 wrote: >>> >> >>> >> Also, the results of the inner query produced the same results: >>> >> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , >>> d.account >>> >> AS >>> >> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN >>> >> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND >>> s.ad >>> >> = >>> >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= >>> '2016-01-03'").count() >>> >> RESULT:23747 >>> >> >>> >> >>> >> >>> >> -- >>> >> View this message in context: >>> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html >>> >> Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> >> >>> >> - >>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> >> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >>> >> >>> >> >>> > >>> >> >> > -- Cesar Flores

Re: how to query the number of running executors?

2016-04-06 Thread Cesar Flores

Thanks Ted: That is the kind of answer I was looking for. Best, Cesar flores On Wed, Apr 6, 2016 at 3:01 PM, Ted Yu wrote: > Have you looked at SparkListener ? > > /** >* Called when the driver registers a new executor. >*/ > def onExecutorA

how to query the number of running executors?

2016-04-06 Thread Cesar Flores

Hello: I wonder if there is a way to query the number of running executors (nor the number asked executors) inside a spark job? Thanks -- Cesar Flores

Spark property parameters priority

2016-03-11 Thread Cesar Flores

the config parameter spark.sql.shuffle.partitions, which I need to modify on the fly to do group by clauses depending on the size of my input.* Thanks -- Cesar Flores

performance of personalized page rank

2016-03-01 Thread Cesar Flores

of time (i.e. less than 12 hours). Best -- Cesar Flores

Re: Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores

I found my problem. I was calling setParameterValue(defaultValue) more than one time in the hierarchy of my classes. Thanks! On Mon, Feb 15, 2016 at 6:34 PM, Cesar Flores wrote: > > I have a set of transformers (each with specific parameters) in spark > 1.3.1. I have two versions,

Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores

.* *Does anyone have any idea of what I may be doing wrong. My guess is that I am doing something weird in my class hierarchy but can not figure out what.* Thanks! -- Cesar Flores

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Cesar Flores

are better off not >> running the orderBy clause. >> >> May be someone from spark sql team could answer that how should the >> partitioning of the output DF be handled when doing an orderBy? >> >> Hemant >> www.snappydata.io >> https://github.com/Snappy

Optimal way to re-partition from a single partition

2016-02-08 Thread Cesar Flores

with a single partition and around 14 million records val newDF = hc.createDataFrame(rdd, df.schema) This process is really slow. Is there any other way of achieving this task, or to optimize it (perhaps tweaking a spark configuration parameter)? Thanks a lot -- Cesar Flores

A question about sql clustering

2015-11-23 Thread Cesar Flores

very useful for performing joins later). Is that true? And second question, if I save *df* just after the query into a hive table, when I reload this table from hive, does spark will remember the partitioning? I am using at the moment 1.3.1 spark version. Thanks -- Cesar Flores

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores

ing since its > mostly a blackbox. > > 1) could be fixed by adding caching. 2) is on our roadmap (though you'd > have to use logical DataFrame expressions to do the partitioning instead of > a class based partitioner). > > On Wed, Oct 14, 2015 at 8:45 AM, Cesar Flores wro

Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores

x._2) val partitioned_df = hc.createDataFrame(partitioned_rdd, unpartitioned_df.schema) Thanks a lot -- Cesar Flores

Is coalesce smart while merging partitions?

2015-10-07 Thread Cesar Flores

to merge is random? Thanks -- Cesar Flores

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores

linux path /home/my_user_name, which fails. On Thu, Aug 6, 2015 at 3:12 PM, Cesar Flores wrote: > Well, I try this approach, and still have issues. Apparently TestHive can > not delete the hive metastore directory. The complete error that I have is: > > 15/08/06 15:01:29 ERROR Dr

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores

On Mon, Aug 3, 2015 at 5:56 PM, Michael Armbrust wrote: > TestHive takes care of creating a temporary directory for each invocation > so that multiple test runs won't conflict. > > On Mon, Aug 3, 2015 at 3:09 PM, Cesar Flores wrote: > >> >> We are using a local h

shutdown local hivecontext?

2015-08-03 Thread Cesar Flores

ooks like: libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.0" % "test", parallelExecution in Test := false, fork := true, javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled") We are working under Spark 1.3.0 Thanks -- Cesar Flores

Dataframe in single partition after sorting?

2015-07-02 Thread Cesar Flores

Thanks!!! -- Cesar Flores

Dataframe random permutation?

2015-06-01 Thread Cesar Flores

tried also: hc.createDataFrame(df.rdd.repartition(100),df.schema) which appears to be a random permutation. Can some one confirm me that the last line is in fact a random permutation, or point me out to a better approach? Thanks -- Cesar Flores

dataframe cumulative sum

2015-05-29 Thread Cesar Flores

cumsum column as the next one: flag | price | cumsum_price --|--- 1|47.808764653746 | 47.808764653746 1|47.808764653746 | 95.6175293075 1|31.9869279512204| 127.604457259 Thanks -- Cesar Flores

Adding an indexed column

2015-05-28 Thread Cesar Flores

as the next one: flag | price | index --|--- 1|47.808764653746 | 0 1|47.808764653746 | 1 1|31.9869279512204| 2 1|47.7907893713564| 3 1|16.7599200038239| 4 1|16.7599200038239| 5 1|20.3916014172137| 6 -- Cesar Flores

partitioning after extracting from a hive table?

2015-05-22 Thread Cesar Flores

I have a table in a Hive database partitioning by date. I notice that when I query this table using HiveContext the created data frame has an specific number of partitions. Do this partitioning corresponds to my original table partitioning in Hive? Thanks -- Cesar Flores

Naming an DF aggregated column

2015-05-19 Thread Cesar Flores

column to something else on the fly, and not after performing the aggregation? thanks -- Cesar Flores

Re: dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores

Never mind. I found the solution: val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, hiveLoadedDataFrame.schema) which translate to convert the data frame to rdd and back again to data frame. Not the prettiest solution, but at least it solves my problems. Thanks, Cesar Flores On

dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores

at all my fields are missing. Can someone tell me if I need to do some post processing after loading from hive in order to avoid this kind of errors? Thanks -- Cesar Flores

ML Pipeline question about caching

2015-03-17 Thread Cesar Flores

a lot -- Cesar Flores

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Cesar Flores

transformers classes for feature extraction, and If I need to save the input and maybe output SchemaRDD of the transform function in every transformer, this may not very efficient. Thanks On Tue, Mar 10, 2015 at 8:20 PM, Tobias Pfeiffer wrote: > Hi, > > On Tue, Mar 10, 2015 at 2:13 PM, Ces

SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Cesar Flores

different syntax? Are they interchangeable? Which one has better performance? Thanks a lot -- Cesar Flores

Data Frame types

2015-03-06 Thread Cesar Flores

be able to handle user defined classes too? Do user classes will need to extend they will need to define the same approach? -- Cesar Flores

Re: SchemaRDD.select

2015-02-19 Thread Cesar Flores

uired fields, but would like to hear the opinion of an expert about it. Thanks On Thu, Feb 19, 2015 at 12:01 PM, Cesar Flores wrote: > > I am trying to pass a variable number of arguments to the select function > of a SchemaRDD I created, as I want to select the fields in run time:

SchemaRDD.select

2015-02-19 Thread Cesar Flores

ct function? If not, what will be a better approach for selecting the required fields in run time? Thanks in advance for your help -- Cesar Flores

ML Transformer

2015-02-18 Thread Cesar Flores

private to the ml package: private[ml] def transformSchema(schema: StructType, paramMap: ParamMap): StructType Do any user can create their own transformers? If not, do this functionality will be added in the future. Thanks -- Cesar Flores

Cleaning spark memory

Integrating spark source in an eclipse project?

Re: My notes on Spark Performance & Tuning Guide

DAG Pipelines?

Re: Weird results with Spark SQL Outer joins

Re: how to query the number of running executors?

how to query the number of running executors?

Spark property parameters priority

performance of personalized page rank

Re: Migrating Transformers from Spark 1.3.1 to 1.5.0

Migrating Transformers from Spark 1.3.1 to 1.5.0

Re: Optimal way to re-partition from a single partition

Optimal way to re-partition from a single partition

A question about sql clustering

Re: Question about data frame partitioning in Spark 1.3.0

Question about data frame partitioning in Spark 1.3.0

Is coalesce smart while merging partitions?

Re: shutdown local hivecontext?

Re: shutdown local hivecontext?

shutdown local hivecontext?

Dataframe in single partition after sorting?

Dataframe random permutation?

dataframe cumulative sum

Adding an indexed column

partitioning after extracting from a hive table?

Naming an DF aggregated column

Re: dataframe can not find fields after loading from hive

dataframe can not find fields after loading from hive

ML Pipeline question about caching

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

SchemaRDD: SQL Queries vs Language Integrated Queries

Data Frame types

Re: SchemaRDD.select

SchemaRDD.select

ML Transformer

35 matches

Site Navigation

Mail list logo

Footer information