from:"Cesar"

How to check if a data frame is cached?

2016-07-14 Thread Cesar

Is there a simpler way to check if a data frame is cached other than: dataframe.registerTempTable("cachedOutput") assert(hc.isCached("cachedOutput"), "The table was not cached") Thanks! -- Cesar Flores

saving data frame to optimize joins at a later time

2016-08-02 Thread Cesar

SELECT * FROM tableAlias " ) Do the partition information ("id") will be stored in whse.someTable such that when querying on that table in a second spark job, the information will be used for optimizing joins for example? If this approach do not work, can you suggest one that works? Thanks -- Cesar Flores

Logistic regression formula string

2016-08-08 Thread Cesar

something similar to what R output does (where it clearly indicates which weight corresponds to each feature name, including categorical ones). Thanks a lot ! -- Cesar Flores

releasing memory without stopping the spark context ?

2016-08-31 Thread Cesar

Is there a way to release all persisted RDD's/DataFrame's in Spark without stopping the SparkContext ? Thanks a lot -- Cesar Flores

Logistic Regression Standardization in ML

2016-10-10 Thread Cesar

true. * * @group setParam */ Specifically I am having issues with understanding why the solution should converge to the same weight values with/without standardization ? Thanks ! -- Cesar Flores

does column order matter in dataframe.repartition?

2016-11-17 Thread Cesar

lumns variable? Thanks -- Cesar Flores

credentials are not hiding on a jdbc query

2016-12-06 Thread Cesar

am looking more for a hack kind of solution. Thanks a lot ! -- Cesar Flores

tuning the spark.locality.wait

2017-01-21 Thread Cesar

I can try ? Thanks a lot ! -- Cesar Flores

Unpersist all from memory in spark 2.2

2017-09-25 Thread Cesar

Is there a way to unpersist all data frames, data sets, and/or RDD in Spark 2.2 in a single call? Thanks -- Cesar Flores

Union of multiple data frames

2018-04-05 Thread Cesar

mes?* thanks -- Cesar Flores

Re: Union of multiple data frames

2018-04-05 Thread Cesar

Thanks for your answers. The suggested method works when the number of Data Frames is small. However, I am trying to union >30 Data Frames, and the time to create the plan is taking longer than the execution, which should not be the case. Thanks! -- Cesar On Thu, Apr 5, 2018 at 1:29 PM, A

Naming an DF aggregated column

2015-05-19 Thread Cesar Flores

column to something else on the fly, and not after performing the aggregation? thanks -- Cesar Flores

partitioning after extracting from a hive table?

2015-05-22 Thread Cesar Flores

I have a table in a Hive database partitioning by date. I notice that when I query this table using HiveContext the created data frame has an specific number of partitions. Do this partitioning corresponds to my original table partitioning in Hive? Thanks -- Cesar Flores

Adding an indexed column

2015-05-28 Thread Cesar Flores

as the next one: flag | price | index --|--- 1|47.808764653746 | 0 1|47.808764653746 | 1 1|31.9869279512204| 2 1|47.7907893713564| 3 1|16.7599200038239| 4 1|16.7599200038239| 5 1|20.3916014172137| 6 -- Cesar Flores

dataframe cumulative sum

2015-05-29 Thread Cesar Flores

cumsum column as the next one: flag | price | cumsum_price --|--- 1|47.808764653746 | 47.808764653746 1|47.808764653746 | 95.6175293075 1|31.9869279512204| 127.604457259 Thanks -- Cesar Flores

Dataframe random permutation?

2015-06-01 Thread Cesar Flores

tried also: hc.createDataFrame(df.rdd.repartition(100),df.schema) which appears to be a random permutation. Can some one confirm me that the last line is in fact a random permutation, or point me out to a better approach? Thanks -- Cesar Flores

Dataframe in single partition after sorting?

2015-07-02 Thread Cesar Flores

Thanks!!! -- Cesar Flores

Spark property parameters priority

2016-03-11 Thread Cesar Flores

the config parameter spark.sql.shuffle.partitions, which I need to modify on the fly to do group by clauses depending on the size of my input.* Thanks -- Cesar Flores

how to query the number of running executors?

2016-04-06 Thread Cesar Flores

Hello: I wonder if there is a way to query the number of running executors (nor the number asked executors) inside a spark job? Thanks -- Cesar Flores

Re: how to query the number of running executors?

2016-04-06 Thread Cesar Flores

Thanks Ted: That is the kind of answer I was looking for. Best, Cesar flores On Wed, Apr 6, 2016 at 3:01 PM, Ted Yu wrote: > Have you looked at SparkListener ? > > /** >* Called when the driver registers a new executor. >*/ > def onExecutorA

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Cesar Flores

gt; >> >>> >> From: kpe...@gmail.com >>> >> Date: Mon, 2 May 2016 12:11:18 -0700 >>> >> Subject: Re: Weird results with Spark SQL Outer joins >>> >> To: gourav.sengu...@gmail.com >>> >> CC: user@spark.apache.org >>> >> >>> >> >>> >> Gourav, >>> >> >>> >> I wish that was case, but I have done a select count on each of the >>> two >>> >> tables individually and they return back different number of rows: >>> >> >>> >> >>> >> dps.registerTempTable("dps_pin_promo_lt") >>> >> swig.registerTempTable("swig_pin_promo_lt") >>> >> >>> >> >>> >> dps.count() >>> >> RESULT: 42632 >>> >> >>> >> >>> >> swig.count() >>> >> RESULT: 42034 >>> >> >>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta >>> >> wrote: >>> >> >>> >> This shows that both the tables have matching records and no >>> mismatches. >>> >> Therefore obviously you have the same results irrespective of whether >>> you >>> >> use right or left join. >>> >> >>> >> I think that there is no problem here, unless I am missing something. >>> >> >>> >> Regards, >>> >> Gourav >>> >> >>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 wrote: >>> >> >>> >> Also, the results of the inner query produced the same results: >>> >> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , >>> d.account >>> >> AS >>> >> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , >>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN >>> >> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND >>> s.ad >>> >> = >>> >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >= >>> '2016-01-03'").count() >>> >> RESULT:23747 >>> >> >>> >> >>> >> >>> >> -- >>> >> View this message in context: >>> >> >>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html >>> >> Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> >> >>> >> - >>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> >> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >>> >> >>> >> >>> > >>> >> >> > -- Cesar Flores

DAG Pipelines?

2016-05-04 Thread Cesar Flores

functionality may be useful?* Thanks -- Cesar Flores

Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Cesar Flores

Please sent me to me too ! Thanks ! ! ! Cesar Flores On Tue, May 17, 2016 at 4:55 PM, Femi Anthony wrote: > Please send it to me as well. > > Thanks > > Sent from my iPhone > > On May 17, 2016, at 12:09 PM, Raghavendra Pandey < > raghavendra.pan...@gmail.com>

Integrating spark source in an eclipse project?

2016-06-07 Thread Cesar Flores

I created a spark application in Eclipse by including the spark-assembly-1.6.0-hadoop2.6.0.jar file in the path. However, this method does not allow me see spark code. Is there an easy way to include spark source code for reference in an application developed in Eclipse? Thanks ! -- Cesar

Cleaning spark memory

2016-06-10 Thread Cesar Flores

? Thanks -- Cesar Flores

Optimal way to re-partition from a single partition

2016-02-08 Thread Cesar Flores

with a single partition and around 14 million records val newDF = hc.createDataFrame(rdd, df.schema) This process is really slow. Is there any other way of achieving this task, or to optimize it (perhaps tweaking a spark configuration parameter)? Thanks a lot -- Cesar Flores

Re: Optimal way to re-partition from a single partition

2016-02-09 Thread Cesar Flores

are better off not >> running the orderBy clause. >> >> May be someone from spark sql team could answer that how should the >> partitioning of the output DF be handled when doing an orderBy? >> >> Hemant >> www.snappydata.io >> https://github.com/Snappy

Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores

.* *Does anyone have any idea of what I may be doing wrong. My guess is that I am doing something weird in my class hierarchy but can not figure out what.* Thanks! -- Cesar Flores

Re: Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores

I found my problem. I was calling setParameterValue(defaultValue) more than one time in the hierarchy of my classes. Thanks! On Mon, Feb 15, 2016 at 6:34 PM, Cesar Flores wrote: > > I have a set of transformers (each with specific parameters) in spark > 1.3.1. I have two versions,

performance of personalized page rank

2016-03-01 Thread Cesar Flores

of time (i.e. less than 12 hours). Best -- Cesar Flores

Job on Yarn not using all given capacity ends up failing

2015-10-05 Thread Cesar Berezowski

Hi, I recently upgraded from 1.2.1 to 1.3.1 (through HDP). I have a job that does a cartesian product on two datasets (2K and 500K lines minimum) to do string matching. I updated it to use Dataframes because the old code wouldn’t run anymore (deprecated RDD functions). It used to run very w

Re: Spark 1.3.1 on Yarn not using all given capacity

2015-10-06 Thread Cesar Berezowski

3 cores* not 8 César. > Le 6 oct. 2015 à 19:08, Cesar Berezowski a écrit : > > I deployed hdp 2.3.1 and got spark 1.3.1, spark 1.4 is supposed to be > available as technical preview I think > > vendor’s forum ? you mean hortonworks' ? > > -- > Update on m

Is coalesce smart while merging partitions?

2015-10-07 Thread Cesar Flores

to merge is random? Thanks -- Cesar Flores

Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores

x._2) val partitioned_df = hc.createDataFrame(partitioned_rdd, unpartitioned_df.schema) Thanks a lot -- Cesar Flores

Re: Question about data frame partitioning in Spark 1.3.0

2015-10-14 Thread Cesar Flores

ing since its > mostly a blackbox. > > 1) could be fixed by adding caching. 2) is on our roadmap (though you'd > have to use logical DataFrame expressions to do the partitioning instead of > a class based partitioner). > > On Wed, Oct 14, 2015 at 8:45 AM, Cesar Flores wro

A question about sql clustering

2015-11-23 Thread Cesar Flores

very useful for performing joins later). Is that true? And second question, if I save *df* just after the query into a hive table, when I reload this table from hive, does spark will remember the partitioning? I am using at the moment 1.3.1 spark version. Thanks -- Cesar Flores

shutdown local hivecontext?

2015-08-03 Thread Cesar Flores

ooks like: libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.0" % "test", parallelExecution in Test := false, fork := true, javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled") We are working under Spark 1.3.0 Thanks -- Cesar Flores

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores

On Mon, Aug 3, 2015 at 5:56 PM, Michael Armbrust wrote: > TestHive takes care of creating a temporary directory for each invocation > so that multiple test runs won't conflict. > > On Mon, Aug 3, 2015 at 3:09 PM, Cesar Flores wrote: > >> >> We are using a local h

Re: shutdown local hivecontext?

2015-08-06 Thread Cesar Flores

linux path /home/my_user_name, which fails. On Thu, Aug 6, 2015 at 3:12 PM, Cesar Flores wrote: > Well, I try this approach, and still have issues. Apparently TestHive can > not delete the hive metastore directory. The complete error that I have is: > > 15/08/06 15:01:29 ERROR Dr

ML Transformer

2015-02-18 Thread Cesar Flores

private to the ml package: private[ml] def transformSchema(schema: StructType, paramMap: ParamMap): StructType Do any user can create their own transformers? If not, do this functionality will be added in the future. Thanks -- Cesar Flores

SchemaRDD.select

2015-02-19 Thread Cesar Flores

ct function? If not, what will be a better approach for selecting the required fields in run time? Thanks in advance for your help -- Cesar Flores

Re: SchemaRDD.select

2015-02-19 Thread Cesar Flores

uired fields, but would like to hear the opinion of an expert about it. Thanks On Thu, Feb 19, 2015 at 12:01 PM, Cesar Flores wrote: > > I am trying to pass a variable number of arguments to the select function > of a SchemaRDD I created, as I want to select the fields in run time:

Data Frame types

2015-03-06 Thread Cesar Flores

be able to handle user defined classes too? Do user classes will need to extend they will need to define the same approach? -- Cesar Flores

SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-10 Thread Cesar Flores

different syntax? Are they interchangeable? Which one has better performance? Thanks a lot -- Cesar Flores

Re: SchemaRDD: SQL Queries vs Language Integrated Queries

2015-03-11 Thread Cesar Flores

transformers classes for feature extraction, and If I need to save the input and maybe output SchemaRDD of the transform function in every transformer, this may not very efficient. Thanks On Tue, Mar 10, 2015 at 8:20 PM, Tobias Pfeiffer wrote: > Hi, > > On Tue, Mar 10, 2015 at 2:13 PM, Ces

ML Pipeline question about caching

2015-03-17 Thread Cesar Flores

a lot -- Cesar Flores

dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores

at all my fields are missing. Can someone tell me if I need to do some post processing after loading from hive in order to avoid this kind of errors? Thanks -- Cesar Flores

Re: dataframe can not find fields after loading from hive

2015-04-16 Thread Cesar Flores

Never mind. I found the solution: val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd, hiveLoadedDataFrame.schema) which translate to convert the data frame to rdd and back again to data frame. Not the prettiest solution, but at least it solves my problems. Thanks, Cesar Flores On

Anybody changed their mind about going to the Spark Summit 2014

2014-06-27 Thread Cesar Arevalo

Hi All: I was wondering if anybody had bought a ticket for the upcoming Spark Summit 2014 this coming week and had changed their mind about going. Let me know, since it has sold out and I can't buy a ticket anymore, I would be interested in buying it. Best, -- Cesar Arevalo Software Eng

Re: Spark Streaming on top of Cassandra?

2014-07-04 Thread Cesar Arevalo

-spark-streaming-for-high-velocity-analytics-on-cassandra Best, -Cesar On Jul 4, 2014, at 12:33 AM, zarzyk wrote: > Hi, > I bump this thread as I'm also interested in the answer. Can anyone help or > point to the information on how to do Spark Streaming from/to Cassandra? > &

Re: Spark Streaming on top of Cassandra?

2014-07-04 Thread Cesar Arevalo

-spark-streaming-for-high-velocity-analytics-on-cassandra Best, -Cesar On Jul 4, 2014, at 12:33 AM, zarzyk wrote: > Hi, > I bump this thread as I'm also interested in the answer. Can anyone help or > point to the information on how to do Spark Streaming from/to Cassandra? > &

Re: Spark 1.0 failed on HDP 2.0 with absurd exception

2014-07-05 Thread Cesar Arevalo

HOT-hadoop2.2.0.jar I didn't try this, so it may not work. Best, -Cesar On Sat, Jul 5, 2014 at 2:48 AM, Konstantin Kudryavtsev < kudryavtsev.konstan...@gmail.com> wrote: > Hi all, > > I have cluster with HDP 2.0. I built Spark 1.0 on edge node and trying to > run with a command &g

Re: Broadcast variable in Spark Java application

2014-07-07 Thread Cesar Arevalo

t you are doing wrong. I've found that following the spark programming guide online usually gives me enough information, but I guess you've already tried that. Best, -Cesar > On Jul 7, 2014, at 12:41 AM, Praveen R wrote: > > I need a variable to be broadcasted from driver to

NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-17 Thread Cesar Arevalo

ar:/opt/spark-poc/lib_managed/bundles/com.jolbox/bonecp/bonecp-0.7.1.RELEASE.jar:/opt/spark-poc/sbt/ivy/cache/com.datastax.cassandra/cassandra-driver-core/bundles/cassandra-driver-core-2.0.4.jar:/opt/spark-poc/lib_managed/jars/org.json/json/json-20090211.jar Can anybody help me? Best, --

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo

Nope, it is NOT null. Check this out: scala> hiveContext == null res2: Boolean = false And thanks for sending that link, but I had already looked at it. Any other ideas? I looked through some of the relevant Spark Hive code and I'm starting to think this may be a bug. -Cesar On Mon

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo

not available. It may be completely missing from the current classpath, Best, -Cesar On Mon, Aug 18, 2014 at 12:47 AM, Akhil Das wrote: > Then definitely its a jar conflict. Can you try removing this jar from the > class path /opt/spark-poc/lib_managed/jars/org. > spark-project.hive/hi

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-19 Thread Cesar Arevalo

ack to you. -Cesar On Tue, Aug 19, 2014 at 2:04 PM, Yin Huai wrote: > Seems https://issues.apache.org/jira/browse/SPARK-2846 is the jira > tracking this issue. > > > On Mon, Aug 18, 2014 at 6:26 PM, cesararevalo > wrote: > >> Thanks, Zhan for the follow up. >> &

GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo

I was thinking of using the pregel API, so I will continue looking into that. Anyway, I look forward to a response. Best, -- Cesar Arevalo Software Engineer ❘ Zephyr Health 450 Mission Street, Suite #201 ❘ San Francisco, CA 94105 m: +1 415-571-7687 ❘ s: arevalocesar | t: @zephyrhealth <https://twitter.

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo

Hey, thanks for your response. And I had seen the triplets, but I'm not quite sure how the triplets would get me that V1 is connected to V4. Maybe I need to spend more time understanding it, I guess. -Cesar On Wed, Aug 20, 2014 at 10:56 AM, glxc wrote: > I don't know if Pr

Re: GraphX question about graph traversal

2014-08-20 Thread Cesar Arevalo

code I need to modify. I'll let you know how it goes. -Cesar On Wed, Aug 20, 2014 at 2:14 PM, Ankur Dave wrote: > At 2014-08-20 10:34:50 -0700, Cesar Arevalo > wrote: > > I would like to get the type B vertices that are connected through type A > > vertices where the

Time series data

2015-06-26 Thread Caio Cesar Trucolo

Hi everyone! I am working with multiple time series data and in summary I have to adjust each time series (like inserting average values in data gaps) and then training regression models with mllib for each time series. The adjustment step I did with the adjustement function being mapped for each

61 matches

Mail list logo