Is there a simpler way to check if a data frame is cached other than:
dataframe.registerTempTable("cachedOutput")
assert(hc.isCached("cachedOutput"), "The table was not cached")
Thanks!
--
Cesar Flores
SELECT * FROM tableAlias
"
)
Do the partition information ("id") will be stored in whse.someTable such
that when querying on that table in a second spark job, the information
will be used for optimizing joins for example?
If this approach do not work, can you suggest one that works?
Thanks
--
Cesar Flores
something similar to what R output does (where it clearly
indicates which weight corresponds to each feature name, including
categorical ones).
Thanks a lot !
--
Cesar Flores
Is there a way to release all persisted RDD's/DataFrame's in Spark without
stopping the SparkContext ?
Thanks a lot
--
Cesar Flores
true.
*
* @group setParam
*/
Specifically I am having issues with understanding why the solution should
converge to the same weight values with/without standardization ?
Thanks !
--
Cesar Flores
lumns variable?
Thanks
--
Cesar Flores
am looking more for a hack kind of solution.
Thanks a lot !
--
Cesar Flores
I can try ?
Thanks a lot !
--
Cesar Flores
Is there a way to unpersist all data frames, data sets, and/or RDD in Spark
2.2 in a single call?
Thanks
--
Cesar Flores
mes?*
thanks
--
Cesar Flores
Thanks for your answers.
The suggested method works when the number of Data Frames is small.
However, I am trying to union >30 Data Frames, and the time to create the
plan is taking longer than the execution, which should not be the case.
Thanks!
--
Cesar
On Thu, Apr 5, 2018 at 1:29 PM, A
column to something else on the
fly, and not after performing the aggregation?
thanks
--
Cesar Flores
I have a table in a Hive database partitioning by date. I notice that when
I query this table using HiveContext the created data frame has an specific
number of partitions.
Do this partitioning corresponds to my original table partitioning in Hive?
Thanks
--
Cesar Flores
as the next one:
flag | price | index
--|---
1|47.808764653746 | 0
1|47.808764653746 | 1
1|31.9869279512204| 2
1|47.7907893713564| 3
1|16.7599200038239| 4
1|16.7599200038239| 5
1|20.3916014172137| 6
--
Cesar Flores
cumsum column as the next one:
flag | price | cumsum_price
--|---
1|47.808764653746 | 47.808764653746
1|47.808764653746 | 95.6175293075
1|31.9869279512204| 127.604457259
Thanks
--
Cesar Flores
tried also:
hc.createDataFrame(df.rdd.repartition(100),df.schema)
which appears to be a random permutation. Can some one confirm me that the
last line is in fact a random permutation, or point me out to a better
approach?
Thanks
--
Cesar Flores
Thanks!!!
--
Cesar Flores
the config parameter
spark.sql.shuffle.partitions, which I need to modify on the fly to do group
by clauses depending on the size of my input.*
Thanks
--
Cesar Flores
Hello:
I wonder if there is a way to query the number of running executors (nor
the number asked executors) inside a spark job?
Thanks
--
Cesar Flores
Thanks Ted:
That is the kind of answer I was looking for.
Best,
Cesar flores
On Wed, Apr 6, 2016 at 3:01 PM, Ted Yu wrote:
> Have you looked at SparkListener ?
>
> /**
>* Called when the driver registers a new executor.
>*/
> def onExecutorA
gt; >>
>>> >> From: kpe...@gmail.com
>>> >> Date: Mon, 2 May 2016 12:11:18 -0700
>>> >> Subject: Re: Weird results with Spark SQL Outer joins
>>> >> To: gourav.sengu...@gmail.com
>>> >> CC: user@spark.apache.org
>>> >>
>>> >>
>>> >> Gourav,
>>> >>
>>> >> I wish that was case, but I have done a select count on each of the
>>> two
>>> >> tables individually and they return back different number of rows:
>>> >>
>>> >>
>>> >> dps.registerTempTable("dps_pin_promo_lt")
>>> >> swig.registerTempTable("swig_pin_promo_lt")
>>> >>
>>> >>
>>> >> dps.count()
>>> >> RESULT: 42632
>>> >>
>>> >>
>>> >> swig.count()
>>> >> RESULT: 42034
>>> >>
>>> >> On Mon, May 2, 2016 at 11:55 AM, Gourav Sengupta
>>> >> wrote:
>>> >>
>>> >> This shows that both the tables have matching records and no
>>> mismatches.
>>> >> Therefore obviously you have the same results irrespective of whether
>>> you
>>> >> use right or left join.
>>> >>
>>> >> I think that there is no problem here, unless I am missing something.
>>> >>
>>> >> Regards,
>>> >> Gourav
>>> >>
>>> >> On Mon, May 2, 2016 at 7:48 PM, kpeng1 wrote:
>>> >>
>>> >> Also, the results of the inner query produced the same results:
>>> >> sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc ,
>>> d.account
>>> >> AS
>>> >> d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend ,
>>> >> d.spend_in_dollar AS d_spend FROM swig_pin_promo_lt s INNER JOIN
>>> >> dps_pin_promo_lt d ON (s.date = d.date AND s.account = d.account AND
>>> s.ad
>>> >> =
>>> >> d.ad) WHERE s.date >= '2016-01-03'AND d.date >=
>>> '2016-01-03'").count()
>>> >> RESULT:23747
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Weird-results-with-Spark-SQL-Outer-joins-tp26861p26863.html
>>> >> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>> >>
>>> >>
>>> >
>>>
>>
>>
>
--
Cesar Flores
functionality may be
useful?*
Thanks
--
Cesar Flores
Please sent me to me too !
Thanks ! ! !
Cesar Flores
On Tue, May 17, 2016 at 4:55 PM, Femi Anthony wrote:
> Please send it to me as well.
>
> Thanks
>
> Sent from my iPhone
>
> On May 17, 2016, at 12:09 PM, Raghavendra Pandey <
> raghavendra.pan...@gmail.com>
I created a spark application in Eclipse by including the
spark-assembly-1.6.0-hadoop2.6.0.jar file in the path.
However, this method does not allow me see spark code. Is there an easy way
to include spark source code for reference in an application developed in
Eclipse?
Thanks !
--
Cesar
?
Thanks
--
Cesar Flores
with a single
partition and around 14 million records
val newDF = hc.createDataFrame(rdd, df.schema)
This process is really slow. Is there any other way of achieving this task,
or to optimize it (perhaps tweaking a spark configuration parameter)?
Thanks a lot
--
Cesar Flores
are better off not
>> running the orderBy clause.
>>
>> May be someone from spark sql team could answer that how should the
>> partitioning of the output DF be handled when doing an orderBy?
>>
>> Hemant
>> www.snappydata.io
>> https://github.com/Snappy
.*
*Does anyone have any idea of what I may be doing wrong. My guess is that I
am doing something weird in my class hierarchy but can not figure out what.*
Thanks!
--
Cesar Flores
I found my problem. I was calling setParameterValue(defaultValue) more than
one time in the hierarchy of my classes.
Thanks!
On Mon, Feb 15, 2016 at 6:34 PM, Cesar Flores wrote:
>
> I have a set of transformers (each with specific parameters) in spark
> 1.3.1. I have two versions,
of time (i.e. less than 12 hours).
Best
--
Cesar Flores
Hi,
I recently upgraded from 1.2.1 to 1.3.1 (through HDP).
I have a job that does a cartesian product on two datasets (2K and 500K lines
minimum) to do string matching.
I updated it to use Dataframes because the old code wouldn’t run anymore
(deprecated RDD functions).
It used to run very w
3 cores* not 8
César.
> Le 6 oct. 2015 à 19:08, Cesar Berezowski a écrit :
>
> I deployed hdp 2.3.1 and got spark 1.3.1, spark 1.4 is supposed to be
> available as technical preview I think
>
> vendor’s forum ? you mean hortonworks' ?
>
> --
> Update on m
to merge is random?
Thanks
--
Cesar Flores
x._2)
val partitioned_df = hc.createDataFrame(partitioned_rdd,
unpartitioned_df.schema)
Thanks a lot
--
Cesar Flores
ing since its
> mostly a blackbox.
>
> 1) could be fixed by adding caching. 2) is on our roadmap (though you'd
> have to use logical DataFrame expressions to do the partitioning instead of
> a class based partitioner).
>
> On Wed, Oct 14, 2015 at 8:45 AM, Cesar Flores wro
very useful for
performing joins later). Is that true?
And second question, if I save *df* just after the query into a hive table,
when I reload this table from hive, does spark will remember the
partitioning?
I am using at the moment 1.3.1 spark version.
Thanks
--
Cesar Flores
ooks like:
libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.0" % "test",
parallelExecution in Test := false,
fork := true,
javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M",
"-XX:+CMSClassUnloadingEnabled")
We are working under Spark 1.3.0
Thanks
--
Cesar Flores
On Mon, Aug 3, 2015 at 5:56 PM, Michael Armbrust
wrote:
> TestHive takes care of creating a temporary directory for each invocation
> so that multiple test runs won't conflict.
>
> On Mon, Aug 3, 2015 at 3:09 PM, Cesar Flores wrote:
>
>>
>> We are using a local h
linux path /home/my_user_name, which fails.
On Thu, Aug 6, 2015 at 3:12 PM, Cesar Flores wrote:
> Well, I try this approach, and still have issues. Apparently TestHive can
> not delete the hive metastore directory. The complete error that I have is:
>
> 15/08/06 15:01:29 ERROR Dr
private to the ml package:
private[ml] def transformSchema(schema: StructType, paramMap: ParamMap):
StructType
Do any user can create their own transformers? If not, do this
functionality will be added in the future.
Thanks
--
Cesar Flores
ct
function? If not, what will be a better approach for selecting the required
fields in run time?
Thanks in advance for your help
--
Cesar Flores
uired fields, but would like to hear the opinion
of an expert about it.
Thanks
On Thu, Feb 19, 2015 at 12:01 PM, Cesar Flores wrote:
>
> I am trying to pass a variable number of arguments to the select function
> of a SchemaRDD I created, as I want to select the fields in run time:
be able to handle user defined classes too? Do user classes will
need to extend they will need to define the same approach?
--
Cesar Flores
different syntax? Are they interchangeable? Which one has
better performance?
Thanks a lot
--
Cesar Flores
transformers classes for feature extraction, and If I need to save the
input and maybe output SchemaRDD of the transform function in every
transformer, this may not very efficient.
Thanks
On Tue, Mar 10, 2015 at 8:20 PM, Tobias Pfeiffer wrote:
> Hi,
>
> On Tue, Mar 10, 2015 at 2:13 PM, Ces
a lot
--
Cesar Flores
at all my fields are missing.
Can someone tell me if I need to do some post processing after loading from
hive in order to avoid this kind of errors?
Thanks
--
Cesar Flores
Never mind. I found the solution:
val newDataFrame = hc.createDataFrame(hiveLoadedDataFrame.rdd,
hiveLoadedDataFrame.schema)
which translate to convert the data frame to rdd and back again to data
frame. Not the prettiest solution, but at least it solves my problems.
Thanks,
Cesar Flores
On
Hi All:
I was wondering if anybody had bought a ticket for the upcoming Spark
Summit 2014 this coming week and had changed their mind about going.
Let me know, since it has sold out and I can't buy a ticket anymore, I
would be interested in buying it.
Best,
--
Cesar Arevalo
Software Eng
-spark-streaming-for-high-velocity-analytics-on-cassandra
Best,
-Cesar
On Jul 4, 2014, at 12:33 AM, zarzyk wrote:
> Hi,
> I bump this thread as I'm also interested in the answer. Can anyone help or
> point to the information on how to do Spark Streaming from/to Cassandra?
>
&
-spark-streaming-for-high-velocity-analytics-on-cassandra
Best,
-Cesar
On Jul 4, 2014, at 12:33 AM, zarzyk wrote:
> Hi,
> I bump this thread as I'm also interested in the answer. Can anyone help or
> point to the information on how to do Spark Streaming from/to Cassandra?
>
&
HOT-hadoop2.2.0.jar
I didn't try this, so it may not work.
Best,
-Cesar
On Sat, Jul 5, 2014 at 2:48 AM, Konstantin Kudryavtsev <
kudryavtsev.konstan...@gmail.com> wrote:
> Hi all,
>
> I have cluster with HDP 2.0. I built Spark 1.0 on edge node and trying to
> run with a command
&g
t you are doing wrong.
I've found that following the spark programming guide online usually gives me
enough information, but I guess you've already tried that.
Best,
-Cesar
> On Jul 7, 2014, at 12:41 AM, Praveen R wrote:
>
> I need a variable to be broadcasted from driver to
ar:/opt/spark-poc/lib_managed/bundles/com.jolbox/bonecp/bonecp-0.7.1.RELEASE.jar:/opt/spark-poc/sbt/ivy/cache/com.datastax.cassandra/cassandra-driver-core/bundles/cassandra-driver-core-2.0.4.jar:/opt/spark-poc/lib_managed/jars/org.json/json/json-20090211.jar
Can anybody help me?
Best,
--
Nope, it is NOT null. Check this out:
scala> hiveContext == null
res2: Boolean = false
And thanks for sending that link, but I had already looked at it. Any other
ideas?
I looked through some of the relevant Spark Hive code and I'm starting to
think this may be a bug.
-Cesar
On Mon
not available.
It may be completely missing from the current classpath,
Best,
-Cesar
On Mon, Aug 18, 2014 at 12:47 AM, Akhil Das
wrote:
> Then definitely its a jar conflict. Can you try removing this jar from the
> class path /opt/spark-poc/lib_managed/jars/org.
> spark-project.hive/hi
ack to you.
-Cesar
On Tue, Aug 19, 2014 at 2:04 PM, Yin Huai wrote:
> Seems https://issues.apache.org/jira/browse/SPARK-2846 is the jira
> tracking this issue.
>
>
> On Mon, Aug 18, 2014 at 6:26 PM, cesararevalo
> wrote:
>
>> Thanks, Zhan for the follow up.
>>
&
I was thinking of using
the pregel API, so I will continue looking into that. Anyway, I look
forward to a response.
Best,
--
Cesar Arevalo
Software Engineer ❘ Zephyr Health
450 Mission Street, Suite #201 ❘ San Francisco, CA 94105
m: +1 415-571-7687 ❘ s: arevalocesar | t: @zephyrhealth
<https://twitter.
Hey, thanks for your response.
And I had seen the triplets, but I'm not quite sure how the triplets would
get me that V1 is connected to V4. Maybe I need to spend more time
understanding it, I guess.
-Cesar
On Wed, Aug 20, 2014 at 10:56 AM, glxc wrote:
> I don't know if Pr
code I need
to modify.
I'll let you know how it goes.
-Cesar
On Wed, Aug 20, 2014 at 2:14 PM, Ankur Dave wrote:
> At 2014-08-20 10:34:50 -0700, Cesar Arevalo
> wrote:
> > I would like to get the type B vertices that are connected through type A
> > vertices where the
Hi everyone!
I am working with multiple time series data and in summary I have to adjust
each time series (like inserting average values in data gaps) and then
training regression models with mllib for each time series. The adjustment
step I did with the adjustement function being mapped for each
61 matches
Mail list logo