leases/spark-release-3-0-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this release. This release would not have been possible without you.
>>
>>
>> Thanks,
>> Ruifeng Zheng
>>
>>
--
---
Takeshi Yamamuro
they are not used. Do I need to register somehow
> them? Could someone share a relevant example?
> Thx.
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
2020 at 2:31 PM Takeshi Yamamuro
> wrote:
>
>> Hi,
>>
>> Please see an example code in
>> https://github.com/gaborgsomogyi/spark-jdbc-connection-provider (
>> https://github.com/apache/spark/pull/29024).
>> Since it depends on the service loader, I think you
> Is there any performance penalty for using scala BigDecimal? it's more
> convenient from an API point of view than java.math.BigDecimal.
>
--
---
Takeshi Yamamuro
;>> and early feedback to
>>> this release. This release would not have been possible without you.
>>>
>>> To download Spark 3.1.1, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-1-1.html
>>>
>>>
--
---
Takeshi Yamamuro
e would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>> Dongjoon Hyun
>>>
>>
>
> --
>
>
--
---
Takeshi Yamamuro
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1007)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
>
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:989)
>
> at BIDMach.RunOnSpark$.runOnSpark(RunOnSpark.scala:50)
>
> ... 50 elided
>
>
--
---
Takeshi Yamamuro
putStream.java:1921)
>>>> at
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>>> at
>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>>> at
>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>>>> at
>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>>>> at
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>>>> at
>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>>>> at
>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>>>> at
>>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>>>>
>>>>
>>>> I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark
>>>> property maximizeResourceAllocation is set to true (executor.memory = 48G
>>>> according to spark ui environment). We're also using kryo serialization and
>>>> Yarn is the resource manager.
>>>>
>>>> Any ideas as what might be going wrong and how to debug this?
>>>>
>>>> Thanks,
>>>> Arash
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
--
---
Takeshi Yamamuro
at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>> at
>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921)
>>
>>
>> I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark
>> property maximizeResourceAllocation is set to true (executor.memory = 48G
>> according to spark ui environment). We're also using kryo serialization and
>> Yarn is the resource manager.
>>
>> Any ideas as what might be going wrong and how to debug this?
>>
>> Thanks,
>> Arash
>>
>>
>>
>
--
---
Takeshi Yamamuro
ype#13666 = TEST)
>> Scan
>>
>> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][perm_group#13667,user_group#13665,perm_type#13666]
>> ConvertToUnsafe
>> Scan
>>
>> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117]
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548p26553.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
--
---
Takeshi Yamamuro
er-list.1001560.n3.nabble.com/Spark-and-DB-connection-pool-tp26577.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For addi
ent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
just re-sent,
-- Forwarded message --
From: Takeshi Yamamuro
Date: Thu, Mar 24, 2016 at 5:19 PM
Subject: Re: Forcing data from disk to memory
To: Daniel Imberman
Hi,
We have no direct approach; we need to unpersist cached data, then
re-cache data as MEMORY_ONLY.
// maropu
ser-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
I run
> multiple queries against it the RDD fully persists, but this means that the
> first 4/5 queries I run are extremely slow.
>
> Is there any way I can make sure that the entire RDD ends up in memory the
> first time I load it?
>
> Thank you
>
> On Thu, Ma
:{0} quotient:{1} remander:{2} repartition({3})"
>
> .format(numRows, quotient, remander, numPartitions))
>
> print(debugStr)
>
>
>
> csvDF = resultDF.coalesce(numPartitions)
>
>
>
> orderByColName = "count"
>
> csvDF = cs
> Best Regards,
> Ayan Guha
>
--
---
Takeshi Yamamuro
an option to create Hive table, driven by job specific
> configuration.
>
> What do you think?
>
> Best
> Ayan
>
> On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro
> wrote:
>
>> Hi,
>>
>> It depends on your use case using sqoop.
>> What
> -XX:PermSize=256m --conf spark.driver.extraJavaOptions
> -XX:MaxPermSize=1024m -XX:PermSize=256m --conf
> spark.yarn.executor.memoryOverhead=1024
>
> Need to know the best practices/better ways to optimize code.
>
> Thanks,
> Divya
>
>
--
---
Takeshi Yamamuro
368
>
> FYI
>
> On Thu, Apr 14, 2016 at 6:53 PM, Edmon Begoli wrote:
>
>> I am needing this fact for the research paper I am writing right now.
>>
>> When did Spark start supporting Parquet and when ORC?
>> (what release)
>>
>> I appreciate any i
>> arkadiusz.b...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> Is there any statistics regarding YARN vs Standalone Spark Usage in
>>> >> production ?
>>> >>
>>> >> I would like to choose most supported and used technology in
>>> >> production for our project.
>>> >>
>>> >>
>>> >> BR,
>>> >>
>>> >> Arkadiusz Bicz
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>> >
>>>
>>
>>
>
--
---
Takeshi Yamamuro
afterwards.
>
>
>
> PS: Workaround is to use toDF(cols) and rename all columns, but I am
> wondering if toDF has any impact on the RDD structure behind (e.g.
> repartitioning, cache, etc)
>
>
>
> Appreciated,
>
> Saif
>
>
>
>
>
--
---
Takeshi Yamamuro
thub.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql
>>>>>
>>>>> *2. * Range filter queries on Long types
>>>>>
>>>>> *code*:
>>>>>
>>>>>> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and
>>>>>> `length` <= '1000'")
>>>>>
>>>>> *Full example*:
>>>>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>>>>> *plan*:
>>>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql
>>>>>
>>>>> The SolrRelation class we use extends
>>>>> <https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/SolrRelation.scala#L37>
>>>>> the PrunedFilteredScan.
>>>>>
>>>>> Since Solr supports date ranges, I would like for the timestamp
>>>>> filters to be pushed down to the Solr query.
>>>>>
>>>>> Are there limitations on the type of filters that are passed down with
>>>>> Timestamp types ?
>>>>> Is there something that I should do in my code to fix this ?
>>>>>
>>>>> Thanks,
>>>>> --
>>>>> Kiran Chitturi
>>>>>
>>>>>
>>>>
>>
>
--
---
Takeshi Yamamuro
any best practices or has anyone come across a case like this
> before where a write to a database seems to perform poorly?
>
> Thanks,
> Jon
>
--
---
Takeshi Yamamuro
Sorry to wrongly send message in mid.
How about trying to increate 'batchsize` in a jdbc option to improve
performance?
// maropu
On Thu, Apr 21, 2016 at 2:15 PM, Takeshi Yamamuro
wrote:
> Hi,
>
> How about trying to increate 'batchsize
>
> On Wed, Apr 20, 2016
y contributions in a paper, or add you as a co-author
> if you have any significant contribution (and if interested).
>
> Thank you,
> Edmon
>
--
---
Takeshi Yamamuro
ote:
>
>> Hi,
>> I am using Spark 1.5.2 .
>> I have a use case where I need to join the same dataframe twice on two
>> different columns.
>> I am getting error missing Columns
>>
>> For instance ,
>> val df1 = df2.join(df3,"Column1")
>> Below throwing error missing columns
>> val df 4 = df1.join(df3,"Column2")
>>
>> Is the bug or valid scenario ?
>>
>>
>>
>>
>> Thanks,
>> Divya
>>
>
>
--
---
Takeshi Yamamuro
d(b,IntegerType,false), StructField(b,IntegerType,false))
>
> On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro
> wrote:
>
>> Hi,
>>
>> I tried;
>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b")
>> val df2 = Seq((1, 1), (2, 2), (
df2("b").as("2-b"))
val df4 = df3.join(df2, df3("2-b") === df2("b"))
// maropu
On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot
wrote:
> Correct Takeshi
> Even I am facing the same issue .
>
> How to avoid the ambiguity ?
>
>
> On 27 Apri
of csv files. But it seems it works with sqlcontext only. Is
> there a way to remove header from csv files without sqlcontext ?
>
> Thanks
> Ashutosh
>
--
---
Takeshi Yamamuro
first: string (nullable = true)
>> |-- last: string (nullable = true)
>>
>> *Error while accessing table:*
>> Exception in thread "main" org.apache.spark.sql.AnalysisException: Table
>> not found: person;
>>
>> Does anyone have solution for this?
>>
>> Thanks,
>> Asmath
>>
>
>
--
---
Takeshi Yamamuro
9:24 GMT+01:00 Michael Armbrust :
>
>>
>>> logical plan after optimizer execution:
>>>
>>> Project [id#0L,id#1L]
>>> !+- Filter (id#0L = cast(1 as bigint))
>>> ! +- Join Inner, Some((id#0L = id#1L))
>>> ! :- Subquery t
>>> ! : +- Relation[id#0L] JSONRelation
>>> ! +- Subquery u
>>> ! +- Relation[id#1L] JSONRelation
>>>
>>
>> I think you are mistaken. If this was the optimized plan there would be
>> no subqueries.
>>
>
>
--
---
Takeshi Yamamuro
ver after filling the node's memory.
>
> Thanks
>
>
>
>
--
---
Takeshi Yamamuro
tor$Executor$$reportHeartBeat(Executor.scala:449)
> at
>
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470)
> at
>
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
> at
>
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470)
> at
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765)
> at
> org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)"
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/GC-overhead-limit-exceeded-tp26966.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
ap space"*
> But why spark doesn't split data into a disk?
>
> On Mon, May 16, 2016 at 5:11 PM, Takeshi Yamamuro
> wrote:
>
>> Hi,
>>
>> Why did you though you have enough memory for your task? You checked task
>> statistics in your WebUI?
>> An
"Id" from "join_test"
>>>
>>> 0: jdbc:hive2://> *select Id from join_test;*
>>> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
>>> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
>>> 0: jdbc:hive2://>
>>>
>>> Is there a way to merge the value of df1("Id") and df2("Id") into one
>>> "Id"
>>>
>>> Thanks
>>>
>>
>>
>
--
---
Takeshi Yamamuro
gt;>>> Hi,
>>>>
>>>> I tried to join two dataframe
>>>>
>>>> df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter")
>>>>
>>>> df_join.registerTempTable("join_test")
>>>>
>>>>
>>>> When querying "Id" from "join_test"
>>>>
>>>> 0: jdbc:hive2://> *select Id from join_test;*
>>>> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
>>>> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
>>>> 0: jdbc:hive2://>
>>>>
>>>> Is there a way to merge the value of df1("Id") and df2("Id") into one
>>>> "Id"
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>
--
---
Takeshi Yamamuro
ecords to be inserted to a HBase table (PHOENIX) as a
> result of a Spark Job. I would like to know if i convert it to a Dataframe
> and save it, will it do Bulk load (or) it is not the efficient way to write
> data to a HBase table
>
> --
> Thanks and Regards
> Mohan
>
--
---
Takeshi Yamamuro
'Id' is
> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0)
>
> I am currently using spark 1.5.2.
> Is there any alternative way in 1.5
>
> Thanks
>
> On Wed, May 18, 2016 at 12:12 PM, Takeshi Yamamuro
> wrote:
>
>> Also, you can p
--++++
>
>
> but, when you query the "id"
>
>
> sqlContext.sql("select id from test")
>
> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is
> *ambiguous*, could be: Id#1
---------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
--
---
Takeshi Yamamuro
ut/userprofile/20160519/part-2
> 2016-05-20 17:01
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-3
> 2016-05-20 17:01
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-4
> 2016-05-20 17:01
> /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-5
>
>
>
>
>
--
---
Takeshi Yamamuro
i All,
>>
>> I have two RDDs A and B where in A is of size 30 MB and B is of size 7
>> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in
>> cartesian operation ?
>>
>> I am using spark 1.6.0 version
>>
>> Regards,
>> Padma Ch
>>
>
>
--
---
Takeshi Yamamuro
d) and using
> saveAsTextFile, trying to save it. However, this is also taking too much
> time.
>
> Thanks,
> Padma Ch
>
> On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro
> wrote:
>
>> Hi,
>>
>> Seems you'd be better off using DataFrame#join inst
(ParquetRelation.scala:65)
>
> at
> org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165)
>
> Please suggest. It seems like it not able to convert some data
>
--
---
Takeshi Yamamuro
uot;_temp_)("_1"))
> .withColumn("y", col("_temp_")("_2"))
> .drop("_temp_")
>
> this works, but its not pretty with the temporary field stuff.
>
> i also tried this:
> val tmp = udf(inputColumns)
> df
> .withColum
figurations is - 3 node cluster (1 master and 2 slaves). Each slave has
> 1 TB hard disk space, 300GB memory and 32 cores.
>
> HDFS block size is 128 MB.
>
> Thanks,
> Padma Ch
>
--
---
Takeshi Yamamuro
Couldn't you include all the needed columns in your input dataframe?
// maropu
On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers wrote:
> that is nice and compact, but it does not add the columns to an existing
> dataframe
>
> On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro
as 2.4 MB, totaling up to an overall input size of 9.7 MB for the
> whole stage? Isn't it just meant to read the metadata and ignore the
> content of the file?
>
>
>
> Regards,
>
> Dennis
>
--
---
Takeshi Yamamuro
", $"cX").select($"r._1",
$"r._2", $"c0", $"c1", $"cX")
// maropu
On Fri, May 27, 2016 at 12:15 PM, Koert Kuipers wrote:
> yes, but i also need all the columns (plus of course the 2 new ones) in my
> output. your sele
e printing.
>
>
>
>
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> v.E.1
>
>
>
>
>
>
>
>
>
--
---
Takeshi Yamamuro
;
> *Abhishek Kumar*
>
>
>
>
>
>
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is protected by law. If
> you are not the intended recipient, you should delete this message and any
> disclosure, copying, or distribution of this message, or the taking of any
> action based on it, by you is strictly prohibited.
>
> v.E.1
>
>
>
>
>
>
>
>
>
--
---
Takeshi Yamamuro
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>
--
---
Takeshi Yamamuro
in(df2,condition,'inner")??
>
> ps:df1.registertable('t1')
> ps:df2.registertable('t2')
> thanks
>
--
---
Takeshi Yamamuro
, 781, 781, 781, 782, 781, 781, 781, 782, 781,
>> 781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781,
>> 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782,
>> 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782)
>>
>> So two partitions are double the size. Is this expected behaviour or is
>> it some kind of bug?
>>
>> Thanks,
>> Maciej Sokołowski
>>
>
>
--
---
Takeshi Yamamuro
1, 781,
>>> 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 781, 782, 781, 781, 781,
>>> 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781,
>>> 781, 782, 781, 781, 781, 781, 1563, 782, 781, 781, 782, 781, 781, 781, 781,
>>> 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781,
>>> 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781,
>>> 781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781,
>>> 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782,
>>> 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782)
>>>
>>> So two partitions are double the size. Is this expected behaviour or is
>>> it some kind of bug?
>>>
>>> Thanks,
>>> Maciej Sokołowski
>>>
>>
>>
>
--
---
Takeshi Yamamuro
s
> that mostly end up doing nothing at all. Is it possible to prevent that? I
> assume only if the driver was able to inspect the cached meta data and
> avoid creating tasks for files that aren't used in the first place.
>
>
> On 27 May 2016 at 04:25, Takeshi Yamamuro wrote:
&g
taframes,
> repartitioning is a costly process.
>
>
>
> In short, what are the available strategies or configurations that help
> reading from disk or hdfs with proper executor-data-distribution??
>
>
>
> If this needs to be more specific, I am strictly focused on PARQUET files
> rom HDFS. I know there are some MIN
>
>
>
> Really appreciate,
>
> Saif
>
>
>
--
---
Takeshi Yamamuro
t from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
ph.pregel in GraphX does not have anything similar to mapPartitions.
>
> Can something like this be done in GraphX Pregel API?
>
--
---
Takeshi Yamamuro
detail ?
>>
>> Which version of Spark are you using ?
>>
>> Thanks
>>
>> On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv <
>> daniel.ha...@veracity-group.com> wrote:
>>
>>> Hi,
>>> Our application is failing due to issues with the torrent broadcast, is
>>> there a way to switch to another broadcast method ?
>>>
>>> Thank you.
>>> Daniel
>>>
>>
>>
>
--
---
Takeshi Yamamuro
see a
>> simple reduceByKey replacement.
>>
>> Regards,
>>
>> Bryan Jeffrey
>>
>>
>
--
---
Takeshi Yamamuro
> been forwarded to you without proper authority, you are notified that any
>> use or dissemination of this information in any manner is strictly
>> prohibited. In such cases, please notify us immediately at i...@yash.com
>> and delete this mail from your records.
>>
>
>
--
---
Takeshi Yamamuro
puted in
> catalyst module.
>
> *Regards,*
> *Srinivasan Hariharan*
> *+91-9940395830 <%2B91-9940395830>*
>
>
>
>
--
---
Takeshi Yamamuro
Is there a way of: (i) is there a way of recovering references to data
> frames that are still persisted in memory OR (ii) a way of just unpersist
> all spark cached variables?
>
>
> Thanks
> --
> Cesar Flores
>
--
---
Takeshi Yamamuro
k files the same
>>> way
>>> it's loading 4k files?
>>>
>>> To give you some idea about my setup and data:
>>> - ~70k files across 24 directories in HDFS
>>> - Each directory contains 3k files on average
>>> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores
>>> available to YARN
>>> - Spark 1.6.1
>>>
>>> Thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> <http://nabble.com>.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>>
>
--
---
Takeshi Yamamuro
-
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
ast_69652 in memory! (computed 496.0 B so far)
>>>>> 16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) + 2.6
>>>>> GB (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage limit = 2.6
>>>>> GB.
>>>>> 16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652 to
>>>>> disk instead.
>>>>> 16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally
>>>>> 16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0 (TID
>>>>> 452316). 2043 bytes result sent to driver
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> L
>>>>>
>>>>>
>>>> --
>>>
>>> Ben Slater
>>> Chief Product Officer
>>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>>> +61 437 929 798
>>>
>>
>>
>
--
---
Takeshi Yamamuro
subscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
code which works as expected:
>> df.groupBy("uid").count().select("uid")
>>
>> Thanks!
>> --
>> Pedro Rodriguez
>> PhD Student in Distributed Machine Learning | CU Boulder
>> UC Berkeley AMPLab Alumni
>>
>> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
>> Github: github.com/EntilZha | LinkedIn:
>> https://www.linkedin.com/in/pedrorodriguezscience
>>
>>
>
--
---
Takeshi Yamamuro
ood
> description/guide of using this syntax I would be willing to contribute
> some documentation.
>
> Pedro
>
> On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro
> wrote:
>
>> Hi,
>>
>> In 2.0, you can say;
>> val ds = Seq[Tuple2[Int, Int]]((1, 0),
trying to avoid map since my impression is that this uses a Scala
>> closure so is not optimized as well as doing column-wise operations is.
>>
>> Looks like the $ notation is the way to go, thanks for the help. Is there
>> an explanation of how this works? I imagine it is a method/f
any app I can use safely without exceeding
> memory allocated etc?
>
> Thanking you
>
>
>
--
---
Takeshi Yamamuro
you
>
> What are the main differences between a local mode and standalone mode. I
> understand local mode does not support cluster. Is that the only difference?
>
>
>
> On Sunday, 19 June 2016, 9:52, Takeshi Yamamuro
> wrote:
>
>
> Hi,
>
> In a local mode, spark
object which held SparkContext as a member, eg:
> object A {
> sc: SparkContext = new SparkContext
> def mapFunction {}
> }
>
> and when I called rdd.map(A.mapFunction) it failed as A.sc is not
> serializable.
>
> Thanks,
> Daniel
>
> On Tue, Jun 7
y (unique
> clustered index) on ID column
>
> If I load it through JDBC into a DataFrame and register it
> via registerTempTable will the data will be in the order of ID in tempTable?
>
> Thanks
>
--
---
Takeshi Yamamuro
a BroadcastHashJoin instead of SortMergeJoin automatically. )
>
>
>
>
>
>
--
---
Takeshi Yamamuro
ashjoin automatically when if "found" that a output RDD is
> small enough?
>
>
> ------
> 发件人:Takeshi Yamamuro
> 发送时间:2016年6月20日(星期一) 19:16
> 收件人:梅西0247
> 抄 送:user
> 主 题:Re: Is it possible to turn a S
<http://www.xactlycorp.com/email-click/>
>
> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn]
> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter]
> <https://twitter.com/Xactly> [image: Facebook]
> <https://www.facebook.com/XactlyCorp> [image: YouTube]
> <http://www.youtube.com/xactlycorporation>
--
---
Takeshi Yamamuro
I find it confusing that I need to supply a column there. It feels like it
>>>> might be distinct count or something. This can be seen in latest
>>>> documentation
>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$>
>>>> .
>>>>
>>>> I am considering filling this in spark bug tracker. Any opinions on
>>>> this?
>>>>
>>>> Thanks
>>>>
>>>> Jakub
>>>>
>>>>
>>>
>>
>
--
---
Takeshi Yamamuro
eger merge(Integer b1, Integer b2) {
> if (b1 == null) {
> return b2;
> } else if (b2 == null) {
> return b1;
> } else {
> return b1 + b2;
> }
> }
>
>
--
---
Takeshi Yamamuro
wrote:
> Not sure about what's the rule in case of `b + null = null` but the same
> code works perfectly in 1.6.1, just tried it..
>
> On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro
> wrote:
>
>> Hi,
>>
>> This behaviour seems to be expected because y
n.
>
> On Sun, Jun 26, 2016 at 1:53 PM Takeshi Yamamuro
> wrote:
>
>> Whatever it is, this is expected; if an initial value is null, spark
>> codegen removes all the aggregates.
>> See:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala
isclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 June 2016 at 14:32, Jestin Ma wrote:
>
>> If it’s not too much trouble, could I get some pointers/help on this?
>> (see link)
>>
>> http://stackoverflow.com/questions/38085801/can-dataframe-joins-in-spark-preserve-order
>>
>> -also, as a side question, do Dataframes support easy reordering of
>> columns?
>>
>> Thank you!
>> Jestin
>>
>
>
--
---
Takeshi Yamamuro
-
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
; of memory on my driver machine. Reasons for one instance per machine I do
>>> not understand.
>>>
>>> --
>>>
>>>
>>> *Sincerely yoursEgor Pakhomov*
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>
--
---
Takeshi Yamamuro
gt;>>>>>>> 1) One instance per machine
>>>>>>>> 2) Yarn client only(not yarn cluster)
>>>>>>>>
>>>>>>>> Are there any architectural reasons for such limitations? About
>>>>>>>> yarn-client I might understand in theory - master is the same process
>>>>>>>> as a
>>>>>>>> server, so it makes some sense, but it's really inconvenient - I need
>>>>>>>> a lot
>>>>>>>> of memory on my driver machine. Reasons for one instance per machine I
>>>>>>>> do
>>>>>>>> not understand.
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>>
>>>>>>>> *Sincerely yoursEgor Pakhomov*
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best Regards
>>>>>>>
>>>>>>> Jeff Zhang
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>> *Sincerely yoursEgor Pakhomov*
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> *Sincerely yoursEgor Pakhomov*
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>>
>>
>> *Sincerely yoursEgor Pakhomov*
>>
>
>
>
> --
>
>
> *Sincerely yoursEgor Pakhomov*
>
--
---
Takeshi Yamamuro
t mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
--
---
Takeshi Yamamuro
l?
>
>
> Thanks,
> Lalitha
>
--
---
Takeshi Yamamuro
ithub.com/EntilZha | LinkedIn
> <https://www.linkedin.com/in/pedrorodriguezscience>
>
--
---
Takeshi Yamamuro
st)?
>
> Is there any documentation demonstrating rule based transformation for
> physical plan trees?
>
> Thanks,
> Lalitha
>
> On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro
> wrote:
>
>> Hi,
>>
>> No, spark has no hint for the hash join.
&g
4, 2016 at 10:17 PM, Takeshi Yamamuro
> wrote:
>
>> The join selection can be described in
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92
>> .
>> If you have join keys, you can set -1 at
&g
gt;>> > =
>>> > +---++
>>> > |_c0|code|
>>> > +---++
>>> > | 18| AS|
>>> > | 16| |
>>> > | 13| UK|
>>> > | 14| US|
>>> > | 20| As|
>>> > | 15| IN|
>>> > | 19| IR|
>>> > | 11| PK|
>>> > +---++
>>> >
>>> > i am expecting the below one any idea, how to apply IS NOT NULL ?
>>> >
>>> > +---++
>>> > |_c0|code|
>>> > +---++
>>> > | 18| AS|
>>> > | 13| UK|
>>> > | 14| US|
>>> > | 20| As|
>>> > | 15| IN|
>>> > | 19| IR|
>>> > | 11| PK|
>>> > +---++
>>> >
>>> >
>>> >
>>> > Thanks & Regards
>>> >Radha krishna
>>> >
>>> >
>>>
>>
>>
>>
>> --
>>
>>
>>
>>
>>
>>
>>
>>
>> Thanks & Regards
>>Radha krishna
>>
>>
>>
--
---
Takeshi Yamamuro
ks:
>
> path = '/data/train_parquet/0_0_0.parquet'
> train0_df = sqlContext.read.load(path)
> train_df.take(1)
>
> Thanks in advance.
>
> Samir
>
--
---
Takeshi Yamamuro
Because It’s only
>> test in local with local mode If I deploy on mesos cluster what would
>> happened?
>>
>> Need you guys suggests some solutions for that. Thanks.
>>
>> Chanh
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>
--
---
Takeshi Yamamuro
iveChunkList.readAll(ParquetFileReader.java:755)
>> at
>> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494)
>> at
>> org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.checkEndOfRowGroup(UnsafeRowParquetRecord
>>
>
t;>>- *Cores in use:* 28 Total, 2 Used
>>>- *Memory in use:* 56.0 GB Total, 1024.0 MB Used
>>>- *Applications:* 1 Running, 6 Completed
>>>- *Drivers:* 0 Running, 0 Completed
>>>- *Status:* ALIVE
>>>
>>> Each worker has 8 cores and 4GB memory.
>>>
>>> My questions is how do people running in production decide these
>>> properties -
>>>
>>> 1) --num-executors
>>> 2) --executor-cores
>>> 3) --executor-memory
>>> 4) num of partitions
>>> 5) spark.default.parallelism
>>>
>>> Thanks,
>>> Kartik
>>>
>>>
>>>
>>
>
--
---
Takeshi Yamamuro
keley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>
--
---
Takeshi Yamamuro
>> On Jul 24, 2016, at 5:34 PM, janardhan shetty
>> wrote:
>>
>> We have data in Bz2 compression format. Any links in Spark to convert
>> into Parquet and also performance benchmarks and uses study materials ?
>>
>>
>>
>
--
---
Takeshi Yamamuro
Driver.java:425)
>> at
>> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166)
>> at
>> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>> at
>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>> at
>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at
>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> Error in query: cannot recognize input near 'parquetTable' 'USING' 'org'
>> in table name; line 2 pos 0
>>
>>
>> am I use it in the wrong way?
>>
>>
>>
>>
>>
>> thanks
>>
>
--
---
Takeshi Yamamuro
1 - 100 of 218 matches
Mail list logo