Re: [ANNOUNCE] Announcing Apache Spark 3.0.1

2020-09-11 Thread Takeshi Yamamuro
leases/spark-release-3-0-1.html >> >> We would like to acknowledge all community members for contributing to >> this release. This release would not have been possible without you. >> >> >> Thanks, >> Ruifeng Zheng >> >> -- --- Takeshi Yamamuro

Re: Custom JdbcConnectionProvider

2020-10-27 Thread Takeshi Yamamuro
they are not used. Do I need to register somehow > them? Could someone share a relevant example? > Thx. > > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Custom JdbcConnectionProvider

2020-10-27 Thread Takeshi Yamamuro
2020 at 2:31 PM Takeshi Yamamuro > wrote: > >> Hi, >> >> Please see an example code in >> https://github.com/gaborgsomogyi/spark-jdbc-connection-provider ( >> https://github.com/apache/spark/pull/29024). >> Since it depends on the service loader, I think you

Re: Spark SQL Dataset and BigDecimal

2021-02-17 Thread Takeshi Yamamuro
> Is there any performance penalty for using scala BigDecimal? it's more > convenient from an API point of view than java.math.BigDecimal. > -- --- Takeshi Yamamuro

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-02 Thread Takeshi Yamamuro
;>> and early feedback to >>> this release. This release would not have been possible without you. >>> >>> To download Spark 3.1.1, head over to the download page: >>> http://spark.apache.org/downloads.html >>> >>> To view the release notes: >>> https://spark.apache.org/releases/spark-release-3-1-1.html >>> >>> -- --- Takeshi Yamamuro

Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Takeshi Yamamuro
e would like to acknowledge all community members for contributing to >>> this >>> release. This release would not have been possible without you. >>> >>> Dongjoon Hyun >>> >> > > -- > > -- --- Takeshi Yamamuro

Re: Spark reduce serialization question

2016-03-06 Thread Takeshi Yamamuro
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1007) > > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > > at org.apache.spark.rdd.RDD.withScope(RDD.scala:310) > > at org.apache.spark.rdd.RDD.reduce(RDD.scala:989) > > at BIDMach.RunOnSpark$.runOnSpark(RunOnSpark.scala:50) > > ... 50 elided > > -- --- Takeshi Yamamuro

Re: OOM exception during Broadcast

2016-03-07 Thread Takeshi Yamamuro
putStream.java:1921) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) >>>> at >>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >>>> at >>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >>>> at >>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) >>>> at >>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) >>>> >>>> >>>> I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark >>>> property maximizeResourceAllocation is set to true (executor.memory = 48G >>>> according to spark ui environment). We're also using kryo serialization and >>>> Yarn is the resource manager. >>>> >>>> Any ideas as what might be going wrong and how to debug this? >>>> >>>> Thanks, >>>> Arash >>>> >>>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> > -- --- Takeshi Yamamuro

Re: OOM exception during Broadcast

2016-03-07 Thread Takeshi Yamamuro
at >> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) >> at >> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) >> at >> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) >> at >> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1997) >> at >> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1921) >> >> >> I'm using spark 1.5.2. Cluster nodes are amazon r3.2xlarge. The spark >> property maximizeResourceAllocation is set to true (executor.memory = 48G >> according to spark ui environment). We're also using kryo serialization and >> Yarn is the resource manager. >> >> Any ideas as what might be going wrong and how to debug this? >> >> Thanks, >> Arash >> >> >> > -- --- Takeshi Yamamuro

Re: Spark SQL Optimization

2016-03-23 Thread Takeshi Yamamuro
ype#13666 = TEST) >> Scan >> >> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table2.parquet][perm_group#13667,user_group#13665,perm_type#13666] >> ConvertToUnsafe >> Scan >> >> ParquetRelation[snackfs://shared:9042/aladdin_data_beta/table1.parquet][portfolio_code#14119,anc_port_group#14117] >> >> >> >> >> >> >> >> >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Optimization-tp26548p26553.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- --- Takeshi Yamamuro

Re: Spark and DB connection pool

2016-03-23 Thread Takeshi Yamamuro
er-list.1001560.n3.nabble.com/Spark-and-DB-connection-pool-tp26577.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For addi

Re: Best way to determine # of workers

2016-03-24 Thread Takeshi Yamamuro
ent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Fwd: Forcing data from disk to memory

2016-03-24 Thread Takeshi Yamamuro
just re-sent, -- Forwarded message -- From: Takeshi Yamamuro Date: Thu, Mar 24, 2016 at 5:19 PM Subject: Re: Forcing data from disk to memory To: Daniel Imberman Hi, We have no direct approach; we need to unpersist cached data, then re-cache data as MEMORY_ONLY. // maropu

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-24 Thread Takeshi Yamamuro
ser-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Forcing data from disk to memory

2016-03-25 Thread Takeshi Yamamuro
I run > multiple queries against it the RDD fully persists, but this means that the > first 4/5 queries I run are extremely slow. > > Is there any way I can make sure that the entire RDD ends up in memory the > first time I load it? > > Thank you > > On Thu, Ma

Re: data frame problem preserving sort order with repartition() and coalesce()

2016-03-29 Thread Takeshi Yamamuro
:{0} quotient:{1} remander:{2} repartition({3})" > > .format(numRows, quotient, remander, numPartitions)) > > print(debugStr) > > ​ > > csvDF = resultDF.coalesce(numPartitions) > > ​ > > orderByColName = "count" > > csvDF = cs

Re: Sqoop on Spark

2016-04-05 Thread Takeshi Yamamuro
> Best Regards, > Ayan Guha > -- --- Takeshi Yamamuro

Re: Sqoop on Spark

2016-04-05 Thread Takeshi Yamamuro
an option to create Hive table, driven by job specific > configuration. > > What do you think? > > Best > Ayan > > On Wed, Apr 6, 2016 at 2:30 PM, Takeshi Yamamuro > wrote: > >> Hi, >> >> It depends on your use case using sqoop. >> What&#x

Re: Memory needs when using expensive operations like groupBy

2016-04-14 Thread Takeshi Yamamuro
> -XX:PermSize=256m --conf spark.driver.extraJavaOptions > -XX:MaxPermSize=1024m -XX:PermSize=256m --conf > spark.yarn.executor.memoryOverhead=1024 > > Need to know the best practices/better ways to optimize code. > > Thanks, > Divya > > -- --- Takeshi Yamamuro

Re: When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Takeshi Yamamuro
368 > > FYI > > On Thu, Apr 14, 2016 at 6:53 PM, Edmon Begoli wrote: > >> I am needing this fact for the research paper I am writing right now. >> >> When did Spark start supporting Parquet and when ORC? >> (what release) >> >> I appreciate any i

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Takeshi Yamamuro
>> arkadiusz.b...@gmail.com> >>> > wrote: >>> >> >>> >> Hello, >>> >> >>> >> Is there any statistics regarding YARN vs Standalone Spark Usage in >>> >> production ? >>> >> >>> >> I would like to choose most supported and used technology in >>> >> production for our project. >>> >> >>> >> >>> >> BR, >>> >> >>> >> Arkadiusz Bicz >>> >> >>> >> - >>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> >> For additional commands, e-mail: user-h...@spark.apache.org >>> >> >>> > >>> >> >> > -- --- Takeshi Yamamuro

Re: Strange bug: Filter problem with parenthesis

2016-04-14 Thread Takeshi Yamamuro
afterwards. > > > > PS: Workaround is to use toDF(cols) and rename all columns, but I am > wondering if toDF has any impact on the RDD structure behind (e.g. > repartitioning, cache, etc) > > > > Appreciated, > > Saif > > > > > -- --- Takeshi Yamamuro

Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Takeshi Yamamuro
thub.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql >>>>> >>>>> *2. * Range filter queries on Long types >>>>> >>>>> *code*: >>>>> >>>>>> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and >>>>>> `length` <= '1000'") >>>>> >>>>> *Full example*: >>>>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151 >>>>> *plan*: >>>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql >>>>> >>>>> The SolrRelation class we use extends >>>>> <https://github.com/lucidworks/spark-solr/blob/master/src/main/scala/com/lucidworks/spark/SolrRelation.scala#L37> >>>>> the PrunedFilteredScan. >>>>> >>>>> Since Solr supports date ranges, I would like for the timestamp >>>>> filters to be pushed down to the Solr query. >>>>> >>>>> Are there limitations on the type of filters that are passed down with >>>>> Timestamp types ? >>>>> Is there something that I should do in my code to fix this ? >>>>> >>>>> Thanks, >>>>> -- >>>>> Kiran Chitturi >>>>> >>>>> >>>> >> > -- --- Takeshi Yamamuro

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-20 Thread Takeshi Yamamuro
any best practices or has anyone come across a case like this > before where a write to a database seems to perform poorly? > > Thanks, > Jon > -- --- Takeshi Yamamuro

Re: Spark 1.6.1 DataFrame write to JDBC

2016-04-20 Thread Takeshi Yamamuro
Sorry to wrongly send message in mid. How about trying to increate 'batchsize` in a jdbc option to improve performance? // maropu On Thu, Apr 21, 2016 at 2:15 PM, Takeshi Yamamuro wrote: > Hi, > > How about trying to increate 'batchsize > > On Wed, Apr 20, 2016

Re: Profiling memory use and access

2016-04-24 Thread Takeshi Yamamuro
y contributions in a paper, or add you as a co-author > if you have any significant contribution (and if interested). > > Thank you, > Edmon > -- --- Takeshi Yamamuro

Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
ote: > >> Hi, >> I am using Spark 1.5.2 . >> I have a use case where I need to join the same dataframe twice on two >> different columns. >> I am getting error missing Columns >> >> For instance , >> val df1 = df2.join(df3,"Column1") >> Below throwing error missing columns >> val df 4 = df1.join(df3,"Column2") >> >> Is the bug or valid scenario ? >> >> >> >> >> Thanks, >> Divya >> > > -- --- Takeshi Yamamuro

Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
d(b,IntegerType,false), StructField(b,IntegerType,false)) > > On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro > wrote: > >> Hi, >> >> I tried; >> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df2 = Seq((1, 1), (2, 2), (

Re: Cant join same dataframe twice ?

2016-04-26 Thread Takeshi Yamamuro
df2("b").as("2-b")) val df4 = df3.join(df2, df3("2-b") === df2("b")) // maropu On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot wrote: > Correct Takeshi > Even I am facing the same issue . > > How to avoid the ambiguity ? > > > On 27 Apri

Re: removing header from csv file

2016-04-26 Thread Takeshi Yamamuro
of csv files. But it seems it works with sqlcontext only. Is > there a way to remove header from csv files without sqlcontext ? > > Thanks > Ashutosh > -- --- Takeshi Yamamuro

Re: DataFrame cannot find temporary table

2016-05-09 Thread Takeshi Yamamuro
first: string (nullable = true) >> |-- last: string (nullable = true) >> >> *Error while accessing table:* >> Exception in thread "main" org.apache.spark.sql.AnalysisException: Table >> not found: person; >> >> Does anyone have solution for this? >> >> Thanks, >> Asmath >> > > -- --- Takeshi Yamamuro

Re: Spark 1.6 Catalyst optimizer

2016-05-12 Thread Takeshi Yamamuro
9:24 GMT+01:00 Michael Armbrust : > >> >>> logical plan after optimizer execution: >>> >>> Project [id#0L,id#1L] >>> !+- Filter (id#0L = cast(1 as bigint)) >>> ! +- Join Inner, Some((id#0L = id#1L)) >>> ! :- Subquery t >>> ! : +- Relation[id#0L] JSONRelation >>> ! +- Subquery u >>> ! +- Relation[id#1L] JSONRelation >>> >> >> I think you are mistaken. If this was the optimized plan there would be >> no subqueries. >> > > -- --- Takeshi Yamamuro

Re: Spark handling spill overs

2016-05-12 Thread Takeshi Yamamuro
ver after filling the node's memory. > > Thanks > > > > -- --- Takeshi Yamamuro

Re: GC overhead limit exceeded

2016-05-16 Thread Takeshi Yamamuro
tor$Executor$$reportHeartBeat(Executor.scala:449) > at > > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470) > at > > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470) > at > > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470) > at > org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) > at > org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745)" > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/GC-overhead-limit-exceeded-tp26966.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: GC overhead limit exceeded

2016-05-16 Thread Takeshi Yamamuro
ap space"* > But why spark doesn't split data into a disk? > > On Mon, May 16, 2016 at 5:11 PM, Takeshi Yamamuro > wrote: > >> Hi, >> >> Why did you though you have enough memory for your task? You checked task >> statistics in your WebUI? >> An

Re: Error joining dataframes

2016-05-17 Thread Takeshi Yamamuro
"Id" from "join_test" >>> >>> 0: jdbc:hive2://> *select Id from join_test;* >>> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is >>> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0) >>> 0: jdbc:hive2://> >>> >>> Is there a way to merge the value of df1("Id") and df2("Id") into one >>> "Id" >>> >>> Thanks >>> >> >> > -- --- Takeshi Yamamuro

Re: Error joining dataframes

2016-05-17 Thread Takeshi Yamamuro
gt;>>> Hi, >>>> >>>> I tried to join two dataframe >>>> >>>> df_join = df1.*join*(df2, ((df1("Id") === df2("Id")), "fullouter") >>>> >>>> df_join.registerTempTable("join_test") >>>> >>>> >>>> When querying "Id" from "join_test" >>>> >>>> 0: jdbc:hive2://> *select Id from join_test;* >>>> *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is >>>> *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0) >>>> 0: jdbc:hive2://> >>>> >>>> Is there a way to merge the value of df1("Id") and df2("Id") into one >>>> "Id" >>>> >>>> Thanks >>>> >>> >>> >> > -- --- Takeshi Yamamuro

Re: SPARK - DataFrame for BulkLoad

2016-05-17 Thread Takeshi Yamamuro
ecords to be inserted to a HBase table (PHOENIX) as a > result of a Spark Job. I would like to know if i convert it to a Dataframe > and save it, will it do Bulk load (or) it is not the efficient way to write > data to a HBase table > > -- > Thanks and Regards > Mohan > -- --- Takeshi Yamamuro

Re: Error joining dataframes

2016-05-18 Thread Takeshi Yamamuro
'Id' is > *ambiguous*, could be: Id#128, Id#155.; line 1 pos 7 (state=,code=0) > > I am currently using spark 1.5.2. > Is there any alternative way in 1.5 > > Thanks > > On Wed, May 18, 2016 at 12:12 PM, Takeshi Yamamuro > wrote: > >> Also, you can p

Re: Error joining dataframes

2016-05-18 Thread Takeshi Yamamuro
--++++ > > > but, when you query the "id" > > > sqlContext.sql("select id from test") > > *Error*: org.apache.spark.sql.AnalysisException: Reference 'Id' is > *ambiguous*, could be: Id#1

Re: Is there a way to merge parquet small files?

2016-05-20 Thread Takeshi Yamamuro
--------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- --- Takeshi Yamamuro

Re: Spark.default.parallelism can not set reduce number

2016-05-20 Thread Takeshi Yamamuro
ut/userprofile/20160519/part-2 > 2016-05-20 17:01 > /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-3 > 2016-05-20 17:01 > /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-4 > 2016-05-20 17:01 > /warehouse/dmpv3.db/datafile/tmp/output/userprofile/20160519/part-5 > > > > > -- --- Takeshi Yamamuro

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
i All, >> >> I have two RDDs A and B where in A is of size 30 MB and B is of size 7 >> MB, A.cartesian(B) is taking too much time. Is there any bottleneck in >> cartesian operation ? >> >> I am using spark 1.6.0 version >> >> Regards, >> Padma Ch >> > > -- --- Takeshi Yamamuro

Re: Cartesian join on RDDs taking too much time

2016-05-25 Thread Takeshi Yamamuro
d) and using > saveAsTextFile, trying to save it. However, this is also taking too much > time. > > Thanks, > Padma Ch > > On Wed, May 25, 2016 at 1:32 PM, Takeshi Yamamuro > wrote: > >> Hi, >> >> Seems you'd be better off using DataFrame#join inst

Re: Facing issues while reading parquet file in spark 1.2.1

2016-05-25 Thread Takeshi Yamamuro
(ParquetRelation.scala:65) > > at > org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:165) > > Please suggest. It seems like it not able to convert some data > -- --- Takeshi Yamamuro

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-25 Thread Takeshi Yamamuro
uot;_temp_)("_1")) > .withColumn("y", col("_temp_")("_2")) > .drop("_temp_") > > this works, but its not pretty with the temporary field stuff. > > i also tried this: > val tmp = udf(inputColumns) > df > .withColum

Re: Spark Job Execution halts during shuffle...

2016-05-26 Thread Takeshi Yamamuro
figurations is - 3 node cluster (1 master and 2 slaves). Each slave has > 1 TB hard disk space, 300GB memory and 32 cores. > > HDFS block size is 128 MB. > > Thanks, > Padma Ch > -- --- Takeshi Yamamuro

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Takeshi Yamamuro
Couldn't you include all the needed columns in your input dataframe? // maropu On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers wrote: > that is nice and compact, but it does not add the columns to an existing > dataframe > > On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro

Re: Spark input size when filtering on parquet files

2016-05-26 Thread Takeshi Yamamuro
as 2.4 MB, totaling up to an overall input size of 9.7 MB for the > whole stage? Isn't it just meant to read the metadata and ignore the > content of the file? > > > > Regards, > > Dennis > -- --- Takeshi Yamamuro

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Takeshi Yamamuro
", $"cX").select($"r._1", $"r._2", $"c0", $"c1", $"cX") // maropu On Fri, May 27, 2016 at 12:15 PM, Koert Kuipers wrote: > yes, but i also need all the columns (plus of course the 2 new ones) in my > output. your sele

Re: GraphX Java API

2016-05-29 Thread Takeshi Yamamuro
e printing. > > > > > > > > This message (including any attachments) contains confidential information > intended for a specific individual and purpose, and is protected by law. If > you are not the intended recipient, you should delete this message and any > disclosure, copying, or distribution of this message, or the taking of any > action based on it, by you is strictly prohibited. > > v.E.1 > > > > > > > > > -- --- Takeshi Yamamuro

Re: GraphX Java API

2016-05-29 Thread Takeshi Yamamuro
; > *Abhishek Kumar* > > > > > > > > This message (including any attachments) contains confidential information > intended for a specific individual and purpose, and is protected by law. If > you are not the intended recipient, you should delete this message and any > disclosure, copying, or distribution of this message, or the taking of any > action based on it, by you is strictly prohibited. > > v.E.1 > > > > > > > > > -- --- Takeshi Yamamuro

Re: FAILED_TO_UNCOMPRESS Error - Spark 1.3.1

2016-05-30 Thread Takeshi Yamamuro
> > > > > > > NOTE: This message may contain information that is confidential, > proprietary, privileged or otherwise protected by law. The message is > intended solely for the named addressee. If received in error, please > destroy and notify the sender. Any use of this email is prohibited when > received in error. Impetus does not represent, warrant and/or guarantee, > that the integrity of this communication has been maintained nor that the > communication is free of errors, virus, interception or interference. > -- --- Takeshi Yamamuro

Re: equvalent beewn join sql and data frame

2016-05-30 Thread Takeshi Yamamuro
in(df2,condition,'inner")?? > > ps:df1.registertable('t1') > ps:df2.registertable('t2') > thanks > -- --- Takeshi Yamamuro

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Takeshi Yamamuro
, 781, 781, 781, 782, 781, 781, 781, 782, 781, >> 781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, >> 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782, >> 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782) >> >> So two partitions are double the size. Is this expected behaviour or is >> it some kind of bug? >> >> Thanks, >> Maciej Sokołowski >> > > -- --- Takeshi Yamamuro

Re: Splitting RDD to exact number of partitions

2016-05-31 Thread Takeshi Yamamuro
1, 781, >>> 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 781, 782, 781, 781, 781, >>> 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, >>> 781, 782, 781, 781, 781, 781, 1563, 782, 781, 781, 782, 781, 781, 781, 781, >>> 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 782, 781, 781, >>> 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782, 781, >>> 781, 781, 782, 781, 781, 782, 781, 781, 782, 781, 781, 781, 781, 782, 781, >>> 781, 781, 782, 781, 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 782, >>> 781, 781, 782, 781, 781, 781, 781, 782, 781, 781, 781, 782) >>> >>> So two partitions are double the size. Is this expected behaviour or is >>> it some kind of bug? >>> >>> Thanks, >>> Maciej Sokołowski >>> >> >> > -- --- Takeshi Yamamuro

Re: Spark input size when filtering on parquet files

2016-06-01 Thread Takeshi Yamamuro
s > that mostly end up doing nothing at all. Is it possible to prevent that? I > assume only if the driver was able to inspect the cached meta data and > avoid creating tasks for files that aren't used in the first place. > > > On 27 May 2016 at 04:25, Takeshi Yamamuro wrote: &g

Re: Strategies for propery load-balanced partitioning

2016-06-03 Thread Takeshi Yamamuro
taframes, > repartitioning is a costly process. > > > > In short, what are the available strategies or configurations that help > reading from disk or hdfs with proper executor-data-distribution?? > > > > If this needs to be more specific, I am strictly focused on PARQUET files > rom HDFS. I know there are some MIN > > > > Really appreciate, > > Saif > > > -- --- Takeshi Yamamuro

Re: Custom positioning/partitioning Dataframes

2016-06-03 Thread Takeshi Yamamuro
t from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: How to generate seeded random numbers in GraphX Pregel API vertex procedure?

2016-06-03 Thread Takeshi Yamamuro
ph.pregel in GraphX does not have anything similar to mapPartitions. > > Can something like this be done in GraphX Pregel API? > -- --- Takeshi Yamamuro

Re: Switching broadcast mechanism from torrrent

2016-06-07 Thread Takeshi Yamamuro
detail ? >> >> Which version of Spark are you using ? >> >> Thanks >> >> On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv < >> daniel.ha...@veracity-group.com> wrote: >> >>> Hi, >>> Our application is failing due to issues with the torrent broadcast, is >>> there a way to switch to another broadcast method ? >>> >>> Thank you. >>> Daniel >>> >> >> > -- --- Takeshi Yamamuro

Re: Dataset - reduceByKey

2016-06-07 Thread Takeshi Yamamuro
see a >> simple reduceByKey replacement. >> >> Regards, >> >> Bryan Jeffrey >> >> > -- --- Takeshi Yamamuro

Re: Saving Parquet files to S3

2016-06-09 Thread Takeshi Yamamuro
> been forwarded to you without proper authority, you are notified that any >> use or dissemination of this information in any manner is strictly >> prohibited. In such cases, please notify us immediately at i...@yash.com >> and delete this mail from your records. >> > > -- --- Takeshi Yamamuro

Re: Catalyst optimizer cpu/Io cost

2016-06-10 Thread Takeshi Yamamuro
puted in > catalyst module. > > *Regards,* > *Srinivasan Hariharan* > *+91-9940395830 <%2B91-9940395830>* > > > > -- --- Takeshi Yamamuro

Re: Cleaning spark memory

2016-06-10 Thread Takeshi Yamamuro
Is there a way of: (i) is there a way of recovering references to data > frames that are still persisted in memory OR (ii) a way of just unpersist > all spark cached variables? > > > Thanks > -- > Cesar Flores > -- --- Takeshi Yamamuro

Re: Is there a limit on the number of tasks in one job?

2016-06-13 Thread Takeshi Yamamuro
k files the same >>> way >>> it's loading 4k files? >>> >>> To give you some idea about my setup and data: >>> - ~70k files across 24 directories in HDFS >>> - Each directory contains 3k files on average >>> - Cluster: 200 nodes EMR cluster, each node has 53 GB memory and 8 cores >>> available to YARN >>> - Spark 1.6.1 >>> >>> Thanks. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-limit-on-the-number-of-tasks-in-one-job-tp27158.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com >>> <http://nabble.com>. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >> > -- --- Takeshi Yamamuro

Re: how to investigate skew and DataFrames and RangePartitioner

2016-06-14 Thread Takeshi Yamamuro
- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-15 Thread Takeshi Yamamuro
ast_69652 in memory! (computed 496.0 B so far) >>>>> 16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) + 2.6 >>>>> GB (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage limit = 2.6 >>>>> GB. >>>>> 16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652 to >>>>> disk instead. >>>>> 16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally >>>>> 16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0 (TID >>>>> 452316). 2043 bytes result sent to driver >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> L >>>>> >>>>> >>>> -- >>> >>> Ben Slater >>> Chief Product Officer >>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support >>> +61 437 929 798 >>> >> >> > -- --- Takeshi Yamamuro

Re: spark sql broadcast join ?

2016-06-17 Thread Takeshi Yamamuro
subscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Dataset Select Function after Aggregate Error

2016-06-17 Thread Takeshi Yamamuro
code which works as expected: >> df.groupBy("uid").count().select("uid") >> >> Thanks! >> -- >> Pedro Rodriguez >> PhD Student in Distributed Machine Learning | CU Boulder >> UC Berkeley AMPLab Alumni >> >> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >> Github: github.com/EntilZha | LinkedIn: >> https://www.linkedin.com/in/pedrorodriguezscience >> >> > -- --- Takeshi Yamamuro

Re: Dataset Select Function after Aggregate Error

2016-06-17 Thread Takeshi Yamamuro
ood > description/guide of using this syntax I would be willing to contribute > some documentation. > > Pedro > > On Fri, Jun 17, 2016 at 8:53 PM, Takeshi Yamamuro > wrote: > >> Hi, >> >> In 2.0, you can say; >> val ds = Seq[Tuple2[Int, Int]]((1, 0),

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Takeshi Yamamuro
trying to avoid map since my impression is that this uses a Scala >> closure so is not optimized as well as doing column-wise operations is. >> >> Looks like the $ notation is the way to go, thanks for the help. Is there >> an explanation of how this works? I imagine it is a method/f

Re: Running Spark in local mode

2016-06-19 Thread Takeshi Yamamuro
any app I can use safely without exceeding > memory allocated etc? > > Thanking you > > > -- --- Takeshi Yamamuro

Re: Running Spark in local mode

2016-06-19 Thread Takeshi Yamamuro
you > > What are the main differences between a local mode and standalone mode. I > understand local mode does not support cluster. Is that the only difference? > > > > On Sunday, 19 June 2016, 9:52, Takeshi Yamamuro > wrote: > > > Hi, > > In a local mode, spark

Re: Switching broadcast mechanism from torrrent

2016-06-19 Thread Takeshi Yamamuro
object which held SparkContext as a member, eg: > object A { > sc: SparkContext = new SparkContext > def mapFunction {} > } > > and when I called rdd.map(A.mapFunction) it failed as A.sc is not > serializable. > > Thanks, > Daniel > > On Tue, Jun 7

Re: JDBC load into tempTable

2016-06-20 Thread Takeshi Yamamuro
y (unique > clustered index) on ID column > > If I load it through JDBC into a DataFrame and register it > via registerTempTable will the data will be in the order of ID in tempTable? > > Thanks > -- --- Takeshi Yamamuro

Re: Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread Takeshi Yamamuro
a BroadcastHashJoin instead of SortMergeJoin automatically. ) > > > > > > -- --- Takeshi Yamamuro

Re: Is it possible to turn a SortMergeJoin into BroadcastHashJoin?

2016-06-20 Thread Takeshi Yamamuro
ashjoin automatically when if "found" that a output RDD is > small enough? > > > ------ > 发件人:Takeshi Yamamuro > 发送时间:2016年6月20日(星期一) 19:16 > 收件人:梅西0247 > 抄 送:user > 主 题:Re: Is it possible to turn a S

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Takeshi Yamamuro
<http://www.xactlycorp.com/email-click/> > > <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] > <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] > <https://twitter.com/Xactly> [image: Facebook] > <https://www.facebook.com/XactlyCorp> [image: YouTube] > <http://www.youtube.com/xactlycorporation> -- --- Takeshi Yamamuro

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Takeshi Yamamuro
I find it confusing that I need to supply a column there. It feels like it >>>> might be distinct count or something. This can be seen in latest >>>> documentation >>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$> >>>> . >>>> >>>> I am considering filling this in spark bug tracker. Any opinions on >>>> this? >>>> >>>> Thanks >>>> >>>> Jakub >>>> >>>> >>> >> > -- --- Takeshi Yamamuro

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
eger merge(Integer b1, Integer b2) { > if (b1 == null) { > return b2; > } else if (b2 == null) { > return b1; > } else { > return b1 + b2; > } > } > > -- --- Takeshi Yamamuro

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
wrote: > Not sure about what's the rule in case of `b + null = null` but the same > code works perfectly in 1.6.1, just tried it.. > > On Sun, Jun 26, 2016 at 1:24 PM Takeshi Yamamuro > wrote: > >> Hi, >> >> This behaviour seems to be expected because y

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-26 Thread Takeshi Yamamuro
n. > > On Sun, Jun 26, 2016 at 1:53 PM Takeshi Yamamuro > wrote: > >> Whatever it is, this is expected; if an initial value is null, spark >> codegen removes all the aggregates. >> See: >> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala

Re: Can Spark Dataframes preserve order when joining?

2016-06-30 Thread Takeshi Yamamuro
isclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 29 June 2016 at 14:32, Jestin Ma wrote: > >> If it’s not too much trouble, could I get some pointers/help on this? >> (see link) >> >> http://stackoverflow.com/questions/38085801/can-dataframe-joins-in-spark-preserve-order >> >> -also, as a side question, do Dataframes support easy reordering of >> columns? >> >> Thank you! >> Jestin >> > > -- --- Takeshi Yamamuro

Re: How are threads created in SQL Executor?

2016-07-01 Thread Takeshi Yamamuro
- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-01 Thread Takeshi Yamamuro
; of memory on my driver machine. Reasons for one instance per machine I do >>> not understand. >>> >>> -- >>> >>> >>> *Sincerely yoursEgor Pakhomov* >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> > > > > -- > > > *Sincerely yoursEgor Pakhomov* > -- --- Takeshi Yamamuro

Re: Thrift JDBC server - why only one per machine and only yarn-client

2016-07-02 Thread Takeshi Yamamuro
gt;>>>>>>> 1) One instance per machine >>>>>>>> 2) Yarn client only(not yarn cluster) >>>>>>>> >>>>>>>> Are there any architectural reasons for such limitations? About >>>>>>>> yarn-client I might understand in theory - master is the same process >>>>>>>> as a >>>>>>>> server, so it makes some sense, but it's really inconvenient - I need >>>>>>>> a lot >>>>>>>> of memory on my driver machine. Reasons for one instance per machine I >>>>>>>> do >>>>>>>> not understand. >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> >>>>>>>> *Sincerely yoursEgor Pakhomov* >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best Regards >>>>>>> >>>>>>> Jeff Zhang >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> >>>>>> *Sincerely yoursEgor Pakhomov* >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best Regards >>>>> >>>>> Jeff Zhang >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> >>>> *Sincerely yoursEgor Pakhomov* >>>> >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> >> >> -- >> >> >> *Sincerely yoursEgor Pakhomov* >> > > > > -- > > > *Sincerely yoursEgor Pakhomov* > -- --- Takeshi Yamamuro

Re: spark parquet too many small files ?

2016-07-02 Thread Takeshi Yamamuro
t mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- --- Takeshi Yamamuro

Re: Enforcing shuffle hash join

2016-07-02 Thread Takeshi Yamamuro
l? > > > Thanks, > Lalitha > -- --- Takeshi Yamamuro

Re: Custom RDD: Report Size of Partition in Bytes to Spark

2016-07-03 Thread Takeshi Yamamuro
ithub.com/EntilZha | LinkedIn > <https://www.linkedin.com/in/pedrorodriguezscience> > -- --- Takeshi Yamamuro

Re: Enforcing shuffle hash join

2016-07-04 Thread Takeshi Yamamuro
st)? > > Is there any documentation demonstrating rule based transformation for > physical plan trees? > > Thanks, > Lalitha > > On Sat, Jul 2, 2016 at 12:58 AM, Takeshi Yamamuro > wrote: > >> Hi, >> >> No, spark has no hint for the hash join. &g

Re: Enforcing shuffle hash join

2016-07-04 Thread Takeshi Yamamuro
4, 2016 at 10:17 PM, Takeshi Yamamuro > wrote: > >> The join selection can be described in >> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala#L92 >> . >> If you have join keys, you can set -1 at &g

Re: IS NOT NULL is not working in programmatic SQL in spark

2016-07-10 Thread Takeshi Yamamuro
gt;>> > = >>> > +---++ >>> > |_c0|code| >>> > +---++ >>> > | 18| AS| >>> > | 16| | >>> > | 13| UK| >>> > | 14| US| >>> > | 20| As| >>> > | 15| IN| >>> > | 19| IR| >>> > | 11| PK| >>> > +---++ >>> > >>> > i am expecting the below one any idea, how to apply IS NOT NULL ? >>> > >>> > +---++ >>> > |_c0|code| >>> > +---++ >>> > | 18| AS| >>> > | 13| UK| >>> > | 14| US| >>> > | 20| As| >>> > | 15| IN| >>> > | 19| IR| >>> > | 11| PK| >>> > +---++ >>> > >>> > >>> > >>> > Thanks & Regards >>> >Radha krishna >>> > >>> > >>> >> >> >> >> -- >> >> >> >> >> >> >> >> >> Thanks & Regards >>Radha krishna >> >> >> -- --- Takeshi Yamamuro

Re: Spark crashes with two parquet files

2016-07-10 Thread Takeshi Yamamuro
ks: > > path = '/data/train_parquet/0_0_0.parquet' > train0_df = sqlContext.read.load(path) > train_df.take(1) > > Thanks in advance. > > Samir > -- --- Takeshi Yamamuro

Re: How to run Zeppelin and Spark Thrift Server Together

2016-07-10 Thread Takeshi Yamamuro
Because It’s only >> test in local with local mode If I deploy on mesos cluster what would >> happened? >> >> Need you guys suggests some solutions for that. Thanks. >> >> Chanh >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > > -- > Best Regards, > Ayan Guha > -- --- Takeshi Yamamuro

Re: Spark crashes with two parquet files

2016-07-10 Thread Takeshi Yamamuro
iveChunkList.readAll(ParquetFileReader.java:755) >> at >> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494) >> at >> org.apache.spark.sql.execution.datasources.parquet.UnsafeRowParquetRecordReader.checkEndOfRowGroup(UnsafeRowParquetRecord >> >

Re: Spark cluster tuning recommendation

2016-07-12 Thread Takeshi Yamamuro
t;>>- *Cores in use:* 28 Total, 2 Used >>>- *Memory in use:* 56.0 GB Total, 1024.0 MB Used >>>- *Applications:* 1 Running, 6 Completed >>>- *Drivers:* 0 Running, 0 Completed >>>- *Status:* ALIVE >>> >>> Each worker has 8 cores and 4GB memory. >>> >>> My questions is how do people running in production decide these >>> properties - >>> >>> 1) --num-executors >>> 2) --executor-cores >>> 3) --executor-memory >>> 4) num of partitions >>> 5) spark.default.parallelism >>> >>> Thanks, >>> Kartik >>> >>> >>> >> > -- --- Takeshi Yamamuro

Re: Tools for Balancing Partitions by Size

2016-07-12 Thread Takeshi Yamamuro
keley AMPLab Alumni > > ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 > Github: github.com/EntilZha | LinkedIn: > https://www.linkedin.com/in/pedrorodriguezscience > > -- --- Takeshi Yamamuro

Re: Bzip2 to Parquet format

2016-07-25 Thread Takeshi Yamamuro
>> On Jul 24, 2016, at 5:34 PM, janardhan shetty >> wrote: >> >> We have data in Bz2 compression format. Any links in Spark to convert >> into Parquet and also performance benchmarks and uses study materials ? >> >> >> > -- --- Takeshi Yamamuro

Re: read parquetfile in spark-sql error

2016-07-25 Thread Takeshi Yamamuro
Driver.java:425) >> at >> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:166) >> at >> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:606) >> at >> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) >> at >> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) >> at >> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) >> at >> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) >> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >> Error in query: cannot recognize input near 'parquetTable' 'USING' 'org' >> in table name; line 2 pos 0 >> >> >> am I use it in the wrong way? >> >> >> >> >> >> thanks >> > -- --- Takeshi Yamamuro

  1   2   3   >