[ANNOUNCE] Apache Sedona 1.7.2 released

2025-06-09 Thread Jia Yu
Dear all, We are happy to report that we have released Apache Sedona 1.7.2. Thank you again for your help. Apache Sedona is a cluster computing system for processing large-scale spatial data on top of Spark, Flink, and Snowflake. Vote thread (Permalink from https://lists.apache.org/list.html): h

Apache Sedona + Iceberg GEO meetup in San Francisco

2025-05-08 Thread Jia Yu
rates seamlessly into existing workflows and formats like Iceberg and Parquet. Spatial data can be used just like any other data type, unlocking powerful insights for business intelligence, analytics, and more. Speakers: Jia Yu – Co-Founder & Chief Architect, Wherobots (https://wherobots.com/) Mat

[ANNOUNCE] Apache Sedona 1.7.1 released

2025-03-16 Thread Jia Yu
Dear all, We are happy to report that we have released Apache Sedona 1.7.1. Thank you again for your help. Apache Sedona is a cluster computing system for processing large-scale spatial data on top of Apache Spark, Flink and Snowflake. Vote thread (Permalink from https://lists.apache.org/list.ht

[ANNOUNCE] Apache Sedona 1.7.0 released

2024-12-03 Thread Jia Yu
Dear all, We are happy to report that we have released Apache Sedona 1.7.0. Thank you again for your help. Apache Sedona is a cluster computing system for processing large-scale spatial data. Vote thread (Permalink from https://lists.apache.org/list.html): https://lists.apache.org/thread/5hvcr80

Fwd: [ANNOUNCE] Apache Sedona 1.6.1 released

2024-08-27 Thread Jia Yu
Dear all, We are happy to report that we have released Apache Sedona 1.6.1. Apache Sedona is a cluster computing system for processing large-scale spatial data. Website: http://sedona.apache.org/ Release notes: https://github.com/apache/sedona/blob/sedona-1.6.1/docs/setup/release-notes.md Down

Re: Spark Connect, Master, and Workers

2023-09-01 Thread James Yu
Can I simply understand Spark Connect this way: The client process is now the Spark driver? From: Brian Huynh Sent: Thursday, August 10, 2023 10:15 PM To: Kezhi Xiong Cc: user@spark.apache.org Subject: Re: Spark Connect, Master, and Workers Hi Kezhi, Yes, you

[k8s] Fail to expose custom port on executor container specified in my executor pod template

2023-06-26 Thread James Yu
Hi Team, I have no luck in trying to expose port 5005 (for remote debugging purpose) on my executor container using the following pod template and spark configuration s3a://mybucket/pod-template-executor-debug.yaml apiVersion:

Unsubscribe

2023-06-11 Thread Yu voidy

Re: Log4j 1.2.17 spark CVE

2021-12-13 Thread James Yu
Question: Spark use log4j 1.2.17, if my application jar contains log4j 2.x and gets submitted to the Spark cluster. Which version of log4j gets actually used during the Spark session? From: Sean Owen Sent: Monday, December 13, 2021 8:25 AM To: Jörn Franke Cc: P

Re: start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-08 Thread James Yu
amages arising from such loss, damage or destruction. On Wed, 8 Dec 2021 at 19:45, James Yu mailto:ja...@ispot.tv>> wrote: Just thought about another possibility which is to containerize the history server and run the container with proper restart policy. This may be the approach we will

Re: start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-08 Thread James Yu
Sent: Tuesday, December 7, 2021 1:29 PM To: James Yu Cc: user @spark Subject: Re: start-history-server.sh doesn't survive system reboot. Recommendation? The scripts just launch the processes. To make any process restart on system restart, you would need to set it up as a system service

start-history-server.sh doesn't survive system reboot. Recommendation?

2021-12-07 Thread James Yu
Hi Users, We found that the history server launched by using the "start-history-server.sh" command does not survive system reboot. Any recommendation of making it always up even after reboot? Thanks, James

[Spark Core]: Does Spark support group scheduling techniques like Drizzle?

2021-11-25 Thread Bowen Yu
feature in the future? Best, Bowen Yu

Re: Performance Problems Migrating to S3A Committers

2021-08-05 Thread James Yu
See this ticket https://issues.apache.org/jira/browse/HADOOP-17201. It may help your team. From: Johnny Burns Sent: Tuesday, June 22, 2021 3:41 PM To: user@spark.apache.org Cc: data-orchestration-team Subject: Performance Problems Migrating to S3A Committers H

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
rito Sent: Wednesday, February 3, 2021 11:05 AM To: James Yu ; user Subject: Re: Poor performance caused by coalesce to 1 Coalesce is reducing the parallelization of your last stage, in your case to 1 task. So, it’s natural it will give poor performance especially with large data. If you absol

Poor performance caused by coalesce to 1

2021-02-03 Thread James Yu
Hi Team, We are running into this poor performance issue and seeking your suggestion on how to improve it: We have a particular dataset which we aggregate from other datasets and like to write out to one single file (because it is small enough). We found that after a series of transformations

Re: Where do the executors get my app jar from?

2020-08-14 Thread James Yu
Henoc, Ok. That is for Yarn with HDFS. What will happen in Kubernetes as resource manager without HDFS scenario? James From: Henoc Sent: Thursday, August 13, 2020 10:45 PM To: James Yu Cc: user ; russell.spit...@gmail.com Subject: Re: Where do the executors

Where do the executors get my app jar from?

2020-08-13 Thread James Yu
Hi, When I spark submit a Spark app with my app jar located in S3, obviously the Driver will download the jar from the s3 location. What is not clear to me is: where do the Executors get the jar from? From the same s3 location, or somehow from the Driver, or they don't need the jar? Thanks i

Unsubscribe

2020-05-05 Thread Zeming Yu
Unsubscribe Get Outlook for Android

unsubscribe

2020-04-29 Thread Zeming Yu
unsubscribe Get Outlook for Android

[no subject]

2020-04-28 Thread Zeming Yu
Unsubscribe Get Outlook for Android

Re: Spark driver thread

2020-03-06 Thread James Yu
Pol, thanks for your reply. Actually I am running Spark apps in CLUSTER mode. Is what you said still applicable in cluster mode. Thanks in advance for your further clarification. From: Pol Santamaria Sent: Friday, March 6, 2020 12:59 AM To: James Yu Cc: user

Spark driver thread

2020-03-05 Thread James Yu
Hi, Does a Spark driver always works as single threaded? If yes, does it mean asking for more than one vCPU for the driver is wasteful? Thanks, James

Re: batch processing in spark

2019-05-05 Thread Genmao Yu
IIUC, you can use mapPartitions transformation and pass a function f. The function is used to map a tuple of input iterator to an output iterator. Upon the input iterator, you can process multiple records at a time. > 在 2019年5月6日,上午2:59,swastik mittal 写道: > > From my experience in spark, whe

Re: SIGBUS (0xa) when using DataFrameWriter.insertInto

2018-10-27 Thread Ted Yu
I don't seem to find the log. Can you double check ? Thanks Original message From: alexzautke Date: 10/27/18 8:54 AM (GMT-08:00) To: user@spark.apache.org Subject: Re: SIGBUS (0xa) when using DataFrameWriter.insertInto Please also find attached a complete error log. -- Se

Re: error while submitting job

2018-09-29 Thread Ted Yu
Can you tell us the version of Spark and the connector you used ? Thanks  Original message From: yuvraj singh <19yuvrajsing...@gmail.com> Date: 9/29/18 10:42 PM (GMT-08:00) To: user@spark.apache.org Subject: error while submitting job Hi , i am getting this error please help

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-19 Thread Ted Yu
Hi, w.r.t. ElementTrackingStore, since it is backed by KVStore, there should be other classes which occupy significant memory. Can you pastebin the top 10 entries among the heap dump ? Thanks

Re: how to use the sql join in java please

2018-04-11 Thread Yu, Yucai
Do you really want to do a cartesian product on those two tables? If yes, you can set spark.sql.crossJoin.enabled=true. Thanks, Yucai From: "1427357...@qq.com" <1427357...@qq.com> Date: Wednesday, April 11, 2018 at 3:16 PM To: spark?users Subject: how to use the sql join in java please Hi all,

Re: KafkaUtils.createStream(..) is removed for API

2018-02-18 Thread Ted Yu
createStream() is still in external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala But it is not in external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaUtils.scala FYI On Sun, Feb 18, 2018 at 5:17 PM, naresh Goud wrote: > Hello Team, > > I s

Re: Broken SQL Visualization?

2018-01-15 Thread Ted Yu
Did you include any picture ? Looks like the picture didn't go thru. Please use third party site.  Thanks Original message From: Tomasz Gawęda Date: 1/15/18 2:07 PM (GMT-08:00) To: d...@spark.apache.org, user@spark.apache.org Subject: Broken SQL Visualization? Hi, today I ha

Re: how to mention others in JIRA comment please?

2017-06-26 Thread Ted Yu
You can find the JIRA handle of the person you want to mention by going to a JIRA where that person has commented. e.g. you want to find the handle for Joseph. You can go to: https://issues.apache.org/jira/browse/SPARK-6635 and click on his name in comment: https://issues.apache.org/jira/secure/V

Re: the compile of spark stoped without any hints, would you like help me please?

2017-06-25 Thread Ted Yu
Does adding -X to mvn command give you more information ? Cheers On Sun, Jun 25, 2017 at 5:29 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote: > Hi all, > > Today I use new PC to compile SPARK. > At the beginning, it worked well. > But it stop at some point. > the content in consle is : > ==

Re: examples for flattening dataframes using pyspark

2017-05-27 Thread Zeming Yu
Sorry, sent the incomplete email by mistake. Here's the full email: > Hi, > > I need to flatten a nested dataframe and I' following this example: > https://docs.databricks.com/spark/latest/spark-sql/complex-types.html > > Just wondering: > 1. how can I test for the existence of an item before ret

examples for flattening dataframes using pyspark

2017-05-27 Thread Zeming Yu
Hi, I need to flatten a nested dataframe and I' following this example: https://docs.databricks.com/spark/latest/spark-sql/complex-types.html Just wondering: 1. how can I test for the existence of an item before retrieving it Say test if "b" exists before adding that into my flat dataframe event

using pandas and pyspark to run ETL job - always failing after about 40 minutes

2017-05-26 Thread Zeming Yu
Hi, I tried running the ETL job a few times. It always fails after 40 minutes or so. When I relaunch jupyter and rerun the job, it runs without error. Then it fails again after some time. Just wondering if anyone else has encountered this before? Here's the error message: ---

Re: KMeans Clustering is not Reproducible

2017-05-24 Thread Yu Zhang
gives approximation guarantees on the kmeans cost. You could set the initial seeding points which will avoid the 'agnostic' issue. Regards, Yu Zhang On Wed, May 24, 2017 at 1:49 PM, Ankur Srivastava < ankur.srivast...@gmail.com> wrote: > Hi Christoph, > > I am not an ex

Re: what does this error mean?

2017-05-13 Thread Zeming Yu
"server ({0}:{1})".format(self.address, self.port)969 logger.exception(msg)--> 970 raise Py4JNetworkError(msg, e) 971 972 def close(self, reset=False): Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:34166) O

what does this error mean?

2017-05-13 Thread Zeming Yu
My code runs error free on my local pc. Just tried running the same code on a ubuntu machine on ec2, and got the error below. Any idea where to start in terms of debugging? ---Py4JError Tracebac

how to set up h2o sparkling water on jupyter notebook on a windows machine

2017-05-08 Thread Zeming Yu
Hi, I'm a newbie, so please bear with me. *I'm using a windows 10 machine. I installed spark here:* C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7 *I also installed h2o sparkling water here:* C:\sparkling-water-2.1.1 *I use this code in command line to launch a jupyter notebook for pysp

how to check whether spill over to hard drive happened or not

2017-05-06 Thread Zeming Yu
hi, I'm running pyspark on my local PC using the stand alone mode. After a pyspark window function on a dataframe, I did a groupby query on the dataframe. The groupby query turns out to be very slow (10+ minutes on a small data set). I then cached the dataframe and re-ran the same query. The quer

Re: take the difference between two columns of a dataframe in pyspark

2017-05-06 Thread Zeming Yu
OK. I've worked it out. df.withColumn('diff', col('A')-col('B')) On Sun, May 7, 2017 at 11:49 AM, Zeming Yu wrote: > Say I have the following dataframe with two numeric columns A and B, > what's the best way to add a column sh

take the difference between two columns of a dataframe in pyspark

2017-05-06 Thread Zeming Yu
Say I have the following dataframe with two numeric columns A and B, what's the best way to add a column showing the difference between the two columns? +-+--+ |A| B| +-+--+ |786.31999|786.12| | 786.12|

Spark books

2017-05-03 Thread Zeming Yu
I'm trying to decide whether to buy the book learning spark, spark for machine learning etc. or wait for a new edition covering the new concepts like dataframe and datasets. Anyone got any suggestions?

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
ave to say. > Would it be more efficient if a relational database with the right index > (code field in the above case) to perform more efficiently (with spark that > uses predicate push-down)? > Hope this helps. > > Thanks, > Muthu > > On Sun, Apr 30, 2017 at 1:45 AM, Zemi

examples of dealing with nested parquet/ dataframe file

2017-04-30 Thread Zeming Yu
Hi, I'm still trying to decide whether to store my data as deeply nested or flat parquet file. The main reason for storing the nested file is it stores data in its raw format, no information loss. I have two questions: 1. Is it always necessary to flatten a nested dataframe for the purpose of b

Re: Recommended cluster parameters

2017-04-30 Thread Zeming Yu
I've got a similar question. Would you be able to provide some rough guide (even a range is fine) on the number of nodes, cores, and total amount of RAM required? Do you want to store 1 TB, 1 PB or far more? - say 6 TB of data in parquet format on s3 Do you want to just read that data, retrieve

Re: parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
anke wrote: > Depends on your queries, the data structure etc. generally flat is better, > but if your query filter is on the highest level then you may have better > performance with a nested structure, but it really depends > > > On 30. Apr 2017, at 10:19, Zeming Yu wrote: >

parquet optimal file structure - flat vs nested

2017-04-30 Thread Zeming Yu
Hi, We're building a parquet based data lake. I was under the impression that flat files are more efficient than deeply nested files (say 3 or 4 levels down). Is that correct? Thanks, Zeming

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
ndex.html ​ > > > Thank you, > *Pushkar Gujar* > > > On Tue, Apr 25, 2017 at 8:50 AM, Zeming Yu wrote: > >> How could I access the first element of the holiday column? >> >> I tried the following code, but it doesn't work: >> start_date_test2.with

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
How could I access the first element of the holiday column? I tried the following code, but it doesn't work: start_date_test2.withColumn("diff", datediff(start_date_test2.start_date, start_date_test2.holiday*[0]*)).show() On Tue, Apr 25, 2017 at 10:20 PM, Zeming Yu wrote: >

Re: how to find the nearest holiday

2017-04-25 Thread Zeming Yu
-05-30,2017-10-01]| calculate a column called "days_from_nearest_holiday" which calculates the difference between 11 aug 2017 and 1 oct 2017? On Tue, Apr 25, 2017 at 6:00 PM, Wen Pei Yu wrote: > TypeError: unorderable types: str() >= datetime.date() > > Should transf

Re: how to find the nearest holiday

2017-04-25 Thread Wen Pei Yu
TypeError: unorderable types: str() >= datetime.date()   Should transfer string to Date type when compare.   Yu Wenpei.   - Original message -From: Zeming Yu To: user Cc:Subject: how to find the nearest holidayDate: Tue, Apr 25, 2017 3:39 PM  I have a column of dates (date type), j

how to find the nearest holiday

2017-04-25 Thread Zeming Yu
I have a column of dates (date type), just trying to find the nearest holiday of the date. Anyone has any idea what went wrong below? start_date_test = flight3.select("start_date").distinct() start_date_test.show() holidays = ['2017-09-01', '2017-10-01'] +--+ |start_date| +--+

Re: udf that handles null values

2017-04-24 Thread Zeming Yu
issue today at stackoverflow. > > http://stackoverflow.com/questions/43595201/python-how- > to-convert-pyspark-column-to-date-type-if-there-are-null- > values/43595728#43595728 > > > Thank you, > *Pushkar Gujar* > > > On Mon, Apr 24, 2017 at 8:22 PM, Zeming Yu wro

one hot encode a column of vector

2017-04-24 Thread Zeming Yu
how do I do one hot encode on a column of array? e.g. ['TG', 'CA'] FYI here's my code for one hot encoding normal categorical columns. How do I make it work for a column of array? from pyspark.ml import Pipeline from pyspark.ml.feature import StringIndexer indexers = [StringIndexer(inputCol=co

pyspark vector

2017-04-24 Thread Zeming Yu
Hi all, Beginner question: what does the 3 mean in the (3,[0,1,2],[1.0,1.0,1.0])? https://spark.apache.org/docs/2.1.0/ml-features.html id | texts | vector |-|--- 0 | Array("a", "b", "c")| (3,[0,1,2],[1.0,1.

udf that handles null values

2017-04-24 Thread Zeming Yu
hi all, I tried to write a UDF that handles null values: def getMinutes(hString, minString): if (hString != None) & (minString != None): return int(hString) * 60 + int(minString[:-1]) else: return None flight2 = (flight2.withColumn("duration_minutes", udfGetMinutes("duration_h", "duratio

Re: how to add new column using regular expression within pyspark dataframe

2017-04-22 Thread Zeming Yu
lit(flight.duration,'h').getItem(0)) > > > Thank you, > *Pushkar Gujar* > > > On Thu, Apr 20, 2017 at 4:35 AM, Zeming Yu wrote: > >> Any examples? >> >> On 20 Apr. 2017 3:44 pm, "颜发才(Yan Facai)" wrote: >> >>> How about u

Re: how to add new column using regular expression within pyspark dataframe

2017-04-20 Thread Zeming Yu
wal.wordpress.com/2015/10/02/spark-custom-udf-example/ > > > > On Mon, Apr 17, 2017 at 8:25 PM, Zeming Yu wrote: > >> I've got a dataframe with a column looking like this: >> >> display(flight.select("duration").show()) >> >> ++ &g

how to add new column using regular expression within pyspark dataframe

2017-04-17 Thread Zeming Yu
I've got a dataframe with a column looking like this: display(flight.select("duration").show()) ++ |duration| ++ | 15h10m| | 17h0m| | 21h25m| | 14h30m| | 24h50m| | 26h10m| | 14h30m| | 23h5m| | 21h30m| | 11h50m| | 16h10m| | 15h15m| | 21h25m| | 14h25m| | 14h40m| |

Re: optimising storage and ec2 instances

2017-04-11 Thread Zeming Yu
ot; wrote: > > > On 11 Apr 2017, at 11:07, Zeming Yu wrote: > > > > Hi all, > > > > I'm a beginner with spark, and I'm wondering if someone could provide > guidance on the following 2 questions I have. > > > > Background: I have a data se

optimising storage and ec2 instances

2017-04-11 Thread Zeming Yu
Hi all, I'm a beginner with spark, and I'm wondering if someone could provide guidance on the following 2 questions I have. Background: I have a data set growing by 6 TB p.a. I plan to use spark to read in all the data, manipulate it and build a predictive model on it (say GBM) I plan to store th

Re: Aggregated column name

2017-03-23 Thread Wen Pei Yu
Thanks. Kevin This works for one or two column agg. But not work for this: val expr = (Map("forCount" -> "count") ++ features.map((_ -> "mean"))) val averageDF = originalDF .withColumn("forCount", lit(0)) .groupBy(col("...")) .agg

Aggregated column name

2017-03-23 Thread Wen Pei Yu
figure parameter for this, or which PR change this?   Thank you very much. Yu Wenpei. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
Does the storage handler provide bulk load capability ? Cheers > On Jan 25, 2017, at 3:39 AM, Amrit Jangid wrote: > > Hi chetan, > > If you just need HBase Data into Hive, You can use Hive EXTERNAL TABLE with > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'. > > Try this if you

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
The references are vendor specific. Suggest contacting vendor's mailing list for your PR. My initial interpretation of HBase repository is that of Apache. Cheers On Wed, Jan 25, 2017 at 7:38 AM, Chetan Khatri wrote: > @Ted Yu, Correct but HBase-Spark module available at HBase re

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu
Though no hbase release has the hbase-spark module, you can find the backport patch on HBASE-14160 (for Spark 1.6) You can build the hbase-spark module yourself. Cheers On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri wrote: > Hello Spark Community Folks, > > Currently I am using HBase 1.2.4 and

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu
processing is delivered to hbase. Cheers On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri wrote: > Ok, Sure will ask. > > But what would be generic best practice solution for Incremental load from > HBASE. > > On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu wrote: > >> I haven

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu
I haven't used Gobblin. You can consider asking Gobblin mailing list of the first option. The second option would work. On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri wrote: > Hello Guys, > > I would like to understand different approach for Distributed Incremental > load from HBase, Is there

driver in queued state and not started

2016-12-05 Thread Yu Wei
Hi Guys, I tried to run spark on mesos cluster. However, when I tried to submit jobs via spark-submit. The driver is in "Queued state" and not started. Which should I check? Thanks, Jared, (??) Software developer Interested in open source software, big data, Linux

Two questions about running spark on mesos

2016-11-14 Thread Yu Wei
Hi Guys, Two questions about running spark on mesos. 1, Does spark configuration of conf/slaves still work when running spark on mesos? According to my observations, it seemed that conf/slaves still took effect when running spark-shell. However, it doesn't take effect when deploying

If you have used spark-sas7bdat package to transform SAS data set to Spark, please be aware

2016-10-27 Thread Shi Yu
I found some main issues and wrote it on my blog: https://eilianyu.wordpress.com/2016/10/27/be-aware-of-hidden-data-errors-using-spark-sas7bdat-pacakge-to-ingest-sas-datasets-to-spark/

RE: Can we disable parquet logs in Spark?

2016-10-21 Thread Yu, Yucai
I set "log4j.rootCategory=ERROR, console" and using "-file conf/log4f.properties" to make most of logs suppressed, but those org.apache.parquet log still exists. Any way to disable them also? Thanks, Yucai From: Yu, Yucai [mailto:yucai...@intel.com] Sent: Friday, October

Can we disable parquet logs in Spark?

2016-10-20 Thread Yu, Yucai
Hi, I see lots of parquet logs in container logs(YARN mode), like below: stdout: Oct 21, 2016 2:27:30 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 8,448B for [ss_promo_sk] INT32: 5,996 values, 8,513B raw, 8,409B comp, 1 pages, encodings: [PLAIN_DICTIONARY, BIT_PACKED,

Best practice of complicated SQL query in Spark/Hive

2016-10-06 Thread Shi Yu
Hello, I wonder what is the state-of-art best practice to achieve best performance running complicated SQL query today in 2016? I am new to this topic and have read about Hive on Tez Spark on Hive Spark SQL 2.0 (It seems Spark 2.0 supports complicated nest query) The documentation I read sugge

Re: access spark thrift server from another spark session

2016-10-04 Thread Herman Yu
l saved. > > On 4 Oct 2016 11:44, "Takeshi Yamamuro" <mailto:linguin@gmail.com>> wrote: > -dev +user > > Hi, > > Have you try to share a session by > `spark.sql.hive.thriftServer.singleSession`? > > // maropu > > On Tue, Oct 4, 2016 at

Re: udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
df: - a|b|c --- 1|m|n 1|x | j 2|m|x ... import pyspark.sql.functions as F from pyspark.sql.types import MapType, StringType def my_zip(c, d): return dict(zip(c, d)) my_zip = F.udf(_my_zip, MapType(StingType(), StringType(), True), True) df.groupBy('a').agg(my_zip(collect_list

Re: udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
btw, i am using spark 1.6.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/udf-of-aggregation-in-pyspark-dataframe-tp27811p27812.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
Hi, is there a way to write a udf in pyspark support agg()? i search all over the docs and internet, and tested it out.. some say yes, some say no. and when i try those yes code examples, just complaint about AnalysisException: u"expression 'pythonUDF' is neither present in the group by, nor

Re: namespace quota not take effect

2016-08-25 Thread Ted Yu
This question should have been posted to user@ Looks like you were using wrong config. See: http://hbase.apache.org/book.html#quota See 'Setting Namespace Quotas' section further down. Cheers On Tue, Aug 23, 2016 at 11:38 PM, W.H wrote: > hi guys > I am testing the hbase namespace quota at

Re: Apply ML to grouped dataframe

2016-08-23 Thread Wen Pei Yu
3| [2.0,16.0]| |12462589343|3| [1.0,1.0]| +---+-++ From: ayan guha To: Wen Pei Yu/China/IBM@IBMCN Cc: user , Nirmal Fernando Date: 08/23/2016 05:13 PM Subject:Re: Apply ML to grouped dataframe I would suggest you to construct a toy probl

Re: Apply ML to grouped dataframe

2016-08-22 Thread Wen Pei Yu
Hi Mirmal Filter works fine if I want handle one of grouped dataframe. But I has multiple grouped dataframe, I wish I can apply ML algorithm to all of them in one job, but not in for loops. Wenpei. From: Nirmal Fernando To: Wen Pei Yu/China/IBM@IBMCN Cc: User Date: 08/23/2016

Re: Apply ML to grouped dataframe

2016-08-22 Thread Wen Pei Yu
: Nirmal Fernando To: Wen Pei Yu/China/IBM@IBMCN Cc: User Date: 08/23/2016 01:14 PM Subject:Re: Apply ML to grouped dataframe Hi Wen, AFAIK Spark MLlib implements its machine learning algorithms on top of Spark dataframe API. What did you mean by a grouped dataframe? On T

Re: Apply ML to grouped dataframe

2016-08-22 Thread Wen Pei Yu
Hi Nirmal I didn't get your point. Can you tell me more about how to use MLlib to grouped dataframe? Regards. Wenpei. From: Nirmal Fernando To: Wen Pei Yu/China/IBM@IBMCN Cc: User Date: 08/23/2016 10:26 AM Subject:Re: Apply ML to grouped dataframe You can use

Apply ML to grouped dataframe

2016-08-22 Thread Wen Pei Yu
Hi We have a dataframe, then want group it and apply a ML algorithm or statistics(say t test) to each one. Is there any efficient way for this situation? Currently, we transfer to pyspark, use groupbykey and apply numpy function to array. But this wasn't an efficient way, right? Regards. Wenpei

Re: Attempting to accept an unknown offer

2016-08-17 Thread Ted Yu
me from a hive sql. There are other > similar jobs which work fine > > On Wed, Aug 17, 2016 at 8:52 AM, Ted Yu wrote: > >> Can you provide more information ? >> >> Were you running on YARN ? >> Which version of Spark are you using ? >> >> Was your job fail

Re: Attempting to accept an unknown offer

2016-08-17 Thread Ted Yu
Can you provide more information ? Were you running on YARN ? Which version of Spark are you using ? Was your job failing ? Thanks On Wed, Aug 17, 2016 at 8:46 AM, vr spark wrote: > > W0816 23:17:01.984846 16360 sched.cpp:1195] Attempting to accept an > unknown offer b859f2f3-7484-482d-8c0d-3

Re: Undefined function json_array_to_map

2016-08-17 Thread Ted Yu
Can you show the complete stack trace ? Which version of Spark are you using ? Thanks On Wed, Aug 17, 2016 at 8:46 AM, vr spark wrote: > Hi, > I am getting error on below scenario. Please suggest. > > i have a virtual view in hive > > view name log_data > it has 2 columns > > query_map

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Ted Yu
t's a converted dataset of case classes to > dataframe. This is deterministically causing the error in Scala 2.11. > > Once I can get a deterministically breaking test without work code I will > try to file a Jira bug. > > On Tue, Aug 16, 2016, 04:17 Ted Yu wrote: > >> I t

Re: long lineage

2016-08-16 Thread Ted Yu
Have you tried periodic checkpoints ? Cheers > On Aug 16, 2016, at 5:50 AM, pseudo oduesp wrote: > > Hi , > how we can deal after raise stackoverflow trigger by long lineage ? > i mean i have this error and how resolve it wiyhout creating new session > thanks >

Re: class not found exception Logging while running JavaKMeansExample

2016-08-16 Thread Ted Yu
og4j and sl4j dependencies in pom. I am > still not getting what dependencies I am missing. > > Best Regards, > Subash Basnet > > On Mon, Aug 15, 2016 at 6:50 PM, Ted Yu wrote: > >> Logging has become private in 2.0 release: >> >> private[spark] tra

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Ted Yu
-15285 with master branch. > Should we reopen SPARK-15285? > > Best Regards, > Kazuaki Ishizaki, > > > > From:Ted Yu > To:dhruve ashar > Cc:Aris , "user@spark.apache.org" > > Date:2016/08/15 06:19 > Subject:

Re: class not found exception Logging while running JavaKMeansExample

2016-08-15 Thread Ted Yu
Logging has become private in 2.0 release: private[spark] trait Logging { On Mon, Aug 15, 2016 at 9:48 AM, subash basnet wrote: > Hello all, > > I am trying to run JavaKMeansExample of the spark example project. I am > getting the classnotfound exception error: > *Exception in thread "main" jav

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-14 Thread Ted Yu
Looks like the proposed fix was reverted: Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB" This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf. Maybe this was fixed in some other JIRA ? On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar w

Re: Why I can't use broadcast var defined in a global object?

2016-08-13 Thread Ted Yu
Can you (or David) resend David's reply ? I don't see the reply in this thread. Thanks > On Aug 13, 2016, at 8:39 PM, yaochunnan wrote: > > Hi David, > Your answers have solved my problem! Detailed and accurate. Thank you very > much! > > > > -- > View this message in context: > http://a

Re: Single point of failure with Driver host crashing

2016-08-11 Thread Ted Yu
Have you read https://spark.apache.org/docs/latest/spark-standalone.html#high-availability ? FYI On Thu, Aug 11, 2016 at 12:40 PM, Mich Talebzadeh wrote: > > Hi, > > Although Spark is fault tolerant when nodes go down like below: > > FROM tmp > [Stage 1:===>

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-08 Thread Ted Yu
lacesUnchanged.unionAll(placesAddedWithMerchantId). > unionAll(placesUpdatedFromHotelsWithMerchantId).unionAll(pla > cesUpdatedFromRestaurantsWithMerchantId).unionAll(placesChanged) > > I'm using Spark 1.6.2. > > On Mon, Aug 8, 2016 at 3:11 PM, Ted Yu wrote: > >&g

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-08 Thread Ted Yu
Can you show the code snippet for unionAll operation ? Which Spark release do you use ? BTW please use user@spark.apache.org in the future. On Mon, Aug 8, 2016 at 11:47 AM, max square wrote: > Hey guys, > > I'm trying to save Dataframe in CSV format after performing unionAll > operations on it

Re: Multiple Sources Found for Parquet

2016-08-08 Thread Ted Yu
Can you examine classpath to see where *DefaultSource comes from ?* *Thanks* On Mon, Aug 8, 2016 at 2:34 AM, 金国栋 wrote: > I'm using Spark2.0.0 to do sql analysis over parquet files, when using > `read().parquet("path")`, or `write().parquet("path")` in Java(I followed > the example java file in

Re: submitting spark job with kerberized Hadoop issue

2016-08-07 Thread Ted Yu
The link in Jerry's response was quite old. Please see: http://hbase.apache.org/book.html#security Thanks On Sun, Aug 7, 2016 at 6:55 PM, Saisai Shao wrote: > 1. Standalone mode doesn't support accessing kerberized Hadoop, simply > because it lacks the mechanism to distribute delegation tokens

  1   2   3   4   5   6   7   8   9   10   >