RE: How to add MaxDOP option in spark mssql JDBC

2024-04-24 Thread Appel, Kevin
You might be able to leverage the prepareQuery option, that is at https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option ... this was introduced in Spark 3.4.0 to handle temp table query and CTE query against MSSQL server since what you send in is not actually what get

Unsubscribe

2023-07-27 Thread Kevin Wang
Unsubscribe please!

RE: determine week of month from date in spark3

2022-02-11 Thread Appel, Kevin
-03-30| 2| | 4|2014-03-31| 3| | 5|2015-03-07| 7| | 6|2015-03-08| 1| | 7|2015-03-30| 2| | 8|2015-03-31| 3| +---+--++ From: Appel, Kevin Sent: Friday, February 11, 2022 2:35 PM To: user@spark.apache.org; 'Sean

determine week of month from date in spark3

2022-02-11 Thread Appel, Kevin
there any caveats or items to be aware of that might get us later? For example in a future Spark 3.3.X is this option going to be deprecated This was an item that we ran into from Spark2 to Spark3 conversion and trying to see how to best handle this Thanks for your feedback, Kevin -

How to run spark benchmark on standalone cluster?

2021-07-02 Thread Kevin Su
Hi all, I want to run spark benchmark on a standalone cluster, and I have changed the DataSourceReadBenchmark.scala setting. (Remove "spark.master") --- a/sql/core/src/test /scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala +++ b/sql/core/src/test /scala/org/apache/spar

Fwd: Fail to run benchmark in Github Action

2021-06-26 Thread Kevin Su
-- Forwarded message - 寄件者: Kevin Su Date: 2021年6月25日 週五 下午8:23 Subject: Fail to run benchmark in Github Action To: Hi all, I try to run a benchmark test in GitHub action in my fork, and I faced the below error. https://github.com/pingsutw/spark/runs/2867617238

Re: Stream-static join : Refreshing subset of static data / Connection pooling

2020-11-29 Thread chen kevin
big table that will be joined. * I think frequent I/O actions like select may cause memery or i/o issues. 2. You can use postgreSql connection pools to avoid making connnection frequently. -- Best, Kevin Chen From: Geervan Hayatnagarkar Date: Sunday, November 29, 2020 at 6:20 PM

Re: Stream-static join : Refreshing subset of static data / Connection pooling

2020-11-29 Thread chen kevin
Hi, you can use Debezium to capture real-timely the row-level changes in PostgreSql, then stream them to kafka, finally etl and write the data to hbase by flink/spark streaming。So you can join the data in hbase directly. in consideration of the particularly big table, the scan performance in

Re: how to manage HBase connections in Executors of Spark Streaming ?

2020-11-25 Thread chen kevin
1. the issue about that Kerberos expires. * You don’t need to care aboubt usually, you can use the local keytab at every node in the Hadoop cluster. * If there don’t have the keytab in your Hadoop cluster, you will need update your keytab in every executor periodically。 2. bes

Re: Using two WriteStreams in same spark structured streaming job

2020-11-08 Thread Kevin Pis
h function, then I may need to use custom Kafka stream > writer > right ?! > > And I might not be able to use default writestream.format(Kafka) method ?! > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best, Kevin Pis

Re: Spark streaming with Kafka

2020-11-03 Thread Kevin Pis
t; > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best, Kevin Pis

Re: spark-submit parameters about two keytab files to yarn and kafka

2020-11-01 Thread kevin chen
g to SASL_PLAINTEXT, if your spark version is 1.6. *note:* my test env: spark 2.0.2 kafka 0.10 references 1. using-spark-streaming <https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.0/bk_spark-component-guide/content/using-spark-streaming.html> -- Best, Kevin Pis Gabor Somogyi 于2020

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-11-01 Thread kevin chen
Perhaps it can avoid errors(exhausting executor and driver memory) to add random numbers to the entity_id column when you solve the issue by Patrick's way. Daniel Chalef 于2020年10月31日周六 上午12:42写道: > Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will > explore ways of doing

Re: Exception when reading multiline JSON file

2019-09-12 Thread Kevin Mellott
easier to troubleshoot because you can execute the Spark code one step at a time using their visual notebook experience. Hope that helps point you in the right direction. https://spark.apache.org/docs/latest/monitoring.html https://m.youtube.com/watch?v=KscZf1y97m8 Kevin On Thu, Sep 12, 2019 at 12

Re: How to sleep Spark job

2019-01-22 Thread Kevin Mellott
I’d recommend using a scheduler of some kind to trigger your job each hour, and have the Spark job exit when it completes. Spark is not meant to run in any type of “sleep mode”, unless you want to run a structured streaming job and create a separate process to pull data from Casandra and publish it

spark jdbc postgres query results don't match those of postgres query

2018-03-29 Thread Kevin Peng
I am running into a weird issue in Spark 1.6, which I was wondering if anyone has encountered before. I am running a simple select query from spark using a jdbc connection to postgres: val POSTGRES_DRIVER: String = "org.postgresql.Driver" val srcSql = """select total_action_value, last_updated from

NullPointerException issue in LDA.train()

2018-02-09 Thread Kevin Lam
I've heavily followed the code outlined here: http://sean.lane.sh/blog/2016/PySpark_and_LDA Any ideas or help is appreciated!! Thanks in advance, Kevin Example trace of output: 16:22:55 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 8.0 in >> stage 42.0 (TID 16163, >

How to preserve the order of parquet files?

2018-02-07 Thread Kevin Jung
Hi all, In spark 2.2.1, when I load parquet files, it shows differently ordered result of original dataset. It seems like FileSourceScanExec.createNonBucketedReadRDD method sorts parquet file splits by their own lengths. - val splitFiles = selectedPartitions.flatMap { partition =>

[Spark ML] LogisticRegressionWithSGD

2017-06-29 Thread Kevin Quinn
Hello, I'd like to build a system that leverages semi-online updates and I wanted to use stochastic gradient descent. However, after looking at the documentation it looks like that method is deprecated. Is there a reason why it was deprecated? Is there a planned replacement? As far as I know L

Re: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Kevin Wang
I am also interested in this topic. Anything else anyone can recommend? Thanks. Best, Kevin On Tue, Apr 11, 2017 at 5:00 AM, Alonso Isidoro Roman wrote: > i did not use it yet, but this library looks promising: > > https://github.com/databricks/spark-corenlp > > > Al

Re: Aggregated column name

2017-03-23 Thread Kevin Mellott
ot;)).agg(count("number"))*.alias("ColumnNameCount")* Hope that helps! Kevin On Thu, Mar 23, 2017 at 2:41 AM, Wen Pei Yu wrote: > Hi All > > I found some spark version(spark 1.4) return upper case aggregated > column, and some return low case. > As below code, &

Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Kevin Peng
Mohini, We set that parameter before we went and played with the number of executors and that didn't seem to help at all. Thanks, KP On Tue, Mar 14, 2017 at 3:37 PM, mohini kalamkar wrote: > Hi, > > try using this parameter --conf spark.sql.shuffle.partitions=1000 > > Thanks, > Mohini > > On

Re: pivot over non numerical data

2017-02-01 Thread Kevin Mellott
This should work for non-numerical data as well - can you please elaborate on the error you are getting and provide a code sample? As a preliminary hint, you can "aggregate" text values using *max*. df.groupBy("someCol") .pivot("anotherCol") .agg(max($"textC

Re: spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Kevin Burton
ed SparkEvents of which stick around in order > for the UI to render. There are some options under `spark.ui.retained*` to > limit that if it's a problem. > > > On Mon, Jan 9, 2017 at 6:00 PM, Kevin Burton wrote: > >> We've had various OOM issues with spark and ha

spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Kevin Burton
going to try to give it more of course but would be nice to know if this is a legitimate memory constraint or there is a bug somewhere. PS: One thought I had was that it would be nice to have spark keep track of where an OOM was encountered, in what component. Kevin -- We’re hiring if you know

Re: Spark app write too many small parquet files

2016-12-08 Thread Kevin Tran
. What is the practise for number of file size and files ? How to compacting small parquet flies to small number of bigger parquet file ? Thanks, Kevin. On Tue, Nov 29, 2016 at 3:01 AM, Chin Wei Low wrote: > Try limit the partitions. spark.sql.shuffle.partitions > > This control the number

OutOfMemoryError while running job...

2016-12-06 Thread Kevin Burton
I am trying to run a Spark job which reads from ElasticSearch and should write it's output back to a separate ElasticSearch index. Unfortunately I keep getting `java.lang.OutOfMemoryError: Java heap space` exceptions. I've tried running it with: --conf spark.memory.offHeap.enabled=true --conf spark

Re: Spark app write too many small parquet files

2016-11-28 Thread Kevin Tran
Hi Denny, Thank you for your inputs. I also use 128 MB but still too many files generated by Spark app which is only ~14 KB each ! That's why I'm asking if there is a solution for this if some one has same issue. Cheers, Kevin. On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee wrote: > G

Spark app write too many small parquet files

2016-11-27 Thread Kevin Tran
Should it write each chunk of bigger data size (such as 128 MB) with proper number of files ? Does anyone find out any performance changes when changing data size of each parquet file ? Thanks, Kevin.

Re: Nearest neighbour search

2016-11-14 Thread Kevin Mellott
You may be able to benefit from Soundcloud's open source implementation, either as a solution or as a reference implementation. https://github.com/soundcloud/cosine-lsh-join-spark Thanks, Kevin On Sun, Nov 13, 2016 at 2:07 PM, Meeraj Kunnumpurath < mee...@servicesymphony.com> wrote:

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
have to check out HBase as well; I've heard good things! Thanks, Kevin On Mon, Oct 10, 2016 at 11:38 AM, Mich Talebzadeh wrote: > Hi Kevin, > > What is the streaming interval (batch interval) above? > > I do analytics on streaming trade data but after manipulation of > indi

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
om 30 seconds to around 1 second. // ssc = instance of SparkStreamingContext ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") I've also verified that the parquet files being generated are usable by both Hive and Impala. Hope that helps!

Spark Streaming Advice

2016-10-06 Thread Kevin Mellott
run into a similar situation regarding data ingestion with Spark Streaming and do you have any tips to share? Our end goal is to store the information in a way that makes it efficient to query, using a tool like Hive or Impala. Thanks, Kevin

Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread Kevin Mellott
The documentation details the algorithm being used at http://spark.apache.org/docs/latest/mllib-decision-tree.html Thanks, Kevin On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty wrote: > Hi, > > Any help here is appreciated .. > > On Wed, Sep 28, 2016 at 11:34 AM, janardhan

Re: Dataframe Grouping - Sorting - Mapping

2016-09-30 Thread Kevin Mellott
api/scala/index.html#org.apache.spark.sql.DataFrame Thanks, Kevin On Fri, Sep 30, 2016 at 5:46 AM, AJT wrote: > I'm looking to do the following with my Spark dataframe > (1) val df1 = df.groupBy() > (2) val df2 = df1.sort() > (3) val df3 = df2.mapPartitions() > > I can alre

Extract timestamp from Kafka message

2016-09-25 Thread Kevin Tran
, kafkaParams, topics ); Thanks, Kevin.

Re: Optimal/Expected way to run demo spark-scala scripts?

2016-09-23 Thread Kevin Mellott
://databricks.com/try-databricks Thanks, Kevin On Fri, Sep 23, 2016 at 2:37 PM, Dan Bikle wrote: > hello spark-world, > > I am new to spark and want to learn how to use it. > > I come from the Python world. > > I see an example at the url below: > > http://spark.apache.org/d

Re: In Spark-scala, how to fill Vectors.dense in DataFrame from CSV?

2016-09-22 Thread Kevin Mellott
You'll want to use the spark-csv package, which is included in Spark 2.0. The repository documentation has some great usage examples. https://github.com/databricks/spark-csv Thanks, Kevin On Thu, Sep 22, 2016 at 8:40 PM, Dan Bikle wrote: > hello spark-world, > > I am new t

Re: unresolved dependency: datastax#spark-cassandra-connector;2.0.0-s_2.11-M3-20-g75719df: not found

2016-09-21 Thread Kevin Mellott
3 You can verify the available versions by searching Maven at http://search.maven.org. Thanks, Kevin On Wed, Sep 21, 2016 at 3:38 AM, muhammet pakyürek wrote: > while i run the spark-shell as below > > spark-shell --jars '/home/ktuser/spark-cassandra- > connector/target/scala

Re: Similar Items

2016-09-20 Thread Kevin Mellott
Using the Soundcloud implementation of LSH, I was able to process a 22K product dataset in a mere 65 seconds! Thanks so much for the help! On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott wrote: > Thanks Nick - those examples will help a ton!! > > On Tue, Sep 20, 2016 at 12:20 PM, Nick

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
that one also > > On Sep 20, 2016 10:44 PM, "Kevin Mellott" > wrote: > >> Instead of *mode="append"*, try *mode="overwrite"* >> >> On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally < >> sankar.mittapa...@creditvidya.com> wrot

Re: Similar Items

2016-09-20 Thread Kevin Mellott
hub.com/soundcloud/cosine-lsh-join-spark - not used this but > looks like it should do exactly what you need. > https://github.com/mrsqueeze/*spark*-hash > <https://github.com/mrsqueeze/spark-hash> > > > On Tue, 20 Sep 2016 at 18:06 Kevin Mellott > wrote: > >>

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
uot;) > > I tried these two commands. > write.df(sankar2,"/nfspartition/sankar/test/test.csv","csv",header="true") > > saveDF(sankar2,"sankartest.csv",source="csv",mode="append",schema="true") > > > > On Tue,

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Can you please post the line of code that is doing the df.write command? On Tue, Sep 20, 2016 at 9:29 AM, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > Hey Kevin, > > It is a empty directory, It is able to write part files to the directory > but while mergin

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
/api/R/write.df.html Thanks, Kevin On Tue, Sep 20, 2016 at 12:16 AM, sankarmittapally < sankar.mittapa...@creditvidya.com> wrote: > We have setup a spark cluster which is on NFS shared storage, there is no > permission issues with NFS storage, all the users are able to write to NFS

Similar Items

2016-09-19 Thread Kevin Mellott
orward way to do this in Spark? I tried creating a UDF (that used the Breeze linear algebra methods internally); however, that did not scale well. Thanks, Kevin

Re: study materials for operators on Dataframe

2016-09-19 Thread Kevin Mellott
I would recommend signing up for a Databricks Community Edition account. It will give you access to a 6GB cluster, with many different example programs that you can use to get started. https://databricks.com/try-databricks If you are looking for a more formal training method, I just completed the

Re: driver OOM - need recommended memory for driver

2016-09-19 Thread Kevin Mellott
ning.html Hope that helps! Kevin On Mon, Sep 19, 2016 at 9:32 AM, Anand Viswanathan < anand_v...@ymail.com.invalid> wrote: > Hi, > > Spark version :spark-1.5.2-bin-hadoop2.6 ,using pyspark. > > I am running a machine learning program, which runs perfectly by > specifyi

Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-19 Thread Kevin Burton
I tried with write.json and write.csv. The write.text method won't work because I have more than one column and refuses to execute. Doesn't seem to work on any data. On Sat, Sep 17, 2016 at 10:52 PM, Hyukjin Kwon wrote: > Hi Kevin, > > I have few questions on this. > >

Re: Missing output partition file in S3

2016-09-19 Thread Chen, Kevin
, Kevin From: Steve Loughran mailto:ste...@hortonworks.com>> Date: Friday, September 16, 2016 at 3:46 AM To: Chen Kevin mailto:kevin.c...@neustar.biz>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: Re: Missing o

take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Kevin Burton
I'm seeing some weird behavior and wanted some feedback. I have a fairly large, multi-hour job that operates over about 5TB of data. It builds it out into a ranked category index of about 25000 categories sorted by rank, descending. I want to write this to a file but it's not actually writing an

Missing output partition file in S3

2016-09-15 Thread Chen, Kevin
Hi, Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result, one chunk of data was lost. If I rerun the same job, the problem usually goes away. This has been happen

Add sqldriver.jar to Spark 1.6.0 executors

2016-09-14 Thread Kevin Tran
me ! Does anyone have Spark app work with driver jar on executors before please give me your ideas. Thank you. Cheers, Kevin.

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
on the command line if using > the shell. > > > On Tue, Sep 13, 2016, 19:22 Kevin Burton wrote: > >> The problem is that without a new spark context, with a custom conf, >> elasticsearch-hadoop is refusing to read in settings about the ES setup... >> >> if I

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 13 September 2016 at 18:57, Sean Owen wrote: > >> But you're in

Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
I'm rather confused here as to what to do about creating a new SparkContext. Spark 2.0 prevents it... (exception included below) yet a TON of examples I've seen basically tell you to create a new SparkContext as standard practice: http://spark.apache.org/docs/latest/configuration.html#dynamicall

"Too many elements to create a power set" on Elasticsearch

2016-09-11 Thread Kevin Burton
1.6.1 and 1.6.2 don't work on our Elasticsearch setup because we use daily indexes. We get the error: "Too many elements to create a power set" It works on SINGLE indexes.. but if I specify content_* then I get this error. I don't see this documented anywhere. Is this a known issue? Is there

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
, Sep 10, 2016 at 7:42 PM, Kevin Burton wrote: > Ah.. might actually. I'll have to mess around with that. > > On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley wrote: > >> Would `topByKey` help? >> >> https://github.com/apache/spark/blob/master/mllib/src/main/

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
> Karl > > On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton wrote: > >> I'm trying to figure out a way to group by and return the top 100 records >> in that group. >> >> Something like: >> >> SELECT TOP(100, user_id) FROM posts GROUP BY user_id; >

Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
I'm trying to figure out a way to group by and return the top 100 records in that group. Something like: SELECT TOP(100, user_id) FROM posts GROUP BY user_id; But I can't really figure out the best way to do this... There is a FIRST and LAST aggregate function but this only returns one column.

Re: call() function being called 3 times

2016-09-07 Thread Kevin Tran
h worker-0] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 12.0 (TID 12). 2518 bytes result sent to driver Does anyone have any ideas? On Wed, Sep 7, 2016 at 7:30 PM, Kevin Tran wrote: > Hi Everyone, > Does anyone know why call() function being called *3 tim

call() function being called 3 times

2016-09-07 Thread Kevin Tran
SQLContext(rdd.context()); > > >> JavaRDD rowRDD = rdd.map(new Function() { > > public JavaBean call(String record) { >> *<== being called 3 times* > > What I tried: * *cache()* * cleaning up *checkpoint dir* Thanks, Kevin.

Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
Hi Mich, Thank you for your input. Does monotonically incremental ensure about race condition and does it duplicates the ids at some points with multi threads, multi instances, ... ? Even System.currentTimeMillis() still has duplication? Cheers, Kevin. On Mon, Sep 5, 2016 at 12:30 AM, Mich

Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
Hi everyone, Please give me your opinions on what is the best ID Generator for ID field in parquet ? UUID.randomUUID(); AtomicReference currentTime = new AtomicReference<>(System.currentTimeMillis()); AtomicLong counter = new AtomicLong(0); Thanks, Kevin. https://issues.apac

Re: Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
reference architecture which HBase is apart of ? Please share with me best practises you might know or your favourite designs. Thanks, Kevin. On Mon, Aug 29, 2016 at 5:18 AM, Mich Talebzadeh wrote: > Hi, > > Can you explain about you particular stack. > > Example what i

Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
Hi, Does anyone know what is the best practises to store data to parquet file? Does parquet file has limit in size ( 1TB ) ? Should we use SaveMode.APPEND for long running streaming app ? How should we store in HDFS (directory structure, ... )? Thanks, Kevin.

Spark StringType could hold how many characters ?

2016-08-28 Thread Kevin Tran
could handle ? In the Spark code: org.apache.spark.sql.types.StringType /** * The default size of a value of the StringType is 4096 bytes. */ override def defaultSize: Int = 4096 Thanks, Kevin.

Write parquet file from Spark Streaming

2016-08-27 Thread Kevin Tran
Hi Everyone, Does anyone know how to write parquet file after parsing data in Spark Streaming? Thanks, Kevin.

Re: tpcds for spark2.0

2016-08-01 Thread kevin
29 21:17 GMT+08:00 Olivier Girardot : > I have the same kind of issue (not using spark-sql-perf), just trying to > deploy 2.0.0 on mesos. > I'll keep you posted as I investigate > > > > On Wed, Jul 27, 2016 1:06 PM, kevin kiss.kevin...@gmail.com wrote: > >> hi,all:

Re: spark.read.format("jdbc")

2016-08-01 Thread kevin
,'email','gender')" > statement.executeUpdate(sql_insert) > > > Also you should specify path your jdbc jar file in --driver-class-path > variable when you running spark-submit: > > spark-shell --master "local[2]" --driver-class-path > /opt/cl

Re: spark.read.format("jdbc")

2016-07-31 Thread kevin
maybe there is another version spark on the classpath? 2016-08-01 14:30 GMT+08:00 kevin : > hi,all: >I try to load data from jdbc datasource,but I got error with : > java.lang.RuntimeException: Multiple sources found

spark.read.format("jdbc")

2016-07-31 Thread kevin
hi,all: I try to load data from jdbc datasource,but I got error with : java.lang.RuntimeException: Multiple sources found for jdbc (org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider, org.apache.spark.sql.execution.datasources.jdbc.DefaultSource), please specify the fully quali

tpcds for spark2.0

2016-07-27 Thread kevin
hi,all: I want to have a test about tpcds99 sql run on spark2.0. I user https://github.com/databricks/spark-sql-perf about the master version ,when I run :val tpcds = new TPCDS (sqlContext = sqlContext) I got error: scala> val tpcds = new TPCDS (sqlContext = sqlContext) error: missing or invalid

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread kevin
l you call collect spark* do nothing* so you df would not > have any data -> can’t call foreach. > Call collect execute the process -> get data -> foreach is ok. > > > On Jul 26, 2016, at 2:30 PM, kevin wrote: > > blacklistDF.collect() > > >

dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread kevin
HI ALL: I don't quite understand the different between : dataframe.foreach and dataframe.collect().foreach . When to use dataframe.foreach? I use spark2.0 ,I want to iterate a dataframe to get one colum's value : this can work out blacklistDF.collect().foreach { x => println(s">

Re: spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread kevin
thanks a lot Terry 2016-07-26 12:03 GMT+08:00 Terry Hoo : > Kevin, > > Try to create the StreamingContext as following: > > val ssc = new StreamingContext(spark.sparkContext, Seconds(2)) > > > > On Tue, Jul 26, 2016 at 11:25 AM, kevin wrote: > >> hi,all: &

spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread kevin
hi,all: I want to read data from kafka and regist as a table then join a jdbc table. My sample like this : val spark = SparkSession .builder .config(sparkConf) .getOrCreate() val jdbcDF = spark.read.format("jdbc").options(Map("url" -> "jdbc:mysql://master1:3306/demo", "drive

Re: Odp.: spark2.0 can't run SqlNetworkWordCount

2016-07-25 Thread kevin
------- > *Od:* kevin > *Wysłane:* 25 lipca 2016 11:33 > *Do:* user.spark; dev.spark > *Temat:* spark2.0 can't run SqlNetworkWordCount > > hi,all: > I download spark2.0 per-build. I can run SqlNetworkWordCount test use : > bin/run-example org.apache.spark.exa

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread kevin
1.6. There is also Kafka 0.10 > support in > > dstream. > > > > On July 25, 2016 at 10:26:49 AM, Andy Davidson > > (a...@santacruzintegration.com) wrote: > > > > Hi Kevin > > > > Just a heads up at the recent spark summit in S.F. There was a > presen

spark2.0 can't run SqlNetworkWordCount

2016-07-25 Thread kevin
hi,all: I download spark2.0 per-build. I can run SqlNetworkWordCount test use : bin/run-example org.apache.spark.examples.streaming.SqlNetworkWordCount master1 but when I use spark2.0 example source code SqlNetworkWordCount.scala and build it to a jar bao with dependencies ( JDK 1.8 AND SCALA

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread kevin
I have compile it from source code 2016-07-25 12:05 GMT+08:00 kevin : > hi,all : > I try to run example org.apache.spark.examples.streaming.KafkaWordCount , > I got error : > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/streami

where I can find spark-streaming-kafka for spark2.0

2016-07-24 Thread kevin
hi,all : I try to run example org.apache.spark.examples.streaming.KafkaWordCount , I got error : Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$ at org.apache.spark.examples.streaming.KafkaWordCount$.main(KafkaWordCount.scala:57) at org.apache

ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread kevin
hi,all: I build spark use: ./make-distribution.sh --name "hadoop2.7.1" --tgz "-Pyarn,hadoop-2.6,parquet-provided,hive,hive-thriftserver" -DskipTests -Dhadoop.version=2.7.1 I can run example : ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master spark://master1:7077 \ --

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
Yeah.. thanks Nick. Figured that out since your last email... I deleted the 2.10 by accident but then put 2+2 together. Got it working now. Still sticking to my story that it's somewhat complicated to setup :) Kevin On Thu, Jun 2, 2016 at 3:59 PM, Nick Pentreath wrote: > Which Scala

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
Spark versions). > > > > On Thu, 2 Jun 2016 at 15:34 Kevin Burton wrote: > >> I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's >> not super easy. >> >> I wish there was an easier way to get this stuff to work.. Last tim

Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
t elasticsearch-hadoop-2.3.2.jar and try again. Lots of trial and error here :-/ Kevin -- We’re hiring if you know of any awesome Java Devops or Linux Operations Engineers! Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my

Compute the global rank of the column

2016-05-31 Thread Dai, Kevin
Hi, All I want to compute the rank of some column in a table. Currently, I use the window function to do it. However all data will be in one partition. Is there better solution to do it? Regards, Kevin.

Re: Alternative to groupByKey() + mapValues() for non-commutative, non-associative aggregate?

2016-05-03 Thread Kevin Mellott
If you put this into a dataframe then you may be able to use one hot encoding and treat these as categorical features. I believe that the ml pipeline components use project tungsten so the performance will be very fast. After you process the result on the dataframe you would then need to assemble y

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
all the rows with null columns in those fields. In other > words you > >> are doing a inner join in all your queries. > >> > >> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> > >> wrote: > >>> > >&

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
join. > > In Spark 2.0, we turn these join into inner join actually. > > On Tue, May 3, 2016 at 9:50 AM, Cesar Flores wrote: > > Hi > > > > Have you tried the joins without the where clause? When you use them you > are > > filtering all the rows with

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
at 11:16 PM, Davies Liu wrote: > as @Gourav said, all the join with different join type show the same > results, > which meant that all the rows from left could match at least one row from > right, > all the rows from right could match at least one row from left, even > the numb

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Yong, Sorry, let explain my deduction; it is going be difficult to get a sample data out since the dataset I am using is proprietary. >From the above set queries (ones mentioned in above comments) both inner and outer join are producing the same counts. They are basically pulling out selected co

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, I wish that was case, but I have done a select count on each of the two tables individually and they return back different number of rows: dps.registerTempTable("dps_pin_promo_lt") swig.registerTempTable("swig_pin_promo_lt") dps.count() RESULT: 42632 swig.count() RESULT: 42034 On

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, Apologies. I edited my post with this information: Spark version: 1.6 Result from spark shell OS: Linux version 2.6.32-431.20.3.el6.x86_64 ( mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014 Thanks, KP On Mon,

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-16 Thread Kevin Eid
One last email to announce that I've fixed all of the issues. Don't hesitate to contact me if you encounter the same. I'd be happy to help. Regards, Kevin On 14 Apr 2016 12:39 p.m., "Kevin Eid" wrote: > Hi all, > > I managed to copy my .py files from loca

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Kevin Eid
ions about how to move those files from local to the cluster? Thanks in advance, Kevin On 12 April 2016 at 12:19, Sun, Rui wrote: > Which py file is your main file (primary py file)? Zip the other two py > files. Leave the main py file alone. Don't copy them to S3 because it seems >

Introducing Spark User Group in Korea & Question on creating non-software goods (stickers)

2016-04-01 Thread Kevin (Sangwoo) Kim
Hi all! I'm Kevin, one of contributors of Spark and I'm organizing Spark User Group in Korea. We're having 2500 members in community, and it's even growing faster today. https://www.facebook.com/groups/sparkkoreauser/ <https://www.facebook.com/groups/sparkkoreauser/?__mref=

Re: println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread Kevin Peng
Ted, What triggerAndWait does is perform a rest call to a specified url and then waits until the status message that gets returned by that url in a json a field says complete. The issues is I put a println at the very top of the method and that doesn't get printed out, and I know that println isn

java.lang.OutOfMemoryError: Direct buffer memory when using broadcast join

2016-03-21 Thread Dai, Kevin
2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) Can anyone tell me what's wrong and how to fix it? Best Regards, Kevin.

Re: How to convert Parquet file to a text file.

2016-03-15 Thread Kevin Mellott
I'd recommend reading the parquet file into a DataFrame object, and then using spark-csv to write to a CSV file. On Tue, Mar 15, 2016 at 3:34 PM, Shishir Anshuman wrote: > I need to convert the parquet file generated by the spark to a text (csv > prefera

  1   2   3   >