ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread kevin
hi,all: I build spark use: ./make-distribution.sh --name "hadoop2.7.1" --tgz "-Pyarn,hadoop-2.6,parquet-provided,hive,hive-thriftserver" -DskipTests -Dhadoop.version=2.7.1 I can run example : ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master spark://master1:7077 \ --

where I can find spark-streaming-kafka for spark2.0

2016-07-24 Thread kevin
hi,all : I try to run example org.apache.spark.examples.streaming.KafkaWordCount , I got error : Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$ at org.apache.spark.examples.streaming.KafkaWordCount$.main(KafkaWordCount.scala:57) at org.apache

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread kevin
I have compile it from source code 2016-07-25 12:05 GMT+08:00 kevin : > hi,all : > I try to run example org.apache.spark.examples.streaming.KafkaWordCount , > I got error : > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/spark/streami

spark2.0 can't run SqlNetworkWordCount

2016-07-25 Thread kevin
hi,all: I download spark2.0 per-build. I can run SqlNetworkWordCount test use : bin/run-example org.apache.spark.examples.streaming.SqlNetworkWordCount master1 but when I use spark2.0 example source code SqlNetworkWordCount.scala and build it to a jar bao with dependencies ( JDK 1.8 AND SCALA

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread kevin
1.6. There is also Kafka 0.10 > support in > > dstream. > > > > On July 25, 2016 at 10:26:49 AM, Andy Davidson > > (a...@santacruzintegration.com) wrote: > > > > Hi Kevin > > > > Just a heads up at the recent spark summit in S.F. There was a > presen

Re: Odp.: spark2.0 can't run SqlNetworkWordCount

2016-07-25 Thread kevin
------- > *Od:* kevin > *Wysłane:* 25 lipca 2016 11:33 > *Do:* user.spark; dev.spark > *Temat:* spark2.0 can't run SqlNetworkWordCount > > hi,all: > I download spark2.0 per-build. I can run SqlNetworkWordCount test use : > bin/run-example org.apache.spark.exa

spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread kevin
hi,all: I want to read data from kafka and regist as a table then join a jdbc table. My sample like this : val spark = SparkSession .builder .config(sparkConf) .getOrCreate() val jdbcDF = spark.read.format("jdbc").options(Map("url" -> "jdbc:mysql://master1:3306/demo", "drive

Re: spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread kevin
thanks a lot Terry 2016-07-26 12:03 GMT+08:00 Terry Hoo : > Kevin, > > Try to create the StreamingContext as following: > > val ssc = new StreamingContext(spark.sparkContext, Seconds(2)) > > > > On Tue, Jul 26, 2016 at 11:25 AM, kevin wrote: > >> hi,all: &

dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread kevin
HI ALL: I don't quite understand the different between : dataframe.foreach and dataframe.collect().foreach . When to use dataframe.foreach? I use spark2.0 ,I want to iterate a dataframe to get one colum's value : this can work out blacklistDF.collect().foreach { x => println(s">

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread kevin
l you call collect spark* do nothing* so you df would not > have any data -> can’t call foreach. > Call collect execute the process -> get data -> foreach is ok. > > > On Jul 26, 2016, at 2:30 PM, kevin wrote: > > blacklistDF.collect() > > >

tpcds for spark2.0

2016-07-27 Thread kevin
hi,all: I want to have a test about tpcds99 sql run on spark2.0. I user https://github.com/databricks/spark-sql-perf about the master version ,when I run :val tpcds = new TPCDS (sqlContext = sqlContext) I got error: scala> val tpcds = new TPCDS (sqlContext = sqlContext) error: missing or invalid

spark.read.format("jdbc")

2016-07-31 Thread kevin
hi,all: I try to load data from jdbc datasource,but I got error with : java.lang.RuntimeException: Multiple sources found for jdbc (org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider, org.apache.spark.sql.execution.datasources.jdbc.DefaultSource), please specify the fully quali

Re: spark.read.format("jdbc")

2016-07-31 Thread kevin
maybe there is another version spark on the classpath? 2016-08-01 14:30 GMT+08:00 kevin : > hi,all: >I try to load data from jdbc datasource,but I got error with : > java.lang.RuntimeException: Multiple sources found

Re: spark.read.format("jdbc")

2016-08-01 Thread kevin
,'email','gender')" > statement.executeUpdate(sql_insert) > > > Also you should specify path your jdbc jar file in --driver-class-path > variable when you running spark-submit: > > spark-shell --master "local[2]" --driver-class-path > /opt/cl

Re: tpcds for spark2.0

2016-08-01 Thread kevin
29 21:17 GMT+08:00 Olivier Girardot : > I have the same kind of issue (not using spark-sql-perf), just trying to > deploy 2.0.0 on mesos. > I'll keep you posted as I investigate > > > > On Wed, Jul 27, 2016 1:06 PM, kevin kiss.kevin...@gmail.com wrote: > >> hi,all:

how to build spark with out hive

2016-01-25 Thread kevin
HI,all I need to test hive on spark ,to use spark as the hive's execute engine. I download the spark source 1.5.2 from apache web-site. I have installed maven3.3.9 and scala 2.10.6 ,so I change the /make-distribution.sh to point to my mvn location where I installed. then I run

hive1.2.1 on spark 1.5.2

2016-01-26 Thread kevin
hi,all I tried hive on spark with version hive1.2.1 spark1.5.2. I build spark witout -Phive . And I test spark cluster stand alone with spark-submit and it is ok. but when I use hive , on spark web-site I can see the hive on spark application ,finally I got error: 16/01/26 16:23:42 INFO slf

Keep state inside map function

2014-07-30 Thread Kevin
passed into the map function and then pass it along to the reduce function. Thanks in advance. -Kevin

Re: Keep state inside map function

2014-07-30 Thread Kevin
#x27;t share state across Mappers, or Mappers and Reducers in > Hadoop. (At least there was no direct way.) Same here. But you can > maintain state across many map calls. > > On Wed, Jul 30, 2014 at 6:07 PM, Kevin wrote: > > Hi, > > > > Is it possible to maintain state

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-11-01 Thread kevin chen
Perhaps it can avoid errors(exhausting executor and driver memory) to add random numbers to the entity_id column when you solve the issue by Patrick's way. Daniel Chalef 于2020年10月31日周六 上午12:42写道: > Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will > explore ways of doing

Re: spark-submit parameters about two keytab files to yarn and kafka

2020-11-01 Thread kevin chen
g to SASL_PLAINTEXT, if your spark version is 1.6. *note:* my test env: spark 2.0.2 kafka 0.10 references 1. using-spark-streaming <https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.0/bk_spark-component-guide/content/using-spark-streaming.html> -- Best, Kevin Pis Gabor Somogyi 于2020

Re: Spark streaming with Kafka

2020-11-03 Thread Kevin Pis
t; > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best, Kevin Pis

Re: Using two WriteStreams in same spark structured streaming job

2020-11-08 Thread Kevin Pis
h function, then I may need to use custom Kafka stream > writer > right ?! > > And I might not be able to use default writestream.format(Kafka) method ?! > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best, Kevin Pis

Re: how to manage HBase connections in Executors of Spark Streaming ?

2020-11-25 Thread chen kevin
1. the issue about that Kerberos expires. * You don’t need to care aboubt usually, you can use the local keytab at every node in the Hadoop cluster. * If there don’t have the keytab in your Hadoop cluster, you will need update your keytab in every executor periodically。 2. bes

Re: Stream-static join : Refreshing subset of static data / Connection pooling

2020-11-29 Thread chen kevin
Hi, you can use Debezium to capture real-timely the row-level changes in PostgreSql, then stream them to kafka, finally etl and write the data to hbase by flink/spark streaming。So you can join the data in hbase directly. in consideration of the particularly big table, the scan performance in

Re: Stream-static join : Refreshing subset of static data / Connection pooling

2020-11-29 Thread chen kevin
big table that will be joined. * I think frequent I/O actions like select may cause memery or i/o issues. 2. You can use postgreSql connection pools to avoid making connnection frequently. -- Best, Kevin Chen From: Geervan Hayatnagarkar Date: Sunday, November 29, 2020 at 6:20 PM

Fwd: Fail to run benchmark in Github Action

2021-06-26 Thread Kevin Su
-- Forwarded message - 寄件者: Kevin Su Date: 2021年6月25日 週五 下午8:23 Subject: Fail to run benchmark in Github Action To: Hi all, I try to run a benchmark test in GitHub action in my fork, and I faced the below error. https://github.com/pingsutw/spark/runs/2867617238

How to run spark benchmark on standalone cluster?

2021-07-02 Thread Kevin Su
Hi all, I want to run spark benchmark on a standalone cluster, and I have changed the DataSourceReadBenchmark.scala setting. (Remove "spark.master") --- a/sql/core/src/test /scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala +++ b/sql/core/src/test /scala/org/apache/spar

determine week of month from date in spark3

2022-02-11 Thread Appel, Kevin
there any caveats or items to be aware of that might get us later? For example in a future Spark 3.3.X is this option going to be deprecated This was an item that we ran into from Spark2 to Spark3 conversion and trying to see how to best handle this Thanks for your feedback, Kevin -

RE: determine week of month from date in spark3

2022-02-11 Thread Appel, Kevin
-03-30| 2| | 4|2014-03-31| 3| | 5|2015-03-07| 7| | 6|2015-03-08| 1| | 7|2015-03-30| 2| | 8|2015-03-31| 3| +---+--++ From: Appel, Kevin Sent: Friday, February 11, 2022 2:35 PM To: user@spark.apache.org; 'Sean

Unsubscribe

2023-07-27 Thread Kevin Wang
Unsubscribe please!

RE: How to add MaxDOP option in spark mssql JDBC

2024-04-24 Thread Appel, Kevin
You might be able to leverage the prepareQuery option, that is at https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option ... this was introduced in Spark 3.4.0 to handle temp table query and CTE query against MSSQL server since what you send in is not actually what get

IPv6 support

2015-05-20 Thread Kevin Liu
inconclusive. Can someone help clarify the current status for IPv6? Thanks Kevin ‹‹ errors ‹ 5/05/20 10:17:30 INFO Executor: Fetching http://2401:db00:2030:709b:face:0:9:0:51453/jars/spark-examples-1.3.1-hadoo p2.6.0.jar with timestamp 1432142250197 15/05/20 10:17:30 INFO Executor: Fetching http

Re: how to implement ALS with csv file? getting error while calling Rating class

2016-03-07 Thread Kevin Mellott
If you are using DataFrames, then you also can specify the schema when loading as an alternate solution. I've found Spark-CSV to be a very useful library when working with CSV data. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.

Re: RDD recomputation

2016-03-10 Thread Kevin Mellott
I've had very good success troubleshooting this type of thing by using the Spark Web UI, which will depict a breakdown of all tasks. This also includes the RDDs being used, as well as any cached data. Additional information about this tool can be found at http://spark.apache.org/docs/latest/monitor

Re: How to convert Parquet file to a text file.

2016-03-15 Thread Kevin Mellott
I'd recommend reading the parquet file into a DataFrame object, and then using spark-csv to write to a CSV file. On Tue, Mar 15, 2016 at 3:34 PM, Shishir Anshuman wrote: > I need to convert the parquet file generated by the spark to a text (csv > prefera

java.lang.OutOfMemoryError: Direct buffer memory when using broadcast join

2016-03-21 Thread Dai, Kevin
2.run(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) Can anyone tell me what's wrong and how to fix it? Best Regards, Kevin.

Re: println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread Kevin Peng
Ted, What triggerAndWait does is perform a rest call to a specified url and then waits until the status message that gets returned by that url in a json a field says complete. The issues is I put a println at the very top of the method and that doesn't get printed out, and I know that println isn

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Kevin Eid
ions about how to move those files from local to the cluster? Thanks in advance, Kevin On 12 April 2016 at 12:19, Sun, Rui wrote: > Which py file is your main file (primary py file)? Zip the other two py > files. Leave the main py file alone. Don't copy them to S3 because it seems >

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-16 Thread Kevin Eid
One last email to announce that I've fixed all of the issues. Don't hesitate to contact me if you encounter the same. I'd be happy to help. Regards, Kevin On 14 Apr 2016 12:39 p.m., "Kevin Eid" wrote: > Hi all, > > I managed to copy my .py files from loca

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, Apologies. I edited my post with this information: Spark version: 1.6 Result from spark shell OS: Linux version 2.6.32-431.20.3.el6.x86_64 ( mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014 Thanks, KP On Mon,

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, I wish that was case, but I have done a select count on each of the two tables individually and they return back different number of rows: dps.registerTempTable("dps_pin_promo_lt") swig.registerTempTable("swig_pin_promo_lt") dps.count() RESULT: 42632 swig.count() RESULT: 42034 On

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Yong, Sorry, let explain my deduction; it is going be difficult to get a sample data out since the dataset I am using is proprietary. >From the above set queries (ones mentioned in above comments) both inner and outer join are producing the same counts. They are basically pulling out selected co

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
at 11:16 PM, Davies Liu wrote: > as @Gourav said, all the join with different join type show the same > results, > which meant that all the rows from left could match at least one row from > right, > all the rows from right could match at least one row from left, even > the numb

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
join. > > In Spark 2.0, we turn these join into inner join actually. > > On Tue, May 3, 2016 at 9:50 AM, Cesar Flores wrote: > > Hi > > > > Have you tried the joins without the where clause? When you use them you > are > > filtering all the rows with

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
all the rows with null columns in those fields. In other > words you > >> are doing a inner join in all your queries. > >> > >> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> > >> wrote: > >>> > >&

Re: Alternative to groupByKey() + mapValues() for non-commutative, non-associative aggregate?

2016-05-03 Thread Kevin Mellott
If you put this into a dataframe then you may be able to use one hot encoding and treat these as categorical features. I believe that the ml pipeline components use project tungsten so the performance will be very fast. After you process the result on the dataframe you would then need to assemble y

Compute the global rank of the column

2016-05-31 Thread Dai, Kevin
Hi, All I want to compute the rank of some column in a table. Currently, I use the window function to do it. However all data will be in one partition. Is there better solution to do it? Regards, Kevin.

Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
t elasticsearch-hadoop-2.3.2.jar and try again. Lots of trial and error here :-/ Kevin -- We’re hiring if you know of any awesome Java Devops or Linux Operations Engineers! Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
Spark versions). > > > > On Thu, 2 Jun 2016 at 15:34 Kevin Burton wrote: > >> I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's >> not super easy. >> >> I wish there was an easier way to get this stuff to work.. Last tim

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
Yeah.. thanks Nick. Figured that out since your last email... I deleted the 2.10 by accident but then put 2+2 together. Got it working now. Still sticking to my story that it's somewhat complicated to setup :) Kevin On Thu, Jun 2, 2016 at 3:59 PM, Nick Pentreath wrote: > Which Scala

Re: rdd join very slow when rdd created from data frame

2016-01-12 Thread Kevin Mellott
Can you please provide the high-level schema of the entities that you are attempting to join? I think that you may be able to use a more efficient technique to join these together; perhaps by registering the Dataframes as temp tables and constructing a Spark SQL query. Also, which version of Spark

Re: yarn-client: SparkSubmitDriverBootstrapper not found in yarn client mode (1.6.0)

2016-01-13 Thread Kevin Mellott
Lin - if you add "--verbose" to your original *spark-submit* command, it will let you know the location in which Spark is running. As Marcelo pointed out, this will likely indicate version 1.3, which may help you confirm if this is your problem. On Wed, Jan 13, 2016 at 12:06 PM, Marcelo Vanzin wr

Re: Hive is unable to avro file written by spark avro

2016-01-13 Thread Kevin Mellott
element). - Kevin On Wed, Jan 13, 2016 at 7:20 PM, Siva wrote: > Hi Everyone, > > Avro data written by dataframe in hdfs in not able to read by hive. Saving > data avro format with below statement. > > df.save("com.databricks.spark.avro", SaveMode.Append, Map("path

Re: sqlContext.cacheTable("tableName") vs dataFrame.cache()

2016-01-15 Thread Kevin Mellott
.select("col1", "col2").show() Here, the usage of *cacheTable* will affect ONLY the *sqlContext.sql* query. > sqlContext.cacheTable("myData") sqlContext.sql("SELECT col1, col2 FROM myData").show() Thanks, Kevin On Fri, Jan 15, 2016 at 7:00 AM, George Sigletos w

Re: Multi tenancy, REST and MLlib

2016-01-15 Thread Kevin Mellott
It sounds like you may be interested in a solution that implements the Lambda Architecture , such as Oryx2 . At a high level, this gives you the ability to request and receive information immediately (serving layer), generating the

Re: trouble implementing complex transformer in java that can be used with Pipeline. Scala to Java porting problem

2016-01-20 Thread Kevin Mellott
g like: df.sqlContext.udf.register(...) Thanks, Kevin On Wed, Jan 20, 2016 at 6:15 PM, Andy Davidson < a...@santacruzintegration.com> wrote: > For clarity callUDF() is not defined on DataFrames. It is defined on > org.apache.spark.sql.functions > . Strange the class name starts with lower ca

Re: Passing binding variable in query used in Data Source API

2016-01-21 Thread Kevin Mellott
Another alternative that you can consider is to use Sqoop to move your data from PostgreSQL to HDFS, and then just load it into your DataFrame without needing to use JDBC drivers. I've had success using this approach, and depending on your setup you can easily manage/sche

Re: [Spark] Reading avro file in Spark 1.3.0

2016-01-25 Thread Kevin Mellott
I think that you may be looking at documentation pertaining to the more recent versions of Spark. Try looking at the examples linked below, which applies to the Spark 1.3 version. There aren't many Java examples, but the code should be very similar to the Scala ones (i.e. using "load" instead of "r

Re: Spark SQL . How to enlarge output rows ?

2016-01-27 Thread Kevin Mellott
I believe that *show* should work if you provide it with both the number of rows and the truncate flag. ex: df.show(10, false) http://spark.apache.org/docs/1.5.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.show On Wed, Jan 27, 2016 at 2:39 AM, Akhil Das wrote: > Why would you want to pr

Re: Spark Distribution of Small Dataset

2016-01-28 Thread Kevin Mellott
. I'd recommend taking a look at the video below, which will explain this concept in much greater detail. It also goes through an example and shows you how to use the logging tools to understand what is happening within your program. https://www.youtube.com/watch?v=dmL0N3qfSc8 Thanks, Kevin O

Re: unsubscribe email

2016-02-01 Thread Kevin Mellott
Take a look at the first section on http://spark.apache.org/community.html. You basically just need to send an email from the aliased email to user-unsubscr...@spark.apache.org. If you cannot log into that email directly, then I'd recommend using a mail client that allows for the "send-as" function

Re: how to covert millisecond time to SQL timeStamp

2016-02-01 Thread Kevin Mellott
I've had pretty good success using Joda-Time for date/time manipulations within Spark applications. You may be able to use the *DateTIme* constructor below, if you are starting with milliseconds. DateTime public DateTime(long instant) Constructs an inst

Best way to bring up Spark with Cassandra (and Elasticsearch) in production.

2016-02-14 Thread Kevin Burton
with Cassandra but instead of reading from a file it reads/writes to C*. Then once testing is working I'm going to setup spark in cluster mode with the same dependencies. Does this sound like a reasonable strategy? Kevin -- We’re hiring if you know of any awesome Java Devops or Linux

Re: Using functional programming rather than SQL

2016-02-22 Thread Kevin Mellott
In your example, the *rs* instance should be a DataFrame object. In other words, the result of *HiveContext.sql* is a DataFrame that you can manipulate using *filter, map, *etc. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext On Mon, Feb 22, 2016 at

Re: How to get progress information of an RDD operation

2016-02-23 Thread Kevin Mellott
Have you considered using the Spark Web UI to view progress on your job? It does a very good job showing the progress of the overall job, as well as allows you to drill into the individual tasks and server activity. On Tue, Feb 23, 2016 at 12:53 PM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexi

Re: Network Spark Streaming from multiple remote hosts

2016-02-23 Thread Kevin Mellott
aming application is then receiving messages from the queue and performing whatever processing you'd like. http://kafka.apache.org/documentation.html#introduction Thanks, Kevin On Tue, Feb 23, 2016 at 3:13 PM, Vinti Maheshwari wrote: > Hi All > > I wrote program for Spark Streaming

Re: Spark SQL partitioned tables - check for partition

2016-02-25 Thread Kevin Mellott
Once you have loaded information into a DataFrame, you can use the *mapPartitionsi or forEachPartition *operations to both identify the partitions and operate against them. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame On Thu, Feb 25, 2016 at 9:24 AM, De

Re: Spark SQL partitioned tables - check for partition

2016-02-25 Thread Kevin Mellott
, Deenar Toraskar wrote: > Kevin > > I meant the partitions on disk/hdfs not the inmemory RDD/Dataframe > partitions. If I am right mapPartitions or forEachPartitions would identify > and operate on the in memory partitions. > > Deenar > > On 25 February 2016 at 15

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
You can use the constructor that accepts a BoostingStrategy object, which will allow you to set the tree strategy (and other hyperparameters as well). *GradientBoostedTrees

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
I believe that you can instantiate an instance of the AbsoluteError class for the *Loss* object, since that object implements the Loss interface. For example. val loss = new AbsoluteError() boostingStrategy.setLoss(loss) On Mon, Feb 29, 2016 at 9:33 AM, diplomatic Guru wrote: > Hi Ke

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
I'm using Spark 1.3.0, maybe it is not ready for this version? > > > > On 29 February 2016 at 15:38, Kevin Mellott > wrote: > >> I believe that you can instantiate an instance of the AbsoluteError class >> for the *Loss* object, since that object implements the Lo

Re: [MLlib] How to set Loss to Gradient Boosted Tree in Java

2016-02-29 Thread Kevin Mellott
; } > > override private[mllib] def computeError(prediction: Double, label: > Double): Double = { > val err = label - prediction > math.abs(err) > } > } > > > On 29 February 2016 at 15:49, Kevin Mellott > wrote: > >> Looks like it should be present

Flattening Data within DataFrames

2016-02-29 Thread Kevin Mellott
because I am not able to JOIN the *categories* table more than once. Has anybody dealt with this type of use case before, and if so how did you achieve the desired behavior? Thank you in advance for your thoughts. Kevin

Re: Flattening Data within DataFrames

2016-02-29 Thread Kevin Mellott
Thanks Michal - this is exactly what I need. On Mon, Feb 29, 2016 at 11:40 AM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > Hi Kevin, > > This should help: > > https://databricks.com/blog/2016/02/09/reshaping-data-with-pivot-in-spark.html > > On 29 Fe

GenericRowWithSchema is too heavy

2015-07-27 Thread Kevin Jung
w.toSeq) <= underlying sequence of a row 4) saveAsObjectFile or use org.apache.spark.util.SizeEstimator.estimate And my result is, (dataframe with 5columns) GenericRowWithSchema => 13gb GenericRow => 8.2gb Seq => 7gb Best regards Kevin -- View this message in context: http:/

Re: What is the optimal approach to do Secondary Sort in Spark?

2015-08-11 Thread Kevin Jung
You should create key as tuple type. In your case, RDD[((id, timeStamp) , value)] is the proper way to do. Kevin --- Original Message --- Sender : swetha Date : 2015-08-12 09:37 (GMT+09:00) Title : What is the optimal approach to do Secondary Sort in Spark? Hi, What is the optimal

Can't find directory after resetting REPL state

2015-08-15 Thread Kevin Jung
7;/tmp/spark-f47f3917-ac31-4138-bf1a-a8cefd094ac3' but it is not a directory ~~~impossible to command anymore~~~ I figure out reset() method in SparkIMain try to delete virtualDirectory and then create again. But virtualDirectory.create() makes a file, not a directory. Does anyone face a same prob

Re: Can't find directory after resetting REPL state

2015-08-16 Thread Kevin Jung
Thanks Ted, it may be a bug. This is a jira ticket. https://issues.apache.org/jira/browse/SPARK-10039 Kevin --- Original Message --- Sender : Ted Yu Date : 2015-08-16 11:29 (GMT+09:00) Title : Re: Can't find directory after resetting REPL state I tried with master branch and go

SaveAsTable changes the order of rows

2015-08-18 Thread Kevin Jung
my case, the order is not important but some of users may want to keep their keys ordered. Kevin 상기 메일은 지정된 수신인만을 위한 것이며, 부정경쟁방지 및 영업비밀보호에 관한 법률,개인정보 보호법을 포함하여 관련 법령에 따라 보호의 대상이 되는 영업비밀, 산업기술,기밀정보, 개인정보 등을 포함하고 있을 수 있습니다. 본 문서에 포함된 정보의 전부 또는 일부를 무단으로 복사 또는 사용하거나 제3자에게 공개, 배포, 제공하는 것은 엄격히 금지됩니다

Drop table and Hive warehouse

2015-08-24 Thread Kevin Jung
sTable with a same name before dropping table. Is it a normal situation? If it is, I will delete files manually ;) Kevin 상기 메일은 지정된 수신인만을 위한 것이며, 부정경쟁방지 및 영업비밀보호에 관한 법률,개인정보 보호법을 포함하여 관련 법령에 따라 보호의 대상이 되는 영업비밀, 산업기술,기밀정보, 개인정보 등을 포함하고 있을 수 있습니다. 본 문서에 포함된 정보의 전부 또는 일부를 무단으로 복사 또는 사용하거나 제3자에게 공개, 배포, 제공

Re: Drop table and Hive warehouse

2015-08-24 Thread Kevin Jung
files in HDFS2. So I can not reproduce a table with same location and same name. If I update DBS table in metastoredb to set default database URI to HDFS1, it works perfectly. Kevin --- Original Message --- Sender : Michael Armbrust Date : 2015-08-25 00:43 (GMT+09:00) Title : Re: Drop

Re: Unable to build Spark 1.5, is build broken or can anyone successfully build?

2015-08-30 Thread Kevin Jung
I expect it because the versions are not in the range defined in pom.xml. You should upgrade your maven version to 3.3.3 and JDK to 1.7. Spark team already knows this issue so you can get some information on community board of developers. Kevin -- View this message in context: http://apache

What happens to this RDD? OutOfMemoryError

2015-09-04 Thread Kevin Mandich
ull error is shown below. Please let me know if I'm missing something obvious. Thank you! Kevin Mandich Exception in thread "refresh progress" Exception in thread "SparkListenerBus" [2015-09-04 20:43:14,385] {bash_operator.py:58} INFO - Exception: java.lang.OutOfMemoryEr

Spark summit Asia

2015-09-07 Thread Kevin Jung
Is there any plan to hold Spark summit in Asia? I'm very much looking forward to it. Kevin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-summit-Asia-tp24598.html Sent from the Apache Spark User List mailing list archive at Nabbl

Write parquet file from Spark Streaming

2016-08-27 Thread Kevin Tran
Hi Everyone, Does anyone know how to write parquet file after parsing data in Spark Streaming? Thanks, Kevin.

Spark StringType could hold how many characters ?

2016-08-28 Thread Kevin Tran
could handle ? In the Spark code: org.apache.spark.sql.types.StringType /** * The default size of a value of the StringType is 4096 bytes. */ override def defaultSize: Int = 4096 Thanks, Kevin.

Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
Hi, Does anyone know what is the best practises to store data to parquet file? Does parquet file has limit in size ( 1TB ) ? Should we use SaveMode.APPEND for long running streaming app ? How should we store in HDFS (directory structure, ... )? Thanks, Kevin.

Re: Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
reference architecture which HBase is apart of ? Please share with me best practises you might know or your favourite designs. Thanks, Kevin. On Mon, Aug 29, 2016 at 5:18 AM, Mich Talebzadeh wrote: > Hi, > > Can you explain about you particular stack. > > Example what i

Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
Hi everyone, Please give me your opinions on what is the best ID Generator for ID field in parquet ? UUID.randomUUID(); AtomicReference currentTime = new AtomicReference<>(System.currentTimeMillis()); AtomicLong counter = new AtomicLong(0); Thanks, Kevin. https://issues.apac

Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
Hi Mich, Thank you for your input. Does monotonically incremental ensure about race condition and does it duplicates the ids at some points with multi threads, multi instances, ... ? Even System.currentTimeMillis() still has duplication? Cheers, Kevin. On Mon, Sep 5, 2016 at 12:30 AM, Mich

call() function being called 3 times

2016-09-07 Thread Kevin Tran
SQLContext(rdd.context()); > > >> JavaRDD rowRDD = rdd.map(new Function() { > > public JavaBean call(String record) { >> *<== being called 3 times* > > What I tried: * *cache()* * cleaning up *checkpoint dir* Thanks, Kevin.

Re: call() function being called 3 times

2016-09-07 Thread Kevin Tran
h worker-0] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 12.0 (TID 12). 2518 bytes result sent to driver Does anyone have any ideas? On Wed, Sep 7, 2016 at 7:30 PM, Kevin Tran wrote: > Hi Everyone, > Does anyone know why call() function being called *3 tim

Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
I'm trying to figure out a way to group by and return the top 100 records in that group. Something like: SELECT TOP(100, user_id) FROM posts GROUP BY user_id; But I can't really figure out the best way to do this... There is a FIRST and LAST aggregate function but this only returns one column.

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
> Karl > > On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton wrote: > >> I'm trying to figure out a way to group by and return the top 100 records >> in that group. >> >> Something like: >> >> SELECT TOP(100, user_id) FROM posts GROUP BY user_id; >

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
, Sep 10, 2016 at 7:42 PM, Kevin Burton wrote: > Ah.. might actually. I'll have to mess around with that. > > On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley wrote: > >> Would `topByKey` help? >> >> https://github.com/apache/spark/blob/master/mllib/src/main/

"Too many elements to create a power set" on Elasticsearch

2016-09-11 Thread Kevin Burton
1.6.1 and 1.6.2 don't work on our Elasticsearch setup because we use daily indexes. We get the error: "Too many elements to create a power set" It works on SINGLE indexes.. but if I specify content_* then I get this error. I don't see this documented anywhere. Is this a known issue? Is there

Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
I'm rather confused here as to what to do about creating a new SparkContext. Spark 2.0 prevents it... (exception included below) yet a TON of examples I've seen basically tell you to create a new SparkContext as standard practice: http://spark.apache.org/docs/latest/configuration.html#dynamicall

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 13 September 2016 at 18:57, Sean Owen wrote: > >> But you're in

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
on the command line if using > the shell. > > > On Tue, Sep 13, 2016, 19:22 Kevin Burton wrote: > >> The problem is that without a new spark context, with a custom conf, >> elasticsearch-hadoop is refusing to read in settings about the ES setup... >> >> if I

  1   2   3   >