Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-07 Thread Koert Kuipers
it seems to me with SPARK-20202 we are no longer planning to support hadoop2 + hive 1.2. is that correct? so basically spark 3.1 will no longer run on say CDH 5.x or HDP2.x with hive? my use case is building spark 3.1 and launching on these existing clusters that are not managed by me. e.g. i do

understanding spark shuffle file re-use better

2021-01-13 Thread Koert Kuipers
is shuffle file re-use based on identity or equality of the dataframe? for example if run the exact same code twice to load data and do transforms (joins, aggregations, etc.) but without re-using any actual dataframes, will i still see skipped stages thanks to shuffle file re-use? thanks! koert

Re: Current state of dataset api

2021-10-05 Thread Koert Kuipers
the encoder api remains a pain point due to its lack of composability. serialization overhead is also still there i believe. i dont remember what has happened to the predicate pushdown issues, i think they are mostly resolved? we tend to use dataset api on our methods/interfaces where its fitting b

Re: When should we cache / persist ? After or Before Actions?

2022-04-27 Thread Koert Kuipers
we have quite a few persists statements in our codebase whenever we are reusing a dataframe. we noticed that it slows things down quite a bit (sometimes doubles the runtime), while providing little benefits, since spark already re-uses the shuffle files underlying the dataframe efficiently even if

spark re-use shuffle files not happening

2022-07-16 Thread Koert Kuipers
i have seen many jobs where spark re-uses shuffle files (and skips a stage of a job), which is an awesome feature given how expensive shuffles are, and i generally now assume this will happen. however i feel like i am going a little crazy today. i did the simplest test in spark 3.3.0, basically i

Re: [EXTERNAL] spark re-use shuffle files not happening

2022-07-16 Thread Koert Kuipers
ion), not cross jobs. > -- > *From:* Koert Kuipers > *Sent:* Saturday, July 16, 2022 6:43 PM > *To:* user > *Subject:* [EXTERNAL] spark re-use shuffle files not happening > > > *ATTENTION:* This email originated from outside of GM. > > >

Re: Elasticsearch support for Spark 3.x

2023-09-01 Thread Koert Kuipers
could the provided scope be the issue? On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev wrote: > Using the following dependency for Spark 3 in POM file (My Scala version > is 2.12.14) > > > > > > > *org.elasticsearch > elasticsearch-spark-30_2.12 > 7.12.0provided* > > > The code throws error

Re: Does Spark support role-based authentication and access to Amazon S3? (Kubernetes cluster deployment)

2023-12-13 Thread Koert Kuipers
yes it does using IAM roles for service accounts. see: https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html i wrote a little bit about this also here: https://technotes.tresata.com/spark-on-k8s/ On Wed, Dec 13, 2023 at 7:52 AM Atul Patil wrote: > Hello Team, > >

Re: [spark.local.dir] comma separated list does not work

2024-01-12 Thread Koert Kuipers
try it without spaces? export SPARK_LOCAL_DIRS="/tmp,/share/" On Fri, Jan 12, 2024 at 5:00 PM Andrew Petersen wrote: > Hello Spark community > > SPARK_LOCAL_DIRS or > spark.local.dir > is supposed to accept a list. > > I want to list one local (fast) drive, followed by a gpfs network drive,

Re: Recommended Scala version

2015-05-26 Thread Koert Kuipers
we are still running into issues with spark-shell not working on 2.11, but we are running on somewhat older master so maybe that has been resolved already. On Tue, May 26, 2015 at 11:48 AM, Dean Wampler wrote: > Most of the 2.11 issues are being resolved in Spark 1.4. For a while, the > Spark pr

Re: Spark SQL and Skewed Joins

2015-06-16 Thread Koert Kuipers
a skew join (where the dominant key is spread across multiple executors) is pretty standard in other frameworks, see for example in scalding: https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/JoinAlgorithms.scala this would be a great addition to sp

Re: Spark SQL and Skewed Joins

2015-06-17 Thread Koert Kuipers
could it be composed maybe? a general version and then a sql version that exploits the additional info/abilities available there and uses the general version internally... i assume the sql version can benefit from the logical phase optimization to pick join details. or is there more? On Tue, Jun

org.apache.spark.sql.ScalaReflectionLock

2015-06-23 Thread Koert Kuipers
just a heads up, i was doing some basic coding using DataFrame, Row, StructType, etc. and i ended up with deadlocks in my sbt tests due to the usage of ScalaReflectionLock.synchronized in the spark sql code. the issue away when i changed my tests to run consecutively...

sql dataframe internal representation

2015-06-25 Thread Koert Kuipers
i noticed in DataFrame that to get the rdd out of it some conversions are done: val converter = CatalystTypeConverters.createToScalaConverter(schema) rows.map(converter(_).asInstanceOf[Row]) does this mean DataFrame internally does not use the standard scala types? why not?

Re: Join highly skewed datasets

2015-06-26 Thread Koert Kuipers
we went through a similar process, switching from scalding (where everything just works on large datasets) to spark (where it does not). spark can be made to work on very large datasets, it just requires a little more effort. pay attention to your storage levels (should be memory-and-disk or disk-

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
you need 1) to publish to inhouse maven, so your application can depend on your version, and 2) use the spark distribution you compiled to launch your job (assuming you run with yarn so you can launch multiple versions of spark on same cluster) On Sun, Jun 28, 2015 at 4:33 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote

Re: What does "Spark is not just MapReduce" mean? Isn't every Spark job a form of MapReduce?

2015-06-28 Thread Koert Kuipers
spark is partitioner aware, so it can exploit a situation where 2 datasets are partitioned the same way (for example by doing a map-side join on them). map-red does not expose this. On Sun, Jun 28, 2015 at 12:13 PM, YaoPau wrote: > I've heard "Spark is not just MapReduce" mentioned during Spark

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
t; >> + '[' 1 == 1 ']' >> >> + cp >> '/Users/dvasthimal/ebay/projects/ep/spark-1.4.0/lib_managed/jars/datanucleus*.jar' >> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/lib/ >> >> cp: >> /Users/dvasthimal/ebay/p

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
cts/ep/spark-1.4.0/dist/lib/ >>> >>> + mkdir -p >>> /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/dist/examples/src/main >>> >>> + cp -r /Users/dvasthimal/ebay/projects/ep/spark-1.4.0/examples/src/main >>> /Users/dvasthimal/ebay/projects/ep/spark-1.4

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
of partitions (should > be large, multiple of num executors)," > > > https://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose > > When do i choose this setting ? (Attached is my code for reference) > > > > On Sun, Jun 28, 2015 at

Re: Join highly skewed datasets

2015-06-28 Thread Koert Kuipers
. > > And is my assumptions on replication levels correct. > > Did you get a chance to look at my processing. > > > > On Sun, Jun 28, 2015 at 3:31 PM, Koert Kuipers wrote: > >> regarding your calculation of executors... RAM in executor is not really >> comparable

Re: Fine control with sc.sequenceFile

2015-06-29 Thread Koert Kuipers
see also: https://github.com/apache/spark/pull/6848 On Mon, Jun 29, 2015 at 12:48 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > sc.hadoopConfiguration.set("mapreduce.input.fileinputformat.split.maxsize", > "67108864") > > sc.sequenceFile(getMostRecentDirectory(tablePath, _.startsWith("_")).get > + "/*", classO

Re: StorageLevel.MEMORY_AND_DISK_SER

2015-07-01 Thread Koert Kuipers
rdd.persist(StorageLevel.MEMORY_AND_DISK_SER) On Wed, Jul 1, 2015 at 11:01 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > How do i persist an RDD using StorageLevel.MEMORY_AND_DISK_SER ? > > > -- > Deepak > >

duplicate names in sql allowed?

2015-07-02 Thread Koert Kuipers
i am surprised this is allowed... scala> sqlContext.sql("select name as boo, score as boo from candidates").schema res7: org.apache.spark.sql.types.StructType = StructType(StructField(boo,StringType,true), StructField(boo,IntegerType,true)) should StructType check for duplicate field names?

Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
, "Akhil Das" wrote: > I think you can open up a jira, not sure if this PR > <https://github.com/apache/spark/pull/2209/files> (SPARK-2890 > <https://issues.apache.org/jira/browse/SPARK-2890>) broke the validation > piece. > > Thanks > Best Regards > &

Re: duplicate names in sql allowed?

2015-07-03 Thread Koert Kuipers
https://issues.apache.org/jira/browse/SPARK-8817 On Fri, Jul 3, 2015 at 11:43 AM, Koert Kuipers wrote: > i see the relaxation to allow duplicate field names was done on purpose, > since some data sources can have dupes due to case insensitive resolution. > > apparently the issue

master compile broken for scala 2.11

2015-07-14 Thread Koert Kuipers
it works for scala 2.10, but for 2.11 i get: [ERROR] /home/koert/src/spark/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java:135: error: is not abstract and does not override abstract method minBy(Function1,Ordering) in TraversableOnce [ERROR] return new

create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
has anyone tried to make HiveContext only if the class is available? i tried this: implicit lazy val sqlc: SQLContext = try { Class.forName("org.apache.spark.sql.hive.HiveContext", true, Thread.currentThread.getContextClassLoader) .getConstructor(classOf[SparkContext]).newInstance(sc).asInst

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023-L1037). > What is the version of Spark you are using? How did you add the spark-csv > jar? > > On Thu, Jul 16, 2015 at 1:21 PM, Koert Kuipers wrote: > >> has anyone tried to m

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
i am using scala 2.11 spark jars are not in my assembly jar (they are "provided"), since i launch with spark-submit On Thu, Jul 16, 2015 at 4:34 PM, Koert Kuipers wrote: > spark 1.4.0 > > spark-csv is a normal dependency of my project and in the assembly jar > that i us

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
ear? > > Thanks, > > Yin > > On Thu, Jul 16, 2015 at 2:12 PM, Koert Kuipers wrote: > >> i am using scala 2.11 >> >> spark jars are not in my assembly jar (they are "provided"), since i >> launch with spark-submit >> >> On Thu, Jul 16

Re: create HiveContext if available, otherwise SQLContext

2015-07-16 Thread Koert Kuipers
that solved it, thanks! On Thu, Jul 16, 2015 at 6:22 PM, Koert Kuipers wrote: > thanks i will try 1.4.1 > > On Thu, Jul 16, 2015 at 5:24 PM, Yin Huai wrote: > >> Hi Koert, >> >> For the classloader issue, you probably hit >> https://issues.apache.org/ji

Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-03-02 Thread Koert Kuipers
worried that at some point the legacy memory management will be deprecated and then i am stuck with this performance issue. On Mon, Feb 29, 2016 at 12:47 PM, Koert Kuipers wrote: > setting spark.shuffle.reduceLocality.enabled=false worked for me, thanks > > > is there any refe

Re: AVRO vs Parquet

2016-03-03 Thread Koert Kuipers
well can you use orc without bringing in the kitchen sink of dependencies also known as hive? On Thu, Mar 3, 2016 at 11:48 PM, Jong Wook Kim wrote: > How about ORC? I have experimented briefly with Parquet and ORC, and I > liked the fact that ORC has its schema within the file, which makes it >

Re: Does anyone implement org.apache.spark.serializer.Serializer in their own code?

2016-03-07 Thread Koert Kuipers
we are not, but it seems reasonable to me that a user has the ability to implement their own serializer. can you refactor and break compatibility, but not make it private? On Mon, Mar 7, 2016 at 9:57 PM, Josh Rosen wrote: > Does anyone implement Spark's serializer interface > (org.apache.spark.

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers
if its based on HadoopFsRelation shouldn't it support it? HadoopFsRelation handles globs On Wed, Mar 9, 2016 at 8:56 AM, Ted Yu wrote: > This is currently not supported. > > On Mar 9, 2016, at 4:38 AM, Jakub Liska wrote: > > Hey, > > is something like this possible? > > sqlContext.read.json("/m

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers
i use multi level wildcard with hadoop fs -ls, which is the exact same glob function call On Wed, Mar 9, 2016 at 9:24 AM, Ted Yu wrote: > Hadoop glob pattern doesn't support multi level wildcard. > > Thanks > > On Mar 9, 2016, at 6:15 AM, Koert Kuipers wrote:

Re: DataFrame support for hadoop glob patterns

2016-03-09 Thread Koert Kuipers
i tried with avro input, something like /data/txn_*/* and it works for me On Wed, Mar 9, 2016 at 12:12 PM, Ted Yu wrote: > Koert: > I meant org.apache.hadoop.mapred.FileInputFormat doesn't support multi > level wildcard. > > Cheers > > On Wed, Mar 9, 2016 at 8:2

Re: YARN process with Spark

2016-03-11 Thread Koert Kuipers
you get a spark executor per yarn container. the spark executor can have multiple cores, yes. this is configurable. so the number of partitions that can be processed in parallel is num-executors * executor-cores. and for processing a partition the available memory is executor-memory / executor-core

spark shuffle service on yarn

2016-03-18 Thread Koert Kuipers
spark on yarn is nice because i can bring my own spark. i am worried that the shuffle service forces me to use some "sanctioned" spark version that is officially "installed" on the cluster. so... can i safely install the spark 1.3 shuffle service on yarn and use it with other 1.x versions of spark

Re: spark 1.6.0 connect to hive metastore

2016-03-23 Thread Koert Kuipers
y with Spark 1.6 but with beeline as well. > I resolved it via installation & running hiveserver2 role instance at the > same server wher metastore is. <http://metastore.mycompany.com:9083> > > On Tue, Feb 9, 2016 at 10:58 PM, Koert Kuipers wrote: > >> has anyone suc

nullable in spark-sql

2016-03-24 Thread Koert Kuipers
In spark 2, is nullable treated as reliable? or is it just a hint for efficient code generation, the optimizer etc. The reason i ask is i see a lot of code generated with if statements handling null for struct fields where nullable=false

Re: Databricks fails to read the csv file with blank line at the file header

2016-03-26 Thread Koert Kuipers
To me this is expected behavior that I would not want fixed, but if you look at the recent commits for spark-csv it has one that deals this... On Mar 26, 2016 21:25, "Mich Talebzadeh" wrote: > > Hi, > > I have a standard csv file (saved as csv in HDFS) that has first line of > blank at the header

Re: RDD Partitions not distributed evenly to executors

2016-04-04 Thread Koert Kuipers
rectly propagated to all nodes? Are they identical? > Yes; these files are stored on a shared memory directory accessible to > all nodes. > > Koert Kuipers: > > we ran into similar issues and it seems related to the new memory > > management. can you try: > > spark.me

Re: Datasets combineByKey

2016-04-10 Thread Koert Kuipers
yes it is On Apr 10, 2016 3:17 PM, "Amit Sela" wrote: > I think *org.apache.spark.sql.expressions.Aggregator* is what I'm looking > for, makes sense ? > > On Sun, Apr 10, 2016 at 4:08 PM Amit Sela wrote: > >> I'm mapping RDD API to Datasets API and I was wondering if I was missing >> something o

Aggregator support in DataFrame

2016-04-11 Thread Koert Kuipers
i like the Aggregator a lot (org.apache.spark.sql.expressions.Aggregator), but i find the way to use it somewhat confusing. I am supposed to simply call aggregator.toColumn, but that doesn't allow me to specify which fields it operates on in a DataFrame. i would basically like to do something like

Re: Aggregator support in DataFrame

2016-04-11 Thread Koert Kuipers
/github.com/apache/spark/commit/520dde48d0d52de1710a3275fdd5355dd69d > > I'm not sure that solves your problem though... > > On Mon, Apr 11, 2016 at 4:45 PM, Koert Kuipers wrote: > >> i like the Aggregator a lot >> (org.apache.spark.sql.expressions.Aggregator

Re: Aggregator support in DataFrame

2016-04-12 Thread Koert Kuipers
better because i have encoders so i can use kryo). On Mon, Apr 11, 2016 at 10:53 PM, Koert Kuipers wrote: > saw that, dont think it solves it. i basically want to add some children > to the expression i guess, to indicate what i am operating on? not sure if > even makes sense > >

Re: Aggregator support in DataFrame

2016-04-12 Thread Koert Kuipers
l Armbrust wrote: > Did you see these? > > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/scala/typed.scala#L70 > > On Tue, Apr 12, 2016 at 9:46 AM, Koert Kuipers wrote: > >> i dont really see how Aggregator can be

Re: Apache Flink

2016-04-17 Thread Koert Kuipers
i never found much info that flink was actually designed to be fault tolerant. if fault tolerance is more bolt-on/add-on/afterthought then that doesn't bode well for large scale data processing. spark was designed with fault tolerance in mind from the beginning. On Sun, Apr 17, 2016 at 9:52 AM, Mi

Re: VectorAssembler handling null values

2016-04-20 Thread Koert Kuipers
thanks for that, its good to know that functionality exists. but shouldn't a decision tree be able to handle missing (aka null) values more intelligently than simply using replacement values? see for example here: http://stats.stackexchange.com/questions/96025/how-do-decision-tree-learning-algori

Re: Is there a way to run a jar built for scala 2.11 on spark 1.6.1 (which is using 2.10?)

2016-05-18 Thread Koert Kuipers
no but you can trivially build spark 1.6.1 for scala 2.11 On Wed, May 18, 2016 at 6:11 PM, Sergey Zelvenskiy wrote: > >

feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
we currently have 2 explode definitions in Dataset: def explode[A <: Product : TypeTag](input: Column*)(f: Row => TraversableOnce[A]): DataFrame def explode[A, B : TypeTag](inputColumn: String, outputColumn: String)(f: A => TraversableOnce[B]): DataFrame 1) the separation of the functions into

Re: feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
as("Item")). It would be >> great to understand more why you are using these instead. >> >> On Wed, May 25, 2016 at 8:49 AM, Koert Kuipers wrote: >> >>> we currently have 2 explode definitions in Dataset: >>> >>> def explode[A <: Produc

unsure how to create 2 outputs from spark-sql udf expression

2016-05-25 Thread Koert Kuipers
hello all, i have a single udf that creates 2 outputs (so a tuple 2). i would like to add these 2 columns to my dataframe. my current solution is along these lines: df .withColumn("_temp_", udf(inputColumns)) .withColumn("x", col("_temp_)("_1")) .withColumn("y", col("_temp_")("_2")) .drop

Re: feedback on dataset api explode

2016-05-25 Thread Koert Kuipers
25/16 12:30 PM, Reynold Xin wrote: >> >> Based on this discussion I'm thinking we should deprecate the two explode >> functions. >> >> On Wednesday, May 25, 2016, Koert Kuipers < >> ko...@tresata.com> wrote: >> >>> wenchen, >>>

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
uot;b") > df.select(func($"a").as("r")).select($"r._1", $"r._2") > > // maropu > > > On Thu, May 26, 2016 at 5:11 AM, Koert Kuipers wrote: > >> hello all, >> >> i have a single udf that creates 2 outputs (so a

Re: Pros and Cons

2016-05-26 Thread Koert Kuipers
We do disk-to-disk iterative algorithms in spark all the time, on datasets that do not fit in memory, and it works well for us. I usually have to do some tuning of number of partitions for a new dataset but that's about it in terms of inconveniences. On May 26, 2016 2:07 AM, "Jörn Franke" wrote:

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
maropu > > On Fri, May 27, 2016 at 1:46 AM, Koert Kuipers wrote: > >> that is nice and compact, but it does not add the columns to an existing >> dataframe >> >> On Wed, May 25, 2016 at 11:39 PM, Takeshi Yamamuro > > wrote: >> >>> Hi, >&

Re: unsure how to create 2 outputs from spark-sql udf expression

2016-05-26 Thread Koert Kuipers
toDF("input", "c0", "c1", other > needed columns, "cX") > df.select(func($"a").as("r"), $"c0", $"c1", $"cX").select($"r._1", > $"r._2", $"c0", $"c1", .

Re: I'm pretty sure this is a Dataset bug

2016-05-27 Thread Koert Kuipers
i am glad to see this, i think we can into this as well (in 2.0.0-SNAPSHOT) but i couldn't reproduce it nicely. my observation was that joins of 2 datasets that were derived from the same datasource gave this kind of trouble. i changed my datasource from val to def (so it got created twice) as a w

Re: Apache Spark security.NosuchAlgorithm exception on changing from java 7 to java 8

2016-06-06 Thread Koert Kuipers
mhh i would not be very happy if the implication is that i have to start maintaining separate spark builds for client clusters that use java 8... On Mon, Jun 6, 2016 at 4:34 PM, Ted Yu wrote: > Please see: > https://spark.apache.org/docs/latest/security.html > > w.r.t. Java 8, probably you need

setting column names on dataset

2016-06-07 Thread Koert Kuipers
for some operators on Dataset, like joinWith, one needs to use an expression which means referring to columns by name. how can i set the column names for a Dataset before doing a joinWith? currently i am aware of: df.toDF("k", "v").as[(K, V)] but that seems inefficient/anti-pattern? i shouldn't

Re: setting column names on dataset

2016-06-07 Thread Koert Kuipers
quot;).show(false) > +++ > |_1 |_2 | > +++ > |[foo,42]|[foo,42]| > |[bar,24]|[bar,24]| > +++ > > Pozdrawiam, > Jacek Laskowski > > https://medium.com/@jaceklaskowski/ > Mastering Apache Spark http://bit.ly/mastering-

Re: UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Koert Kuipers
Sets are not supported. you basically need to stick to products (tuples, case classes), Seq and Map (and in spark 2 also Option). Or you can need to resort to the kryo-based encoder. On Wed, Jun 8, 2016 at 3:45 PM, Peter Halliday wrote: > I have some code that was producing OOM during shuffle a

Re: UnsupportedOperationException: converting from RDD to DataSets on 1.6.1

2016-06-08 Thread Koert Kuipers
You can try passing in an explicit encoder: org.apache.spark.sql.Encoders.kryo[Set[com.wix.accord.Violation]] Although this might only be available in spark 2, i don't remember top of my head... On Wed, Jun 8, 2016 at 11:57 PM, Koert Kuipers wrote: > Sets are not supported. you basica

Re: Option Encoder

2016-06-23 Thread Koert Kuipers
an implicit encoder for Option[X] given an implicit encoder for X would be nice, i run into this often too. i do not think it exists. your best is to hope ExpressionEncoder will do... On Thu, Jun 23, 2016 at 2:16 PM, Richard Marscher wrote: > Is there a proper way to make or get an Encoder for

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-06-29 Thread Koert Kuipers
its the difference between a semigroup and a monoid, and yes max does not easily fit into a monoid. see also discussion here: https://issues.apache.org/jira/browse/SPARK-15598 On Mon, Jun 27, 2016 at 3:19 AM, Amit Sela wrote: > OK. I see that, but the current (provided) implementations are very

Re: Aggregator (Spark 2.0) skips aggregation is zero(0 returns null

2016-07-01 Thread Koert Kuipers
onality discussed in SPARK-15598 ? > without changing how the Aggregator works. > > I bypassed it by using Optional (Guava) because I'm using the Java API, > but it's a bit cumbersome... > > Thanks, > Amit > > On Thu, Jun 30, 2016 at 1:54 AM Koert Kuipers wrot

Re: Question regarding structured data and partitions

2016-07-06 Thread Koert Kuipers
spark does keep some information on the partitions of an RDD, namely the partitioning/partitioner. GroupSorted is an extension for key-value RDDs that also keeps track of the ordering, allowing for faster joins, non-reduce type operations on very large groups of values per key, etc. see here: http

Re: Question regarding structured data and partitions

2016-07-07 Thread Koert Kuipers
, tan shai wrote: > Using partitioning with dataframes, how can we retrieve informations about > partitions? partitions bounds for example > > Thanks, > Shaira > > 2016-07-07 6:30 GMT+02:00 Koert Kuipers : > >> spark does keep some information on the partitions of an R

Re: Extend Dataframe API

2016-07-07 Thread Koert Kuipers
i dont see any easy way to extend the plans, beyond creating a custom version of spark. On Thu, Jul 7, 2016 at 9:31 AM, tan shai wrote: > Hi, > > I need to add new operations to the dataframe API. > Can any one explain to me how to extend the plans of query execution? > > Many thanks. >

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Koert Kuipers
repartitionAndSortWithinPartit sort by keys, not values per key, so not really secondary sort by itself. for secondary sort also check out: https://github.com/tresata/spark-sorted On Thu, Jul 14, 2016 at 1:09 PM, Punit Naik wrote: > Hi guys > > In my spark/scala code I am implementing secondar

Re: repartitionAndSortWithinPartitions HELP

2016-07-14 Thread Koert Kuipers
t; I have already used "repartitionAndSortWithinPartitions" for secondary > sorting and it works fine. Just wanted to know whether it will sort the > entire RDD or not. > > On Thu, Jul 14, 2016 at 11:25 PM, Koert Kuipers wrote: > >> repartitionAndSortWithinPartit

Re: repartitionAndSortWithinPartitions HELP

2016-07-15 Thread Koert Kuipers
used for > "repartitionAndSortWithinPartitions" as an argument to "sortByKey"? > > On 14-Jul-2016 11:38 PM, "Koert Kuipers" wrote: > >> repartitionAndSortWithinPartitions partitions the rdd and sorts within >> each partition. so each partition is

Re: repartitionAndSortWithinPartitions HELP

2016-07-15 Thread Koert Kuipers
Jul 14, 2016 at 11:52 PM, Punit Naik >> wrote: >> >>> Okay. Can't I supply the same partitioner I used for >>> "repartitionAndSortWithinPartitions" as an argument to "sortByKey"? >>> >>> On 14-Jul-2016 11:38 PM, "Koert Kuipers&qu

transtition SQLContext to SparkSession

2016-07-18 Thread Koert Kuipers
in my codebase i would like to gradually transition to SparkSession, so while i start using SparkSession i also want a SQLContext to be available as before (but with a deprecated warning when i use it). this should be easy since SQLContext is now a wrapper for SparkSession. so basically: val sessi

Re: Execute function once on each node

2016-07-19 Thread Koert Kuipers
The whole point of a well designed global filesystem is to not move the data On Jul 19, 2016 10:07, "Koert Kuipers" wrote: > If you run hdfs on those ssds (with low replication factor) wouldn't it > also effectively write to local disk with low latency? > > On Jul 18

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
when parquet came out it was developed by a community of companies, and was designed as a library to be supported by multiple big data projects. nice orc on the other hand initially only supported hive. it wasn't even designed as a library that can be re-used. even today it brings in the kitchen s

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
> [1]https://parquet.apache.org/documentation/latest/ > [2]https://orc.apache.org/docs/ > [3] > http://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet > > On 26 Jul 2016, at 15:19, Koert Kuipers wrote: > > when parquet came out it was developed by a co

Re: ORC v/s Parquet for Spark 2.0

2016-07-26 Thread Koert Kuipers
leveraging this new repo? > > > org.apache.orc > orc > 1.1.2 > pom > > > > > > > > > > Sent from my iPhone > On Jul 26, 2016, at 4:50 PM, Koert Kuipers wrote: > > parquet was inspired by dremel but written from the ground up

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Koert Kuipers
we share a single single sparksession across tests, and they can run in parallel. is pretty fast On Mon, Aug 1, 2016 at 12:02 PM, Everett Anderson wrote: > Hi, > > Right now, if any code uses DataFrame/Dataset, I need a test setup that > brings up a local master as in this article >

spark historyserver backwards compatible

2016-08-05 Thread Koert Kuipers
we have spark 1.5.x, 1.6.x and 2.0.0 job running on yarn but yarn can have only one spark history server. what to do? is it safe to use the spark 2 history server to report on spark 1 jobs?

Re: spark historyserver backwards compatible

2016-08-05 Thread Koert Kuipers
thanks On Fri, Aug 5, 2016 at 5:21 PM, Marcelo Vanzin wrote: > Yes, the 2.0 history server should be backwards compatible. > > On Fri, Aug 5, 2016 at 2:14 PM, Koert Kuipers wrote: > > we have spark 1.5.x, 1.6.x and 2.0.0 job running on yarn > > > > but yarn can

type inference csv dates

2016-08-12 Thread Koert Kuipers
i generally like the type inference feature of the spark-sql csv datasource, however i have been stung several times by date inference. the problem is that when a column is converted to a date type the original data is lost. this is not a lossless conversion. and i often have a requirement where i

Re: Losing executors due to memory problems

2016-08-12 Thread Koert Kuipers
you could have a very large key? perhaps a token value? i love the rdd api but have found that for joins dataframe/dataset performs better. maybe can you do the joins in that? On Thu, Aug 11, 2016 at 7:41 PM, Muttineni, Vinay wrote: > Hello, > > I have a spark job that basically reads data from

Re: Spark 2 and existing code with sqlContext

2016-08-12 Thread Koert Kuipers
you can get it from the SparkSession for backwards compatibility: val sqlContext = spark.sqlContext On Mon, Aug 8, 2016 at 9:11 AM, Mich Talebzadeh wrote: > Hi, > > In Spark 1.6.1 this worked > > scala> sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/ > HH:mm:ss.ss') ").collect

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Koert Kuipers
you cannot mix spark 1 and spark 2 jars change this libraryDependencies += "org.apache.spark" %% "spark-hive" % "1.5.1" to libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" On Sun, Aug 14, 2016 at 11:58 AM, Mich Talebzadeh wrote: > Hi, > > In Spark 2 I am using sbt or mvn to c

Re: Issue with compiling Scala with Spark 2

2016-08-14 Thread Koert Kuipers
Jd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical con

Re: Change nullable property in Dataset schema

2016-08-15 Thread Koert Kuipers
why do you want the array to have nullable = false? what is the benefit? On Wed, Aug 3, 2016 at 10:45 AM, Kazuaki Ishizaki wrote: > Dear all, > Would it be possible to let me know how to change nullable property in > Dataset? > > When I looked for how to change nullable property in Dataframe sch

create SparkSession without loading defaults for unit tests

2016-08-16 Thread Koert Kuipers
for unit tests i would like to create a SparkSession that does not load anything from system properties, similar to: new SQLContext(new SparkContext(new SparkConf(loadDefaults = false))) how do i go about doing this? i dont see a way. thanks! koert

Re: Preventing an RDD from shuffling

2015-12-16 Thread Koert Kuipers
a join needs a partitioner, and will shuffle the data as needed for the given partitioner (or if the data is already partitioned then it will leave it alone), after which it will process with something like a map-side join. if you can specify a partitioner that meets the exact layout of your data

Re: Large number of conf broadcasts

2015-12-17 Thread Koert Kuipers
l request from the master branch in github? > > Thanks, > Prasad. > > From: Anders Arpteg > Date: Thursday, October 22, 2015 at 10:37 AM > To: Koert Kuipers > Cc: user > Subject: Re: Large number of conf broadcasts > > Yes, seems unnecessary. I actually tried patc

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
rhel/centos 6 ships with python 2.6, doesnt it? if so, i still know plenty of large companies where python 2.6 is the only option. asking them for python 2.7 is not going to work so i think its a bad idea On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland wrote: > I don't see a reason Spark 2.0 w

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
n a couple of projects using Spark (banking industry) where >> CentOS + Python 2.6 is the toolbox available. >> >> That said, I believe it should not be a concern for Spark. Python 2.6 is >> old and busted, which is totally opposite to the Spark philosophy IMO. >> >&

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
s the Python versioning concerns for RHEL users? > > On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers wrote: > >> yeah, the practical concern is that we have no control over java or >> python version on large company clusters. our current reality for the vast >> majority o

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
ng your changes open source. The GPL-compatible > licenses make it possible to combine Python with other software that is > released under the GPL; the others don’t. > > Nick > ​ > > On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers wrote: > >> i do not think so. >> >> do

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
if python 2.7 only has to be present on the node that launches the app (does it?) than that could be important indeed. On Tue, Jan 5, 2016 at 6:02 PM, Koert Kuipers wrote: > interesting i didnt know that! > > On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas < > nicholas.cham...@g

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Koert Kuipers
t;>> I think all the slaves need the same (or a compatible) version of Python >>> installed since they run Python code in PySpark jobs natively. >>> >>> On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers wrote: >>> >>>> interesting i didnt know that

Re: Spark on Apache Ingnite?

2016-01-11 Thread Koert Kuipers
where is ignite's resilience/fault-tolerance design documented? i can not find it. i would generally stay away from it if fault-tolerance is an afterthought. On Mon, Jan 11, 2016 at 10:31 AM, RodrigoB wrote: > Although I haven't work explicitly with either, they do seem to differ in > design and

  1   2   3   4   5   6   >