from:"Night Wolf"

Re: DataFrame job fails on parsing error, help?

2016-04-28 Thread Night Wolf

We are hitting the same issue on Spark 1.6.1 with tungsten enabled, kryo enabled & sort based shuffle. Did you find a resolution? On Sat, Apr 9, 2016 at 6:31 AM, Ted Yu wrote: > Not much. > > So no chance of different snappy version ? > > On Fri, Apr 8, 2016 at 1:26 PM, Nicolas Tilmans > wrote

Total Task size exception in Spark 1.6.0 when writing a DataFrame

2016-01-17 Thread Night Wolf

Hi all, Doing some simple column transformations (e.g. trimming strings) on a DataFrame using UDFs. This DataFrame is in Avro format and being loaded off HDFS. The job has about 16,000 parts/tasks. About half way through the job, then fails with a message; org.apache.spark.SparkException: Job ab

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf

me a array double with 3 fields, the prediction, the class A probability and the class B probability. How could I make those like 3 columns from my expression? Clearly .withColumn only expects 1 column back. On Tue, Sep 8, 2015 at 6:21 PM, Night Wolf wrote: > Sorry for the spam - I had some

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf

t 5:47 PM, Night Wolf wrote: > So basically I need something like > > df.withColumn("score", new Column(new Expression { > ... > > def eval(input: Row = null): EvaluatedType = myModel.score(input) > ... > > })) > > But I can't do this, so how can I

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf

e value or some struct... On Tue, Sep 8, 2015 at 5:33 PM, Night Wolf wrote: > Not sure how that would work. Really I want to tack on an extra column > onto the DF with a UDF that can take a Row object. > > On Tue, Sep 8, 2015 at 1:54 AM, Jörn Franke wrote: > >> Can you use a m

Re: Spark SQL - UDF for scoring a model - take $"*"

2015-09-08 Thread Night Wolf

rs are Comma-separated... > > Le lun. 7 sept. 2015 à 8:35, Night Wolf a écrit : > >> Is it possible to have a UDF which takes a variable number of arguments? >> >> e.g. df.select(myUdf($"*")) fails with >> >> org.apache.spark.sql.AnalysisException: u

Spark SQL - UDF for scoring a model - take $"*"

2015-09-06 Thread Night Wolf

Is it possible to have a UDF which takes a variable number of arguments? e.g. df.select(myUdf($"*")) fails with org.apache.spark.sql.AnalysisException: unresolved operator 'Project [scalaUDF(*) AS scalaUDF(*)#26]; What I would like to do is pass in a generic data frame which can be then passed t

SPARK_DIST_CLASSPATH, primordial class loader & app ClassNotFound

2015-08-26 Thread Night Wolf

Hey all, I'm trying to do some stuff with a YAML file in the Spark driver using SnakeYAML library in scala. When I put the snakeyaml v1.14 jar on the SPARK_DIST_CLASSPATH and try to de-serialize some objects from YAML into classes in my app JAR on the driver (only the driver). I get the exception

SparkSQL - understanding Cross Joins

2015-06-25 Thread Night Wolf

Hi guys, I'm trying to do a cross join (cartesian product) with 3 tables stored as parquet. Each table has 1 column, a long key. Table A has 60,000 keys with 1000 partitions Table B has 1000 keys with 1 partition Table C has 4 keys with 1 partition The output should be 240million row combination

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Night Wolf

ive > tasks or regular tasks (the first attempt of the task)? Is this error > deterministic (can you reproduce every time you run this command)? > > Thanks, > > Yin > > On Mon, Jun 15, 2015 at 8:59 PM, Night Wolf > wrote: > >> Looking at the logs of the execut

Re: Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Night Wolf

: Running task 11093.0 in stage 0.0 (TID 9552) 15/06/16 13:43:22 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 9553 15/06/16 13:43:22 INFO executor.Executor: Running task 10323.1 in stage 0.0 (TID 9553) On Tue, Jun 16, 2015 at 1:47 PM, Night Wolf wrote: > Hi guys, > > Using

Spark DataFrame 1.4 write to parquet/saveAsTable tasks fail

2015-06-15 Thread Night Wolf

Hi guys, Using Spark 1.4, trying to save a dataframe as a table, a really simple test, but I'm getting a bunch of NPEs; The code Im running is very simple; qc.read.parquet("/user/sparkuser/data/staged/item_sales_basket_id.parquet").write.format("parquet").saveAsTable("is_20150617_test2") Logs

Re: Join highly skewed datasets

2015-06-15 Thread Night Wolf

How far did you get? On Tue, Jun 2, 2015 at 4:02 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > We use Scoobi + MR to perform joins and we particularly use blockJoin() > API of scoobi > > > /** Perform an equijoin with another distributed list where this list is > considerably smaller > * than the right (but too la

Re: Spark 1.4 HiveContext fails to initialise with native libs

2015-06-03 Thread Night Wolf

ain? > > spark.sql.hive.metastore.sharedPrefixes > com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni > > https://issues.apache.org/jira/browse/SPARK-7819 has more context about > it. > > On Wed, Jun 3, 2015 at 9:38 PM, Nig

Spark 1.4 HiveContext fails to initialise with native libs

2015-06-03 Thread Night Wolf

Hi all, Trying out Spark 1.4 RC4 on MapR4/Hadoop 2.5.1 running in yarn-client mode with Hive support. *Build command;* ./make-distribution.sh --name mapr4.0.2_yarn_j6_2.10 --tgz -Pyarn -Pmapr4 -Phadoop-2.4 -Pmapr4 -Phive -Phadoop-provided -Dhadoop.version=2.5.1-mapr-1501 -Dyarn.version=2.5.1-mapr

Re: Spark 1.4 & YARN Application Master fails with 500 connect refused

2015-06-02 Thread Night Wolf

1.4; it also has been > working fine for me. > > Are you sure you're using exactly the same Hadoop libraries (since you're > building with -Phadoop-provided) and Hadoop configuration in both cases? > > On Tue, Jun 2, 2015 at 5:29 PM, Night Wolf wrote: > >> Hi a

Re: Spark 1.4 & YARN Application Master fails with 500 connect refused

2015-06-02 Thread Night Wolf

tderr) 15/06/03 10:34:26 INFO impl.ContainerManagementProtocolProxy: Opening proxy : qtausc-pphd0177.hadoop.local:40237 15/06/03 10:34:31 INFO impl.AMRMClientImpl: Received new token for : qtausc-pphd0132.hadoop.local:44108 15/06/03 10:34:31 INFO yarn.YarnAllocator: Received 1 containers from YARN

Spark 1.4 & YARN Application Master fails with 500 connect refused

2015-06-02 Thread Night Wolf

Hi all, Trying out Spark 1.4 on MapR Hadoop 2.5.1 running in yarn-client mode. Seems the application master doesn't work anymore, I get a 500 connect refused, even when I hit the IP/port of the spark UI directly. The logs don't show much. I build spark with Java 6, hive & scala 2.10 and 2.11. I'v

Spark 1.3.1 Performance Tuning/Patterns for Data Generation Heavy/Throughput Jobs

2015-05-19 Thread Night Wolf

Hi all, I have a job that, for every row, creates about 20 new objects (i.e. RDD of 100 rows in = RDD 2000 rows out). The reason for this is each row is tagged with a list of the 'buckets' or 'windows' it belongs to. The actual data is about 10 billion rows. Each executor has 60GB of memory. Cur

Spark Sorted DataFrame & Repartitioning

2015-05-13 Thread Night Wolf

Hi guys, If I load a dataframe via a sql context that has a SORT BY in the query and I want to repartition the data frame will it keep the sort order in each partition? I want to repartition because I'm going to run a Map that generates lots of data internally so to avoid Out Of Memory errors I n

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Night Wolf

I'm seeing a similar thing with a slightly different stack trace. Ideas? org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150) org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32) org.apache.spark.util.collection.E

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-05-12 Thread Night Wolf

Seeing similar issues, did you find a solution? One would be to increase the number of partitions if you're doing lots of object creation. On Thu, Feb 12, 2015 at 7:26 PM, fightf...@163.com wrote: > Hi, patrick > > Really glad to get your reply. > Yes, we are doing group by operations for our wo

Re: Sort (order by) of the big dataset

2015-05-07 Thread Night Wolf

What was the answer, was it only setting spark.sql.shuffle.partitions? On Thu, Apr 30, 2015 at 12:14 PM, Ulanov, Alexander wrote: > After day of debugging (actually, more), I can answer my question: > > The problem is that the default value 200 of > “spark.sql.shuffle.partitions” is too small f

Re: Partition Case Class RDD without ParRDDFunctions

2015-05-07 Thread Night Wolf

was experimenting with Row class in > python and apparently partitionby automatically takes first column as key. > However, I am not sure how you can access a part of an object without > deserializing it (either explicitly or Spark doing it for you) > > On Wed, May 6, 2015 at 7:14 PM,

Partition Case Class RDD without ParRDDFunctions

2015-05-06 Thread Night Wolf

Hi, If I have an RDD[MyClass] and I want to partition it by the hash code of MyClass for performance reasons, is there any way to do this without converting it into a PairRDD RDD[(K,V)] and calling partitionBy??? Mapping it to a tuple2 seems like a waste of space/computation. It looks like the P

Re: Spark SQL ThriftServer Impersonation Support

2015-05-03 Thread Night Wolf

Thanks Andrew. What version of HS2 is the SparkSQL thrift server using? What would be involved in updating? Is it a simple case of increasing the deep version in one of the project POMs? Cheers, ~N On Sat, May 2, 2015 at 11:38 AM, Andrew Lee wrote: > Hi N, > > See: https://issues.apache.org/jir

Spark SQL ThriftServer Impersonation Support

2015-04-30 Thread Night Wolf

Hi guys, Trying to use the SparkSQL Thriftserver with hive metastore. It seems that hive meta impersonation works fine (when running Hive tasks). However spinning up SparkSQL thrift server, impersonation doesn't seem to work... What settings do I need to enable impersonation? I've copied the sa

Re: Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Night Wolf

luster, into a common > location. > > On Thu, Apr 23, 2015 at 6:38 PM, Night Wolf > wrote: > > Hi guys, > > > > Having a problem build a DataFrame in Spark SQL from a JDBC data source > when > > running with --master yarn-client and adding the JDBC driver JAR

Spark SQL - Setting YARN Classpath for primordial class loader

2015-04-23 Thread Night Wolf

Hi guys, Having a problem build a DataFrame in Spark SQL from a JDBC data source when running with --master yarn-client and adding the JDBC driver JAR with --jars. If I run with a local[*] master all works fine. ./bin/spark-shell --jars /tmp/libs/mysql-jdbc.jar --master yarn-client sqlContext.lo

Spark 1.3 build with hive support fails on JLine

2015-03-30 Thread Night Wolf

Hey, Trying to build Spark 1.3 with Scala 2.11 supporting yarn & hive (with thrift server). Running; *mvn -e -DskipTests -Pscala-2.11 -Dscala-2.11 -Pyarn -Pmapr4 -Phive -Phive-thriftserver clean install* The build fails with; INFO] Compiling 9 Scala sources to /var/lib/jenkins/workspace/cse-Ap

Re: Unable to find org.apache.spark.sql.catalyst.ScalaReflection class

2015-03-23 Thread Night Wolf

Was a solution ever found for this. Trying to run some test cases with sbt test which use spark sql and in Spark 1.3.0 release with Scala 2.11.6 I get this error. Setting fork := true in sbt seems to work but its a less than idea work around. On Tue, Mar 17, 2015 at 9:37 PM, Eric Charles wrote:

Re: Compile Spark with Maven & Zinc Scala Plugin

2015-03-06 Thread Night Wolf

Tried with that. No luck. Same error on abt-interface jar. I can see maven downloaded that jar into my .m2 cache On Friday, March 6, 2015, 鹰 <980548...@qq.com> wrote: > try it with mvn -DskipTests -Pscala-2.11 clean install package

Compile Spark with Maven & Zinc Scala Plugin

2015-03-05 Thread Night Wolf

Hey, Trying to build latest spark 1.3 with Maven using -DskipTests clean install package But I'm getting errors with zinc, in the logs I see; [INFO] *--- scala-maven-plugin:3.2.0:compile (scala-compile-first) @ spark-network-common_2.11 --- * ... [error] Required file not found: sbt-interface

Building Spark 1.3 for Scala 2.11 using Maven

2015-03-05 Thread Night Wolf

Hey guys, Trying to build Spark 1.3 for Scala 2.11. I'm running with the folllowng Maven command; -DskipTests -Dscala-2.11 clean install package *Exception*: [ERROR] Failed to execute goal on project spark-core_2.10: Could not resolve dependencies for project org.apache.spark:spark-core_2.10:

Re: Columnar-Oriented RDDs

2015-03-01 Thread Night Wolf

to Spark SQL and is used by default >> when you run .cache on a SchemaRDD or CACHE TABLE. >> >> I'd also look at parquet which is more efficient and handles nested data >> better. >> >> On Fri, Feb 13, 2015 at 7:36 AM, Night Wolf > > wrote: >> &g

Columnar-Oriented RDDs

2015-02-13 Thread Night Wolf

Hi all, I'd like to build/use column oriented RDDs in some of my Spark code. A normal Spark RDD is stored as row oriented object if I understand correctly. I'd like to leverage some of the advantages of a columnar memory format. Shark (used to) and SparkSQL uses a columnar storage format using pr

Re: unit tests with "java.io.IOException: Could not create FileClient"

2015-02-08 Thread Night Wolf

Did you find a work around for this? Could it be class path ordering. I would expect the "file://..." protocol to work when you have the MapR jars on the classpath..? On Tue, Jan 20, 2015 at 4:36 AM, Ted Yu wrote: > Your classpath has some MapR jar. > > Is that intentional ? > > Cheers > > On M

Spark Master Build Failing to run on cluster in standalone ClassNotFoundException: javax.servlet.FilterRegistration

2015-02-03 Thread Night Wolf

Hi, I just built Spark 1.3 master using maven via make-distribution.sh; ./make-distribution.sh --name mapr3 --skip-java-test --tgz -Pmapr3 -Phive -Phive-thriftserver -Phive-0.12.0 When trying to start the standalone spark master on a cluster I get the following stack trace; 15/02/04 08:53:56 I

Scala Spark SQL row object Ordinal Method Call Aliasing

2015-01-20 Thread Night Wolf

In Spark SQL we have Row objects which contain a list of fields that make up a row. A Rowhas ordinal accessors such as .getInt(0) or getString(2). Say ordinal 0 = ID and ordinal 1 = Name. It becomes hard to remember what ordinal is what, making the code confusing. Say for example I have the follo

Fast HashSets & HashMaps - Spark Collection Utils

2015-01-14 Thread Night Wolf

Hi all, I'd like to leverage some of the fast Spark collection implementations in my own code. Particularity for doing things like distinct counts in a mapPartitions loop. Are there any plans to make the org.apache.spark.util.collection implementations public? Is there any other library out ther

Re: Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Night Wolf

test" as Intellij won't provide the "provided" scope > libraries when running code in "main" source (but it will for sources under > "test"). > > With this config you can "sbt assembly" in order to get the fat jar > without Spar

Problems with Spark Core 1.2.0 SBT project in IntelliJ

2015-01-13 Thread Night Wolf

Hi, I'm trying to load up an SBT project in IntelliJ 14 (windows) running 1.7 JDK, SBT 0.13.5 -I seem to be getting errors with the project. The build.sbt file is super simple; name := "scala-spark-test1" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.spark" %% "s

SparkSQL and Hive/Hive metastore testing - LocalHiveContext

2014-11-18 Thread Night Wolf

Hi, Just to give some context. We are using Hive metastore with csv & Parquet files as a part of our ETL pipeline. We query these with SparkSQL to do some down stream work. I'm curious whats the best way to go about testing Hive & SparkSQL? I'm using 1.1.0 I see that the LocalHiveContext has bee

43 matches

Mail list logo