date:20160811

Re: Spark2 SBT Assembly

2016-08-11 Thread Jacek Laskowski

Hi, What about...libraryDependencies in build.sbt with % Provided + sbt-assembly + sbt assembly = DONE. Not much has changed since. Jacek On 11 Aug 2016 11:29 a.m., "Efe Selcuk" wrote: > Bump! > > On Wed, Aug 10, 2016 at 2:59 PM, Efe Selcuk wrote: > >> Thanks for the replies, folks. >> >> My

Log messages for shuffle phase

2016-08-11 Thread Suman Somasundar

Hi, While going through the logs of an application, I noticed that I could not find any logs to dig deeper into any of the shuffle phases. I am interested in finding out time taken by each shuffle phase, the size of data spilled to disk if any, among other things. Does anyone know how

Re: DataFramesWriter saving DataFrames timestamp in weird format

2016-08-11 Thread Hyukjin Kwon

Do you mind if I ask which format you used to save the data? I guess you used CSV and there is a related PR open here https://github.com/apache/spark/pull/14279#issuecomment-237434591 2016-08-12 6:04 GMT+09:00 Jestin Ma : > When I load in a timestamp column and try to save it immediately witho

Losing executors due to memory problems

2016-08-11 Thread Muttineni, Vinay

Hello, I have a spark job that basically reads data from two tables into two Dataframes which are subsequently converted to RDD's. I, then, join them based on a common key. Each table is about 10 TB in size but after filtering, the two RDD's are about 500GB each. I have 800 executors with 8GB me

RE: Spark join and large temp files

2016-08-11 Thread Ashic Mahtab

Hi Ben,Already tried that. The thing is that any form of shuffle on the big dataset (which repartition will cause) puts a node's chunk into /tmp, and that fill up disk. I solved the problem by storing the 1.5GB dataset in an embedded redis instance on the driver, and doing a straight flatmap of

DataFramesWriter saving DataFrames timestamp in weird format

2016-08-11 Thread Jestin Ma

When I load in a timestamp column and try to save it immediately without any transformations, the output time is unix time with padded 0's until there are 16 values. For example, loading in a time of August 3, 2016, 00:36:25 GMT, which is 1470184585 in UNIX time, saves as 147018458500. When I

Re: Spark 2 cannot create ORC table when CLUSTERED. This worked in Spark 1.6.1

2016-08-11 Thread Gourav Sengupta

And SPARK even reads ORC data very slowly. And in case the HIVE table is partitioned, then it just hangs. Regards, Gourav On Thu, Aug 11, 2016 at 6:02 PM, Mich Talebzadeh wrote: > > > This does not work with CLUSTERED BY clause in Spark 2 now! > > CREATE TABLE test.dummy2 > ( > ID INT >

Re: Spark join and large temp files

2016-08-11 Thread Ben Teeuwen

Hmm, hashing will probably send all of the records with the same key to the same partition / machine. I’d try it out, and hope that if you have a few superlarge keys bigger than the RAM available of one node, they spill to disk. Maybe play with persist() and using a different Storage Level. > O

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Nick Pentreath

Ok, interesting. Would be interested to see how it compares. By the way, the feature size you select for the hasher should be a power of 2 (e.g. 2**24 to 2**26 may be worth trying) to ensure the feature indexes are evenly distributed (see the section on HashingTF under http://spark.apache.org/docs

Re: OOM with StringIndexer, 800m rows & 56m distinct value column

2016-08-11 Thread Ben Teeuwen

Thanks Nick, I played around with the hashing trick. When I set numFeatures to the amount of distinct values for the largest sparse feature, I ended up with half of them colliding. When raising the numFeatures to have less collisions, I soon ended up with the same memory problems as before. To b

Re: Single point of failure with Driver host crashing

2016-08-11 Thread Mich Talebzadeh

Thanks Ted, In this case we were using Standalone with Standalone master started on another node. The app was started on a node but not the master node. The master node was not affected. The node in question was the edge (running spark-submit). >From the link I was not sure this matter would hav

Re: Single point of failure with Driver host crashing

2016-08-11 Thread Ted Yu

Have you read https://spark.apache.org/docs/latest/spark-standalone.html#high-availability ? FYI On Thu, Aug 11, 2016 at 12:40 PM, Mich Talebzadeh wrote: > > Hi, > > Although Spark is fault tolerant when nodes go down like below: > > FROM tmp > [Stage 1:===>

Re: Spark join and large temp files

2016-08-11 Thread Ben Teeuwen

When you read both ‘a’ and ‘b', can you try repartitioning both by column ‘id’? If you .cache() and .count() to force a shuffle, it'll push the records that will be joined to the same executors. So; a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache() a.count() b = spark.read.pa

Single point of failure with Driver host crashing

2016-08-11 Thread Mich Talebzadeh

Hi, Although Spark is fault tolerant when nodes go down like below: FROM tmp [Stage 1:===> (20 + 10) / 100]16/08/11 20:21:34 ERROR TaskSchedulerImpl: Lost executor 3 on xx.xxx.197.216: worker lost [Stage 1:>

Re: Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Bryan Cutler

The algorithm update is just broken into 2 steps: trainOn - to learn/update the cluster centers, and predictOn - predicts cluster assignment on data The StreamingKMeansExample you reference breaks up data into training and test because you might want to score the predictions. If you don't care ab

Re: Spark streaming not processing messages from partitioned topics

2016-08-11 Thread Diwakar Dhanuskodi

Figured it out. All I am doing wrong is testing it out in pseudo node vm with 1 core. The tasks were hanging out for cpu. In production cluster this works just fine. On Thu, Aug 11, 2016 at 12:45 AM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: > Checked executor logs and UI . There

Re: Spark2 SBT Assembly

2016-08-11 Thread Efe Selcuk

Bump! On Wed, Aug 10, 2016 at 2:59 PM, Efe Selcuk wrote: > Thanks for the replies, folks. > > My specific use case is maybe unusual. I'm working in the context of the > build environment in my company. Spark was being used in such a way that > the fat assembly jar that the old 'sbt assembly' com

Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede

Can someone also provide input on why my code may not be working? Below, I have pasted part of my previous reply which describes the issue I am having here. I am really more perplexed about the first set of code (in bold). I know why the second set of code doesn't work, it is just something I initi

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen

I should be more clear, since the outcome of the discussion above was not that obvious actually. - I agree a change should be made to StandardScaler, and not VectorAssembler - However I do think withMean should still be false by default and be explicitly enabled - The 'offset' idea is orthogonal,

Spark 2 cannot create ORC table when CLUSTERED. This worked in Spark 1.6.1

2016-08-11 Thread Mich Talebzadeh

This does not work with CLUSTERED BY clause in Spark 2 now! CREATE TABLE test.dummy2 ( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(10) , PADDING VARCHAR(10) ) CLUSTERED BY (ID) INTO 256 BUCKETS STORED AS ORC TBLPRO

Re: Spark 1.6.2 can read hive tables created with sqoop, but Spark 2.0.0 cannot

2016-08-11 Thread cdecleene

The data is uncorrupted as I can create the dataframe from the underlying raw parquet from spark 2.0.0 if instead of using SparkSession.sql() to create a dataframe I use SparkSession.read.parquet(). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-

Re: Table registered using registerTempTable not found in HiveContext

2016-08-11 Thread Mich Talebzadeh

this is Spark 2 you create temp table from df using HiveContext val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> s.registerTempTable("tmp") scala> HiveContext.sql("select count(1) from tmp") res18: org.apache.spark.sql.DataFrame = [count(1): bigint] scala> HiveContext.sql("s

Spark 1.6.2 HiveServer2 cannot access temp tables

2016-08-11 Thread Richard M

Im attempting to access a dataframe from jdbc: However this temp table is not accessible from beeline when connected to this instance of HiveServer2. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-HiveServer2-cannot-access-temp-tables-tp27515

Re: Table registered using registerTempTable not found in HiveContext

2016-08-11 Thread Richard M

How are you calling registerTempTable from hiveContext? It appears to be a private method. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Table-registered-using-registerTempTable-not-found-in-HiveContext-tp26555p27514.html Sent from the Apache Spark User Li

Re: HiveThriftServer and spark.sql.hive.thriftServer.singleSession setting

2016-08-11 Thread Richard M

I am running HiveServer2 as well and when I connect with beeline I get the following: org.apache.spark.sql.internal.SessionState cannot be cast to org.apache.spark.sql.hive.HiveSessionState Do you know how to resolve this? -- View this message in context: http://apache-spark-user-list.10015

Why training data in Kmeans Spark streaming clustering

2016-08-11 Thread Ahmed Sadek

Dear All, I was wondering why there is training data and testing data in kmeans ? Shouldn't it be unsupervised learning with just access to stream data ? I found similar question but couldn't understand the answer. http://stackoverflow.com/questions/30972057/is-the-streaming-k-means-clustering-pr

Re: Standardization with Sparse Vectors

2016-08-11 Thread Tobi Bosede

Opening this follow-up question to the entire mailing list. Anyone have thoughts on how I can add a column of dense vectors (created by converting a column of sparse features) to a data frame? My efforts are below. Although I know this is not the best approach for something I plan to put in produc

Re: update specifc rows to DB using sqlContext

2016-08-11 Thread Peyman Mohajerian

Alternatively, you should be able to write to a new table and use trigger or some other mechanism to update the particular row. I don't have any experience with this myself but just looking at this documentation: https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#03%20Data%20

dataframe row list question

2016-08-11 Thread vr spark

I have data which is json in this format myList: array |||-- elem: struct ||||-- nm: string (nullable = true) ||||-- vList: array (nullable = true) |||||-- element: string (containsNull = true) from my kafka stream, i created a dataframe usin

Re: na.fill doesn't work

2016-08-11 Thread Javier Rey

Thanks Assem I'll check this. Samir On Aug 11, 2016 4:39 AM, "Aseem Bansal" wrote: > Check the schema of the data frame. It may be that your columns are > String. You are trying to give default for numerical data. > > On Thu, Aug 11, 2016 at 6:28 AM, Javier Rey wrote: > >> Hi everybody, >> >>

Re: Spark excludes "fastutil" dependencies we need

2016-08-11 Thread cryptoe

I ran into a similar problem while using QDigest. Shading the clearspring, fastutil classes solves the problem. Snippet from pom.xml : package shade

Spark 2.0.0 - Java API - Modify a column in a dataframe

2016-08-11 Thread Aseem Bansal

Hi I have a Dataset I will change a String to String so there will be no schema changes. Is there a way I can run a map on it? I have seen the function at https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html#map(org.apache.spark.api.java.function.MapFunction,%20org.apac

Re: update specifc rows to DB using sqlContext

2016-08-11 Thread Mich Talebzadeh

in that case one alternative would be to save the new table on hdfs and then using some simple ETL load it into a staging table in MySQL and update the original table from staging table The whole thing can be done in a shell script. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com

Re: update specifc rows to DB using sqlContext

2016-08-11 Thread sujeet jog

I read the table via spark SQL , and perform some ML activity on the data , and the resultant will be to update some specific columns with the ML improvised result, hence i do not have a option to do the whole operation in MySQL, Thanks, Sujeet On Thu, Aug 11, 2016 at 3:29 PM, Mich Talebzadeh

Re: Standardization with Sparse Vectors

2016-08-11 Thread Sean Owen

No, that doesn't describe the change being discussed, since you've copied the discussion about adding an 'offset'. That's orthogonal. You're also suggesting making withMean=True the default, which we don't want. The point is that if this is *explicitly* requested, the scaler shouldn't refuse to sub

Re: update specifc rows to DB using sqlContext

2016-08-11 Thread Mich Talebzadeh

Ok it is clearer now. You are using Spark as the query tool on an RDBMS table? Read table via JDBC, write back updating certain records. I have not done this myself but I suspect the issue would be if Spark write will commit the transaction and maintains ACID compliance. (locking the rows etc).

Re: update specifc rows to DB using sqlContext

2016-08-11 Thread sujeet jog

1 ) using mysql DB 2 ) will be inserting/update/overwrite to the same table 3 ) i want to update a specific column in a record, the data is read via Spark SQL, on the below table which is read via sparkSQL, i would like to update the NumOfSamples column . consider DF as the dataFrame which holds

Re: na.fill doesn't work

2016-08-11 Thread Aseem Bansal

Check the schema of the data frame. It may be that your columns are String. You are trying to give default for numerical data. On Thu, Aug 11, 2016 at 6:28 AM, Javier Rey wrote: > Hi everybody, > > I have a data frame after many transformation, my final task is fill na's > with zeros, but I run

Re: Can't generate model for prediction

2016-08-11 Thread Zakaria Hili

here you can find more information about the code of my class " RandomForestRegression..java" : http://spark.apache.org/docs/latest/mllib-ensembles.html#regression ᐧ 2016-08-11 10:18 GMT+02:00 Zakaria Hili : > Hi, > > I recognize that spark can't save generated model on HDFS (I'm used random > fo

Can't generate model for prediction

2016-08-11 Thread Zakaria Hili

Hi, I recognize that spark can't save generated model on HDFS (I'm used random forest regression and linear regression for this test). it can save only the data directory as you can see in the picture bellow : [image: Images intégrées 1] but to load a model I will need some data from metadata di

Re: spark 2.0 readStream from a REST API

2016-08-11 Thread Sela, Amit

The current available output modes are Complete and Append. Complete mode is for stateful processing (aggregations), and Append mode for stateless processing (I.e., map/filter). See : http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes Dataset#writeStream

41 matches

Mail list logo