Hi,
What about...libraryDependencies in build.sbt with % Provided +
sbt-assembly + sbt assembly = DONE.
Not much has changed since.
Jacek
On 11 Aug 2016 11:29 a.m., "Efe Selcuk" wrote:
> Bump!
>
> On Wed, Aug 10, 2016 at 2:59 PM, Efe Selcuk wrote:
>
>> Thanks for the replies, folks.
>>
>> My
Hi,
While going through the logs of an application, I noticed that I could not find
any logs to dig deeper into any of the shuffle phases.
I am interested in finding out time taken by each shuffle phase, the size of
data spilled to disk if any, among other things.
Does anyone know how
Do you mind if I ask which format you used to save the data?
I guess you used CSV and there is a related PR open here
https://github.com/apache/spark/pull/14279#issuecomment-237434591
2016-08-12 6:04 GMT+09:00 Jestin Ma :
> When I load in a timestamp column and try to save it immediately witho
Hello,
I have a spark job that basically reads data from two tables into two
Dataframes which are subsequently converted to RDD's. I, then, join them based
on a common key.
Each table is about 10 TB in size but after filtering, the two RDD's are about
500GB each.
I have 800 executors with 8GB me
Hi Ben,Already tried that. The thing is that any form of shuffle on the big
dataset (which repartition will cause) puts a node's chunk into /tmp, and that
fill up disk. I solved the problem by storing the 1.5GB dataset in an embedded
redis instance on the driver, and doing a straight flatmap of
When I load in a timestamp column and try to save it immediately without
any transformations, the output time is unix time with padded 0's until
there are 16 values.
For example,
loading in a time of August 3, 2016, 00:36:25 GMT, which is 1470184585 in
UNIX time, saves as 147018458500.
When I
And SPARK even reads ORC data very slowly. And in case the HIVE table is
partitioned, then it just hangs.
Regards,
Gourav
On Thu, Aug 11, 2016 at 6:02 PM, Mich Talebzadeh
wrote:
>
>
> This does not work with CLUSTERED BY clause in Spark 2 now!
>
> CREATE TABLE test.dummy2
> (
> ID INT
>
Hmm, hashing will probably send all of the records with the same key to the
same partition / machine.
I’d try it out, and hope that if you have a few superlarge keys bigger than the
RAM available of one node, they spill to disk. Maybe play with persist() and
using a different Storage Level.
> O
Ok, interesting. Would be interested to see how it compares.
By the way, the feature size you select for the hasher should be a power of
2 (e.g. 2**24 to 2**26 may be worth trying) to ensure the feature indexes
are evenly distributed (see the section on HashingTF under
http://spark.apache.org/docs
Thanks Nick, I played around with the hashing trick. When I set numFeatures to
the amount of distinct values for the largest sparse feature, I ended up with
half of them colliding. When raising the numFeatures to have less collisions, I
soon ended up with the same memory problems as before. To b
Thanks Ted,
In this case we were using Standalone with Standalone master started on
another node.
The app was started on a node but not the master node. The master node was
not affected. The node in question was the edge (running spark-submit).
>From the link I was not sure this matter would hav
Have you read
https://spark.apache.org/docs/latest/spark-standalone.html#high-availability
?
FYI
On Thu, Aug 11, 2016 at 12:40 PM, Mich Talebzadeh wrote:
>
> Hi,
>
> Although Spark is fault tolerant when nodes go down like below:
>
> FROM tmp
> [Stage 1:===>
When you read both ‘a’ and ‘b', can you try repartitioning both by column ‘id’?
If you .cache() and .count() to force a shuffle, it'll push the records that
will be joined to the same executors.
So;
a = spark.read.parquet(‘path_to_table_a’).repartition(‘id’).cache()
a.count()
b = spark.read.pa
Hi,
Although Spark is fault tolerant when nodes go down like below:
FROM tmp
[Stage 1:===> (20 + 10) /
100]16/08/11 20:21:34 ERROR TaskSchedulerImpl: Lost executor 3 on
xx.xxx.197.216: worker lost
[Stage 1:>
The algorithm update is just broken into 2 steps: trainOn - to learn/update
the cluster centers, and predictOn - predicts cluster assignment on data
The StreamingKMeansExample you reference breaks up data into training and
test because you might want to score the predictions. If you don't care
ab
Figured it out. All I am doing wrong is testing it out in pseudo node vm
with 1 core. The tasks were hanging out for cpu.
In production cluster this works just fine.
On Thu, Aug 11, 2016 at 12:45 AM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:
> Checked executor logs and UI . There
Bump!
On Wed, Aug 10, 2016 at 2:59 PM, Efe Selcuk wrote:
> Thanks for the replies, folks.
>
> My specific use case is maybe unusual. I'm working in the context of the
> build environment in my company. Spark was being used in such a way that
> the fat assembly jar that the old 'sbt assembly' com
Can someone also provide input on why my code may not be working? Below, I
have pasted part of my previous reply which describes the issue I am having
here. I am really more perplexed about the first set of code (in bold). I
know why the second set of code doesn't work, it is just something I
initi
I should be more clear, since the outcome of the discussion above was
not that obvious actually.
- I agree a change should be made to StandardScaler, and not VectorAssembler
- However I do think withMean should still be false by default and be
explicitly enabled
- The 'offset' idea is orthogonal,
This does not work with CLUSTERED BY clause in Spark 2 now!
CREATE TABLE test.dummy2
(
ID INT
, CLUSTERED INT
, SCATTERED INT
, RANDOMISED INT
, RANDOM_STRING VARCHAR(50)
, SMALL_VC VARCHAR(10)
, PADDING VARCHAR(10)
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPRO
The data is uncorrupted as I can create the dataframe from the underlying raw
parquet from spark 2.0.0 if instead of using SparkSession.sql() to create a
dataframe I use SparkSession.read.parquet().
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-
this is Spark 2
you create temp table from df using HiveContext
val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala> s.registerTempTable("tmp")
scala> HiveContext.sql("select count(1) from tmp")
res18: org.apache.spark.sql.DataFrame = [count(1): bigint]
scala> HiveContext.sql("s
Im attempting to access a dataframe from jdbc:
However this temp table is not accessible from beeline when connected to
this instance of HiveServer2.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-6-2-HiveServer2-cannot-access-temp-tables-tp27515
How are you calling registerTempTable from hiveContext? It appears to be a
private method.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Table-registered-using-registerTempTable-not-found-in-HiveContext-tp26555p27514.html
Sent from the Apache Spark User Li
I am running HiveServer2 as well and when I connect with beeline I get the
following:
org.apache.spark.sql.internal.SessionState cannot be cast to
org.apache.spark.sql.hive.HiveSessionState
Do you know how to resolve this?
--
View this message in context:
http://apache-spark-user-list.10015
Dear All,
I was wondering why there is training data and testing data in kmeans ?
Shouldn't it be unsupervised learning with just access to stream data ?
I found similar question but couldn't understand the answer.
http://stackoverflow.com/questions/30972057/is-the-streaming-k-means-clustering-pr
Opening this follow-up question to the entire mailing list. Anyone
have thoughts
on how I can add a column of dense vectors (created by converting a column
of sparse features) to a data frame? My efforts are below.
Although I know this is not the best approach for something I plan to put
in produc
Alternatively, you should be able to write to a new table and use trigger
or some other mechanism to update the particular row. I don't have any
experience with this myself but just looking at this documentation:
https://docs.cloud.databricks.com/docs/latest/databricks_guide/index.html#03%20Data%20
I have data which is json in this format
myList: array
|||-- elem: struct
||||-- nm: string (nullable = true)
||||-- vList: array (nullable = true)
|||||-- element: string (containsNull = true)
from my kafka stream, i created a dataframe usin
Thanks Assem I'll check this.
Samir
On Aug 11, 2016 4:39 AM, "Aseem Bansal" wrote:
> Check the schema of the data frame. It may be that your columns are
> String. You are trying to give default for numerical data.
>
> On Thu, Aug 11, 2016 at 6:28 AM, Javier Rey wrote:
>
>> Hi everybody,
>>
>>
I ran into a similar problem while using QDigest.
Shading the clearspring, fastutil classes solves the problem.
Snippet from pom.xml :
package
shade
Hi
I have a Dataset
I will change a String to String so there will be no schema changes.
Is there a way I can run a map on it? I have seen the function at
https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/Dataset.html#map(org.apache.spark.api.java.function.MapFunction,%20org.apac
in that case one alternative would be to save the new table on hdfs and
then using some simple ETL load it into a staging table in MySQL and
update the original table from staging table
The whole thing can be done in a shell script.
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com
I read the table via spark SQL , and perform some ML activity on the data
, and the resultant will be to update some specific columns with the ML
improvised result,
hence i do not have a option to do the whole operation in MySQL,
Thanks,
Sujeet
On Thu, Aug 11, 2016 at 3:29 PM, Mich Talebzadeh
No, that doesn't describe the change being discussed, since you've
copied the discussion about adding an 'offset'. That's orthogonal.
You're also suggesting making withMean=True the default, which we
don't want. The point is that if this is *explicitly* requested, the
scaler shouldn't refuse to sub
Ok it is clearer now.
You are using Spark as the query tool on an RDBMS table? Read table via
JDBC, write back updating certain records.
I have not done this myself but I suspect the issue would be if Spark write
will commit the transaction and maintains ACID compliance. (locking the
rows etc).
1 ) using mysql DB
2 ) will be inserting/update/overwrite to the same table
3 ) i want to update a specific column in a record, the data is read via
Spark SQL,
on the below table which is read via sparkSQL, i would like to update the
NumOfSamples column .
consider DF as the dataFrame which holds
Check the schema of the data frame. It may be that your columns are String.
You are trying to give default for numerical data.
On Thu, Aug 11, 2016 at 6:28 AM, Javier Rey wrote:
> Hi everybody,
>
> I have a data frame after many transformation, my final task is fill na's
> with zeros, but I run
here you can find more information about the code of my class "
RandomForestRegression..java" :
http://spark.apache.org/docs/latest/mllib-ensembles.html#regression
ᐧ
2016-08-11 10:18 GMT+02:00 Zakaria Hili :
> Hi,
>
> I recognize that spark can't save generated model on HDFS (I'm used random
> fo
Hi,
I recognize that spark can't save generated model on HDFS (I'm used random
forest regression and linear regression for this test).
it can save only the data directory as you can see in the picture bellow :
[image: Images intégrées 1]
but to load a model I will need some data from metadata di
The current available output modes are Complete and Append. Complete mode is
for stateful processing (aggregations), and Append mode for stateless
processing (I.e., map/filter). See :
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-modes
Dataset#writeStream
41 matches
Mail list logo