[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria
Hi all, I've noticed that spark's xxhas64 output doesn't match other tool's due to using seed=42 as a default. I've looked at a few libraries and they use 0 as a default seed: - python https://github.com/ifduyue/python-xxhash - java https://github.com/OpenHFT/Zero-Allocation-Hashing/ - java (slic

Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Igor Calabria
You might be affected by this issue: https://github.com/apache/iceberg/issues/8601 It was already patched but it isn't released yet. On Thu, Oct 5, 2023 at 7:47 PM Prashant Sharma wrote: > Hi Sanket, more details might help here. > > How does your spark configuration look like? > > What exactly

Re: Efficiently updating running sums only on new data

2022-10-13 Thread Igor Calabria
You can tag the last entry with each key using the same window you're using for your rolling sum. Something like this: "LEAD(1) OVER your_window IS NULL as last_record". Then, you just UNION ALL the last entry of each key(which you tagged) with the new data and run the same query over the windowed

Re: Help with Shuffle Read performance

2022-09-30 Thread Igor Calabria
needs, given >the fact that you only have 128GB RAM. > > Hope this helps... > > On 9/29/22 2:12 PM, Igor Calabria wrote: > > Hi Everyone, > > I'm running spark 3.2 on kubernetes and have a job with a decently sized > shuffle of almost 4TB. The relevant cluster

Re: Help with Shuffle Read performance

2022-09-29 Thread Igor Calabria
IOPS limits, since most of our jobs use instances with local disks instead. On Thu, Sep 29, 2022 at 7:44 PM Vladimir Prus wrote: > Igor, > > what exact instance types do you use? Unless you use local instance > storage and have actually configured your Kubernetes and Spark to use >

Help with Shuffle Read performance

2022-09-29 Thread Igor Calabria
Hi Everyone, I'm running spark 3.2 on kubernetes and have a job with a decently sized shuffle of almost 4TB. The relevant cluster config is as follows: - 30 Executors. 16 physical cores, configured with 32 Cores for spark - 128 GB RAM - shuffle.partitions is 18k which gives me tasks of around 15

Re: Migration from Spark 2.4.0 to Spark 3.1.1 caused SortMergeJoin to change to BroadcastHashJoin

2022-07-06 Thread igor cabral uchoa
configs to support the BHJ, I would like to understand whether this is a new behavior in Spark 3 or a bug. I couldn’t find anything useful on the internet about it. Best regards,Igor Uchôa Sent from Yahoo Mail for iPhone On Wednesday, July 6, 2022, 12:47 PM, Tufan Rakshit wrote: There are a few

Migration from Spark 2.4.0 to Spark 3.1.1 caused SortMergeJoin to change to BroadcastHashJoin

2022-07-06 Thread igor cabral uchoa
Hi all, I hope everyone is doing well.  I'm currently working on a Spark migration project that aims to migrate all Spark SQL pipelines for Spark 3.x version and take advantage of all performance improvements on it. My company is using Spark 2.4.0 but we are targeting to use officially the 3.1.1

Re: repartition(n) should be deprecated/alerted

2022-06-22 Thread Igor Berman
> where determinism is assumed. I don't think that supports such a drastic > statement. > > On Wed, Jun 22, 2022 at 12:39 PM Igor Berman > wrote: > >> Hi All >> tldr; IMHO repartition(n) should be deprecated or red-flagged, so that >> everybody will under

repartition(n) should be deprecated/alerted

2022-06-22 Thread Igor Berman
mechanism to work in a consistent way and not to drop data silently(neither to produce duplicates) Any thoughts? thanks in advance Igor

Re: Spark job fails because of timeout to Driver

2019-10-04 Thread igor cabral uchoa
Maybe it is a basic question, but your cluster has enough resource to run your application? It is requesting 208G of RAM  Thanks, Sent from Yahoo Mail for iPhone On Friday, October 4, 2019, 2:31 PM, Jochen Hebbrecht wrote: Hi Igor, We are deploying by submitting a batch job on a Livy server

Re: Spark job fails because of timeout to Driver

2019-10-04 Thread igor cabral uchoa
Hi Roland! What deploy mode are you using when you submit your applications? It is client or cluster mode? Regards, Sent from Yahoo Mail for iPhone On Friday, October 4, 2019, 12:37 PM, Roland Johann wrote: This are dynamic port ranges and dependa on configuration of your cluster. Per job

Re: In spark streaming application how to distinguish between normal and abnormal termination of application?

2018-04-15 Thread Igor Makhlin
looks like nobody knows the answer on this question ;) On Sat, Mar 31, 2018 at 1:59 PM, Igor Makhlin wrote: > Hi All, > > I'm looking for a way to distinguish between normal and abnormal > termination of a spark streaming application with (checkpointing enabled). > > Addi

In spark streaming application how to distinguish between normal and abnormal termination of application?

2018-03-31 Thread Igor Makhlin
stream in this particular application). -- Sincerely, Igor Makhlin

Initial job has not accepted any resources

2017-01-04 Thread Igor Berman
be "WAITING" indefinitely... I've thought of implementing periodic check(by calling rest api /json) that will kill application when waiting time > 10-15 mins for some activeapp any advice will be appreciated, thanks in advance Igor

Re: need help to have a Java version of this scala script

2016-12-17 Thread Igor Berman
do you mind to show what you have in java? in general $"bla" is col("bla") as soon as you import appropriate function import static org.apache.spark.sql.functions.callUDF; import static org.apache.spark.sql.functions.col; udf should be callUDF e.g. ds.withColumn("localMonth", callUDF("toLocalMonth"

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-10-01 Thread Igor Berman
Takeshi, why are you saying this, how have you checked it's only used from 2.7.3? We use spark 2.0 which is shipped with hadoop dependency of 2.7.2 and we use this setting. We've sort of "verified" it's used by configuring log of file output commiter On 30 September 2016 at 03:12, Takeshi Yamamuro

Re: Dataset doesn't have partitioner after a repartition on one of the columns

2016-09-28 Thread Igor Berman
Michael, can you explain please why bucketBy is supported when using writeAsTable() to parquet by not with parquet() Is it only difference between table api and dataframe/dataset api? or there are some other? org.apache.spark.sql.AnalysisException: 'save' does not support bucketing right now; at o

Re: Missing output partition file in S3

2016-09-16 Thread Igor Berman
are you using speculation? On 15 September 2016 at 21:37, Chen, Kevin wrote: > Hi, > > Has any one encountered an issue of missing output partition file in S3 ? > My spark job writes output to a S3 location. Occasionally, I noticed one > partition file is missing. As a result, one chunk of data

Re: Spark_JDBC_Partitions

2016-09-13 Thread Igor Racic
readers) Regards, Igor 2016-09-11 0:46 GMT+02:00 Mich Talebzadeh : > Good points > > Unfortunately databump. expr, imp use binary format for import and export. > that cannot be used to import data into HDFS in a suitable way. > > One can use what is known as flat,sh scr

Re: Using spark to distribute jobs to standalone servers

2016-08-25 Thread Igor Berman
imho, you'll need to implement custom rdd with your locality settings(i.e. custom implementation of discovering where each partition is located) + setting for spark.locality.wait On 24 August 2016 at 03:48, Mohit Jaggi wrote: > It is a bit hacky but possible. A lot depends on what kind of querie

Re: spark worker continuously trying to connect to master and failed in standalone mode

2016-07-20 Thread Igor Berman
in addition check what ip the master is binding to(with nestat) On 20 July 2016 at 06:12, Andrew Ehrlich wrote: > Troubleshooting steps: > > $ telnet localhost 7077 (on master, to confirm port is open) > $ telnet 7077 (on slave, to confirm port is blocked) > > If the port is available on the ma

streaming new data into bigger parquet file

2016-07-06 Thread Igor Berman
or it can append the new data at the end? 2. while appending process happens - how can I ensure that readers of big parquet files are not blocked and won't get any errors?(i.e. are files are "available" when appending new data to them?) I will highly appreciate any pointers thanks in advance, Igor

Re: Spark streaming readind avro from kafka

2016-06-01 Thread Igor Berman
Avro file contains metadata with schema(writer schema) in Kafka there is no such thing, you should put message that will contain some reference to known schema(put whole schema will have big overhead) some people use schema registry solution On 1 June 2016 at 21:02, justneeraj wrote: > +1 > > I

different SqlContext with same udf name with different meaning

2016-05-08 Thread Igor Berman
Hi, suppose I have multitenant environment and I want to give my users additional functions but for each user/tenant the meaning of same function is dependent on user's specific configuration is it possible to register same function several times under different SqlContexts? are several SqlContext

Re: Apache Flink

2016-04-17 Thread Igor Berman
latency in Flink is not eliminated, but it might be smaller since Flink process each event 1-by-1 while Spark does microbatching(so you can't achieve latency lesser than your microbatch config) probably Spark will have better throughput due to this microbatching On 17 April 2016 at 14:47, Ovidiu

Re: streaming app performance when would increasing execution size or adding more cores

2016-03-07 Thread Igor Berman
may be you are experiencing problem with FileOutputCommiter vs DirectCommiter while working with s3? do you have hdfs so you can try it to verify? committing in s3 will copy 1-by-1 all partitions to your final destination bucket from _temporary, so this stage might become a bottleneck(so reducing

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-05 Thread Igor Berman
it's not safe to use direct committer with append mode, you may loose your data.. On 4 March 2016 at 22:59, Jelez Raditchkov wrote: > Working on a streaming job with DirectParquetOutputCommitter to S3 > I need to use PartitionBy and hence SaveMode.Append > > Apparently when using SaveMode.Append

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Igor Berman
your field name is *enum1_values* but you have data { "foo1": "test123", *"enum1"*: "BLUE" } i.e. since you defined enum and not union(null, enum) it tries to find value for enum1_values and doesn't find one... On 3 March 2016 at 11:30, Chris Miller wrote: > I've been digging into this a littl

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Igor Berman
spark.driver.extraClassPath spark.executor.extraClassPath 2016-03-02 18:01 GMT+02:00 Matthias Niehoff : > Hi, > > we want to add jars to the Master and Worker class path mainly for logging > reason (we have a redis appender to send logs to redis -> logstash -> > elasticsearch). > > While it is wo

Re: .cache() changes contents of RDD

2016-02-27 Thread Igor Berman
are you using avro format by any chance? there is some formats that need to be "deep"-copy before caching or aggregating try something like val input = sc.newAPIHadoopRDD(...) val rdd = input.map(deepCopyTransformation).map(...) rdd.cache() rdd.saveAsTextFile(...) where deepCopyTransformation is f

Re: DirectFileOutputCommiter

2016-02-27 Thread Igor Berman
t won't be "silent" failure. On 26 February 2016 at 11:50, Reynold Xin wrote: > It could lose data in speculation mode, or if any job fails. > > On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman > wrote: > >> Takeshi, do you know the reason why they wanted to remov

Re: Bug in DiskBlockManager subDirs logic?

2016-02-26 Thread Igor Berman
I've experienced such kind of outputs when executor was killed(e.g. by OOM killer) or was lost for some reason i.e. try to look at machine if executor wasn't restarted... On 26 February 2016 at 08:37, Takeshi Yamamuro wrote: > Hi, > > Could you make simple codes to reproduce the issue? > I'm not

Re: Standalone vs. Mesos for production installation on a smallish cluster

2016-02-26 Thread Igor Berman
Imho most of production clusters are standalone there was some presentation from spark summit with some stats inside(can't find right now), so standalone was at 1st place it was from Matei https://databricks.com/resources/slides On 26 February 2016 at 13:40, Petr Novak wrote: > Hi all, > I belie

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
Takeshi, do you know the reason why they wanted to remove this commiter in SPARK-10063? the jira has no info inside as far as I understand the direct committer can't be used when either of two is true 1. speculation mode 2. append mode(ie. not creating new version of data but appending to existing

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
Alexander, implementation you've attaches supports both modes configured by property " mapred.output.direct." + fs.getClass().getSimpleName() as soon as you see _temporary dir probably the mode is off i.e. the default impl is working and you experiencing some other problem. On 26 February 2016 at

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
the performance gain is for commit stage when data is moved from _temporary directory to distination directory since s3 is key-value really the move operation is like copy operation On 26 February 2016 at 08:24, Takeshi Yamamuro wrote: > Hi, > > Great work! > What is the concrete performance ga

Re: reasonable number of executors

2016-02-23 Thread Igor Berman
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications there is a section that is connected to your question On 23 February 2016 at 16:49, Alex Dzhagriev wrote: > Hello all, > > Can someone please advise me on the pros and cons on how to allocate the >

Re: SPARK-9559

2016-02-18 Thread Igor Berman
what are you trying to solve? killing worker jvm is like killing yarn node manager...why would you do this? usually worker jvm is "agent" on each worker machine which opens executors per each application, so it doesn't works hard or has big memory footprint yes it can fail, but it rather corner sit

How to train and predict in parallel via Spark MLlib?

2016-02-18 Thread Igor L.
tion! -- BR, Junior Scala/Python Developer Igor L. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html Sent from the Apache Spark User List m

Re: newbie unable to write to S3 403 forbidden error

2016-02-12 Thread Igor Berman
String dirPath = "s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/*json” * not sure, but can you try to remove s3-us-west-1.amazonaws.com from path ? On 11 February 2016 at 23:15, Andy Davidson wrote: > I am using spark 1.6.0 in a cluster c

Re: Shuffle memory woes

2016-02-08 Thread Igor Berman
ey: >>>"The dataset is 100gb at most, the spills can up to 10T-100T", Are >>> your input files lzo format, and you use sc.text() ? If memory is not >>> enough, spark will spill 3-4x of input data to disk. >>> >>> >>>

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread Igor Berman
show has argument of truncate pass false so it wont truncate your results On 7 February 2016 at 11:01, SLiZn Liu wrote: > Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried > HiveContext, but the result is exactly the same. > ​ > > On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu wrote:

Re: Shuffle memory woes

2016-02-07 Thread Igor Berman
ce" side is not configured properly? I mean if map side is ok, and you just reducing by key or something it should be ok, so some detail is missing...skewed data? aggregate by key? On 6 February 2016 at 20:13, Corey Nolet wrote: > Igor, > > Thank you for the response but unfortunately,

Re: Shuffle memory woes

2016-02-06 Thread Igor Berman
Hi, usually you can solve this by 2 steps make rdd to have more partitions play with shuffle memory fraction in spark 1.6 cache vs shuffle memory fractions are adjusted automatically On 5 February 2016 at 23:07, Corey Nolet wrote: > I just recently had a discovery that my jobs were taking sever

Re: multi-threaded Spark jobs

2016-01-25 Thread Igor Berman
IMHO, you are making mistake. spark manages tasks and cores internally. when you open new threads inside executor - meaning you "over-provisioning" executor(e.g. tasks on other cores will be preempted) On 26 January 2016 at 07:59, Elango Cheran wrote: > Hi everyone, > I've gone through the eff

Re: Converting CSV files to Avro

2016-01-17 Thread Igor Berman
https://github.com/databricks/spark-avro ? On 17 January 2016 at 13:46, Gideon wrote: > Hi everyone, > > I'm writing a Scala program which uses Spark CSV > to read CSV files from a > directory. After reading the CSVs as data frames I need to convert t

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
e the number of cores the job was using on one worker, > but it would not use any other worker (and it would not start if the number > of cores the job wanted was higher than the number available on one worker). > > > On 07.01.2016 22:51, Igor Berman wrote: > > read about *--tota

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
.jar > > Cheers, > Michael > > > On 07.01.2016 22:41, Igor Berman wrote: > > share how you submit your job > what cluster(yarn, standalone) > > On 7 January 2016 at 23:24, Michael Pisula > wrote: > >> Hi there, >> >> I ran a simple Batch Appl

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
share how you submit your job what cluster(yarn, standalone) On 7 January 2016 at 23:24, Michael Pisula wrote: > Hi there, > > I ran a simple Batch Application on a Spark Cluster on EC2. Despite having > 3 > Worker Nodes, I could not get the application processed on more than one > node, regardl

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Igor Berman
another option will be to try rdd.toLocalIterator() not sure if it will help though I had same problem and ended up to move all parts to local disk(with Hadoop FileSystem api) and then processing them locally On 5 January 2016 at 22:08, Alexander Pivovarov wrote: > try coalesce(1, true). > > O

Re: partitioning json data in spark

2015-12-27 Thread Igor Berman
have you tried to specify format of your output, might be parquet is default format? df.write().format("json").mode(SaveMode.Overwrite).save("/tmp/path"); On 27 December 2015 at 15:18, Նարեկ Գալստեան wrote: > Hey all! > I am willing to partition *json *data by a column name and store the > resul

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Igor Berman
sqlContext.sql("select * from table limit 5").show() (not sure if limit 5 supported) or use Dmitriy's solution. select() defines your projection when you've specified entire query On 25 December 2015 at 15:42, Василец Дмитрий wrote: > hello > you can try to use df.limit(5).show() > just trick :

Re: Fat jar can't find jdbc

2015-12-22 Thread Igor Berman
imho, if you succeeded to fetch something from your mysql with same jar in classpath, then Manifest is ok and you indeed should look at your spark sql - jdbc configs On 22 December 2015 at 12:21, David Yerrington wrote: > Igor, I think it's available. After I extract the jar file,

Re: Fat jar can't find jdbc

2015-12-22 Thread Igor Berman
David, can you verify that mysql connector classes indeed in your single jar? open it with zip tool available at your platform another options that might be a problem - if there is some dependency in MANIFEST(not sure though this is the case of mysql connector) then it might be broken after prepar

Re: Spark with log4j

2015-12-21 Thread Igor Berman
I think log4j.properties that are under conf dir are those that are relevant for workers jvms and not the one that you pack withing your jar On 21 December 2015 at 14:07, Kalpesh Jadhav wrote: > Hi Ted, > > > > Thanks for your response, But it doesn’t solve my issue. > > Still print logs on con

Re: fishing for help!

2015-12-21 Thread Igor Berman
look for differences: packages versions, cpu/network/memory diff etc etc On 21 December 2015 at 14:53, Eran Witkon wrote: > Hi, > I know it is a wide question but can you think of reasons why a pyspark > job which runs on from server 1 using user 1 will run faster then the same > job when runni

Re: Preventing an RDD from shuffling

2015-12-16 Thread Igor Berman
imho, you should implement your own rdd with mongo sharding awareness, then this rdd will have this mongo aware partitioner, and then incoming data will be partitioned by this partitioner in join not sure if it's simple task...but you have to get partitioner in you mongo rdd. On 16 December 2015 a

Re: NPE in using AvroKeyValueInputFormat for newAPIHadoopFile

2015-12-16 Thread Igor Berman
check version compatibility I think avro lib should be 1.7.4 check that no other lib brings transitive dependency of other avro version On 16 December 2015 at 09:44, Jinyuan Zhou wrote: > Hi, I tried to load avro files in hdfs but keep getting NPE. > I am using AvroKeyValueInputFormat inside n

Re: Need Help Diagnosing/operating/tuning

2015-11-23 Thread Igor Berman
you should check why executor is killed. as soon as it's killed you can get all kind of strange exceptions... either give your executors more memory(4G is rather small for spark ) or try to decrease your input or maybe split it into more partitions in input format 23G in lzo might expand to x?

Re: newbie: unable to use all my cores and memory

2015-11-20 Thread Igor Berman
u've asked total cores to be 2 + 1 for driver(since you are running in cluster mode, so it's running on one of the slaves) change total cores to be 3*2 change submit mode to be client - you'll have full utilization (btw it's not advisable to use all cores of slave...since there is OS processes and

Re: Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-20 Thread Igor Berman
try to assemble log4j.xml or log4j.properties in your jar...probably you'll get what you want, however pay attention that when you'll move to multinode cluster - there will be difference On 20 November 2015 at 05:10, Afshartous, Nick wrote: > > < log4j.properties file only exists on the master a

Re: Incorrect results with reduceByKey

2015-11-17 Thread Igor Berman
you should clone your data after reading avro On 18 November 2015 at 06:28, tovbinm wrote: > Howdy, > > We've noticed a strange behavior with Avro serialized data and reduceByKey > RDD functionality. Please see below: > > // We're reading a bunch of Avro serialized data > val data: RDD[T] = spa

Re: status of slaves in standalone cluster rest/rpc call

2015-11-09 Thread Igor Berman
further reading code of MasterPage gave me what I want: http://:8080/json returns json view of all info presented in main page On 9 November 2015 at 22:41, Igor Berman wrote: > Hi, > How do I get status of workers(slaves) from driver? > why I need it - I want to autoscale new workers

status of slaves in standalone cluster rest/rpc call

2015-11-09 Thread Igor Berman
Hi, How do I get status of workers(slaves) from driver? why I need it - I want to autoscale new workers and want to poll status of cluster(e.g. number of alive slaves connected) so that I'll submit job only after expected number of slaves joined cluster I've found MasterPage class that produces u

Re: Whether Spark is appropriate for our use case.

2015-11-07 Thread Igor Berman
1. if you have join by some specific field(e.g. user id or account-id or whatever) you may try to partition parquet file by this field and then join will be more efficient. 2. you need to see in spark metrics what is performance of particular join, how much partitions is there, what is shuffle size

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
convert a parquet file that is saved in hdfs to an RDD after > reading the file from hdfs? > > On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman > wrote: > >> Hi, >> we are using avro with compression(snappy). As soon as you have enough >> partitions, the saving won&

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
Hi, we are using avro with compression(snappy). As soon as you have enough partitions, the saving won't be a problem imho. in general hdfs is pretty fast, s3 is less so the issue with storing data is that you will loose your partitioner(even though rdd has it) at loading moment. There is PR that tr

Re: Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Igor Berman
many use it. how do you add aws sdk to classpath? check in environment ui what is in cp. you should make sure that in your cp the version is compatible with one that spark compiled with I think 1.7.4 is compatible(at least we use it) make sure that you don't get other versions from other transitiv

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Igor Berman
check spark.rdd.compress On 19 October 2015 at 21:13, ahaider3 wrote: > Hi, > A lot of the data I have in HDFS is compressed. I noticed when I load this > data into spark and cache it, Spark unrolls the data like normal but stores > the data uncompressed in memory. For example, suppose /data/ is

Re: spark straggle task

2015-10-20 Thread Igor Berman
We know that the JobScheduler have the function to assign the straggle task to another node. - only if you enable and configure spark.speculation On 20 October 2015 at 15:20, Triones,Deng(vip.com) wrote: > Hi All > > We run an application with version 1.4.1 standalone mode. We saw two tasks > in

Re: In-memory computing and cache() in Spark

2015-10-18 Thread Igor Berman
Does ur iterations really submit job? I dont see any action there On Oct 17, 2015 00:03, "Jia Zhan" wrote: > Hi all, > > I am running Spark locally in one node and trying to sweep the memory size > for performance tuning. The machine has 8 CPUs and 16G main memory, the > dataset in my local d

Re: our spark gotchas report while creating batch pipeline

2015-10-18 Thread Igor Berman
thanks Ted :) On 18 October 2015 at 19:07, Ted Yu wrote: > Interesting reading material. > > bq. transformations that loose partitioner > > lose partitioner > > bq. Spark looses the partitioner > > loses the partitioner > > bq. Tunning number of partitions > > Should be tuning. > > bq. or incre

Re: Question about GraphX connected-components

2015-10-10 Thread Igor Berman
let's start from some basics: might be u need to split your data into more partitions? spilling depends on your configuration when you create graph(look for storage level param) and your global configuration. in addition, you assumption of 64GB/100M is probably wrong, since spark divides memory int

Re: Issue with the class generated from avro schema

2015-10-09 Thread Igor Berman
saving those results with > exactly the same schema. > > Thank you for the answer, at least I know that there is no way to make it > works. > > > 2015-10-09 20:19 GMT+02:00 Igor Berman : > >> u should create copy of your avro data before working with it, i.e. just >

Re: Issue with the class generated from avro schema

2015-10-09 Thread Igor Berman
u should create copy of your avro data before working with it, i.e. just after loadFromHDFS map it into new instance that is deap copy of the object it's connected to the way spark/avro reader reads avro files(it reuses some buffer or something) On 9 October 2015 at 19:05, alberskib wrote: > Hi

Re: RDD of ImmutableList

2015-10-05 Thread Igor Berman
kryo doesn't support guava's collections by default I remember encountered project in github that fixes this(not sure though). I've ended to stop using guava collections as soon as spark rdds are concerned. On 5 October 2015 at 21:04, Jakub Dubovsky wrote: > Hi all, > > I would like to have an

Q: optimal way to calculate aggregates on a stream

2015-10-03 Thread igor
Hello I'm new to spark, so sorry if my question looks dumb. I have a problem which I hope to solve using spark. Here is short description: 1. I have a simple flow of the 600k tuples per minute. Each tuple is structured metric name and its value: (a.b.c.d, value) (a.b.x, value) (g

Re: Getting spark application driver ID programmatically

2015-10-02 Thread Igor Berman
if driver id is application id then yes you can do it with String appId = ctx.sc().applicationId(); //when ctx is java context On 1 October 2015 at 20:44, Snehal Nagmote wrote: > Hi , > > I have use case where we need to automate start/stop of spark streaming > application. > > To stop spark jo

Re: Troubleshooting "Task not serializable" in Spark/Scala environments

2015-09-21 Thread Igor Berman
Try to broadcasr header On Sep 22, 2015 08:07, "Balaji Vijayan" wrote: > Howdy, > > I'm a relative novice at Spark/Scala and I'm puzzled by some behavior that > I'm seeing in 2 of my local Spark/Scala environments (Scala for Jupyter and > Scala IDE) but not the 3rd (Spark Shell). The following co

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
none of the jars listed on the classpath contain this class. The > only jar that does is the fat jar that I'm submitting with spark-submit, > which as mentioned isn't showing up on the classpath anywhere. > > -- Nick > > On Tue, Sep 8, 2015 at 8:26 AM Igor Berman wr

Re: Java vs. Scala for Spark

2015-09-08 Thread Igor Berman
we are using java7..its much more verbose that java8 or scala examples in addition there sometimes libraries that has no java api, so you need to write them by yourself(e.g. graphx) on the other hand, scala is not trivial language like java, so it depends on your team On 8 September 2015 at 17:44

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
t/Document$1.class > com/i2028/Document/Document.class > > What else can I do? Is there any way to get more information about the > classes available to the particular classloader kryo is using? > > On Tue, Sep 8, 2015 at 6:34 AM Igor Berman wrote: > >> java

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
ace: (Sorry for the duplicate, Igor -- I forgot to > include the list.) > > > 15/09/08 05:56:43 WARN scheduler.TaskSetManager: Lost task 183.0 in stage > 41.0 (TID 193386, ds-compute2.lumiata.com): java.io.IOException: > com.esotericsoftware.kryo.KryoException: Error constructing

Re: Spark on Yarn: Kryo throws ClassNotFoundException for class included in fat jar

2015-09-08 Thread Igor Berman
as a starting point, attach your stacktrace... ps: look for duplicates in your classpath, maybe you include another jar with same class On 8 September 2015 at 06:38, Nicholas R. Peterson wrote: > I'm trying to run a Spark 1.4.1 job on my CDH5.4 cluster, through Yarn. > Serialization is set to us

Re: Problem to persist Hibernate entity from Spark job

2015-09-05 Thread Igor Berman
how do you create your session? do you reuse it across threads? how do you create/close session manager? look for the problem in session creation, probably something deadlocked, as far as I remember hib.session should be created per thread On 6 September 2015 at 07:11, Zoran Jeremic wrote: > Hi,

Re: Tuning - tasks per core

2015-09-03 Thread Igor Berman
suppose you have 1 job that do some transformation, suppose you have X cores in your cluster and you are willing to give all of them to your job suppose you have no shuffles(to keep it simple) set number of partitions of your input data to be 3X or 2X, thus you'll get 2/3 tasks per each core On 3

Re: Managing httpcomponent dependency in Spark/Solr

2015-09-03 Thread Igor Berman
not sure if it will help, but have you checked https://maven.apache.org/plugins/maven-shade-plugin/examples/class-relocation.html On 31 August 2015 at 19:33, Oliver Schrenk wrote: > Hi, > > We are running a distibuted indexing service for Solr (4.7) on a Spark > (1.2) cluster. Now we wanted to u

Re: bulk upload to Elasticsearch and shuffle behavior

2015-09-01 Thread Igor Berman
Hi Eric, I see that you solved your problem. Imho, when you do repartition you split your work into 2 stages, so your hbase lookup happens at first stage, and upload to ES happens after shuffle on next stage, so without repartition it's hard to tell where is ES upload and where is Hbase lookup time

Re: Parallel execution of RDDs

2015-08-31 Thread Igor Berman
what is size of the pool you submitting spark jobs from(futures you've mentioned)? is it 8? I think you have fixed thread pool of 8 so there can't be more than 8 parallel jobs running...so try to increase it what is number of partitions of each of your rdds? how many cores has your worker machine(t

Re: spark-submit issue

2015-08-31 Thread Igor Berman
2015 at 10:42, Pranay Tonpay wrote: > Igor,, this seems to be the cause, however i am not sure at the moment how > to resolve it ... what i tried just now was that after " > > SparkSubmitDriverBootstrapper" process reaches the hung stage... i went > inside /proc//fd ..

Re: spark-submit issue

2015-08-31 Thread Igor Berman
might be you need to drain stdout/stderr of subprocess...otherwise subprocess can deadlock http://stackoverflow.com/questions/3054531/correct-usage-of-processbuilder On 27 August 2015 at 16:11, pranay wrote: > I have a java program that does this - (using Spark 1.3.1 ) Create a > command > strin

Re: Help Explain Tasks in WebUI:4040

2015-08-31 Thread Igor Berman
are there other processes on sk3? or more generally are you sharing resources with somebody else, virtualization etc does your transformation consumes other services?(e.g. reading from s3, so it can happen that s3 latency plays the role...) can it be that task per some key will take longer than sa

Re: blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Igor Berman
you don't need to register, search in youtube for this video... On 19 August 2015 at 18:34, Gourav Sengupta wrote: > Excellent resource: http://www.oreilly.com/pub/e/3330 > > And more amazing is the fact that the presenter actually responds to your > questions. > > Regards, > Gourav Sengupta > >

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Igor Berman
sted, there are different defaults for the two means of job submission > that come into play in a non-transparent fashion (i.e. not visible in > SparkConf). > > On Wed, Aug 19, 2015 at 1:36 PM, Igor Berman > wrote: > >> any differences in number of cores, memory settings f

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Igor Berman
any differences in number of cores, memory settings for executors? On 19 August 2015 at 09:49, Rick Moritz wrote: > Dear list, > > I am observing a very strange difference in behaviour between a Spark > 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin > interpreter (comp

Re: how do I execute a job on a single worker node in standalone mode

2015-08-18 Thread Igor Berman
by default standalone creates 1 executor on every worker machine per application number of overall cores is configured with --total-executor-cores so in general if you'll specify --total-executor-cores=1 then there would be only 1 core on some executor and you'll get what you want on the other han

Re: Spark Job Hangs on our production cluster

2015-08-11 Thread Igor Berman
how do u want to process 1T of data when you set your executor memory to be 2g? look at spark ui, metrics of tasks...if any look at spark logs on executor machine under work dir(unless you configured log4j) I think your executors are thrashing or spilling to disk. check memory metrics/swapping O

Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Igor Berman
check on which ip/port master listens netstat -a -t --numeric-ports On 7 August 2015 at 20:48, Jeff Jones wrote: > Thanks. Added this to both the client and the master but still not getting > any more information. I confirmed the flag with ps. > > > > jjones53222 2.7 0.1 19399412 549656 p

Re: Enum values in custom objects mess up RDD operations

2015-08-06 Thread Igor Berman
enums hashcode is jvm instance specific(ie. different jvms will give you different values), so you can use ordinal in hashCode computation or use hashCode on enums ordinal as part of hashCode computation On 6 August 2015 at 11:41, Warfish wrote: > Hi everyone, > > I was working with Spark for a

  1   2   >