Q: optimal way to calculate aggregates on a stream

2015-10-03 Thread igor
Hello I'm new to spark, so sorry if my question looks dumb. I have a problem which I hope to solve using spark. Here is short description: 1. I have a simple flow of the 600k tuples per minute. Each tuple is structured metric name and its value: (a.b.c.d, value) (a.b.x, value) (g

repartition(n) should be deprecated/alerted

2022-06-22 Thread Igor Berman
mechanism to work in a consistent way and not to drop data silently(neither to produce duplicates) Any thoughts? thanks in advance Igor

Re: repartition(n) should be deprecated/alerted

2022-06-22 Thread Igor Berman
> where determinism is assumed. I don't think that supports such a drastic > statement. > > On Wed, Jun 22, 2022 at 12:39 PM Igor Berman > wrote: > >> Hi All >> tldr; IMHO repartition(n) should be deprecated or red-flagged, so that >> everybody will under

Help with Shuffle Read performance

2022-09-29 Thread Igor Calabria
Hi Everyone, I'm running spark 3.2 on kubernetes and have a job with a decently sized shuffle of almost 4TB. The relevant cluster config is as follows: - 30 Executors. 16 physical cores, configured with 32 Cores for spark - 128 GB RAM - shuffle.partitions is 18k which gives me tasks of around 15

Re: Help with Shuffle Read performance

2022-09-29 Thread Igor Calabria
IOPS limits, since most of our jobs use instances with local disks instead. On Thu, Sep 29, 2022 at 7:44 PM Vladimir Prus wrote: > Igor, > > what exact instance types do you use? Unless you use local instance > storage and have actually configured your Kubernetes and Spark to use >

Re: Help with Shuffle Read performance

2022-09-30 Thread Igor Calabria
needs, given >the fact that you only have 128GB RAM. > > Hope this helps... > > On 9/29/22 2:12 PM, Igor Calabria wrote: > > Hi Everyone, > > I'm running spark 3.2 on kubernetes and have a job with a decently sized > shuffle of almost 4TB. The relevant cluster

Re: Efficiently updating running sums only on new data

2022-10-13 Thread Igor Calabria
You can tag the last entry with each key using the same window you're using for your rolling sum. Something like this: "LEAD(1) OVER your_window IS NULL as last_record". Then, you just UNION ALL the last entry of each key(which you tagged) with the new data and run the same query over the windowed

Re: Connection pool shut down in Spark Iceberg Streaming Connector

2023-10-05 Thread Igor Calabria
You might be affected by this issue: https://github.com/apache/iceberg/issues/8601 It was already patched but it isn't released yet. On Thu, Oct 5, 2023 at 7:47 PM Prashant Sharma wrote: > Hi Sanket, more details might help here. > > How does your spark configuration look like? > > What exactly

[Spark SQL] xxhash64 default seed of 42 confusion

2024-04-16 Thread Igor Calabria
Hi all, I've noticed that spark's xxhas64 output doesn't match other tool's due to using seed=42 as a default. I've looked at a few libraries and they use 0 as a default seed: - python https://github.com/ifduyue/python-xxhash - java https://github.com/OpenHFT/Zero-Allocation-Hashing/ - java (slic

Re: Batch aggregation by sliding window + join

2015-05-29 Thread Igor Berman
will be partitioned and collocated by the same >> "partitioner"(which is absent for hadooprdd) ... somehow, so that only >> small >> rdds will be sent over network. >> >> Another idea I had - somehow split baseBlock into 2 parts with filter by >> keys o

Re: Batch aggregation by sliding window + join

2015-05-30 Thread Igor Berman
decide to recompute last 3 days > everyday. > On 29 May 2015 23:52, "Igor Berman" wrote: > >> Hi Ayan, >> thanks for the response >> I'm using 1.3.1. I'll check window queries(I dont use spark-sql...only >> core, might be I should?) >> W

Re: union and reduceByKey wrong shuffle?

2015-05-31 Thread Igor Berman
Hi We are using spark 1.3.1 Avro-chill (tomorrow will check if its important) we register avro classes from java Avro 1.7.6 On May 31, 2015 22:37, "Josh Rosen" wrote: > Which Spark version are you using? I'd like to understand whether this > change could be caused by recent Kryo serializer re-us

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Igor Berman
oblem/misconfiguration? On 31 May 2015 at 22:48, Igor Berman wrote: > Hi > We are using spark 1.3.1 > Avro-chill (tomorrow will check if its important) we register avro classes > from java > Avro 1.7.6 > On May 31, 2015 22:37, "Josh Rosen" wrote: > >> Which Spark vers

Re: union and reduceByKey wrong shuffle?

2015-06-01 Thread Igor Berman
rk is to produce a small standalone reproduction? Can you > create an Avro file with some mock data, maybe 10 or so records, then > reproduce this locally? > > On Mon, Jun 1, 2015 at 12:31 PM, Igor Berman > wrote: > >> switching to use simple pojos instead of using avro for s

Re: Managing spark processes via supervisord

2015-06-03 Thread Igor Berman
assuming you are talking about standalone cluster imho, with workers you won't get any problems and it's straightforward since they are usually foreground processes with master it's a bit more complicated, ./sbin/start-master.sh goes background which is not good for supervisor, but anyway I think i

Re: Scaling spark jobs returning large amount of data

2015-06-04 Thread Igor Berman
Hi, as far as I understand you shouldn't send data to driver. Suppose you have file in hdfs/s3 or cassandra partitioning, you should create your job such that every executor/worker of spark will handle part of your input, transform, filter it and at the end write back to cassandra as output(once ag

Re: union and reduceByKey wrong shuffle?

2015-06-05 Thread Igor Berman
n Mon, Jun 1, 2015 at 11:41 PM, Igor Berman > wrote: > >> Hi, >> small mock data doesn't reproduce the problem. IMHO problem is reproduced >> when we make shuffle big enough to split data into disk. >> We will work on it to understand and reproduce the problem(not

Re: SparkContext & Threading

2015-06-05 Thread Igor Berman
+1 to question about serializaiton. SparkContext is still in driver process(even if it has several threads from which you submit jobs) as for the problem, check your classpath, scala version, spark version etc. such errors usually happens when there is some conflict in classpath. Maybe you compiled

Re: SparkContext & Threading

2015-06-05 Thread Igor Berman
Lee, what cluster do you use? standalone, yarn-cluster, yarn-client, mesos? in yarn-cluster the driver program is executed inside one of nodes in cluster, so might be that driver code needs to be serialized to be sent to some node On 5 June 2015 at 22:55, Lee McFadden wrote: > > On Fri, Jun 5, 2

Re: How to set KryoRegistrator class in spark-shell

2015-06-11 Thread Igor Berman
Another option would be to close sc and open new context with your custom configuration On Jun 11, 2015 01:17, "bhomass" wrote: > you need to register using spark-default.xml as explained here > > > https://books.google.com/books?id=WE_GBwAAQBAJ&pg=PA239&lpg=PA239&dq=spark+shell+register+kryo+ser

Re: Spark standalone cluster - resource management

2015-06-23 Thread Igor Berman
probably there are already running jobs there in addition, memory is also a resource, so if you are running 1 application that took all your memory and then you are trying to run another application that asks for the memory the cluster doesn't have then the second app wont be running so why are u

Re: Dependency Injection with Spark Java

2015-06-26 Thread Igor Berman
asked myself same question today...actually depends on what you are trying to do if you want injection into workers code I think it will be a bit hard... if only in code that driver executes i.e. in main, it's straight forward imho, just create your classes from injector(e.g. spring's application c

Re: Create RDD from output of unix command

2015-07-14 Thread Igor Berman
haven't you thought about spark streaming? there is thread that could help https://www.mail-archive.com/user%40spark.apache.org/msg30105.html On 14 July 2015 at 18:20, Hafsa Asif wrote: > Your question is very interesting. What I suggest is, that copy your output > in some text file. Read text f

Re: Add Jars to Master/Worker classpath

2016-03-02 Thread Igor Berman
spark.driver.extraClassPath spark.executor.extraClassPath 2016-03-02 18:01 GMT+02:00 Matthias Niehoff : > Hi, > > we want to add jars to the Master and Worker class path mainly for logging > reason (we have a redis appender to send logs to redis -> logstash -> > elasticsearch). > > While it is wo

Re: Avro SerDe Issue w/ Manual Partitions?

2016-03-03 Thread Igor Berman
your field name is *enum1_values* but you have data { "foo1": "test123", *"enum1"*: "BLUE" } i.e. since you defined enum and not union(null, enum) it tries to find value for enum1_values and doesn't find one... On 3 March 2016 at 11:30, Chris Miller wrote: > I've been digging into this a littl

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-03-05 Thread Igor Berman
it's not safe to use direct committer with append mode, you may loose your data.. On 4 March 2016 at 22:59, Jelez Raditchkov wrote: > Working on a streaming job with DirectParquetOutputCommitter to S3 > I need to use PartitionBy and hence SaveMode.Append > > Apparently when using SaveMode.Append

Re: streaming app performance when would increasing execution size or adding more cores

2016-03-07 Thread Igor Berman
may be you are experiencing problem with FileOutputCommiter vs DirectCommiter while working with s3? do you have hdfs so you can try it to verify? committing in s3 will copy 1-by-1 all partitions to your final destination bucket from _temporary, so this stage might become a bottleneck(so reducing

Re: Apache Flink

2016-04-17 Thread Igor Berman
latency in Flink is not eliminated, but it might be smaller since Flink process each event 1-by-1 while Spark does microbatching(so you can't achieve latency lesser than your microbatch config) probably Spark will have better throughput due to this microbatching On 17 April 2016 at 14:47, Ovidiu

different SqlContext with same udf name with different meaning

2016-05-08 Thread Igor Berman
Hi, suppose I have multitenant environment and I want to give my users additional functions but for each user/tenant the meaning of same function is dependent on user's specific configuration is it possible to register same function several times under different SqlContexts? are several SqlContext

Re: Spark streaming readind avro from kafka

2016-06-01 Thread Igor Berman
Avro file contains metadata with schema(writer schema) in Kafka there is no such thing, you should put message that will contain some reference to known schema(put whole schema will have big overhead) some people use schema registry solution On 1 June 2016 at 21:02, justneeraj wrote: > +1 > > I

streaming new data into bigger parquet file

2016-07-06 Thread Igor Berman
or it can append the new data at the end? 2. while appending process happens - how can I ensure that readers of big parquet files are not blocked and won't get any errors?(i.e. are files are "available" when appending new data to them?) I will highly appreciate any pointers thanks in advance, Igor

Re: spark worker continuously trying to connect to master and failed in standalone mode

2016-07-20 Thread Igor Berman
in addition check what ip the master is binding to(with nestat) On 20 July 2016 at 06:12, Andrew Ehrlich wrote: > Troubleshooting steps: > > $ telnet localhost 7077 (on master, to confirm port is open) > $ telnet 7077 (on slave, to confirm port is blocked) > > If the port is available on the ma

Re: NPE in using AvroKeyValueInputFormat for newAPIHadoopFile

2015-12-16 Thread Igor Berman
check version compatibility I think avro lib should be 1.7.4 check that no other lib brings transitive dependency of other avro version On 16 December 2015 at 09:44, Jinyuan Zhou wrote: > Hi, I tried to load avro files in hdfs but keep getting NPE. > I am using AvroKeyValueInputFormat inside n

Re: Preventing an RDD from shuffling

2015-12-16 Thread Igor Berman
imho, you should implement your own rdd with mongo sharding awareness, then this rdd will have this mongo aware partitioner, and then incoming data will be partitioned by this partitioner in join not sure if it's simple task...but you have to get partitioner in you mongo rdd. On 16 December 2015 a

Re: fishing for help!

2015-12-21 Thread Igor Berman
look for differences: packages versions, cpu/network/memory diff etc etc On 21 December 2015 at 14:53, Eran Witkon wrote: > Hi, > I know it is a wide question but can you think of reasons why a pyspark > job which runs on from server 1 using user 1 will run faster then the same > job when runni

Re: Spark with log4j

2015-12-21 Thread Igor Berman
I think log4j.properties that are under conf dir are those that are relevant for workers jvms and not the one that you pack withing your jar On 21 December 2015 at 14:07, Kalpesh Jadhav wrote: > Hi Ted, > > > > Thanks for your response, But it doesn’t solve my issue. > > Still print logs on con

Re: Fat jar can't find jdbc

2015-12-22 Thread Igor Berman
David, can you verify that mysql connector classes indeed in your single jar? open it with zip tool available at your platform another options that might be a problem - if there is some dependency in MANIFEST(not sure though this is the case of mysql connector) then it might be broken after prepar

Re: Fat jar can't find jdbc

2015-12-22 Thread Igor Berman
imho, if you succeeded to fetch something from your mysql with same jar in classpath, then Manifest is ok and you indeed should look at your spark sql - jdbc configs On 22 December 2015 at 12:21, David Yerrington wrote: > Igor, I think it's available. After I extract the jar file,

Re: Stuck with DataFrame df.select("select * from table");

2015-12-25 Thread Igor Berman
sqlContext.sql("select * from table limit 5").show() (not sure if limit 5 supported) or use Dmitriy's solution. select() defines your projection when you've specified entire query On 25 December 2015 at 15:42, Василец Дмитрий wrote: > hello > you can try to use df.limit(5).show() > just trick :

Re: partitioning json data in spark

2015-12-27 Thread Igor Berman
have you tried to specify format of your output, might be parquet is default format? df.write().format("json").mode(SaveMode.Overwrite).save("/tmp/path"); On 27 December 2015 at 15:18, Նարեկ Գալստեան wrote: > Hey all! > I am willing to partition *json *data by a column name and store the > resul

Re: coalesce(1).saveAsTextfile() takes forever?

2016-01-05 Thread Igor Berman
another option will be to try rdd.toLocalIterator() not sure if it will help though I had same problem and ended up to move all parts to local disk(with Hadoop FileSystem api) and then processing them locally On 5 January 2016 at 22:08, Alexander Pivovarov wrote: > try coalesce(1, true). > > O

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
share how you submit your job what cluster(yarn, standalone) On 7 January 2016 at 23:24, Michael Pisula wrote: > Hi there, > > I ran a simple Batch Application on a Spark Cluster on EC2. Despite having > 3 > Worker Nodes, I could not get the application processed on more than one > node, regardl

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
.jar > > Cheers, > Michael > > > On 07.01.2016 22:41, Igor Berman wrote: > > share how you submit your job > what cluster(yarn, standalone) > > On 7 January 2016 at 23:24, Michael Pisula > wrote: > >> Hi there, >> >> I ran a simple Batch Appl

Re: Spark job uses only one Worker

2016-01-07 Thread Igor Berman
e the number of cores the job was using on one worker, > but it would not use any other worker (and it would not start if the number > of cores the job wanted was higher than the number available on one worker). > > > On 07.01.2016 22:51, Igor Berman wrote: > > read about *--tota

Re: Converting CSV files to Avro

2016-01-17 Thread Igor Berman
https://github.com/databricks/spark-avro ? On 17 January 2016 at 13:46, Gideon wrote: > Hi everyone, > > I'm writing a Scala program which uses Spark CSV > to read CSV files from a > directory. After reading the CSVs as data frames I need to convert t

Re: multi-threaded Spark jobs

2016-01-25 Thread Igor Berman
IMHO, you are making mistake. spark manages tasks and cores internally. when you open new threads inside executor - meaning you "over-provisioning" executor(e.g. tasks on other cores will be preempted) On 26 January 2016 at 07:59, Elango Cheran wrote: > Hi everyone, > I've gone through the eff

Re: Shuffle memory woes

2016-02-06 Thread Igor Berman
Hi, usually you can solve this by 2 steps make rdd to have more partitions play with shuffle memory fraction in spark 1.6 cache vs shuffle memory fractions are adjusted automatically On 5 February 2016 at 23:07, Corey Nolet wrote: > I just recently had a discovery that my jobs were taking sever

Re: Shuffle memory woes

2016-02-07 Thread Igor Berman
ce" side is not configured properly? I mean if map side is ok, and you just reducing by key or something it should be ok, so some detail is missing...skewed data? aggregate by key? On 6 February 2016 at 20:13, Corey Nolet wrote: > Igor, > > Thank you for the response but unfortunately,

Re: Imported CSV file content isn't identical to the original file

2016-02-07 Thread Igor Berman
show has argument of truncate pass false so it wont truncate your results On 7 February 2016 at 11:01, SLiZn Liu wrote: > Plus, I’m using *Spark 1.5.2*, with *spark-csv 1.3.0*. Also tried > HiveContext, but the result is exactly the same. > ​ > > On Sun, Feb 7, 2016 at 3:44 PM SLiZn Liu wrote:

Re: Shuffle memory woes

2016-02-08 Thread Igor Berman
ey: >>>"The dataset is 100gb at most, the spills can up to 10T-100T", Are >>> your input files lzo format, and you use sc.text() ? If memory is not >>> enough, spark will spill 3-4x of input data to disk. >>> >>> >>>

Re: newbie unable to write to S3 403 forbidden error

2016-02-12 Thread Igor Berman
String dirPath = "s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/*json” * not sure, but can you try to remove s3-us-west-1.amazonaws.com from path ? On 11 February 2016 at 23:15, Andy Davidson wrote: > I am using spark 1.6.0 in a cluster c

How to train and predict in parallel via Spark MLlib?

2016-02-18 Thread Igor L.
tion! -- BR, Junior Scala/Python Developer Igor L. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html Sent from the Apache Spark User List m

Re: SPARK-9559

2016-02-18 Thread Igor Berman
what are you trying to solve? killing worker jvm is like killing yarn node manager...why would you do this? usually worker jvm is "agent" on each worker machine which opens executors per each application, so it doesn't works hard or has big memory footprint yes it can fail, but it rather corner sit

Re: reasonable number of executors

2016-02-23 Thread Igor Berman
http://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications there is a section that is connected to your question On 23 February 2016 at 16:49, Alex Dzhagriev wrote: > Hello all, > > Can someone please advise me on the pros and cons on how to allocate the >

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
the performance gain is for commit stage when data is moved from _temporary directory to distination directory since s3 is key-value really the move operation is like copy operation On 26 February 2016 at 08:24, Takeshi Yamamuro wrote: > Hi, > > Great work! > What is the concrete performance ga

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
Alexander, implementation you've attaches supports both modes configured by property " mapred.output.direct." + fs.getClass().getSimpleName() as soon as you see _temporary dir probably the mode is off i.e. the default impl is working and you experiencing some other problem. On 26 February 2016 at

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
Takeshi, do you know the reason why they wanted to remove this commiter in SPARK-10063? the jira has no info inside as far as I understand the direct committer can't be used when either of two is true 1. speculation mode 2. append mode(ie. not creating new version of data but appending to existing

Re: Standalone vs. Mesos for production installation on a smallish cluster

2016-02-26 Thread Igor Berman
Imho most of production clusters are standalone there was some presentation from spark summit with some stats inside(can't find right now), so standalone was at 1st place it was from Matei https://databricks.com/resources/slides On 26 February 2016 at 13:40, Petr Novak wrote: > Hi all, > I belie

Re: Bug in DiskBlockManager subDirs logic?

2016-02-26 Thread Igor Berman
I've experienced such kind of outputs when executor was killed(e.g. by OOM killer) or was lost for some reason i.e. try to look at machine if executor wasn't restarted... On 26 February 2016 at 08:37, Takeshi Yamamuro wrote: > Hi, > > Could you make simple codes to reproduce the issue? > I'm not

Re: DirectFileOutputCommiter

2016-02-27 Thread Igor Berman
t won't be "silent" failure. On 26 February 2016 at 11:50, Reynold Xin wrote: > It could lose data in speculation mode, or if any job fails. > > On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman > wrote: > >> Takeshi, do you know the reason why they wanted to remov

Re: .cache() changes contents of RDD

2016-02-27 Thread Igor Berman
are you using avro format by any chance? there is some formats that need to be "deep"-copy before caching or aggregating try something like val input = sc.newAPIHadoopRDD(...) val rdd = input.map(deepCopyTransformation).map(...) rdd.cache() rdd.saveAsTextFile(...) where deepCopyTransformation is f

Re: Getting spark application driver ID programmatically

2015-10-02 Thread Igor Berman
if driver id is application id then yes you can do it with String appId = ctx.sc().applicationId(); //when ctx is java context On 1 October 2015 at 20:44, Snehal Nagmote wrote: > Hi , > > I have use case where we need to automate start/stop of spark streaming > application. > > To stop spark jo

Re: RDD of ImmutableList

2015-10-05 Thread Igor Berman
kryo doesn't support guava's collections by default I remember encountered project in github that fixes this(not sure though). I've ended to stop using guava collections as soon as spark rdds are concerned. On 5 October 2015 at 21:04, Jakub Dubovsky wrote: > Hi all, > > I would like to have an

Re: Issue with the class generated from avro schema

2015-10-09 Thread Igor Berman
u should create copy of your avro data before working with it, i.e. just after loadFromHDFS map it into new instance that is deap copy of the object it's connected to the way spark/avro reader reads avro files(it reuses some buffer or something) On 9 October 2015 at 19:05, alberskib wrote: > Hi

Re: Issue with the class generated from avro schema

2015-10-09 Thread Igor Berman
saving those results with > exactly the same schema. > > Thank you for the answer, at least I know that there is no way to make it > works. > > > 2015-10-09 20:19 GMT+02:00 Igor Berman : > >> u should create copy of your avro data before working with it, i.e. just >

Re: Question about GraphX connected-components

2015-10-10 Thread Igor Berman
let's start from some basics: might be u need to split your data into more partitions? spilling depends on your configuration when you create graph(look for storage level param) and your global configuration. in addition, you assumption of 64GB/100M is probably wrong, since spark divides memory int

Re: our spark gotchas report while creating batch pipeline

2015-10-18 Thread Igor Berman
thanks Ted :) On 18 October 2015 at 19:07, Ted Yu wrote: > Interesting reading material. > > bq. transformations that loose partitioner > > lose partitioner > > bq. Spark looses the partitioner > > loses the partitioner > > bq. Tunning number of partitions > > Should be tuning. > > bq. or incre

Re: In-memory computing and cache() in Spark

2015-10-18 Thread Igor Berman
Does ur iterations really submit job? I dont see any action there On Oct 17, 2015 00:03, "Jia Zhan" wrote: > Hi all, > > I am running Spark locally in one node and trying to sweep the memory size > for performance tuning. The machine has 8 CPUs and 16G main memory, the > dataset in my local d

Re: spark straggle task

2015-10-20 Thread Igor Berman
We know that the JobScheduler have the function to assign the straggle task to another node. - only if you enable and configure spark.speculation On 20 October 2015 at 15:20, Triones,Deng(vip.com) wrote: > Hi All > > We run an application with version 1.4.1 standalone mode. We saw two tasks > in

Re: Storing Compressed data in HDFS into Spark

2015-10-22 Thread Igor Berman
check spark.rdd.compress On 19 October 2015 at 21:13, ahaider3 wrote: > Hi, > A lot of the data I have in HDFS is compressed. I noticed when I load this > data into spark and cache it, Spark unrolls the data like normal but stores > the data uncompressed in memory. For example, suppose /data/ is

Re: Spark 1.5.1+Hadoop2.6 .. unable to write to S3 (HADOOP-12420)

2015-10-22 Thread Igor Berman
many use it. how do you add aws sdk to classpath? check in environment ui what is in cp. you should make sure that in your cp the version is compatible with one that spark compiled with I think 1.7.4 is compatible(at least we use it) make sure that you don't get other versions from other transitiv

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
Hi, we are using avro with compression(snappy). As soon as you have enough partitions, the saving won't be a problem imho. in general hdfs is pretty fast, s3 is less so the issue with storing data is that you will loose your partitioner(even though rdd has it) at loading moment. There is PR that tr

Re: Efficient approach to store an RDD as a file in HDFS and read it back as an RDD?

2015-11-05 Thread Igor Berman
convert a parquet file that is saved in hdfs to an RDD after > reading the file from hdfs? > > On Thu, Nov 5, 2015 at 10:02 AM, Igor Berman > wrote: > >> Hi, >> we are using avro with compression(snappy). As soon as you have enough >> partitions, the saving won&

Re: Whether Spark is appropriate for our use case.

2015-11-07 Thread Igor Berman
1. if you have join by some specific field(e.g. user id or account-id or whatever) you may try to partition parquet file by this field and then join will be more efficient. 2. you need to see in spark metrics what is performance of particular join, how much partitions is there, what is shuffle size

status of slaves in standalone cluster rest/rpc call

2015-11-09 Thread Igor Berman
Hi, How do I get status of workers(slaves) from driver? why I need it - I want to autoscale new workers and want to poll status of cluster(e.g. number of alive slaves connected) so that I'll submit job only after expected number of slaves joined cluster I've found MasterPage class that produces u

Re: status of slaves in standalone cluster rest/rpc call

2015-11-09 Thread Igor Berman
further reading code of MasterPage gave me what I want: http://:8080/json returns json view of all info presented in main page On 9 November 2015 at 22:41, Igor Berman wrote: > Hi, > How do I get status of workers(slaves) from driver? > why I need it - I want to autoscale new workers

Re: Incorrect results with reduceByKey

2015-11-17 Thread Igor Berman
you should clone your data after reading avro On 18 November 2015 at 06:28, tovbinm wrote: > Howdy, > > We've noticed a strange behavior with Avro serialized data and reduceByKey > RDD functionality. Please see below: > > // We're reading a bunch of Avro serialized data > val data: RDD[T] = spa

Re: Configuring Log4J (Spark 1.5 on EMR 4.1)

2015-11-20 Thread Igor Berman
try to assemble log4j.xml or log4j.properties in your jar...probably you'll get what you want, however pay attention that when you'll move to multinode cluster - there will be difference On 20 November 2015 at 05:10, Afshartous, Nick wrote: > > < log4j.properties file only exists on the master a

Re: newbie: unable to use all my cores and memory

2015-11-20 Thread Igor Berman
u've asked total cores to be 2 + 1 for driver(since you are running in cluster mode, so it's running on one of the slaves) change total cores to be 3*2 change submit mode to be client - you'll have full utilization (btw it's not advisable to use all cores of slave...since there is OS processes and

Re: Need Help Diagnosing/operating/tuning

2015-11-23 Thread Igor Berman
you should check why executor is killed. as soon as it's killed you can get all kind of strange exceptions... either give your executors more memory(4G is rather small for spark ) or try to decrease your input or maybe split it into more partitions in input format 23G in lzo might expand to x?

Re: Is spark suitable for real time query

2015-07-22 Thread Igor Berman
you can use spark rest job server(or any other solution that provides long running spark context) so that you won't pay this bootstrap time on each query in addition : if you have some rdd that u want your queries to be executed on, you can cache this rdd in memory(depends on ur cluster memory size

Re: Spark Number of Partitions Recommendations

2015-07-29 Thread Igor Berman
imho, you need to take into account size of your data too if your cluster is relatively small, you may cause memory pressure on your executors if trying to repartition to some #cores connected number of partitions better to take some max between initial number of partitions(assuming your data is o

Re: Too many open files

2015-07-29 Thread Igor Berman
you probably should increase file handles limit for user that all processes are running with(spark master & workers) e.g. http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/ On 29 July 2015 at 18:39, wrote: > Hello, > > I’ve seen a couple emails on this issue but could

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
What kind of cluster? How many cores on each worker? Is there config for http solr client? I remember standard httpclient has limit per route/host. On Aug 2, 2015 8:17 PM, "Sujit Pal" wrote: > No one has any ideas? > > Is there some more information I should provide? > > I am looking for ways to

Re: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Igor Berman
(assuming you are using HttpClient from 4.x and not 3.x version) not sure what are the defaults... On 2 August 2015 at 23:42, Sujit Pal wrote: > Hi Igor, > > The cluster is a Databricks Spark cluster. It consists of 1 master + 4 > workers, each worker has 60GB RAM and 4 CPUs. The

Re: About memory leak in spark 1.4.1

2015-08-03 Thread Igor Berman
in general, what is your configuration? use --conf "spark.logConf=true" we have 1.4.1 in production standalone cluster and haven't experienced what you are describing can you verify in web-ui that indeed spark got your 50g per executor limit? I mean in configuration page.. might be you are using

Re: About memory leak in spark 1.4.1

2015-08-04 Thread Igor Berman
zer.KryoSerializer > spark.kryoserializer.buffer 32 > spark.kryoserializer.buffer.max 256 > spark.shuffle.consolidateFiles true > spark.io.compression.codec org.apache.spark.io.LZ4CompressionCodec > > > > > > -- 原始邮件 -- > *发件人:* "Igor Berman";;

Re: Combining Spark Files with saveAsTextFile

2015-08-04 Thread Igor Berman
using coalesce might be dangerous, since 1 worker process will need to handle whole file and if the file is huge you'll get OOM, however it depends on implementation, I'm not sure how it will be done nevertheless, worse to try the coallesce method(please post your results) another option would be

Re: Combining Spark Files with saveAsTextFile

2015-08-05 Thread Igor Berman
seems that coallesce do work, see following thread https://www.mail-archive.com/user%40spark.apache.org/msg00928.html On 5 August 2015 at 09:47, Igor Berman wrote: > using coalesce might be dangerous, since 1 worker process will need to > handle whole file and if the file is huge you'

Re: Enum values in custom objects mess up RDD operations

2015-08-06 Thread Igor Berman
enums hashcode is jvm instance specific(ie. different jvms will give you different values), so you can use ordinal in hashCode computation or use hashCode on enums ordinal as part of hashCode computation On 6 August 2015 at 11:41, Warfish wrote: > Hi everyone, > > I was working with Spark for a

Re: All masters are unresponsive! Giving up.

2015-08-07 Thread Igor Berman
check on which ip/port master listens netstat -a -t --numeric-ports On 7 August 2015 at 20:48, Jeff Jones wrote: > Thanks. Added this to both the client and the master but still not getting > any more information. I confirmed the flag with ps. > > > > jjones53222 2.7 0.1 19399412 549656 p

Re: Spark Job Hangs on our production cluster

2015-08-11 Thread Igor Berman
how do u want to process 1T of data when you set your executor memory to be 2g? look at spark ui, metrics of tasks...if any look at spark logs on executor machine under work dir(unless you configured log4j) I think your executors are thrashing or spilling to disk. check memory metrics/swapping O

Re: how do I execute a job on a single worker node in standalone mode

2015-08-18 Thread Igor Berman
by default standalone creates 1 executor on every worker machine per application number of overall cores is configured with --total-executor-cores so in general if you'll specify --total-executor-cores=1 then there would be only 1 core on some executor and you'll get what you want on the other han

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Igor Berman
any differences in number of cores, memory settings for executors? On 19 August 2015 at 09:49, Rick Moritz wrote: > Dear list, > > I am observing a very strange difference in behaviour between a Spark > 1.4.0-rc4 REPL (locally compiled with Java 7) and a Spark 1.4.0 zeppelin > interpreter (comp

Re: Strange shuffle behaviour difference between Zeppelin and Spark-shell

2015-08-19 Thread Igor Berman
sted, there are different defaults for the two means of job submission > that come into play in a non-transparent fashion (i.e. not visible in > SparkConf). > > On Wed, Aug 19, 2015 at 1:36 PM, Igor Berman > wrote: > >> any differences in number of cores, memory settings f

Re: blogs/articles/videos on how to analyse spark performance

2015-08-19 Thread Igor Berman
you don't need to register, search in youtube for this video... On 19 August 2015 at 18:34, Gourav Sengupta wrote: > Excellent resource: http://www.oreilly.com/pub/e/3330 > > And more amazing is the fact that the presenter actually responds to your > questions. > > Regards, > Gourav Sengupta > >

Re: Help Explain Tasks in WebUI:4040

2015-08-31 Thread Igor Berman
are there other processes on sk3? or more generally are you sharing resources with somebody else, virtualization etc does your transformation consumes other services?(e.g. reading from s3, so it can happen that s3 latency plays the role...) can it be that task per some key will take longer than sa

Re: spark-submit issue

2015-08-31 Thread Igor Berman
might be you need to drain stdout/stderr of subprocess...otherwise subprocess can deadlock http://stackoverflow.com/questions/3054531/correct-usage-of-processbuilder On 27 August 2015 at 16:11, pranay wrote: > I have a java program that does this - (using Spark 1.3.1 ) Create a > command > strin

Re: spark-submit issue

2015-08-31 Thread Igor Berman
2015 at 10:42, Pranay Tonpay wrote: > Igor,, this seems to be the cause, however i am not sure at the moment how > to resolve it ... what i tried just now was that after " > > SparkSubmitDriverBootstrapper" process reaches the hung stage... i went > inside /proc//fd ..

Re: Parallel execution of RDDs

2015-08-31 Thread Igor Berman
what is size of the pool you submitting spark jobs from(futures you've mentioned)? is it 8? I think you have fixed thread pool of 8 so there can't be more than 8 parallel jobs running...so try to increase it what is number of partitions of each of your rdds? how many cores has your worker machine(t

  1   2   >