Re: Foreachpartition in spark streaming

2017-03-20 Thread Ryan
foreachPartition is an action but run on each worker, which means you won't see anything on driver. mapPartitions is a transformation which is lazy and won't do anything until an action. it depends on the specific use case which is better. To output sth(like a print in single machine) you could r

Re: Best way to deal with skewed partition sizes

2017-03-22 Thread Ryan
could you give the event timeline and dag for the time consuming stages on spark UI? On Thu, Mar 23, 2017 at 4:30 AM, Matt Deaver wrote: > For various reasons, our data set is partitioned in Spark by customer id > and saved to S3. When trying to read this data, however, the larger > partitions m

Re: Converting dataframe to dataset question

2017-03-23 Thread Ryan
you should import either spark.implicits or sqlContext.implicits, not both. Otherwise the compiler will be confused about two implicit transformations following code works for me, spark version 2.1.0 object Test { def main(args: Array[String]) { val spark = SparkSession .builder

Re: Does spark's random forest need categorical features to be one hot encoded?

2017-03-23 Thread Ryan
no you don't need one hot. but since the feature column is a vector and vector only accepts numbers, if your feature is string then a StringIndexer is needed. http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier here's an example. On Thu, Mar 23, 2017 at

Re: Groupby in fast in Impala than spark sql - any suggestions

2017-03-28 Thread Ryan
how long does it take if you remove the repartition and just collect the result? I don't think repartition is needed here. There's already a shuffle for group by On Tue, Mar 28, 2017 at 10:35 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am working on requirement where i

Re: Groupby in fast in Impala than spark sql - any suggestions

2017-03-28 Thread Ryan
and could you paste the stage and task information from SparkUI On Wed, Mar 29, 2017 at 11:30 AM, Ryan wrote: > how long does it take if you remove the repartition and just collect the > result? I don't think repartition is needed here. There's already a shuffle > for group

Why VectorUDT private?

2017-03-29 Thread Ryan
I'm writing a transformer and the input column is vector type(which is the output column from other transformer). But as the VectorUDT is private, how could I check/transform schema for the vector column?

Re: Why VectorUDT private?

2017-03-29 Thread Ryan
spark version 2.1.0, vector is from ml package. the Vector in mllib has a public VectorUDT type On Thu, Mar 30, 2017 at 10:57 AM, Ryan wrote: > I'm writing a transformer and the input column is vector type(which is the > output column from other transformer). But as the VectorUDT

Re: How to convert Spark MLlib vector to ML Vector?

2017-04-09 Thread Ryan
you could write a udf using the asML method along with some type casting, then apply the udf to data after pca. when using pipeline, that udf need to be wrapped in a customized transformer, I think. On Sun, Apr 9, 2017 at 10:07 PM, Nick Pentreath wrote: > Why not use the RandomForest from Spark

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-16 Thread Ryan
you can build a search tree using ids within each partition to act like an index, or create a bloom filter to see if current partition would have any hit. What's your expected qps and response time for the filter request? On Mon, Apr 17, 2017 at 2:23 PM, MoTao wrote: > Hi all, > > I have 10M (

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-16 Thread Ryan
I don't think you can parallel insert into a hive table without dynamic partition, for hive locking please refer to https://cwiki.apache.org/confluence/display/Hive/Locking. Other than that, it should work. On Mon, Apr 17, 2017 at 6:52 AM, Amol Patil wrote: > Hi All, > > I'm writing generic pys

Re: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
Apr 17, 2017 at 3:01 PM, 莫涛 wrote: > Hi Ryan, > > > 1. "expected qps and response time for the filter request" > > I expect that only the requested BINARY are scanned instead of all > records, so the response time would be "10K * 5MB / disk read speed", or >

Re: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-17 Thread Ryan
row group wouldn't be read if the predicate isn't satisfied due to index. 2. It is absolutely true the performance gain depends on the id distribution... On Mon, Apr 17, 2017 at 4:23 PM, 莫涛 wrote: > Hi Ryan, > > > The attachment is a screen shot for the spark job and this

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-17 Thread Ryan
It shouldn't be a problem then. We've done the similar thing in scala. I don't have much experience with python thread but maybe the code related with reading/writing temp table isn't thread safe. On Mon, Apr 17, 2017 at 9:45 PM, Amol Patil wrote: > Thanks Ryan, > &

Re: 答复: 答复: 答复: How to store 10M records in HDFS to speed up further filtering?

2017-04-20 Thread Ryan
after some searching I find sequence file might be a comparator of har you may interested with. Thanks for all people involved. I've learnt a lot too :-) On Thu, Apr 20, 2017 at 5:25 PM, 莫涛 wrote: > Hi Ryan, > > > The attachment is the event timeline on executors. They are always

Re: Convert the feature vector to raw data

2017-06-07 Thread Ryan
if you use StringIndexer to category the data, IndexToString could convert it back. On Wed, Jun 7, 2017 at 6:14 PM, kundan kumar wrote: > Hi Yan, > > This doesnt work. > > thanks, > kundan > > On Wed, Jun 7, 2017 at 2:53 PM, 颜发才(Yan Facai) > wrote: > >> Hi, kumar. >> >> How about removing the `

Re: Java SPI jar reload in Spark

2017-06-07 Thread Ryan
I'd suggest scripts like js, groovy, etc.. To my understanding the service loader mechanism isn't a good fit for runtime reloading. On Wed, Jun 7, 2017 at 4:55 PM, Jonnas Li(Contractor) < zhongshuang...@envisioncn.com> wrote: > To be more explicit, I used mapwithState() in my application, just li

Re: Question about mllib.recommendation.ALS

2017-06-07 Thread Ryan
1. could you give job, stage & task status from Spark UI? I found it extremely useful for performance tuning. 2. use modele.transform for predictions. Usually we have a pipeline for preparing training data, and use the same pipeline to transform data you want to predict could give us the predictio

Re: good http sync client to be used with spark

2017-06-07 Thread Ryan
we use AsyncHttpClient(from the java world) and simply call future.get as synchronous call. On Thu, Jun 1, 2017 at 4:08 AM, vimal dinakaran wrote: > Hi, > In our application pipeline we need to push the data from spark streaming > to a http server. > > I would like to have a http client with be

Re: Worker node log not showed

2017-06-07 Thread Ryan
I think you need to get the logger within the lambda, otherwise it's the logger on driver side which can't work. On Wed, May 31, 2017 at 4:48 PM, Paolo Patierno wrote: > No it's running in standalone mode as Docker image on Kubernetes. > > > The only way I found was to access "stderr" file creat

Re: No TypeTag Available for String

2017-06-07 Thread Ryan
did you include the proper scala-reflect dependency? On Wed, May 31, 2017 at 1:01 AM, krishmah wrote: > I am currently using Spark 2.0.1 with Scala 2.11.8. However same code works > with Scala 2.10.6. Please advise if I am missing something > > import org.apache.spark.sql.functions.udf > > val g

Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-16 Thread Ryan
I don't think Broadcast itself can be serialized. you can get the value out on the driver side and refer to it in foreach, then the value would be serialized with the lambda expr and sent to workers. On Fri, Jun 16, 2017 at 2:29 AM, Anton Kravchenko < kravchenko.anto...@gmail.com> wrote: > How on

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
Why would you like to do so? I think there's no need for us to explicitly ask for a forEachPartition in spark sql because tungsten is smart enough to figure out whether a sql operation could be applied on each partition or there has to be a shuffle. On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi w

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
rmane. > > 2017-06-25 19:18 GMT-07:00 Ryan : > >> Why would you like to do so? I think there's no need for us to explicitly >> ask for a forEachPartition in spark sql because tungsten is smart enough to >> figure out whether a sql operation could be applied on each part

Re: access a broadcasted variable from within ForeachPartitionFunction Java API

2017-06-25 Thread Ryan
WrapBCV { > private static Broadcast bcv; > public static void setBCV(Broadcast setbcv){ bcv = setbcv; } > public static Integer getBCV() > { > return bcv.value(); > } > } > > > On Fri, Jun 16, 2017 at 3:35 AM, Ryan wrote: > >> I don't thin

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Ryan
apPartitions() give us the ability to invoke this in bulk. We're > looking for a similar approach in SQL. > > > -- > *From:* Ryan > *Sent:* Sunday, June 25, 2017 7:18:32 PM > *To:* jeff saremi > *Cc:* user@spark.apache.org > *Subje

Re: Understanding how spark share db connections created on driver

2017-06-29 Thread Ryan
I think it creates a new connection on each worker, whenever the Processor references Resource, it got initialized. There's no need for the driver connect to the db in this case. On Thu, Jun 29, 2017 at 5:52 PM, salvador wrote: > Hi all, > > I am writing a spark job from which at some point I wa

Re: How to insert a dataframe as a static partition to a partitioned table

2017-07-19 Thread Ryan
Not sure about the writer api, but you could always register a temp table for that dataframe and execute insert hql. On Thu, Jul 20, 2017 at 6:13 AM, ctang wrote: > I wonder if there are any easy ways (or APIs) to insert a dataframe (or > DataSet), which does not contain the partition columns, a

Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
It's just sort of inner join operation... If the second dataset isn't very large it's ok(btw, you can use flatMap directly instead of map followed by flatmap/flattern), otherwise you can register the second one as a rdd/dataset, and join them on user id. On Wed, Aug 9, 2017 at 4:29 PM, wrote: >

Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
rdd has a cartesian method On Wed, Aug 9, 2017 at 5:12 PM, ayan guha wrote: > If you use join without any condition in becomes cross join. In sql, it > looks like > > Select a.*,b.* from a join b > > On Wed, 9 Aug 2017 at 7:08 pm, wrote: > >> Riccardo and Ryan >&g

Re: How can I tell if a Spark job is successful or not?

2017-08-10 Thread Ryan
you could exit with error code just like normal java/scala application, and get it from driver/yarn On Fri, Aug 11, 2017 at 9:55 AM, Wei Zhang wrote: > I suppose you can find the job status from Yarn UI application view. > > > > Cheers, > > -z > > > > *From:* 陈宇航 [mailto:yuhang.c...@foxmail.com]

Re: Does Spark SQL uses Calcite?

2017-08-11 Thread Ryan
the thrift server is a jdbc server, Kanth On Fri, Aug 11, 2017 at 2:51 PM, wrote: > I also wonder why there isn't a jdbc connector for spark sql? > > Sent from my iPhone > > On Aug 10, 2017, at 2:45 PM, Jules Damji wrote: > > Yes, it's more used in Hive than Spark > > Sent from my iPhone > Pard

Re: Spark GroupBy Save to different files

2017-09-01 Thread Ryan
you may try foreachPartition On Fri, Sep 1, 2017 at 10:54 PM, asethia wrote: > Hi, > > I have list of person records in following format: > > case class Person(fName:String, city:String) > > val l=List(Person("A","City1"),Person("B","City2"),Person("C","City1")) > > val rdd:RDD[Person]=sc.parall

Re: Different watermark for different kafka partitions in Structured Streaming

2017-09-01 Thread Ryan
I don't think ss now support "partitioned" watermark. and why different partition's consumption rate vary? If the handling logic is quite different, using different topic is a better way. On Fri, Sep 1, 2017 at 4:59 PM, 张万新 wrote: > Thanks, it's true that looser watermark can guarantee more data

Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
house is in the local filesystem. When I create a spark application JAR and try to run it from my laptop, I get the same problem as #1, namely that it tries to find the warehouse directory on my laptop itself. Am I crazy? Perhaps this isn't a supported way to use Spark? Any help or insights are much appreciated! -Ryan Victory

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
Thanks Apostolos, I'm trying to avoid standing up HDFS just for this use case (single node). -Ryan On Wed, Nov 25, 2020 at 8:56 AM Apostolos N. Papadopoulos < papad...@csd.auth.gr> wrote: > Hi Ryan, > > since the driver is at your laptop, in order to access a remote file y

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Ryan Victory
JAR to the server and execute it from there. This isn't ideal but it's better than nothing. -Ryan On Wed, Nov 25, 2020 at 9:13 AM Chris Coutinho wrote: > I'm also curious if this is possible, so while I can't offer a solution > maybe you could try the following. >

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
I think this is causing issues upgrading ADAM to Spark 1.3.1 (cf. adam#690 ); attempting to build against Hadoop 1.0.4 yields errors like: 2015-06-02 15:57:44 ERROR Executor:96 - Exce

Re: Re: spark 1.3.1 jars in repo1.maven.org

2015-06-02 Thread Ryan Williams
Thanks so much Shixiong! This is great. On Tue, Jun 2, 2015 at 8:26 PM Shixiong Zhu wrote: > Ryan - I sent a PR to fix your issue: > https://github.com/apache/spark/pull/6599 > > Edward - I have no idea why the following error happened. "ContextCleaner" > doesn't

Re: Do we need schema for Parquet files with Spark?

2016-03-04 Thread Ryan Blue
words > if > >> there is no schema provided by user? Where/how to specify my schema / > >> config for Parquet format? > >> > >> Could not find Apache Parquet mailing list in the official site. It > would > >> be great if anyone could share it as well. > >> > >> Regards > >> Ashok > >> > > > > > -- Ryan Blue Software Engineer Netflix

Re: Appropriate Apache Users List Uses

2016-02-09 Thread Ryan Victory
Yeah, a little disappointed with this, I wouldn't expect to be sent unsolicited mail based on my membership to this list. -Ryan Victory On Tue, Feb 9, 2016 at 1:36 PM, John Omernik wrote: > All, I received this today, is this appropriate list use? Note: This was > unsolicited.

Re: SequenceFile and object reuse

2015-11-18 Thread Ryan Williams
Hey Jeff, in addition to what Sandy said, there are two more reasons that this might not be as bad as it seems; I may be incorrect in my understanding though. First, the "additional step" you're referring to is not likely to be adding any overhead; the "extra map" is really just materializing the

Spree: a live-updating web UI for Spark

2015-07-27 Thread Ryan Williams
hey sound useful to you, and let me know if you have questions or comments! -Ryan

updateStateByKey when the state is very large

2015-09-11 Thread Brush,Ryan
ing. We can simulate something like this in user code over Spark, although it's not trivial. Does this path seem reasonable? If others are have similar needs, I'd be interested in working out a common solution. Best, Ryan CONFIDENTIALITY NOTICE This message and any included attachm

Re: MLib : Non Linear Optimization

2016-10-05 Thread Blake Ryan
Outside of Spark, we'd recommend looking at IpOpt as an open source alternative for this: http://www.coin-or.org/projects/Ipopt.xml Blake On Wed, Oct 5, 2016 at 8:59 AM Robin East wrote: > The TFOCS package is announced here: > https://databricks.com/blog/2015/11/02/announcing-the-spark-tfocs

Re: Apache Hive with Spark Configuration

2017-01-03 Thread Ryan Blue
ll me which > version is more compatible with Spark 2.0.2 ? > > THanks > -- Ryan Blue Software Engineer Netflix

Re: Driver hung and happend out of memory while writing to console progress bar

2017-02-10 Thread Ryan Blue
java.lang.String.(String.java:207) at > java.lang.StringBuilder.toString(StringBuilder.java:407) at > scala.collection.mutable.StringBuilder.toString(StringBuilder.scala:430) > at org.apache.spark.ui.ConsoleProgressBar.show(ConsoleProgressBar.scala:101) > at > org.apache.spark.ui.ConsoleProgressBar.org$apache$spark$ui$ConsoleProgressBar$$refresh(ConsoleProgressBar.scala:71) > at > org.apache.spark.ui.ConsoleProgressBar$$anon$1.run(ConsoleProgressBar.scala:55) > at java.util.TimerThread.mainLoop(Timer.java:555) at > java.util.TimerThread.run(Timer.java:505) > > -- Ryan Blue Software Engineer Netflix

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
ve set spark.task.maxFailures to 8 for my job. Seems > like all task retries happen on the same slave in case of failure. My > expectation was that task will be retried on different slave in case of > failure, and chance of all 8 retries to happen on same slave is very less. > > > Regards >

Re: What is correct behavior for spark.task.maxFailures?

2017-04-24 Thread Ryan Blue
stage. In that version, you probably want to set spark.blacklist.task.maxTaskAttemptsPerExecutor. See the settings docs <http://spark.apache.org/docs/latest/configuration.html> and search for “blacklist” to see all the options. rb ​ On Mon, Apr 24, 2017 at 9:41 AM, Ryan Blue wrote: > Chawl

Re: Reparitioning Hive tables - Container killed by YARN for exceeding memory limits

2017-08-02 Thread Ryan Blue
gt;> >> Driver memory=4g, executor mem=12g, num-executors=8, executor core=8 >> >> Do you think below setting can help me to overcome above issue: >> >> spark.default.parellism=1000 >> spark.sql.shuffle.partitions=1000 >> >> Because default max number of partitions are 1000. >> >> >> > -- Ryan Blue Software Engineer Netflix

Unsubscribe

2018-02-19 Thread Ryan Myer
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-01 Thread Ryan Blue
huffleBlockFetcherIterator.scala:419) > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:349) > > -- Ryan Blue Software Engineer Netflix

Re: org.apache.spark.shuffle.FetchFailedException: Too large frame:

2018-05-03 Thread Ryan Blue
Let me know if this understanding is correct > > On Tue, May 1, 2018 at 9:37 PM, Ryan Blue wrote: > >> This is usually caused by skew. Sometimes you can work around it by in >> creasing the number of partitions like you tried, but when that doesn’t >> work you need to

Re: testing frameworks

2018-06-12 Thread Ryan Adams
aggregate baseline. Ryan Ryan Adams radams...@gmail.com On Tue, Jun 12, 2018 at 11:51 AM, Lars Albertsson wrote: > Hi, > > I wrote this answer to the same question a couple of years ago: > https://www.mail-archive.com/user%40spark.apache.org/msg48032.html > > I hav

unsubscribe

2018-08-10 Thread Ryan Adams
Ryan Adams radams...@gmail.com

unsubscribe

2018-09-20 Thread Ryan Adams
unsubscribe Ryan Adams radams...@gmail.com

Re: DataSourceV2 APIs creating multiple instances of DataSourceReader and hence not preserving the state

2018-10-19 Thread Ryan Blue
59 PM Mendelson, Assaf >>> wrote: >>> >>> Could you add a fuller code example? I tried to reproduce it in my >>> environment and I am getting just one instance of the reader… >>> >>> >>> >>> Thanks, >>> >>>

Re: DataSourceV2 producing wrong date value in Custom Data Writer

2019-02-05 Thread Ryan Blue
r.write: " + record.get(0, > DataTypes.DateType)); > > } > > It prints an integer as output: > > MyDataWriter.write: 17039 > > > Is this a bug? or I am doing something wrong? > > Thanks, > Shubham > -- Ryan Blue Software Engineer Netflix

Re: Manually reading parquet files.

2019-03-21 Thread Ryan Blue
tate.newHadoopConfWithOptions(relation.options)) > ) > > *import *scala.collection.JavaConverters._ > > *val *rows = readFile(pFile).flatMap(_ *match *{ > *case *r: InternalRow => *Seq*(r) > > // This doesn't work. vector mode is doing something screwy > *case *b: ColumnarBatch => b.rowIterator().asScala > }).toList > > *println*(rows) > //List([0,1,5b,24,66647361]) > //??this is wrong I think > > > > Has anyone attempted something similar? > > > > Cheers Andrew > > > -- Ryan Blue Software Engineer Netflix

Unsubscribe

2019-12-11 Thread Ryan Victory

Monitoring Spark with Graphite and Grafana

2015-02-26 Thread Ryan Williams
fair amount and have a bunch of ideas about where it should go. Thanks, -Ryan

Re: bitten by spark.yarn.executor.memoryOverhead

2015-03-02 Thread Ryan Williams
For reference, the initial version of #3525 (still open) made this fraction a configurable value, but consensus went against that being desirable so I removed it and marked SPARK-4665 as "won't fix". My

Re: Java IO Stream Corrupted - Invalid Type AC?

2014-06-06 Thread Ryan Compton
Just ran into this today myself. I'm on branch-1.0 using a CDH3 cluster (no modifications to Spark or its dependencies). The error appeared trying to run GraphX's .connectedComponents() on a ~200GB edge list (GraphX worked beautifully on smaller data). Here's the stacktrace (it's quite similar to

Re: best practice: write and debug Spark application in scala-ide and maven

2014-06-07 Thread Ryan Compton
Sounds like there's two questions here: First, from the command line, if you "mvn package" and then run the code with "java -cp targe/*jar-with-dependencies.jar com.ibm.App" do you still get the error? Second, for quick debugging, I agree that it's a pain to wait for mvn package to finish every t

Interconnect benchmarking

2014-06-27 Thread Ryan Compton
We are going to upgrade our cluster from 1g to 10g ethernet. I'd like to run some benchmarks before and after the upgrade. Can anyone suggest a few typical Spark workloads that are network-bound?

Issue with Spark on EC2 using spark-ec2 script

2014-07-31 Thread Ryan Tabora
pergroup 0 2014-07-31 00:10 /tachyon/workers drwxr-xr-x - root supergroup 0 2014-07-31 01:01 /tmp -rw-r--r-- 3 root supergroup 281471 2014-07-31 00:17 /tmp/CHANGES.txt -rw-r--r-- 3 root supergroup 4221 2014-07-31 01:01 /tmp/README.md Regards, Ryan Tabora http://ryantabora.com

NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass with spark-submit

2014-08-03 Thread Ryan Braley
.yahoomail...@web160503.mail.bf1.yahoo.com%3E  that this is a similar error. I am open to recompiling spark to fix this, but I would like to run my job on my cluster rather than just locally.  Thanks, Ryan Ryan Braley  |  Founder  http://traintracks.io/  US: +1 (206) 866 5661 CN: +86 156 1153

scalac crash when compiling DataTypeConversions.scala

2014-10-22 Thread Ryan Williams
I started building Spark / running Spark tests this weekend and on maybe 5-10 occasions have run into a compiler crash while compiling DataTypeConversions.scala. Here <https://gist.github.com/ryan-williams/7673d7da928570907f4d> is a full gist of an innocuous test command (mvn test -D

Re: scalac crash when compiling DataTypeConversions.scala

2014-10-26 Thread Ryan Williams
encountering this issue. > Typically you would have changed one or more of the profiles/options - > which leads to this occurring. > > 2014-10-22 22:00 GMT-07:00 Ryan Williams : > > I started building Spark / running Spark tests this weekend and on maybe >> 5-10 occasions have

FileNotFoundException in appcache shuffle files

2014-10-28 Thread Ryan Williams
regressed, etc., so hoping for some guidance there. So! Anyone else seen this? Is this related to the "bug in shuffle file consolidation"? Was it fixed? Did it regress? Are my confs or other steps unreasonable in some way? Any assistance would be appreciated, thanks. -Ryan [1] https:/

Re: filtering out non English tweets using TwitterUtils

2014-11-11 Thread Ryan Compton
Fwiw if you do decide to handle language detection on your machine this library works great on tweets https://github.com/carrotsearch/langid-java On Tue, Nov 11, 2014, 7:52 PM Tobias Pfeiffer wrote: > Hi, > > On Wed, Nov 12, 2014 at 5:42 AM, SK wrote: >> >> But getLang() is one of the methods o

Re: Data Loss - Spark streaming

2014-12-16 Thread Ryan Williams
TD's portion seems to start at 27:24: http://youtu.be/jcJq3ZalXD8?t=27m24s On Tue Dec 16 2014 at 7:13:43 AM Gerard Maas wrote: > Hi Jeniba, > > The second part of this meetup recording has a very good answer to your > question. TD explains the current behavior and the on-going work in Spark > S

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Ryan Compton
In 0.8 I had problems broadcasting variables around that size, for more info see here: https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201310.mbox/%3ccamgysq9sivs0j9dhv9qgdzp9qxgfadqkrd58b3ynbnhdgkp...@mail.gmail.com%3E On Wed, Mar 12, 2014 at 2:12 PM, Matei Zaharia wrote: > You sh

Re: Is it common in spark to broadcast a 10 gb variable?

2014-03-12 Thread Ryan Compton
Have not upgraded yet... On Wed, Mar 12, 2014 at 3:06 PM, Aureliano Buendia wrote: > Thanks, Ryan. Was your problem solved in spark 0.9? > > > On Wed, Mar 12, 2014 at 9:59 PM, Ryan Compton > wrote: >> >> In 0.8 I had problems broadcasting variables around that size

Re: distinct on huge dataset

2014-03-22 Thread Ryan Compton
Does it work without .distinct() ? Possibly related issue I ran into: https://mail-archives.apache.org/mod_mbox/spark-user/201401.mbox/%3CCAMgYSQ-3YNwD=veb1ct9jro_jetj40rj5ce_8exgsrhm7jb...@mail.gmail.com%3E On Sat, Mar 22, 2014 at 12:45 AM, Kane wrote: > It's 0.9.0 > > > > -- > View this messag

All pairs shortest paths?

2014-03-26 Thread Ryan Compton
No idea how feasible this is. Has anyone done it?

Re: All pairs shortest paths?

2014-03-26 Thread Ryan Compton
To clarify: I don't need the actual paths, just the distances. On Wed, Mar 26, 2014 at 3:04 PM, Ryan Compton wrote: > No idea how feasible this is. Has anyone done it?

Re: All pairs shortest paths?

2014-03-26 Thread Ryan Compton
our graph is small enough that storing all-pairs is feasible, you can > probably run this as an iterative algorithm: > http://en.wikipedia.org/wiki/Floyd–Warshall_algorithm, though I haven’t tried > it. It may be tough to do with GraphX. > > Matei > > On Mar 26, 2014, at 3:51 PM,

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Does this continue in newer versions? (I'm on 0.8.0 now) When I use .distinct() on moderately large datasets (224GB, 8.5B rows, I'm guessing about 500M are distinct) my jobs fail with: 14/04/17 15:04:02 INFO cluster.ClusterTaskSetManager: Loss was due to java.io.FileNotFoundException java.io.File

Re: distinct on huge dataset

2014-04-17 Thread Ryan Compton
Btw, I've got System.setProperty("spark.shuffle.consolidate.files", "true") and use ext3 (CentOS...) On Thu, Apr 17, 2014 at 3:20 PM, Ryan Compton wrote: > Does this continue in newer versions? (I'm on 0.8.0 now) > > When I use .distinct() on moderately

GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
I am trying to read an edge list into a Graph. My data looks like 394365859 --> 136153151 589404147 --> 1361045425 I read it into a Graph via: val edgeFullStrRDD: RDD[String] = sc.textFile(unidirFName) val edgeTupRDD = edgeFullStrRDD.map(x => x.split("\t")) .ma

Re: GraphX: .edges.distinct().count() is 10?

2014-04-22 Thread Ryan Compton
Try this: https://www.dropbox.com/s/xf34l0ta496bdsn/.txt This code: println(g.numEdges) println(g.numVertices) println(g.edges.distinct().count()) gave me 1 9294 2 On Tue, Apr 22, 2014 at 5:14 PM, Ankur Dave wrote: > I wasn't able to reproduce this with a small test file

GraphX, Kryo and BoundedPriorityQueue?

2014-04-23 Thread Ryan Compton
For me, PageRank fails when I use Kryo (works fine if I don't). I found the same problem reported here: https://groups.google.com/forum/#!topic/spark-users/unngi3JdRk8 . Has this been resolved? I'm not launching code from spark-shell. I tried registering GraphKryoRegistrator (instead of my own cl

GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Ryan Compton
I'm trying shoehorn a label propagation-ish algorithm into GraphX. I need to update each vertex with the median value of their neighbors. Unlike PageRank, which updates each vertex with the mean of their neighbors, I don't have a simple commutative and associative function to use for mergeMsg. Wha

Re: GraphX: Help understanding the limitations of Pregel

2014-04-23 Thread Ryan Compton
over, you could iteratively find the median using > bisection, which would be associative and commutative. It's easy to think > of improvements that would make this approach give a reasonable answer in a > few iterations. I have no idea about mixing algorithmic iterations with > median-findi

Re: maven for building scala simple program

2014-05-06 Thread Ryan Compton
I've been using this (you'll need maven 3). http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";> 4.0.0 com.mycompany.app my-app 1.0-SN

Spark 1.0: slf4j version conflicts with pig

2014-05-27 Thread Ryan Compton
I use both Pig and Spark. All my code is built with Maven into a giant *-jar-with-dependencies.jar. I recently upgraded to Spark 1.0 and now all my pig scripts fail with: Caused by: java.lang.RuntimeException: Could not resolve error that occured when launching map reduce job: java.lang.NoSuchMeth

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
to diagnose things > like this. > > My hunch is that something is depending on an old slf4j in your build > and it's overwriting Spark et al. > > On Tue, May 27, 2014 at 10:45 PM, Ryan Compton wrote: >> I use both Pig and Spark. All my code is built with Maven into

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
works/bidirectional-network-current/part-r-1' USING PigStorage() AS (id1:long, id2:long, weight:int); ttt = LIMIT edgeList0 10; DUMP ttt; On Wed, May 28, 2014 at 12:55 PM, Ryan Compton wrote: > It appears to be Spark 1.0 related. I made a pom.xml with a single > dependency on Spark, registeri

Re: Spark 1.0: slf4j version conflicts with pig

2014-05-28 Thread Ryan Compton
posted a JIRA https://issues.apache.org/jira/browse/SPARK-1952 On Wed, May 28, 2014 at 1:14 PM, Ryan Compton wrote: > Remark, just including the jar built by sbt will produce the same > error. i,.e this pig script will fail: > > REGISTER > /usr/share/osi1/spark-1.0.0/assembly/ta

Re: spark streaming

2016-03-02 Thread Shixiong(Ryan) Zhu
Hey, KafkaUtils.createDirectStream doesn't need a StorageLevel as it doesn't store blocks to BlockManager. However, the error is not related to StorageLevel. It may be a bug. Could you provide more info about it? E.g., Spark version, your codes, logs. On Wed, Mar 2, 2016 at 3:02 AM, Vinti Maheshw

Re: Spark Streaming 1.6 mapWithState not working well with Kryo Serialization

2016-03-02 Thread Shixiong(Ryan) Zhu
See https://issues.apache.org/jira/browse/SPARK-12591 After applying the patch, it should work. However, if you want to enable "registrationRequired", you still need to register "org.apache.spark.streaming.util.OpenHashMapBasedStateMap", "org.apache.spark.streaming.util.EmptyStateMap" and "org.apa

Re: getPreferredLocations race condition in spark 1.6.0?

2016-03-02 Thread Shixiong(Ryan) Zhu
I think it's a bug. Could you open a ticket here: https://issues.apache.org/jira/browse/SPARK On Wed, Mar 2, 2016 at 3:46 PM, Andy Sloane wrote: > We are seeing something that looks a lot like a regression from spark 1.2. > When we run jobs with multiple threads, we have a crash somewhere inside

Re: Docker configuration for akka spark streaming

2016-03-14 Thread Shixiong(Ryan) Zhu
Could you use netstat to show the ports that the driver is listening? On Mon, Mar 14, 2016 at 1:45 PM, David Gomez Saavedra wrote: > hi everyone, > > I'm trying to set up spark streaming using akka with a similar example of > the word count provided. When using spark master in local mode everyth

Re: Can not kill driver properly

2016-03-21 Thread Shixiong(Ryan) Zhu
Could you post the log of Master? On Mon, Mar 21, 2016 at 9:25 AM, Hao Ren wrote: > Update: > > I am using --supervise flag for fault tolerance. > > > > On Mon, Mar 21, 2016 at 4:16 PM, Hao Ren wrote: > >> Using spark 1.6.1 >> Spark Streaming Jobs are submitted via spark-submit (cluster mode) >

Re: how to deploy new code with checkpointing

2016-04-11 Thread Shixiong(Ryan) Zhu
You cannot. Streaming doesn't support it because code changes will break Java serialization. On Mon, Apr 11, 2016 at 4:30 PM, Siva Gudavalli wrote: > hello, > > i am writing a spark streaming application to read data from kafka. I am > using no receiver approach and enabled checkpointing to make

Re: Spark and Kafka direct approach problem

2016-05-04 Thread Shixiong(Ryan) Zhu
It's because the Scala version of Spark and the Scala version of Kafka don't match. Please check them. On Wed, May 4, 2016 at 6:17 AM, أنس الليثي wrote: > NoSuchMethodError usually appears because of a difference in the library > versions. > > Check the version of the libraries you downloaded, t

Re: Unresponsive Spark Streaming UI in YARN cluster mode - 1.5.2

2016-07-08 Thread Shixiong(Ryan) Zhu
Hey Tom, Could you provide all blocked threads? Perhaps due to some potential deadlock. On Fri, Jul 8, 2016 at 10:30 AM, Ellis, Tom (Financial Markets IT) < tom.el...@lloydsbanking.com.invalid> wrote: > Hi There, > > > > We’re currently using HDP 2.3.4, Spark 1.5.2 with a Spark Streaming job in

Re: How to register a Tuple3 with KryoSerializer?

2015-12-30 Thread Shixiong(Ryan) Zhu
You can use "_", e.g., sparkConf.registerKryoClasses(Array(classOf[scala.Tuple3[_, _, _]])) Best Regards, Shixiong(Ryan) Zhu Software Engineer Databricks Inc. shixi...@databricks.com databricks.com <http://databricks.com/> On Wed, Dec 30, 2015 at 10:16 AM, Russ wrote: >

Re: 2 of 20,675 Spark Streaming : Out put frequency different from read frequency in StatefulNetworkWordCount

2015-12-30 Thread Shixiong(Ryan) Zhu
You can use "reduceByKeyAndWindow", e.g., val lines = ssc.socketTextStream("localhost", ) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKeyAndWindow((x: Int, y: Int) => x + y, Seconds(60), Seconds(60)) wordCounts.print() On Wed, Dec 3

  1   2   3   >