Re: How to set the file size for parquet Part

2015-05-21 Thread Akhil Das
How many part files are you having? Did you try re-partitioning to a smaller number so that you will have bigger files of smaller number. Thanks Best Regards On Wed, May 20, 2015 at 3:06 AM, Richard Grossman wrote: > Hi > > I'm using spark 1.3.1 and now I can't set the size of the part generate

Re: Read multiple files from S3

2015-05-21 Thread Akhil Das
textFile does reads all files in a directory. We have modified the sparkstreaming code base to read nested files from S3, you can check this function

Question about Serialization in Storage Level

2015-05-21 Thread Jiang, Zhipeng
Hi there, This question may seem to be kind of naïve, but what's the difference between MEMORY_AND_DISK and MEMORY_AND_DISK_SER? If I call rdd.persist(StorageLevel.MEMORY_AND_DISK), the BlockManager won't serialize the rdd? Thanks, Zhipeng

Re: Storing spark processed output to Database asynchronously.

2015-05-21 Thread Tathagata Das
If you cannot push data as fast as you are generating it, then async isnt going to help either. The "work" is just going to keep piling up as many many async jobs even though your batch processing times will be low as that processing time is not going to reflect how much of overall work is pending

Re: Storing spark processed output to Database asynchronously.

2015-05-21 Thread Gautam Bajaj
That is completely alright, as the system will make sure the works get done. My major concern is, the data drop. Will using async stop data loss? On Thu, May 21, 2015 at 4:55 PM, Tathagata Das wrote: > If you cannot push data as fast as you are generating it, then async isnt > going to help eit

Re: java program Get Stuck at broadcasting

2015-05-21 Thread Allan Jie
Hi, Just check the logs of datanode, it looks like this: 2015-05-20 11:42:14,605 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / 10.9.0.48:50676, dest: /10.9.0.17:50010, bytes: 134217728, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_804680172_54, offset: 0, srvID: 39fb78

Re: java program got Stuck at broadcasting

2015-05-21 Thread allanjie
Hi, Just check the logs of datanode, it looks like this: * 2015-05-20 11:42:14,605 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.9.0.48:50676, dest: /10.9.0.17:50010, bytes: 134217728, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_804680172_54, offset: 0, srvID: 39fb78

Re: Storing spark processed output to Database asynchronously.

2015-05-21 Thread Tathagata Das
Can you elaborate on how the data loss is occurring? On Thu, May 21, 2015 at 1:10 AM, Gautam Bajaj wrote: > That is completely alright, as the system will make sure the works get > done. > > My major concern is, the data drop. Will using async stop data loss? > > On Thu, May 21, 2015 at 4:55 PM

Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2015-05-21 Thread Aniket Bhatnagar
I ran into one of the issues that are potentially caused because of this and have logged a JIRA bug - https://issues.apache.org/jira/browse/SPARK-7788 Thanks, Aniket On Wed, Sep 24, 2014 at 12:59 PM Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Hi all > > Reading through Spark streamin

Re: Spark and Flink

2015-05-21 Thread Pa Rö
thanks a lot for ur help, now i split my project, it's works. 2015-05-19 15:44 GMT+02:00 Alexander Alexandrov < alexander.s.alexand...@gmail.com>: > Sorry, we're using a forked version which changed groupID. > > 2015-05-19 15:15 GMT+02:00 Till Rohrmann : > >> I guess it's a typo: "eu.stratosphere

Re: spark mllib kmeans

2015-05-21 Thread Pa Rö
i want evaluate some different distance measure for time-space clustering. so i need a api for implement my own function in java. 2015-05-19 22:08 GMT+02:00 Xiangrui Meng : > Just curious, what distance measure do you need? -Xiangrui > > On Mon, May 11, 2015 at 8:28 AM, Jaonary Rabarisoa > wrote

Re: [Streaming] Non-blocking recommendation in custom receiver documentation and KinesisReceiver's worker.run blocking calll

2015-05-21 Thread Tathagata Das
Thanks for the JIRA. I will look into this issue. TD On Thu, May 21, 2015 at 1:31 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > I ran into one of the issues that are potentially caused because of this > and have logged a JIRA bug - > https://issues.apache.org/jira/browse/SPARK-7788

Re: java program Get Stuck at broadcasting

2015-05-21 Thread Akhil Das
Can you share the code, may be i/someone can help you out Thanks Best Regards On Thu, May 21, 2015 at 1:45 PM, Allan Jie wrote: > Hi, > > Just check the logs of datanode, it looks like this: > > 2015-05-20 11:42:14,605 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >

Re: java program Get Stuck at broadcasting

2015-05-21 Thread Allan Jie
Sure, the code is very simple. I think u guys can understand from the main function. public class Test1 { public static double[][] createBroadcastPoints(String localPointPath, int row, int col) throws IOException{ BufferedReader br = RAWF.reader(localPointPath); String line = null; int rowIndex =

Re: java program got Stuck at broadcasting

2015-05-21 Thread allanjie
Sure, the code is very simple. I think u guys can understand from the main function. public class Test1 { public static double[][] createBroadcastPoints(String localPointPath, int row, int col) throws IOException{ BufferedReader br = RAWF.reader(localPointPath);

?????? How to use spark to access HBase with Security enabled

2015-05-21 Thread donhoff_h
Hi, Many thanks for the help. My Spark version is 1.3.0 too and I run it on Yarn. According to your advice I have changed the configuration. Now my program can read the hbase-site.xml correctly. And it can also authenticate with zookeeper successfully. But I meet a new problem that is my prog

Re: Spark Streaming - Design considerations/Knobs

2015-05-21 Thread Hemant Bhanawat
Honestly, given the length of my email, I didn't expect a reply. :-) Thanks for reading and replying. However, I have a follow-up question: I don't think if I understand the block replication completely. Are the blocks replicated immediately after they are received by the receiver? Or are they kep

DataFrame Column Alias problem

2015-05-21 Thread SLiZn Liu
Hi Spark Users Group, I’m doing groupby operations on my DataFrame *df* as following, to get count for each value of col1: > df.groupBy("col1").agg("col1" -> "count").show // I don't know if I should > write like this. col1 COUNT(col1#347) aaa2 bbb4 ccc4 ... and more... As I ‘d li

Re: rdd.saveAsTextFile problem

2015-05-21 Thread Akhil Das
On Thu, May 21, 2015 at 4:17 PM, Howard Yang wrote: > follow > http://www.srccodes.com/p/article/38/build-install-configure-run-apache-hadoop-2.2.0-microsoft-windows-os > to build latest version Hadoop in my windows machine, > and Add Environment Variable *HADOOP_HOME* and edit *Path* Variable to

Re: PySpark Logs location

2015-05-21 Thread Oleg Ruchovets
Doesn't work for me so far , using command but got such output. What should I check to fix the issue? Any configuration parameters ... [root@sdo-hdp-bd-master1 ~]# yarn logs -applicationId application_1426424283508_0048 15/05/21 13:25:09 INFO impl.TimelineClientImpl: Timeline service address:

Re: Unable to use hive queries with constants in predicates

2015-05-21 Thread Yana Kadiyska
I have not seen this error but have seen another user have weird parser issues before: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3ccag6lhyed_no6qrutwsxeenrbqjuuzvqtbpxwx4z-gndqoj3...@mail.gmail.com%3E I would attach a debugger and see what is going on -- if I'm looking a

Re: Question about Serialization in Storage Level

2015-05-21 Thread Todd Nist
>From the docs, https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence: Storage LevelMeaningMEMORY_ONLYStore RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're n

map reduce ?

2015-05-21 Thread Yasemin Kaya
Hi, I have JavaPairRDD> and as an example what I want to get. user_id cat1 cat2 cat3 cat4 522 0 1 2 0 62 1 0 3 0 661 1 2 0 1 query : the users who have a number (except 0) in cat1 and cat3 column answer: cat2 -> 522,611 & cat3->522,62 = user 522 How can I get this solution?

Re: java program Get Stuck at broadcasting

2015-05-21 Thread Allan Jie
Hey, I think I found out the problem. Turns out that the file I saved is too large. On 21 May 2015 at 16:44, Akhil Das wrote: > Can you share the code, may be i/someone can help you out > > Thanks > Best Regards > > On Thu, May 21, 2015 at 1:45 PM, Allan Jie wrote: > >> Hi, >> >> Just check th

Re: Multi user setup and saving a DataFrame / RDD to a network exported file system

2015-05-21 Thread Tomasz Fruboes
Hi, thanks for answer, I'll open a ticket. In the meantime - I have found a workaround. The recipe is the following: 1. Create a new account/group on all machines (lets call it sparkuser). Run spark from this account. 2. Add your user to group sparkuser. 3. If you decide to write RDD/parq

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-21 Thread Grega Kešpret
Hi, is this fixed in master? Grega On Thu, May 14, 2015 at 7:50 PM, Michael Armbrust wrote: > End of the month is the target: > https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage > > On Thu, May 14, 2015 at 3:45 AM, Ishwardeep Singh < > ishwardeep.si...@impetus.co.in> wrote: > >>

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
Is there any other way to solve the problem? Let me state the use case I have an RDD[Document] contains over 7 millions items. The RDD need to be save on a persistent storage (currently I save it as object file on disk). Then I need to get a small random sample of Document objects (e.g. 10,000 d

saveAsTextFile() part- files are missing

2015-05-21 Thread rroxanaioana
Hello! I just started with Spark. I have an application which counts words in a file (1 MB file). The file is stored locally. I loaded the file using native code and then created the RDD from it. JavaRDD rddFromFile = context.parallelize(myFile, 2);

Re: DataFrame Column Alias problem

2015-05-21 Thread Ram Sriharsha
df.groupBy($"col1").agg(count($"col1").as("c")).show On Thu, May 21, 2015 at 3:09 AM, SLiZn Liu wrote: > Hi Spark Users Group, > > I’m doing groupby operations on my DataFrame *df* as following, to get > count for each value of col1: > > > df.groupBy("col1").agg("col1" -> "count").show // I don'

Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen
I guess the fundamental issue is that these aren't stored in a way that allows random access to a Document. Underneath, Hadoop has a concept of a MapFile which is like a SequenceFile with an index of offsets into the file where records being. Although Spark doesn't use it, you could maybe create s

Re: java program got Stuck at broadcasting

2015-05-21 Thread Akhil Das
Can you try commenting the saveAsTextFile and do a simple count()? If its a broadcast issue, then it would throw up the same error. On 21 May 2015 14:21, "allanjie" wrote: > Sure, the code is very simple. I think u guys can understand from the main > function. > > public class Test1 { > >

Spark HistoryServer not coming up

2015-05-21 Thread roy
Hi, After restarting Spark HistoryServer, it failed to come up, I checked logs for Spark HistoryServer found following messages :' 2015-05-21 11:38:03,790 WARN org.apache.spark.scheduler.ReplayListenerBus: Log path provided contains no log files. 2015-05-21 11:38:52,319 INFO org.apache.spark.

Re: saveAsTextFile() part- files are missing

2015-05-21 Thread Tomasz Fruboes
Hi, it looks you are writing to a local filesystem. Could you try writing to a location visible by all nodes (master and workers), e.g. nfs share? HTH, Tomasz W dniu 21.05.2015 o 17:16, rroxanaioana pisze: Hello! I just started with Spark. I have an application which counts words in a fi

Re: Spark HistoryServer not coming up

2015-05-21 Thread Marcelo Vanzin
Seems like there might be a mismatch between your Spark jars and your cluster's HDFS version. Make sure you're using the Spark jar that matches the hadoop version of your cluster. On Thu, May 21, 2015 at 8:48 AM, roy wrote: > Hi, > >After restarting Spark HistoryServer, it failed to come up,

Re: Spark Streaming graceful shutdown in Spark 1.4

2015-05-21 Thread Shixiong Zhu
My 2 cents: As per javadoc: https://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#addShutdownHook(java.lang.Thread) "Shutdown hooks should also finish their work quickly. When a program invokes exit the expectation is that the virtual machine will promptly shut down and exit. When the

Query a Dataframe in rdd.map()

2015-05-21 Thread ping yan
I have a dataframe as a reference table for IP frequencies. e.g., ip freq 10.226.93.67 1 10.226.93.69 1 161.168.251.101 4 10.236.70.2 1 161.168.251.105 14 All I need is to query the df in a map. rdd = sc.parallelize(['208.51.22.18', '31.207.6.17

Re: Spark Streaming + Kafka failure recovery

2015-05-21 Thread Bill Jay
Hi Cody, That is clear. Thanks! Bill On Tue, May 19, 2015 at 1:27 PM, Cody Koeninger wrote: > If you checkpoint, the job will start from the successfully consumed > offsets. If you don't checkpoint, by default it will start from the > highest available offset, and you will potentially lose da

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread Holden Karau
So DataFrames, like RDDs, can only be accused from the driver. If your IP Frequency table is small enough you could collect it and distribute it as a hashmap with broadcast or you could also join your rdd with the ip frequency table. Hope that helps :) On Thursday, May 21, 2015, ping yan wrote:

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread ping yan
Thanks. I suspected that, but figured that df query inside a map sounds so intuitive that I don't just want to give up. I've tried join and even better with a DStream.transform() and it works! freqs = testips.transform(lambda rdd: rdd.join(kvrdd).map(lambda (x,y): y[1])) Thank you for the help!

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread Ram Sriharsha
Your original code snippet seems incomplete and there isn't enough information to figure out what problem you actually ran into from your original code snippet there is an rdd variable which is well defined and a df variable that is not defined in the snippet of code you sent one way to make thi

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread Ram Sriharsha
never mind... i didnt realize you were referring to the first table as df. so you want to do a join between the first table and an RDD? the right way to do it within the data frame construct is to think of it as a join... map the second RDD to a data frame and do an inner join on ip On Thu, May 21

RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
I don't need to be 100% randome. How about randomly pick a few partitions and return all docs in those partitions? Is rdd.mapPartitionsWithIndex() the right method to use to just process a small portion of partitions? Ningjun -Original Message- From: Sean Owen [mailto:so...@cloudera.co

Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen
If sampling whole partitions is sufficient (or a part of a partition), sure you could mapPartitionsWithIndex and decide whether to process a partition at all based on its # and skip the rest. That's much faster. On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV) wrote: > I don't need to be

Pipelining with Spark

2015-05-21 Thread dgoldenberg
>From the performance and scalability standpoint, is it better to plug in, say a multi-threaded pipeliner into a Spark job, or implement pipelining via Spark's own transformation mechanisms such as e.g. map or filter? I'm seeing some reference architectures where things like 'morphlines' are plugg

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-05-21 Thread Tathagata Das
Looks like somehow the file size reported by the FSInputDStream of Tachyon's FileSystem interface, is returning zero. On Mon, May 11, 2015 at 4:38 AM, Dibyendu Bhattacharya < dibyendu.bhattach...@gmail.com> wrote: > Just to follow up this thread further . > > I was doing some fault tolerant testi

Official Docker container for Spark

2015-05-21 Thread tridib
Hi, I am using spark 1.2.0. Can you suggest docker containers which can be deployed in production? I found lot of spark images in https://registry.hub.docker.com/ . But could not figure out which one to use. None of them seems like official image. Does anybody have any recommendation? Thanks Tri

Re: Spark Streaming on top of Cassandra?

2015-05-21 Thread tshah77
Can some one provide example of Spark Streaming using Java? I have cassandra running but did not configure spark but would like to create Dstream. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-on-top-of-Cassandra-tp1283p22978.html

Re: Connecting to an inmemory database from Spark

2015-05-21 Thread tshah77
TD, Do you have any example about reading from cassandra using spark streaming in java? I am trying to connect to cassandra using spark streaming and it is throwing an error as could not parse master url. Thanks Tejas -- View this message in context: http://apache-spark-user-list.1001560.n3.

Re: Spark Streaming on top of Cassandra?

2015-05-21 Thread jay vyas
hi. I have a spark streaming -> cassandra application which you can probably borrow pretty easily. You can always rewrite a part of it in java if you need to , or else, you can just use scala (see the blog post below if you want a java style dev workflow w/ scala using intellij)/ This application

Spark MOOC - early access

2015-05-21 Thread Marco Shaw
*Hi Spark Devs and Users,BerkeleyX and Databricks are currently developing two Spark-related MOOC on edX (intro , ml ), the first of which

Re: Connecting to an inmemory database from Spark

2015-05-21 Thread Tathagata Das
Doesnt seem like a Cassandra specific issue. Could you give us more information (code, errors, stack traces)? On Thu, May 21, 2015 at 1:33 PM, tshah77 wrote: > TD, > > Do you have any example about reading from cassandra using spark streaming > in java? > > I am trying to connect to cassandra u

Re: How to use spark to access HBase with Security enabled

2015-05-21 Thread Ted Yu
Are the worker nodes colocated with HBase region servers ? Were you running as hbase super user ? You may need to login, using code similar to the following: if (isSecurityEnabled()) { SecurityUtil.login(conf, fileConfKey, principalConfKey, localhost); } SecurityUtil is ha

Re: How to use spark to access HBase with Security enabled

2015-05-21 Thread Bill Q
What I found with the CDH-5.4.1 Spark 1.3, the spark.executor.extraClassPath setting is not working. Had to use SPARK_CLASSPATH instead. On Thursday, May 21, 2015, Ted Yu wrote: > Are the worker nodes colocated with HBase region servers ? > > Were you running as hbase super user ? > > You may ne

Re: Spark HistoryServer not coming up

2015-05-21 Thread roy
This got resolved after cleaning "/user/spark/applicationHistory/*" -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-HistoryServer-not-coming-up-tp22975p22981.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

running spark on yarn

2015-05-21 Thread Nathan Kronenfeld
Hello, folks. We just recently switched to using Yarn on our cluster (when upgrading to cloudera 5.4.1) I'm trying to run a spark job from within a broader application (a web service running on Jetty), so I can't just start it using spark-submit. Does anyone know of an instructions page on how t

foreach plus accumulator Vs mapPartitions performance

2015-05-21 Thread ben
Hi, everybody. There are some cases in which I can obtain the same results by using the mapPartitions and the foreach method. For example in a typical MapReduce approach one would perform a reduceByKey immediately after a mapPartitions that transform the original RDD in a collection of tuple (ke

foreach vs foreachPartitions

2015-05-21 Thread ben
I would like to know if the foreachPartitions will results in a better performance, due to an higher level of parallelism, compared to the foreach method considering the case in which I'm flowing through an RDD in order to perform some sums into an accumulator variable. Thank you, Beniamino. -

Re: Spark MOOC - early access

2015-05-21 Thread Kartik Mehta
Awesome, Thanks a ton for helping us all and futuristic planning, Much appreciate it, Regards, Kartik On May 21, 2015 4:41 PM, "Marco Shaw" wrote: > > > > > > > > > > > > > > > > *Hi Spark Devs and Users,BerkeleyX and Databricks are currently developing > two Spark-related MOOC on edX (intro

Re: S3NativeFileSystem inefficient implementation when calling sc.textFile

2015-05-21 Thread Peng Cheng
I stumble upon this thread and I conjecture that this may affect restoring a checkpointed RDD as well: http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928 In my case I have 1600+ fragmented che

Pandas timezone problems

2015-05-21 Thread Def_Os
After deserialization, something seems to be wrong with my pandas DataFrames. It looks like the timezone information is lost, and subsequent errors ensue. Serializing and deserializing a timezone-aware DataFrame tests just fine, so it must be Spark that somehow changes the data. My program runs t

Re: NegativeArraySizeException when doing joins on skewed data

2015-05-21 Thread jstripit
I ran into this problem yesterday, but outside of the context of Spark. It's a limitation of Kryo's IdentityObjectIntMap. In Spark you might try using Java's internal serializer instead. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NegativeArraySizeE

Re: Pandas timezone problems

2015-05-21 Thread Xiangrui Meng
These are relevant: JIRA: https://issues.apache.org/jira/browse/SPARK-6411 PR: https://github.com/apache/spark/pull/6250 On Thu, May 21, 2015 at 3:16 PM, Def_Os wrote: > After deserialization, something seems to be wrong with my pandas DataFrames. > It looks like the timezone information is lost

Re: PySpark Logs location

2015-05-21 Thread Ruslan Dautkhanov
https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application When log aggregation isn’t turned on, logs are retained locally on each machine under YARN_APP_LOGS_DIR, which is usually configured to/tmp/logs or $HADOOP_HOME/logs/userlogs depending on the Hadoop version and in

Re: [pyspark] Starting workers in a virtualenv

2015-05-21 Thread Davies Liu
Could you try with specify PYSPARK_PYTHON to the path of python in your virtual env, for example PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py On Mon, Apr 20, 2015 at 12:51 AM, Karlson wrote: > Hi all, > > I am running the Python process that communicates with Spark in a > virtua

Kmeans Labeled Point RDD

2015-05-21 Thread anneywarlord
Hello, New to Spark. I wanted to know if it is possible to use a Labeled Point RDD in org.apache.spark.mllib.clustering.KMeans. After I cluster my data I would like to be able to identify which observations were grouped with each centroid. Thanks -- View this message in context: http://apache

Re: foreach plus accumulator Vs mapPartitions performance

2015-05-21 Thread Burak Yavuz
Or you can simply use `reduceByKeyLocally` if you don't want to worry about implementing accumulators and such, and assuming that the reduced values will fit in memory of the driver (which you are assuming by using accumulators). Best, Burak On Thu, May 21, 2015 at 2:46 PM, ben wrote: > Hi, eve

Re: Kmeans Labeled Point RDD

2015-05-21 Thread Krishna Sankar
You can predict and then zip it with the points RDD to get approx. same as LP. Cheers On Thu, May 21, 2015 at 6:19 PM, anneywarlord wrote: > Hello, > > New to Spark. I wanted to know if it is possible to use a Labeled Point RDD > in org.apache.spark.mllib.clustering.KMeans. After I cluster my d

task all finished, while the stage marked finish long time later problem

2015-05-21 Thread 邓刚 [技术中心]
Hi all, We are running spark streaming with version 1.1.1. recently we found an odd problem. In stage 44554, All the task finished, but the stage marked finished long time later, as you can see the log below, the last task finished @15/05/21 21:17:36 And also the sta

Re: Spark Streaming with Tachyon : Data Loss on Receiver Failure due to WAL error

2015-05-21 Thread Dibyendu Bhattacharya
Hi Tathagata, Thanks for looking into this. Further investigating I found that the issue is with Tachyon does not support File Append. The streaming receiver which writes to WAL when failed, and again restarted, not able to append to same WAL file after restart. I raised this with Tachyon user gr

LDA prediction on new document

2015-05-21 Thread Dani Qiu
Hi, guys, I'm pretty new to LDA. I notice spark 1.3.0 mllib provide EM based LDA implementation. It returns both topics and topic distribution. My question is how can I use these parameters to predict on new document ? And I notice there is an Online LDA implementation in spark master branch, it

?????? How to use spark to access HBase with Security enabled

2015-05-21 Thread donhoff_h
Hi, Thanks very much for the reply. I have tried the "SecurityUtil". I can see from log that this statement executed successfully, but I still can not pass the authentication of HBase. And with more experiments, I found a new interesting senario. If I run the program with yarn-client mode, the

Re: DataFrame Column Alias problem

2015-05-21 Thread SLiZn Liu
However this returns a single column of c, without showing the original col1 . ​ On Thu, May 21, 2015 at 11:25 PM Ram Sriharsha wrote: > df.groupBy($"col1").agg(count($"col1").as("c")).show > > On Thu, May 21, 2015 at 3:09 AM, SLiZn Liu wrote: > >> Hi Spark Users Group, >> >> I’m doing groupby

Re: LDA prediction on new document

2015-05-21 Thread Ken Geis
Dani, this appears to be addressed in SPARK-5567, scheduled for Spark 1.5.0. Ken On May 21, 2015, at 11:12 PM, user-digest-h...@spark.apache.org wrote: > From: Dani Qiu > Subject: LDA prediction on new document > Date: May 21, 2015 at 8:48:40 PM PDT > To: user@spark.apache.org > > > Hi, guys

Re: DataFrame Column Alias problem

2015-05-21 Thread Reynold Xin
In 1.4 it actually shows col1 by default. In 1.3, you can add "col1" to the output, i.e. df.groupBy($"col1").agg($"col1", count($"col1").as("c")).show() On Thu, May 21, 2015 at 11:22 PM, SLiZn Liu wrote: > However this returns a single column of c, without showing the original > col1. > ​ > >

Re: rdd.sample() methods very slow

2015-05-21 Thread Reynold Xin
You can do something like this: val myRdd = ... val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i => Random.nextDouble() < 0.1) // this samples 10% of the partitions rddSampledByPartition.mapPartitions { iter => iter.take(10) } // take the first 10 elements out of each partition

Re: [pyspark] Starting workers in a virtualenv

2015-05-21 Thread Karlson
That works, thank you! On 2015-05-22 03:15, Davies Liu wrote: Could you try with specify PYSPARK_PYTHON to the path of python in your virtual env, for example PYSPARK_PYTHON=/path/to/env/bin/python bin/spark-submit xx.py On Mon, Apr 20, 2015 at 12:51 AM, Karlson wrote: Hi all, I am running