How can I retrieve item-pair after calculating similarity using RowMatrix

2015-04-24 Thread amghost
I have encountered the "all-pairs similarity" problem in my recommendation system. Thanks to this databricks blog, it seems RowMatrix may come to help. However, RowMatrix is a matrix type without meaningful row indices, thereby I don't know how to retrieve the similarity result after invoking colu

Re: Customized Aggregation Query on Spark SQL

2015-04-24 Thread ayan guha
Here you go t = [["A",10,"A10"],["A",20,"A20"],["A",30,"A30"],["B",15,"B15"],["C",10,"C10"],["C",20,"C200"]] TRDD = sc.parallelize(t).map(lambda t: Row(name=str(t[0]),age=int(t[1]),other=str(t[2]))) TDF = ssc.createDataFrame(TRDD) print TDF.printSchema() TDF.registerTempTable("

Re: Customized Aggregation Query on Spark SQL

2015-04-24 Thread ayan guha
can you give an example set of data and desired output> On Sat, Apr 25, 2015 at 2:32 PM, Wenlei Xie wrote: > Hi, > > I would like to answer the following customized aggregation query on Spark > SQL > 1. Group the table by the value of Name > 2. For each group, choose the tuple with the max value

Customized Aggregation Query on Spark SQL

2015-04-24 Thread Wenlei Xie
Hi, I would like to answer the following customized aggregation query on Spark SQL 1. Group the table by the value of Name 2. For each group, choose the tuple with the max value of Age (the ages are distinct for every name) I am wondering what's the best way to do it on Spark SQL? Should I use UD

Re: sparksql - HiveConf not found during task deserialization

2015-04-24 Thread Manku Timma
Setting SPARK_CLASSPATH is triggering other errors. Not working. On 25 April 2015 at 09:16, Manku Timma wrote: > Actually found the culprit. The JavaSerializerInstance.deserialize is > called with a classloader (of type MutableURLClassLoader) which has access > to all the hive classes. But inte

Re: sparksql - HiveConf not found during task deserialization

2015-04-24 Thread Manku Timma
Actually found the culprit. The JavaSerializerInstance.deserialize is called with a classloader (of type MutableURLClassLoader) which has access to all the hive classes. But internally it triggers a call to loadClass but with the default classloader. Below is the stacktrace (line numbers in the Jav

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread ayan guha
I just tested your pr On 25 Apr 2015 10:18, "Ali Bajwa" wrote: > Any ideas on this? Any sample code to join 2 data frames on two columns? > > Thanks > Ali > > On Apr 23, 2015, at 1:05 PM, Ali Bajwa wrote: > > > Hi experts, > > > > Sorry if this is a n00b question or has already been answered...

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread ayan guha
I just tested, your observation in DataFrame API is correct. It behaves weirdly in case of multiple column join. (Maybe we should report a Jira?) Solution: You can go back to our good old composite key field concatenation method. Not ideal, but workaround. (Of course you can use realSQL as well,

Re: Number of input partitions in SparkContext.sequenceFile

2015-04-24 Thread Wenlei Xie
Hi, I checked the number of partitions by System.out.println("INFO: RDD with " + rdd.partitions().size() + " partitions created."); Each single split is about 100MB. I am currently loading the data from local file system, would this explains this observation? Thank you! Best, Wenlei On Tue,

Re: ORCFiles

2015-04-24 Thread Ted Yu
Please see SPARK-2883 There is no Fix Version yet. On Fri, Apr 24, 2015 at 5:45 PM, David Mitchell wrote: > Does anyone know in which version of Spark will there be support for > ORCFiles via spark.sql.hive? Will it be in 1.4? > > David > > > >

ORCFiles

2015-04-24 Thread David Mitchell
Does anyone know in which version of Spark will there be support for ORCFiles via spark.sql.hive? Will it be in 1.4? David

Re: Creating a Row in SparkSQL 1.2 from ArrayList

2015-04-24 Thread Wenlei Xie
Use Object[] in Java just works :). On Fri, Apr 24, 2015 at 4:56 PM, Wenlei Xie wrote: > Hi, > > I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java > by using an List? It looks like > > ArrayList something; > Row.create(something) > > will create a row with single column

Re: Question regarding join with multiple columns with pyspark

2015-04-24 Thread Ali Bajwa
Any ideas on this? Any sample code to join 2 data frames on two columns? Thanks Ali On Apr 23, 2015, at 1:05 PM, Ali Bajwa wrote: > Hi experts, > > Sorry if this is a n00b question or has already been answered... > > Am trying to use the data frames API in python to join 2 dataframes > with mor

RE: Understanding Spark/MLlib failures

2015-04-24 Thread Andrew Leverentz
Hi Burak, Thanks for this insight. I’m curious to know, how did you reach the conclusion that GC pauses were to blame? I’d like to gather some more diagnostic information to determine whether or not I’m facing a similar scenario. ~ Andrew From: Burak Yavuz [mailto:brk...@gmail.com] Sent: Th

RE: indexing an RDD [Python]

2015-04-24 Thread Pagliari, Roberto
Hi, I may need to read many values. The list [0,4,5,6,8] is the locations of the rows I’d like to extract from the RDD (of labledPoints). Could you possibly provide a quick example? Also, I’m not quite sure how this work, but the resulting RDD should be a clone, as I may need to modify the valu

RE: Understanding Spark/MLlib failures

2015-04-24 Thread Andrew Leverentz
Hi Reza, I’m trying to identify groups of similar variables, with the ultimate goal of reducing the dimensionality of the dataset. I believe SVD would be sufficient for this, although I also tried running RowMatrix.computeSVD and observed the same behavior: frequent task failures, with crypti

Re: contributing code - how to test

2015-04-24 Thread Sean Owen
The standard incantation -- which is a little different from standard Maven practice -- is: mvn -DskipTests [your options] clean package mvn [your options] test Some tests require the assembly, so you have to do it this way. I don't know what the test failures were, you didn't post them, but I'm

Re: StreamingContext.textFileStream issue

2015-04-24 Thread Prannoy
Try putting files with different file name and see if the stream is able to detect them. On 25-Apr-2015 3:02 am, "Yang Lei [via Apache Spark User List]" < ml-node+s1001560n22650...@n3.nabble.com> wrote: > I hit the same issue "as if the directory has no files at all" when > running the sample "exa

contributing code - how to test

2015-04-24 Thread Deborah Siegel
Hi, I selected a "starter task" in JIRA, and made changes to my github fork of the current code. I assumed I would be able to build and test. % mvn clean compile was fine but %mvn package failed [ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18:test (default-test

Re: Pyspark where do third parties libraries need to be installed under Yarn-client mode

2015-04-24 Thread Christian Perez
To run MLlib, you only need numpy on each node. For additional dependencies, you can call the spark-submit with --py-files option and add the .zip or .egg. https://spark.apache.org/docs/latest/submitting-applications.html Cheers, Christian On Fri, Apr 24, 2015 at 1:56 AM, Hoai-Thu Vuong wrote:

Re: indexing an RDD [Python]

2015-04-24 Thread Sven Krasser
The solution depends largely on your use case. I assume the index is in the key. In that case, you can make a second RDD out of the list of indices and then use cogroup() on both. If the list of indices is small, just using filter() will work well. If you need to read back a few select values to

Re: Spark on Mesos

2015-04-24 Thread Yang Lei
in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage > 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor > 20150424-104711-1375862026-5050-20113-S1 lost) > Driver stacktrace: > at org.apache.spark.scheduler.DAGScheduler.org > $apache$spark$scheduler

Re: Spark on Mesos

2015-04-24 Thread Tim Chen
ob aborted due to stage failure: Task 5 > in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage > 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor > 20150424-104711-1375862026-5050-20113-S1 lost) > Driver stacktrace: > at org.apache.spark.scheduler.D

Re: StreamingContext.textFileStream issue

2015-04-24 Thread Yang Lei
I hit the same issue "as if the directory has no files at all" when running the sample "examples/src/main/python/streaming/hdfs_wordcount.py" with a local directory, and adding file into that directory . Appreciate comments on how to resolve this. -- View this message in context: http://apache-

Spark on Mesos

2015-04-24 Thread Stephen Carman
aborted due to stage failure: Task 5 in stage 0.0 failed 4 times, most recent failure: Lost task 5.3 in stage 0.0 (TID 23, 10.253.1.117): ExecutorLostFailure (executor 20150424-104711-1375862026-5050-20113-S1 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-24 Thread N B
Hi TD, That little experiment helped a bit. This time we did not see any exceptions for about 16 hours but eventually it did throw the same exceptions as before. The cleaning of the shuffle files also stopped much before these exceptions happened - about 7-1/2 hours after startup. I am not quite

indexing an RDD [Python]

2015-04-24 Thread Pagliari, Roberto
I have an RDD of LabledPoints. Is it possible to select a subset of it based on a list of indeces? For example with idx=[0,4,5,6,8], I'd like to be able to create a new RDD with elements 0,4,5,6 and 8. - To unsubscribe, e-mail:

Creating a Row in SparkSQL 1.2 from ArrayList

2015-04-24 Thread Wenlei Xie
Hi, I am wondering if there is any way to create a Row in SparkSQL 1.2 in Java by using an List? It looks like ArrayList something; Row.create(something) will create a row with single column (and the single column contains the array) Best, Wenlei

Spark Internal Job Scheduling

2015-04-24 Thread Arpit1286
I came across the feature in spark where it allows you to schedule different tasks within a spark context. I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key value RDD [(K1,K2),V] and and a filtered

DAG

2015-04-24 Thread Giovanni Paolo Gibilisco
Hi, I would like to know if it is possible to build the DAG before actually executing the application. My guess is that in the scheduler the DAG is built dynamically at runtime since it might depend on the data, but I was wondering if there is a way (and maybe a tool already) to analyze the code an

Re: tachyon on machines launched with spark-ec2 scripts

2015-04-24 Thread Haoyuan Li
Daniel, Instead of using localhost:19998, you may want to use the real ip address TachyonMaster is configured. You should be able to see more info in Tachyon's UI as well. More info could be found here: http://tachyon-project.org/master/Running-Tachyon-on-EC2.html Best, Haoyuan On Fri, Apr 24,

tachyon on machines launched with spark-ec2 scripts

2015-04-24 Thread Daniel Mahler
I have a cluster launched with spark-ec2. I can see a TachyonMaster process running, but I do not seem to be able to use tachyon from the spark-shell. if I try rdd.saveAsTextFile("tachyon://localhost:19998/path") I get 15/04/24 19:18:31 INFO TaskSetManager: Starting task 12.2 in stage 1.0 (TID 2

Re: Non-Deterministic Graph Building

2015-04-24 Thread hokiegeek2
Hi Everyone, Here's the Scala code for generating the EdgeRDD, VertexRDD, and Graph: //Generate a mapping of vertex (edge) names to VertexIds val vertexNameToIdRDD = rawEdgeRDD.flatMap(x => Seq(x._1.src,x._1.dst)).distinct.zipWithUniqueId.cache //Generate VertexRDD with vertex data (in my case,

Re: Join on DataFrames from the same source (Pyspark)

2015-04-24 Thread Michael Armbrust
fixed in master: https://github.com/apache/spark/commit/2d010f7afe6ac8e67e07da6bea700e9e8c9e6cc2 On Wed, Apr 22, 2015 at 12:19 AM, Karlson wrote: > DataFrames do not have the attributes 'alias' or 'as' in the Python API. > > > On 2015-04-21 20:41, Michael Armbrust wrote: > >> This is https://iss

Re: How to debug Spark on Yarn?

2015-04-24 Thread Sven Krasser
On Fri, Apr 24, 2015 at 11:31 AM, Marcelo Vanzin wrote: > > Spark 1.3 should have links to the executor logs in the UI while the > application is running. Not yet in the history server, though. You're absolutely correct -- didn't notice it until now. This is a great addition! -- www.skrasser.

Re: How to debug Spark on Yarn?

2015-04-24 Thread Marcelo Vanzin
On top of what's been said... On Wed, Apr 22, 2015 at 10:48 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > 1) I can go to Spark UI and see the status of the APP but cannot see the > logs as the job progresses. How can i see logs of executors as they progress > ? Spark 1.3 should have links to the executor logs in t

Re: Spark Cluster Setup

2015-04-24 Thread Dean Wampler
It's mostly manual. You could try automating with something like Chef, of course, but there's nothing already available in terms of automation. dean Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sean Owen
No, it prints each Long in that stream, forever. Have a look at the DStream API. On Fri, Apr 24, 2015 at 2:24 PM, Sergio Jiménez Barrio wrote: > But if a use messages.count().print this show a single number :/ > > - To unsubscri

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sergio Jiménez Barrio
But if a use messages.count().print this show a single number :/ 2015-04-24 20:22 GMT+02:00 Sean Owen : > It's not a Long. it's an infinite stream of Longs. > > On Fri, Apr 24, 2015 at 2:20 PM, Sergio Jiménez Barrio > wrote: > > It isn't the sum. This is de code: > > > > val messages = KafkaUtil

Re: How to debug Spark on Yarn?

2015-04-24 Thread Sven Krasser
For #1, click on a worker node on the YARN dashboard. From there, Tools->Local logs->Userlogs has the logs for each application, and you can view them by executor even while an application is running. (This is for Hadoop 2.4, things may have changed in 2.6.) -Sven On Thu, Apr 23, 2015 at 6:27 AM,

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai
oh, I missed that. It is fixed in 1.3.0. Also, Jianshi, the dataset was not generated by Spark SQL, right? On Fri, Apr 24, 2015 at 11:09 AM, Ted Yu wrote: > Yin: > Fix Version of SPARK-4520 is not set. > I assume it was fixed in 1.3.0 > > Cheers > Fix Version > > On Fri, Apr 24, 2015 at 11:00 A

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sean Owen
The sum? you just need to use an accumulator to sum the counts or something. On Fri, Apr 24, 2015 at 2:14 PM, Sergio Jiménez Barrio wrote: > > Sorry for my explanation, my English is bad. I just need obtain the Long > containing of the DStream created by messages.count(). Thanks for all. > -

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sergio Jiménez Barrio
Sorry for my explanation, my English is bad. I just need obtain the Long containing of the DStream created by messages.count(). Thanks for all. 2015-04-24 20:00 GMT+02:00 Sean Owen : > Do you mean an RDD? I don't think it makes sense to ask if the DStream > has data; it may have no data so far bu

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Ted Yu
Yin: Fix Version of SPARK-4520 is not set. I assume it was fixed in 1.3.0 Cheers Fix Version On Fri, Apr 24, 2015 at 11:00 AM, Yin Huai wrote: > The exception looks like the one mentioned in > https://issues.apache.org/jira/browse/SPARK-4520. What is the version of > Spark? > > On Fri, Apr 24,

Re: Parquet error reading data that contains array of structs

2015-04-24 Thread Yin Huai
The exception looks like the one mentioned in https://issues.apache.org/jira/browse/SPARK-4520. What is the version of Spark? On Fri, Apr 24, 2015 at 2:40 AM, Jianshi Huang wrote: > Hi, > > My data looks like this: > > +---++--+ > | col_name |

Re: regarding ZipWithIndex

2015-04-24 Thread Jeetendra Gangele
Anyone who can guide me how to reduce the Size from Long to Int since I dont need Long index. I am huge data and this index talking 8 bytes, if i can reduce it to 4 bytes will be great help? On 22 April 2015 at 22:46, Jeetendra Gangele wrote: > Sure thanks. if you can guide me how to do this wil

Re: Slower performance when bigger memory?

2015-04-24 Thread Sven Krasser
FWIW, I ran into a similar issue on r3.8xlarge nodes and opted for more/smaller executors. Another observation was that one large executor results in less overall read throughput from S3 (using Amazon's EMRFS implementation) in case that matters to your application. -Sven On Thu, Apr 23, 2015 at 1

Re: Convert DStream to DataFrame

2015-04-24 Thread Yin Huai
Hi Sergio, I missed this thread somehow... For the error "case classes cannot have more than 22 parameters.", it is the limitation of scala (see https://issues.scala-lang.org/browse/SI-7296). You can follow the instruction at https://spark.apache.org/docs/latest/sql-programming-guide.html#programm

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
Thanks that's why I was worried and tested my application again :). On 24 April 2015 at 23:22, Michal Michalski wrote: > Yes. > > Kind regards, > Michał Michalski, > michal.michal...@boxever.com > > On 24 April 2015 at 17:12, Jeetendra Gangele wrote: > >> you used ZipWithUniqueID? >> >> On 24 A

Re: Convert DStream[Long] to Long

2015-04-24 Thread Sean Owen
foreachRDD is an action and doesn't return anything. It seems like you want one final count, but that's not possible with a stream, since there is conceptually no end to a stream of data. You can get a stream of counts, which is what you have already. You can sum those counts in another data struct

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
Yes. Kind regards, Michał Michalski, michal.michal...@boxever.com On 24 April 2015 at 17:12, Jeetendra Gangele wrote: > you used ZipWithUniqueID? > > On 24 April 2015 at 21:28, Michal Michalski > wrote: > >> I somehow missed zipWithIndex (and Sean's email), thanks for hint. I mean >> - I saw i

Convert DStream[Long] to Long

2015-04-24 Thread Sergio Jiménez Barrio
Hi, I need compare the count of messages recived if is 0 or not, but messages.count() return a DStream[Long]. I tried this solution: val cuenta = messages.count().foreachRDD{ rdd => rdd.first() } But th

Re: Convert DStream to DataFrame

2015-04-24 Thread Sergio Jiménez Barrio
Solved! I have solved the problem combining both solutions. The result is this: messages.foreachRDD { rdd => val message: RDD[String] = rdd.map { y => y._2 } val sqlContext = SQLContextSingleton.getInstance(rdd.sparkContext) import

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
you used ZipWithUniqueID? On 24 April 2015 at 21:28, Michal Michalski wrote: > I somehow missed zipWithIndex (and Sean's email), thanks for hint. I mean > - I saw it before, but I just thought it's not doing what I want. I've > re-read the description now and it looks like it might be actually w

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
I somehow missed zipWithIndex (and Sean's email), thanks for hint. I mean - I saw it before, but I just thought it's not doing what I want. I've re-read the description now and it looks like it might be actually what I need. Thanks. Kind regards, Michał Michalski, michal.michal...@boxever.com On

Re: Some questions on Multiple Streams

2015-04-24 Thread Iulian Dragoș
On Fri, Apr 24, 2015 at 4:56 PM, Laeeq Ahmed wrote: > Thanks Dragos, > > Earlier test shows spark.streaming.concurrentJobs has worked. > Glad to hear it worked! iulian > > Regards, > Laeeq > > > > > On Friday, April 24, 2015 11:58 AM, Iulian Dragoș < > iulian.dra...@typesafe.com> wrote: > >

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
I have an RDD which I get from Hbase scan using newAPIHadoopRDD. I am running here ZipWithIndex and its preserving the order. first object got 1 second got 2 third got 3 and so on nth object got n. On 24 April 2015 at 20:56, Ganelin, Ilya wrote: > To maintain the order you can use zipWithInd

Re: Spark Cluster Setup

2015-04-24 Thread James King
Thanks Dean, Sure I have that setup locally and testing it with ZK. But to start my multiple Masters do I need to go to each host and start there or is there a better way to do this. Regards jk On Fri, Apr 24, 2015 at 5:23 PM, Dean Wampler wrote: > The convention for standalone cluster is to

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
To maintain the order you can use zipWithIndex as Sean Owen pointed out. This is the same as zipWithUniqueId except the assigned number is the index of the data in the RDD which I believe matches the order of data as it's stored on HDFS. Sent with Good (www.good.com) -Original Message--

Re: Spark Cluster Setup

2015-04-24 Thread Dean Wampler
The convention for standalone cluster is to use Zookeeper to manage master failover. http://spark.apache.org/docs/latest/spark-standalone.html Dean Wampler, Ph.D. Author: Programming Scala, 2nd Edition (O'Reilly) Typesafe @d

[Ml][Dataframe] Ml pipeline & dataframe repartitioning

2015-04-24 Thread Peter Rudenko
Hi i have a next problem. I have a dataset with 30 columns (15 numeric, 15 categorical) and using ml transformers/estimators to transform each column (StringIndexer for categorical & MeanImputor for numeric). This creates 30 more columns in a dataframe. After i’m using VectorAssembler to combin

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
I read it one by one as I need to maintain the order, but it doesn't mean that I process them one by one later. Input lines refer to different entities I update, so once I read them in order, I group them by the id of the entity I want to update, sort the updates on per-entity basis and process the

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
If you're reading a file one by line then you should simply use Java's Hadoop FileSystem class to read the file with a BuffereInputStream. I don't think you need an RDD here. Sent with Good (www.good.com) -Original Message- From: Michal Michalski [michal.michal...@boxever.com

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
> I'd prefer to avoid preparing the file in advance by adding ordinals before / after each line I mean - I want to avoid doing it outside of spark of course. That's why I want to achieve the same effect with Spark by reading the file as single partition and zipping it with unique id which - I hope

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
The problem I'm facing is that I need to process lines from input file in the order they're stored in the file, as they define the order of updates I need to apply on some data and these updates are not commutative so that order matters. Unfortunately the input is purely order-based, theres no time

What are the likely causes of org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle?

2015-04-24 Thread Peng Cheng
I'm deploying a Spark data processing job on an EC2 cluster, the job is small for the cluster (16 cores with 120G RAM in total), the largest RDD has only 76k+ rows. But heavily skewed in the middle (thus requires repartitioning) and each row has around 100k of data after serialization. The job alwa

RE: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Ganelin, Ilya
Michael - you need to sort your RDD. Check out the shuffle documentation on the Spark Programming Guide. It talks about this specifically. You can resolve this in a couple of ways - either by collecting your RDD and sorting it, using sortBy, or not worrying about the internal ordering. You can s

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Imran Rashid
Another issue is that hadooprdd (which sc.textfile uses) might split input files and even if it doesn't split, it doesn't guarantee that part files numbers go to the corresponding partition number in the rdd. Eg part-0 could go to partition 27 On Apr 24, 2015 7:41 AM, "Michal Michalski" wrote

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Sean Owen
The order of elements in an RDD is in general not guaranteed unless you sort. You shouldn't expect to encounter the partitions of an RDD in any particular order. In practice, you probably find the partitions come up in the order Hadoop presents them in this case. And within a partition, in this ca

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
Of course after you do it, you probably want to call repartition(somevalue) on your RDD to "get your paralellism back". Kind regards, Michał Michalski, michal.michal...@boxever.com On 24 April 2015 at 15:28, Michal Michalski wrote: > I did a quick test as I was curious about it too. I created a

Disable partition discovery

2015-04-24 Thread cosmincatalin
How can one disable *Partition discovery* in *Spark 1.3.0 * when using *sqlContext.parquetFile*? Alternatively, is there a way to load /.parquet/ files without *Partition discovery*? - https://www.linkedin.com/in/cosmincatalinsanda -- View this message in context: http://apache-spark-user-

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Michal Michalski
I did a quick test as I was curious about it too. I created a file with numbers from 0 to 999, in order, line by line. Then I did: scala> val numbers = sc.textFile("./numbers.txt") scala> val zipped = numbers.zipWithUniqueId scala> zipped.foreach(i => println(i)) Expected result if the order was

Re: Spark RDD sortByKey triggering a new job

2015-04-24 Thread Sean Owen
Yes, I think this is a known issue, that sortByKey actually runs a job to assess the distribution of the data. https://issues.apache.org/jira/browse/SPARK-1021 I think further eyes on it would be welcome as it's not desirable. On Fri, Apr 24, 2015 at 9:57 AM, Spico Florin wrote: > I have tested s

Spark RDD sortByKey triggering a new job

2015-04-24 Thread Spico Florin
I have tested sortByKey method with the following code and I have observed that is triggering a new job when is called. I could find this in the neither in API nor in the code. Is this an indented behavior? For example, the RDD zipWithIndex method API specifies that will trigger a new job. But what

Re: Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Jeetendra Gangele
zipwithIndex will preserve the order whatever is there in your val lines. I am not sure about the "val lines=sc.textFile("hdfs://mytextFile") " if this line maintain the order, next will maintain for sure On 24 April 2015 at 18:35, Spico Florin wrote: > Hello! > I know that HadoopRDD partiti

Re: Slower performance when bigger memory?

2015-04-24 Thread Shawn Zheng
this is not about gc issue itself. The memory is On Friday, April 24, 2015, Evo Eftimov wrote: > You can resort to Serialized storage (still in memory) of your RDDs – this > will obviate the need for GC since the RDD elements are stored as > serialized objects off the JVM heap (most likely in Ta

Does HadoopRDD.zipWithIndex method preserve the order of the input data from Hadoop?

2015-04-24 Thread Spico Florin
Hello! I know that HadoopRDD partitions are built based on the number of splits in HDFS. I'm wondering if these partitions preserve the initial order of data in file. As an example, if I have an HDFS (myTextFile) file that has these splits: split 0-> line 1, ..., line k split 1->line k+1,..., li

Multiclass classification using Ml logisticRegression

2015-04-24 Thread Selim Namsi
Hi, I just started using spark ML pipeline to implement a multiclass classifier using LogisticRegressionWithLBFGS (which accepts as a parameters number of classes), I followed the Pipeline example in ML- guide and I used LogisticRegression class which calls LogisticRegressionWithLBFGS class : va

Re: JAVA_HOME problem

2015-04-24 Thread sourabh chaki
Yes Akhil. This is the same issue. I have updated my comment in that ticket. Thanks Sourabh On Fri, Apr 24, 2015 at 12:02 PM, Akhil Das wrote: > Isn't this related to this > https://issues.apache.org/jira/browse/SPARK-6681 > > Thanks > Best Regards > > On Fri, Apr 24, 2015 at 11:40 AM, sourabh

Re: spark-ec2 s3a filesystem support and hadoop versions

2015-04-24 Thread Steve Loughran
S3a isn't ready for production use on anything below Hadoop 2.7.0. I say that as the person who mentored in all the patches for it between Hadoop 2.6 & 2.7 you need everything in https://issues.apache.org/jira/browse/HADOOP-11571 in your code -Hadoop 2.6.0 doesn't have any of the HADOOP-11571 p

Spark Cluster Setup

2015-04-24 Thread James King
I'm trying to find out how to setup a resilient Spark cluster. Things I'm thinking about include: - How to start multiple masters on different hosts? - there isn't a conf/masters file from what I can see Thank you.

Re: Some questions on Multiple Streams

2015-04-24 Thread Iulian Dragoș
It looks like you’re creating 23 actions in your job (one per DStream). As far as I know by default Spark Streaming executes only one job at a time. So your 23 actions are executed one after the other. Try setting spark.streaming.concurrentJobs to something higher than one. iulian ​ On Fri, Apr 2

Parquet error reading data that contains array of structs

2015-04-24 Thread Jianshi Huang
Hi, My data looks like this: +---++--+ | col_name | data_type | comment | +---++--+ | cust_id | string | | | part_num | int|

Re: Some questions on Multiple Streams

2015-04-24 Thread Dibyendu Bhattacharya
You can probably try the Low Level Consumer from spark-packages ( http://spark-packages.org/package/dibbhatt/kafka-spark-consumer) . How many partitions are there for your topics ? Let say you have 10 topics , and each having 3 partition , ideally you can create max 30 parallel Receiver and 30 str

Re: A Spark Group by is running forever

2015-04-24 Thread Iulian Dragoș
On Thu, Apr 23, 2015 at 6:09 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote: > > I have seen multiple blogs stating to use reduceByKey instead of > groupByKey. Could someone please help me in converting below code to use > reduceByKey > > > Code > > some spark processing > ... > > Below > val viEventsWithListingsJ

Re: Pyspark where do third parties libraries need to be installed under Yarn-client mode

2015-04-24 Thread Hoai-Thu Vuong
I use sudo pip install ... for each machine in cluster. And don't think how submit library On Fri, Apr 24, 2015 at 4:21 AM dusts66 wrote: > I am trying to figure out python library management. So my question is: > Where do third party Python libraries(ex. numpy, scipy, etc.) need to exist > if

Re: Some questions on Multiple Streams

2015-04-24 Thread Laeeq Ahmed
Hi, Any comments please. Regards,Laeeq On Friday, April 17, 2015 11:37 AM, Laeeq Ahmed wrote: Hi, I am working with multiple Kafka streams (23 streams) and currently I am processing them separately. I receive one stream from each topic. I have the following questions. 1.    Spark

RE: Slower performance when bigger memory?

2015-04-24 Thread Evo Eftimov
You can resort to Serialized storage (still in memory) of your RDDs - this will obviate the need for GC since the RDD elements are stored as serialized objects off the JVM heap (most likely in Tachion which is distributed in memory files system used by Spark internally) Also review the Object O

RE: Tasks run only on one machine

2015-04-24 Thread Evo Eftimov
# of tasks = # of partitions, hence you can provide the desired number of partitions to the textFile API which should result a) in a better spatial distribution of the RDD b) each partition will be operated upon by a separate task You can provide the number of p -Original Message- Fro

Re: Understanding Spark/MLlib failures

2015-04-24 Thread Hoai-Thu Vuong
Hi Andrew, according to you we should balance the time when gc run and the batch time, which rdd is processed? On Fri, Apr 24, 2015 at 6:58 AM Reza Zadeh wrote: > Hi Andrew, > > The .principalComponents feature of RowMatrix is currently constrained to > tall and skinny matrices. Your matrix is b

Re: what is the best way to transfer data from RDBMS to spark?

2015-04-24 Thread ayan guha
What is the specific usecase? I can think of couple of ways (write to hdfs and then read from spark or stream data to spark). Also I have seen people using mysql jars to bring data in. Essentially you want to simulate creation of rdd. On 24 Apr 2015 18:15, "sequoiadb" wrote: > If I run spark in s

what is the best way to transfer data from RDBMS to spark?

2015-04-24 Thread sequoiadb
If I run spark in stand-alone mode ( not YARN mode ), is there any tool like Sqoop that able to transfer data from RDBMS to spark storage? Thanks - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional command

Re: Shuffle files not cleaned up (Spark 1.2.1)

2015-04-24 Thread N B
Hi TD, That may very well have been the case. There may be some delay on our output side. I have made a change just for testing that sends the output nowhere. I will see if that helps get rid of these errors. Then we can try to find out how we can optimize so that we do not lag. Questions: How ca