Getting : format(target_id, ".", name), value) .. error

2021-02-08 Thread shahab
Hello, I am getting this in unclear error message when I read a parquet file, it seems something is wrong with data but what? I googled a lot but did not find any clue. I hope some spark experts could help me with this? best, Shahab Traceback (most recent call last): File "/usr/lib/

Re: Error using .collect()

2019-05-13 Thread Shahab Yunus
after grouping by, can you perform a join between this map and your other dataset rather than trying to fit the map in memory? Regards, Shahab On Mon, May 13, 2019 at 3:58 PM Kumar sp wrote: > I have a use case where i am using collect().toMap (Group by certain > column and finding count ,cr

Spark SQL LIMIT Gets Stuck

2019-05-01 Thread Shahab Yunus
Hi there. I have a Hive external table (storage format is ORC, data stored on S3, partitioned on one bigint type column) that I am trying to query through pyspark (or spark-shell) shell. df.count() fails with lower values of LIMIT clause with the following exception (seen in Spark UI.) df.show() w

Re: How to get all input tables of a SPARK SQL 'select' statement

2019-01-23 Thread Shahab Yunus
Could be a tangential idea but might help: Why not use queryExecution and logicalPlan objects that are available when you execute a query using SparkSession and get a DataFrame back? The Json representation contains almost all the info that you need and you don't need to go to Hive to get this info

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Shahab Yunus
> > > Original DF -> Iterate -> Pass every element to a function that takes the > element of the original DF and returns a new dataframe including all the > matching terms > > > > > > *From:* Andrew Melo > *Sent:* Friday, December 28, 2018 8:48 PM > *To:* e

Re: What are the alternatives to nested DataFrames?

2018-12-28 Thread Shahab Yunus
Can you have a dataframe with a column which stores json (type string)? Or you can also have a column of array type in which you store all cities matching your query. On Fri, Dec 28, 2018 at 2:48 AM wrote: > Hi community , > > > > As shown in other answers online , Spark does not support the n

Re: Add column value in the dataset on the basis of a condition

2018-12-18 Thread Shahab Yunus
-conditionally On Tue, Dec 18, 2018 at 9:55 AM Shahab Yunus wrote: > Have you tried using withColumn? You can add a boolean column based on > whether the age exists or not and then drop the older age column. You > wouldn't need union of dataframes then > > On Tue, Dec 18, 2018

Re: Add column value in the dataset on the basis of a condition

2018-12-18 Thread Shahab Yunus
Have you tried using withColumn? You can add a boolean column based on whether the age exists or not and then drop the older age column. You wouldn't need union of dataframes then On Tue, Dec 18, 2018 at 8:58 AM Devender Yadav wrote: > Hi All, > > > useful code: > > public class EmployeeBean imp

Re: Parallel read parquet file, write to postgresql

2018-12-03 Thread Shahab Yunus
Hi James. --num-executors is use to control the number of parallel tasks (each per executors) running for your application. For reading and writing data in parallel data partitioning is employed. You can look here for quick intro how data partitioning work: https://jaceklaskowski.gitbooks.io/maste

Re: Convert RDD[Iterrable[MyCaseClass]] to RDD[MyCaseClass]

2018-12-03 Thread Shahab Yunus
Curious why you think this is not smart code? On Mon, Dec 3, 2018 at 8:04 AM James Starks wrote: > By taking with your advice flatMap, now I can convert result from > RDD[Iterable[MyCaseClass]] to RDD[MyCaseClass]. Basically just to perform > flatMap in the end before starting to convert RDD obj

Re: Creating spark Row from database values

2018-09-26 Thread Shahab Yunus
Hi there. Have you seen this link? https://medium.com/@mrpowers/manually-creating-spark-dataframes-b14dae906393 It shows you multiple ways to manually create a dataframe. Hope it helps. Regards, Shahab On Wed, Sep 26, 2018 at 8:02 AM Kuttaiah Robin wrote: > Hello, > > Current

Unsubscribe

2018-04-23 Thread Shahab Yunus
Unsubscribe

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
well? Regards, Shahab On Tue, Apr 10, 2018 at 10:05 AM, Nick Pentreath wrote: > Also check out FeatureHasher in Spark 2.3.0 which is designed to handle > this use case in a more natural way than HashingTF (and handles multiple > columns at once). > > > > On Tue, 10 Apr

StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
to be indexed is huge and columns to be indexed are high cardinality (or with lots of categories) and more than one such column need to be indexed? Meaning it wouldn't fit in memory. Thanks. Regards, Shahab

Sparse Matrix to Matrix multiplication in Spark

2018-04-01 Thread Shahab Yunus
hanks. Regards, Shahab

Warnings on data insert into Hive Table using PySpark

2018-03-19 Thread Shahab Yunus
ot match expected pattern for group* *Usage: hadoop fs [generic options] -chgrp [-R] GROUP PATH...* Software: Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121) Spark 2.0.2 Hadoop 2.7.3-amzn-0 Thanks & Regards, Shahab

Accessing Scala RDD from pyspark

2018-03-15 Thread Shahab Yunus
Thanks & Regards, Shahab

unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found

2015-10-06 Thread shahab
Hi, I am trying to use Spark 1.5, Mlib, but I keep getting "sbt.ResolveException: unresolved dependency: org.apache.spark#spark-streaming_2.10;1.5.0: not found" . It is weird that this happens, but I could not find any solution for this. Does any one faced the same issue? best, /Sh

Re: Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submi

2015-09-18 Thread shahab
It works using yarn-client but I want to make it running on cluster. Is there any way to do so? best, /Shahab On Fri, Sep 18, 2015 at 12:54 PM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > Can you try yarn-client mode? > > On Fri, Sep 18, 2015, 3:38 PM shahab

Zeppelin on Yarn : org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.

2015-09-18 Thread shahab
e use spark-submit. Anyone knows What's the solution to this? best, /Shahab

Re: [Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-10 Thread shahab
. But I will try your solution as well. @Neil : I think something is wrong with my fat jar file, I think I am missing some dependencies in my jar file ! Again thank you all /Shahab On Wed, Sep 9, 2015 at 11:28 PM, Dean Wampler wrote: > If you log into the cluster, do you see the file if you t

[Spark on Amazon EMR] : File does not exist: hdfs://ip-x-x-x-x:/.../spark-assembly-1.4.1-hadoop2.6.0-amzn-0.jar

2015-09-09 Thread shahab
t --deploy-mode cluster --class mypack.MyMainClass --master yarn-cluster s3://mybucket/MySparkApp.jar Is there any one who has similar problem with EMR? best, /Shahab

Zeppelin + Spark on EMR

2015-09-07 Thread shahab
pelin with EMR . best, /Shahab

Re: spark - redshift !!!

2015-07-08 Thread shahab
Sorry, I misunderstood. best, /Shahab On Wed, Jul 8, 2015 at 9:52 AM, spark user wrote: > Hi 'I am looking how to load data in redshift . > Thanks > > > > On Wednesday, July 8, 2015 12:47 AM, shahab > wrote: > > > Hi, > > I did some experiment with

Re: spark - redshift !!!

2015-07-08 Thread shahab
im)) // my data format is in CSV format, comma separated .map (r => MyIbject(r(3), r(4).toLong, r(5).toLong, r(6))) //just map it to the target object format hope this helps, best, /Shahab On Wed, Jul 8, 2015 at 12:57 AM, spark user wrote: > Hi > Can you help me how to load da

Performing sc.paralleize (..) in workers not in the driver program

2015-06-25 Thread shahab
Hi, Apparently, sc.paralleize (..) operation is performed in the driver program not in the workers ! Is it possible to do this in worker process for the sake of scalability? best /Shahab

Re: Reading file from S3, facing java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

2015-06-15 Thread shahab
Thanks Akhil, it solved the problem. best /Shahab On Fri, Jun 12, 2015 at 8:50 PM, Akhil Das wrote: > Looks like your spark is not able to pick up the HADOOP_CONF. To fix this, > you can actually add jets3t-0.9.0.jar to the classpath > (sc.addJar(/path/to/jets3t-0.9.0.jar). > >

Reading file from S3, facing java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException

2015-06-11 Thread shahab
Hi, I tried to read a csv file from amazon s3, but I get the following exception which I have no clue how to solve this. I tried both spark 1.3.1 and 1.2.1, but no success. Any idea how to solve this is appreciated. best, /Shahab the code: val hadoopConf=sc.hadoopConfiguration

Re: PostgreSQL JDBC Classpath Issue

2015-06-10 Thread shahab
Hi George, I have same issue, did you manage to find a solution? best, /Shahab On Wed, May 13, 2015 at 9:21 PM, George Adams wrote: > Hey all, I seem to be having an issue with PostgreSQL JDBC jar on my > classpath. I’ve outlined the issue on Stack Overflow ( > http://stackove

Re: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-05 Thread shahab
Thanks Tristan for sharing this. Actually this happens when I am reading a csv file of 3.5 GB. best, /Shahab On Tue, May 5, 2015 at 9:15 AM, Tristan Blakers wrote: > Hi Shahab, > > I’ve seen exceptions very similar to this (it also manifests as negative > array size exception), a

"java.io.IOException: No space left on device" while doing repartitioning in Spark

2015-05-04 Thread shahab
Hi, I am getting "No space left on device" exception when doing repartitioning of approx. 285 MB of data while these is still 2 GB space left ?? does it mean that repartitioning needs more space (more than 2 GB) for repartitioning of 285 MB of data ?? best, /Shahab java.io.IOExc

how to make sure data is partitioned across all workers?

2015-05-04 Thread shahab
Hi, Is there any way to enforce Spark to partition cached data across all worker nodes, so all data is not cached only in one of the worker nodes? best, /Shahab

com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index:

2015-05-02 Thread shahab
Hi, I am using sprak-1.2.0 and I used Kryo serialization but I get the following excepton. java.io.IOException: com.esotericsoftware.kryo.KryoException: java.lang.IndexOutOfBoundsException: Index: 3448, Size: 1 I do apprecciate if anyone could tell me how I can resolve this? best, /Shahab

Re: is there anyway to enforce Spark to cache data in all worker nodes(almost equally) ?

2015-04-30 Thread shahab
Thanks Alex, but 482MB was just example size, and I am looking for generic approach doing this without broadcasting, any idea? best, /Shahab On Thu, Apr 30, 2015 at 4:21 PM, Alex wrote: > 482 MB should be small enough to be distributed as a set of broadcast > variables. Then you c

is there anyway to enforce Spark to cache data in all worker nodes (almost equally) ?

2015-04-30 Thread shahab
". I have small stand-alone cluster with two nodes A, B. Where node A accommodates Cassandra, Spark Master and Worker and node B contains the second spark worker. best, /Shahab

Re: why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-30 Thread shahab
If you want to avoid disk write, you can mount a ramdisk and > configure "spark.local.dir" to this ram disk. So shuffle output will write > to memory based FS, and will not introduce disk IO. > > Thanks > Jerry > > 2015-03-30 17:15 GMT+08:00 shahab : > >> H

why "Shuffle Write" is not zero when everything is cached and there is enough memory?

2015-03-30 Thread shahab
be done in memory? best, /Shahab

Re: Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-13 Thread shahab
uld be true for *any* transformation which causes a shuffle. > It would not be true if you're combining RDDs with union, since that > doesn't cause a shuffle. > > On Thu, Mar 12, 2015 at 11:04 AM, shahab > wrote: > >> Hi, >> >> Probably this question is alr

Which is more efficient : first join three RDDs and then do filtering or vice versa?

2015-03-12 Thread shahab
RDDs and then do filtering on resulting joint RDD or 2- Apply filtering on each individual RDD and then join the resulting RDDs Or probably there is no difference due to lazy evaluation and under beneath Spark optimisation? best, /Shahab

Re: Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread shahab
Thanks Hao, But my question concerns UDAF (user defined aggregation function ) not UDTF( user defined type function ). I appreciate if you could point me to some starting point on UDAF development in Spark. Thanks Shahab On Tuesday, March 10, 2015, Cheng, Hao wrote: > Currently, Spark

Does any one know how to deploy a custom UDAF jar file in SparkSQL?

2015-03-10 Thread shahab
Hi, Does any one know how to deploy a custom UDAF jar file in SparkSQL? Where should i put the jar file so SparkSQL can pick it up and make it accessible for SparkSQL applications? I do not use spark-shell instead I want to use it in an spark application. best, /Shahab

Registering custom UDAFs with HiveConetxt in SparkSQL, how?

2015-03-10 Thread shahab
then deploy the JAR file to SparkSQL . But is there any way to avoid deploying the jar file and register it programmatically? best, /Shahab

Re: Does SparkSQL support "..... having count (fieldname)" in SQL statement?

2015-03-04 Thread shahab
you using > Shahab? > > > > *From:* yana [mailto:yana.kadiy...@gmail.com] > *Sent:* Wednesday, March 4, 2015 8:47 PM > *To:* shahab; user@spark.apache.org > *Subject:* RE: Does SparkSQL support ". having count (fieldname)" in > SQL statement? > > > >

Does SparkSQL support "..... having count (fieldname)" in SQL statement?

2015-03-04 Thread shahab
AST(('cnt < 2), BooleanType), tree: I couldn't find anywhere is documentation whether "having" keyword is not supported ? If this is the case, what would be the work around? using two nested select statements? best, /Shahab

Re: Can not query TempTable registered by SQL Context using HiveContext

2015-03-03 Thread shahab
Thanks Michael. I understand now. best, /Shahab On Tue, Mar 3, 2015 at 9:38 PM, Michael Armbrust wrote: > As it says in the API docs > <https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.sql.SchemaRDD>, > tables created with registerTempTable are local

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
@Yin: sorry for my mistake, you are right it was added in 1.2, not 0.12.0 , my bad! On Tue, Mar 3, 2015 at 6:47 PM, shahab wrote: > Thanks Rohit, yes my mistake, it does work with 1.1 ( I am actually > running it on spark 1.1) > > But do you mean that even HiveConext of spark (

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
Thanks Rohit, yes my mistake, it does work with 1.1 ( I am actually running it on spark 1.1) But do you mean that even HiveConext of spark (nit Calliope CassandraAwareHiveContext) is not supporting Hive 0.12 ?? best, /Shahab On Tue, Mar 3, 2015 at 5:55 PM, Rohit Rai wrote: > The H

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
supporting them? 2-It does not support Spark 1.1 and 1.2. Any plan for new release? best, /Shahab On Tue, Mar 3, 2015 at 5:41 PM, Rohit Rai wrote: > Hello Shahab, > > I think CassandraAwareHiveContext > <https://github.com/tuplejump/calliope/blob/develop/sql/hive/src/main/scala/o

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
HiveContext, but it seems I can not do this! @Yes, it is added in Hive 0.12, but do you mean It is not supported by HiveContext in Spark Thanks, /Shahab On Tue, Mar 3, 2015 at 5:23 PM, Yin Huai wrote: > Regarding current_date, I think it is not in either Hive 0.12.0 or 0.13.1 > (version

Can not query TempTable registered by SQL Context using HiveContext

2015-03-03 Thread shahab
a:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:103) best, /Shahab

[no subject]

2015-03-03 Thread shahab
a:606) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke( RetryingHMSHandler.java:103) best, /Shahab

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
ot see the registered table suing SQL context. Is this a normal case? best, /Shahab On Tue, Mar 3, 2015 at 1:35 PM, Cheng, Hao wrote: > Hive UDF are only applicable for HiveContext and its subclass instance, > is the CassandraAwareSQLContext a direct sub class of HiveContext or > SQLCon

Re: Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-03 Thread shahab
LDemo.main(SQLDemo.scala) //my code On Tue, Mar 3, 2015 at 8:57 AM, Cheng, Hao wrote: > Can you provide the detailed failure call stack? > > > > *From:* shahab [mailto:shahab.mok...@gmail.com] > *Sent:* Tuesday, March 3, 2015 3:52 PM > *To:* user@spark.apache.org > *Subje

Supporting Hive features in Spark SQL Thrift JDBC server

2015-03-02 Thread shahab
SQL query. There are couple of other UDFs which cause similar error. Am I missing something in my JDBC server ? /Shahab

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-03-01 Thread shahab
Thanks Vijay, but the setup requirement for GML was not straightforward for me at all, so I put it aside for a while. best, /Shahab On Sun, Mar 1, 2015 at 9:34 AM, Vijay Saraswat wrote: > GML is a fast, distributed, in-memory sparse (and dense) matrix > libraries. > > It does not

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-03-01 Thread shahab
Thanks Josef for the comments, I think I need to do some benchmarking. best, /Shahab On Sun, Mar 1, 2015 at 1:25 AM, Joseph Bradley wrote: > Hi Shahab, > > There are actually a few distributed Matrix types which support sparse > representations: RowMatrix, IndexedRo

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread shahab
Thanks a lot Vijay, let me see how it performs. Best Shahab On Friday, February 27, 2015, Vijay Saraswat wrote: > Available in GML -- > > http://x10-lang.org/x10-community/applications/global-matrix-library.html > > We are exploring how to make it available within Spark. Any

Re: Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread shahab
t's called Coordinated Matrix( > http://spark.apache.org/docs/latest/mllib-data-types.html#coordinatematrix) > you need to fill it with elemets of type MatrixEntry( (Long, Long, > Double)) > > > Thanks, > Peter Rudenko > On 2015-02-27 14:01, shahab wrote: > > Hi, >

Is there any Sparse Matrix implementation in Spark/MLib?

2015-02-27 Thread shahab
Hi, I just wonder if there is any Sparse Matrix implementation available in Spark, so it can be used in spark application? best, /Shahab

Access time to an elemnt in cached RDD

2015-02-23 Thread shahab
difference if you plan for near real-time response from Spark ?! best, /Shahab

Re: what does "Submitting ... missing tasks from Stage" mean?

2015-02-23 Thread shahab
Thanks Imran, but I do appreciate if you explain what this mean and what are the reasons make it happening. I do need it. If there is any documentation somewhere you can simply direct me there so I can try to understand it myself. best, /Shahab On Sat, Feb 21, 2015 at 12:26 AM, Imran Rashid

what does "Submitting ... missing tasks from Stage" mean?

2015-02-20 Thread shahab
happen? Does it have any relation to lazy evaluation? best, /Shahab

Re: Why is RDD lookup slow?

2015-02-20 Thread shahab
Thanks you all. Just changing RDD to Map structure saved me approx. 1 second. Yes, I will check out IndexedRDD to see if it has better performance. best, /Shahab On Thu, Feb 19, 2015 at 6:38 PM, Burak Yavuz wrote: > If your dataset is large, there is a Spark Package called Indexed

Why is RDD lookup slow?

2015-02-19 Thread shahab
, like HashMap to keep data and look up it there and use Broadcast to send a copy to all machines? best, /Shahab

Re: Why cached RDD is recomputed again?

2015-02-18 Thread shahab
where else! Do you have any other idea where I should look for the cause? best, /Shahab On Wed, Feb 18, 2015 at 4:22 PM, Sean Owen wrote: > The mostly likely explanation is that you wanted to put all the > partitions in memory and they don't all fit. Unless you asked to > persis

Re: Why groupBy is slow?

2015-02-18 Thread shahab
Thanks Francois for the comment and useful link. I understand the problem better now. best, /Shahab On Wed, Feb 18, 2015 at 10:36 AM, wrote: > In a nutshell : because it’s moving all of your data, compared to other > operations (e.g. reduce) that summarize it in one form or another

Why groupBy is slow?

2015-02-18 Thread shahab
Hi, Based on what I could see in the Spark UI, I noticed that "groupBy" transformation is quite slow (taking a lot of time) compared to other operations. Is there any reason that groupBy is slow? shahab

How to unregister/re-register a TempTable in Spark?

2015-01-28 Thread shahab
Hi, I just wonder if there is any way to unregister/re-register a TempTable in Spark? best, /Shahab

Re: Querying registered RDD (AsTable) using JDBC

2014-12-22 Thread shahab
Thanks Evert for the detailing the solution, I do appreciate it. But I would first try Cheng's suggestion. And Thanks Cheng for the help. I will let you know if I succeed. best, /Shahab On Sun, Dec 21, 2014 at 12:49 PM, Cheng Lian wrote: > Evert - Thanks for the instructions,

Querying registered RDD (AsTable) using JDBC

2014-12-19 Thread shahab
) ? best, /shahab

Querying Temp table using JDBC

2014-12-19 Thread shahab
nable? best, /Shahab

HiveQL support in Cassandra-Spark connector

2014-12-15 Thread shahab
Hi, I just wonder if Cassandra-Spark connector supports executing HiveQL on Cassandra tables? best, /Shahab

Increasing the number of retry in case of job failure

2014-12-05 Thread shahab
Hello, By some (unknown) reasons some of my tasks, that fetch data from Cassandra, are failing so often, and apparently the master removes a tasks which fails more than 4 times (in my case). Is there any way to increase the number of re-tries ? best, /Shahab

Re: How to enforce RDD to be cached?

2014-12-03 Thread shahab
Daniel and Paolo, thanks for the comments. best, /Shahab On Wed, Dec 3, 2014 at 3:12 PM, Paolo Platter wrote: > Yes, > > otherwise you can try: > > rdd.cache().count() > > and then run your benchmark > > Paolo > > *Da:* Daniel Darabos > *Data invi

How to enforce RDD to be cached?

2014-12-03 Thread shahab
perform some benchmarking and I need to separate rdd caching and rdd transformation/action processing time. best, /Shahab

Kryo exception for CassandraSQLRow

2014-12-01 Thread shahab
; %% "spark-streaming" % "1.1.0" % "provided" exclude("com.google.guava", "guava"), "org.apache.spark" %% "spark-catalyst" % "1.1.0" % "provided" exclude("com.google.guava", "guava") exclude("org.apache.spark", "spark-core"), "org.apache.spark" %% "spark-sql" % "1.1.0" % "provided" exclude("com.google.guava", "guava") exclude("org.apache.spark", "spark-core"), "org.apache.spark" %% "spark-hive" % "1.1.0" % "provided" exclude("com.google.guava", "guava") exclude("org.apache.spark", "spark-core"), "org.apache.hadoop" % "hadoop-client" % "1.0.4" % "provided", best, /Shahab

Is there any Spark implementation for Item-based Collaborative Filtering?

2014-11-30 Thread shahab
Hi, I just wonder if there is any implementation for Item-based Collaborative Filtering in Spark? best, /Shahab

Re: How to assign consecutive numeric id to each row based on its content?

2014-11-25 Thread shahab
Thanks a lot, both solutions work. best, /Shahab On Tue, Nov 18, 2014 at 5:28 PM, Daniel Siegmann wrote: > I think zipWithIndex is zero-based, so if you want 1 to N, you'll need to > increment them like so: > > val r2 = r1.keys.distinct().zipWithIndex().mapValues(_ + 1) >

Re: Spark Cassandra Guava version issues

2014-11-24 Thread shahab
I faced same problem, and s work around solution is here : https://github.com/datastax/spark-cassandra-connector/issues/292 best, /Shahab On Mon, Nov 24, 2014 at 3:21 PM, Ashic Mahtab wrote: > I've got a Cassandra 2.1.1 + Spark 1.1.0 cluster running. I'm using > sbt-assembly

How to assign consecutive numeric id to each row based on its content?

2014-11-18 Thread shahab
rows can have same string key . In spark context, how I can map each row into (Numeric_Key, OriginalRow) as map/reduce tasks such that rows with same original string key get same numeric consecutive key? Any hints? best, /Shahab

Re: Cassandra spark connector exception: "NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;"

2014-11-11 Thread shahab
Thanks Helena. I think I will wait for the new release and try it. Again thanks, /Shahab On Tue, Nov 11, 2014 at 3:41 PM, Helena Edelson wrote: > Hi, > It looks like you are building from master > (spark-cassandra-connector-assembly-1.2.0). > - Append this to your com.google.guava

Cassandra spark connector exception: "NoSuchMethodError: com.google.common.collect.Sets.newConcurrentHashSet()Ljava/util/Set;"

2014-11-11 Thread shahab
quot;hadoop-client" % "1.0.4" % "provided", "com.github.nscala-time" %% "nscala-time" % "1.0.0", "org.scalatest" %% "scalatest" % "1.9.1" % "test", "org.apache.spark" %% "spark

Re: How number of partitions effect the performance?

2014-11-03 Thread shahab
Thanks Sean for very useful comments. I understand now better what could be the reasons that my evaluations are messed up. best, /Shahab On Mon, Nov 3, 2014 at 12:08 PM, Sean Owen wrote: > Yes partitions matter. Usually you can use the default, which will > make a partition per input

How number of partitions effect the performance?

2014-11-03 Thread shahab
mean that choosing right number of partitions is the key factor in the Spark performance ? best, /Shahab

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
OK, I created an issue. Hopefully it will be resolved soon. Again thanks, best, /Shahab On Fri, Oct 31, 2014 at 7:05 PM, Helena Edelson wrote: > Hi Shahab, > The apache cassandra version looks great. > > I think that doing > cc.setKeyspace("mydb") > cc.sql("S

Re: Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
spark.cassandra.input.split.size : 1 spark.app.name : SomethingElse spark.fileserver.uri : http://192.168.1.111:51463 spark.driver.port : 51461 spark.master : local Does it have anything to do with the version of Apache Cassandra that I use?? I use "apache-cassandra-2.1

Accessing Cassandra with SparkSQL, Does not work?

2014-10-31 Thread shahab
rLoad(LocalCache.java:3938) in fact mydb3 is anothery keyspace which I did not tried even to connect to it ! Any idea? best, /Shahab Here is how my SBT looks like: libraryDependencies ++= Seq( "com.datastax.spark" %% "spark-cassandra-connector

Re: Best way to partition RDD

2014-10-30 Thread shahab
Thanks Helena, very useful comment, But is "‘spark.cassandra.input.split.size" only effective in Cluster not in Single node? best, /Shahab On Thu, Oct 30, 2014 at 6:26 PM, Helena Edelson wrote: > Shahab, > > Regardless, WRT cassandra and spark when using the spark ca

Doing RDD."count" in parallel , at at least parallelize it as much as possible?

2014-10-30 Thread shahab
Hi, I noticed that the "count" (of RDD) in many of my queries is the most time consuming one as it runs in the "driver" process rather then done by parallel worker nodes, Is there any way to perform "count" in parallel , at at least parallelize it as much as possible? best, /Shahab

Re: Best way to partition RDD

2014-10-30 Thread shahab
Hi Helena, Well... I am just running a toy example, I have one Cassandra node co-located with the Spark Master and one of Spark Workers, all in one machine. I have another node which runs the second Spark worker. /Shahab, On Thu, Oct 30, 2014 at 6:12 PM, Helena Edelson wrote: > Hi Sha

Best way to partition RDD

2014-10-30 Thread shahab
orm Cassandra. So I perform "repartition" on the RDD , and then I did the map/reduce functions. But the main problem is that "repartition" takes so much time (almost 2 min), which is not acceptable in my use-case. Is there any better way to do repartitioning? best, /Shahab

Re: How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread shahab
Thanks for the useful comment. But I guess this setting applies only when I use SparkSQL right= is there any similar settings for Spark? best, /Shahab On Tue, Oct 28, 2014 at 2:38 PM, Wanda Hawk wrote: > Is this what are you looking for ? > > In Shark, default reducer number is

How can number of partitions be set in "spark-env.sh"?

2014-10-28 Thread shahab
ge size of processing data in the partitions and if I have understood correctly I should have smaller partitions (but many of them) ?! Is there any way that I can set the number of partitions dynamically in "spark-env.sh" or in the submiited Spark application? best, /Shahab

How many executor process does an application receives?

2014-10-28 Thread shahab
,...) ?! best, /Shahab

Re: Why RDD is not cached?

2014-10-28 Thread shahab
l cache()? By itself it does nothing but once an action > requires it to be computed it should become cached. > On Oct 28, 2014 8:19 AM, "shahab" wrote: > >> Hi, >> >> I have a standalone spark , where the executor is set to have 6.3 G >> memory , as I am

Why RDD is not cached?

2014-10-28 Thread shahab
n results. Any idea what I am missing in my settings, or... ? thanks, /Shahab

What this exception means? ConnectionManager: key already cancelled ?

2014-10-27 Thread shahab
java.nio.channels.CancelledKeyException at org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:386) at org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:139) Any idea where should I look for the cause? best, /shahab This following is the part of

Measuring execution time

2014-10-24 Thread shahab
underlying DAG and associated tasks it is hard to find what I am looking for. best, /Shahab

Does SQLSpark support Hive built in functions?

2014-10-22 Thread shahab
Hi, I just wonder if SparkSQL supports Hive built-in functions (e.g. from_unixtime) or any of the functions pointed out here : ( https://cwiki.apache.org/confluence/display/Hive/Tutorial) best, /Shahab

What's wrong with my spark filter? I get "org.apache.spark.SparkException: Task not serializable"

2014-10-17 Thread shahab
.... best, /Shahab

  1   2   >