EventLoggingListener threw an exception when sparkContext.stop

2015-07-02 Thread Ayoub
Hello, I recently upgraded to spark 1.4 on Mesos 0.22.1 and now existing code throw the exception bellow. The interesting part is that this problems happens only when spark.eventLog.enabled flag is set to true. So I guess it has something to do with the spark history server. The problem happens

[spark1.4] sparkContext.stop causes exception on Mesos

2015-07-03 Thread Ayoub
is problem ? Thanks, Ayoub. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark1-4-sparkContext-stop-causes-exception-on-Mesos-tp23605.html Sent from the Apache Spark User List mailing list archive at

[spark1.5.1] HiveQl.parse throws org.apache.spark.sql.AnalysisException: null

2015-10-20 Thread Ayoub
cala:277) at org.apache.spark.sql.hive.SQLChecker$$anonfun$1.apply(SQLChecker.scala:9) at org.apache.spark.sql.hive.SQLChecker$$anonfun$1.apply(SQLChecker.scala:9) Should that be done differently on spark 1.5.1 ? Thanks, Ayoub -- View this message in context: http://apache-spark-

RDD[Future[T]] => Future[RDD[T]]

2015-07-26 Thread Ayoub
RDD was a standard scala collection then calling "scala.concurrent.Future.sequence" would have resolved the issue but RDD is not a TraversableOnce (which is required by the method). Is there a way to do this kind of transformation with an RDD[Future[T]] ? Thanks, Ayoub. -- View this

Re: RDD[Future[T]] => Future[RDD[T]]

2015-07-27 Thread Ayoub
sequence of > futures in each partition and return the resulting iterator. > > — > Sent from Mailbox <https://www.dropbox.com/mailbox> > > > On Sun, Jul 26, 2015 at 10:43 PM, Ayoub Benali < > benali.ayoub.i...@gmail.com> wrote: > >> It doesn't work

Re: SQL query over (Long, JSON string) tuples

2015-01-29 Thread Ayoub
Hello, SQLContext and hiveContext have a "jsonRDD" method which accept an RDD[String] where the string is a JSON String a returns a SchemaRDD, it extends RDD[Row] which the type you want. After words you should be able to do a join to keep your tuple. Best, Ayoub. 2015-01-29 10:12

Fwd: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-29 Thread Ayoub
Hello, I had the same issue then I found this JIRA ticket https://issues.apache.org/jira/browse/SPARK-4825 So I switched to Spark 1.2.1-snapshot witch solved the problem. 2015-01-30 8:40 GMT+01:00 Zhan Zhang : > I think it is expected. Refer to the comments in saveAsTable "Note that > this cu

Re: HiveContext created SchemaRDD's saveAsTable is not working on 1.2.0

2015-01-30 Thread Ayoub
I am not personally aware of a repo for snapshot builds. In my use case, I had to build spark 1.2.1-snapshot see https://spark.apache.org/docs/latest/building-spark.html 2015-01-30 17:11 GMT+01:00 Debajyoti Roy : > Thanks Ayoub and Zhan, > I am new to spark and wanted to make sure i

[hive context] Unable to query array once saved as parquet

2015-01-30 Thread Ayoub
join.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) The full code leading to this issue is available here: gist <https://gist.github.com/ayoub-benali/54d6f3b8635530e4e936> Could the problem comes from the way I ins

Re: [hive context] Unable to query array once saved as parquet

2015-01-30 Thread Ayoub
No it is not the case, here is the gist to reproduce the issue https://gist.github.com/ayoub-benali/54d6f3b8635530e4e936 On Jan 30, 2015 8:29 PM, "Michael Armbrust" wrote: > Is it possible that your schema contains duplicate columns or column with > spaces in the name? The parq

Re: [hive context] Unable to query array once saved as parquet

2015-01-31 Thread Ayoub
Hello, as asked, I just filled this JIRA issue <https://issues.apache.org/jira/browse/SPARK-5508>. I will add an other similar code example which lead to "GenericRow cannot be cast to org.apache.spark.sql.catalyst.expressions.SpecificMutableRow" Exception. Best, Ayoub. 2015-0

Re: [hive context] Unable to query array once saved as parquet

2015-02-02 Thread Ayoub
"insertInto" on schema RDD does take only the name of the table. Thanks, Ayoub. 2015-01-31 22:30 GMT+01:00 Ayoub Benali : > Hello, > > as asked, I just filled this JIRA issue > <https://issues.apache.org/jira/browse/SPARK-5508>. > > I will add an other

Re: Parquet compression codecs not applied

2015-02-04 Thread Ayoub
quot;normal", it would be better if "spark.sql.parquet.compression.codec" would be "translated" to "parquet.compression" in case of hive context. Other wise the documentation should be updated to be more precise. 2015-02-04 19:13 GMT+01:00 sahanbull : > Hi

Re: How to get Hive table schema using Spark SQL or otherwise

2015-02-04 Thread Ayoub
Given a hive context you could execute: hiveContext.sql("describe TABLE_NAME") you would get the name of the fields and their types 2015-02-04 21:47 GMT+01:00 nitinkak001 : > I want to get a Hive table schema details into Spark. Specifically, I want > to > get column name and type information. Is

Re: [hive context] Unable to query array once saved as parquet

2015-02-12 Thread Ayoub
the table as long it not partitioned. The queries work fine after that. Best, Ayoub. 2015-01-31 4:05 GMT+01:00 Cheng Lian : > According to the Gist Ayoub provided, the schema is fine. I reproduced > this issue locally, it should be bug, but I don't think it's related to > SPAR

RDD to InputStream

2015-03-13 Thread Ayoub
Hello, I need to convert an RDD[String] to a java.io.InputStream but I didn't find an east way to do it. Currently I am saving the RDD as temporary file and then opening an inputstream on the file but that is not really optimal. Does anybody know a better way to do that ? Thanks,

Re: RDD to InputStream

2015-03-13 Thread Ayoub
r if that's all really what you want to do though. > > > On Fri, Mar 13, 2015 at 9:54 AM, Ayoub > wrote: > > Hello, > > > > I need to convert an RDD[String] to a java.io.InputStream but I didn't > find > > an east way to do it. > > Currentl

Re: RDD to InputStream

2015-03-18 Thread Ayoub
tream. > > On Fri, Mar 13, 2015 at 10:33 AM, Ayoub > wrote: > > Thanks Sean, > > > > I forgot to mention that the data is too big to be collected on the > driver. > > > > So yes your proposition would work in theory but in my case I cannot hold > >

Parquet compression codecs not applied

2015-01-08 Thread Ayoub
om_unixtime(ts)) as month, day(from_unixtime(ts)) as day from raw_foo") I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 and I also tried that with Impala on the same cluster which applied correctly the compression codecs. Does anyone know what could be the problem ? T

Parquet compression codecs not applied

2015-01-09 Thread Ayoub
ear, month(from_unixtime(ts)) as month, day(from_unixtime(ts)) as day from raw_foo") I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 and I also tried that with Impala on the same cluster which applied correctly the compression codecs. Does anyone know what could be

Re: SQL JSON array operations

2015-01-15 Thread Ayoub
You could try to use hive context which bring HiveQL, it would allow you to query nested structures using "LATERAL VIEW explode..." see doc here -- View this message in context: http://apache-spark-user-list.100

spark 2.0 readStream from a REST API

2016-07-31 Thread Ayoub Benali
I want to avoid having to save the data on s3 and then read again. What would be the easiest way to hack around it ? Do I need to implement the Datasource API ? Are there examples on how to create a DataSource from a REST endpoint ? Best, Ayoub

Re: spark 2.0 readStream from a REST API

2016-07-31 Thread Ayoub Benali
urce: mysource. Please find packages at http://spark-packages.org Is there something I need to do in order to "load" the Stream source provider ? Thanks, Ayoub 2016-07-31 17:19 GMT+02:00 Jacek Laskowski : > On Sun, Jul 31, 2016 at 12:53 PM, Ayoub Benali > wrote: > > > I st

Re: spark 2.0 readStream from a REST API

2016-08-01 Thread Ayoub Benali
unt works in the example documentation? is there some other trait that need to be implemented ? Thanks, Ayoub. org.apache.spark.sql.AnalysisException: Queries with streaming sources must > be executed with writeStream.start(); > at > org.apache.spark.sql.catalyst.analysis.UnsupportedOperati

Re: spark 2.0 readStream from a REST API

2016-08-02 Thread Ayoub Benali
Hello, here is the code I am trying to run: https://gist.github.com/ayoub-benali/a96163c711b4fce1bdddf16b911475f2 Thanks, Ayoub. 2016-08-01 13:44 GMT+02:00 Jacek Laskowski : > On Mon, Aug 1, 2016 at 11:01 AM, Ayoub Benali > wrote: > > > the problem now is that when I consum

Re: spark 2.0 readStream from a REST API

2016-08-02 Thread Ayoub Benali
GMT+02:00 Amit Sela : > I think you're missing: > > val query = wordCounts.writeStream > > .outputMode("complete") > .format("console") > .start() > > Dis it help ? > > On Mon, Aug 1, 2016 at 2:44 PM Jacek Laskowski wrote: > >

Re: RDD[Future[T]] => Future[RDD[T]]

2015-07-26 Thread Ayoub Benali
It doesn't work because mapPartitions expects a function f:(Iterator[T]) ⇒ Iterator[U] while .sequence wraps the iterator in a Future 2015-07-26 22:25 GMT+02:00 Ignacio Blasco : > Maybe using mapPartitions and .sequence inside it? > El 26/7/2015 10:22 p. m., "Ayoub"

Spark to eliminate full-table scan latency

2014-10-27 Thread Ron Ayoub
We have a table containing 25 features per item id along with feature weights. A correlation matrix can be constructed for every feature pair based on co-occurrence. If a user inputs a feature they can find out the features that are correlated with a self-join requiring a single full table scan.

RE: Spark to eliminate full-table scan latency

2014-10-27 Thread Ron Ayoub
om CC: user@spark.apache.org You can access cached data in spark through the JDBC server: http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server On Mon, Oct 27, 2014 at 1:47 PM, Ron Ayoub wrote: We have a table containing 25 features per item id along

JdbcRDD in Java

2014-10-28 Thread Ron Ayoub
The following line of code is indicating the constructor is not defined. The only examples I can find of usage of JdbcRDD is Scala examples. Does this work in Java? Is there any examples? Thanks. JdbcRDD rdd = new JdbcRDD(sp, () -> ods.getConnection(), sql, 1, 1783059, 1

Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub
I haven't learned Scala yet so as you might imagine I'm having challenges working with Spark from the Java API. For one thing, it seems very limited in comparison to Scala. I ran into a problem really quick. I need to hydrate an RDD from JDBC/Oracle and so I wanted to use the JdbcRDD. But that i

RE: Is Spark in Java a bad idea?

2014-10-28 Thread Ron Ayoub
I interpret this to mean you have to learn Scala in order to work with Spark in Scala (goes without saying) and also to work with Spark in Java (since you have to jump through some hoops for basic functionality). The best path here is to take this as a learning opportunity and sit down and learn

winutils

2014-10-29 Thread Ron Ayoub
Apparently Spark does require Hadoop even if you do not intend to use Hadoop. Is there a workaround for the below error I get when creating the SparkContext in Scala? I will note that I didn't have this problem yesterday when creating the Spark context in Java as part of the getting started App.

RE: winutils

2014-10-29 Thread Ron Ayoub
he Hadoop jars? On Wed, Oct 29, 2014 at 11:31 AM, Ron Ayoub wrote: Apparently Spark does require Hadoop even if you do not intend to use Hadoop. Is there a workaround for the below error I get when creating the SparkContext in Scala? I will note that I didn't have this problem yesterd

RE: winutils

2014-10-29 Thread Ron Ayoub
he Hadoop jars? On Wed, Oct 29, 2014 at 11:31 AM, Ron Ayoub wrote: Apparently Spark does require Hadoop even if you do not intend to use Hadoop. Is there a workaround for the below error I get when creating the SparkContext in Scala? I will note that I didn't have this problem yesterd

Example of Fold

2014-10-31 Thread Ron Ayoub
I'm want to fold an RDD into a smaller RDD with max elements. I have simple bean objects with 4 properties. I want to group by 3 of the properties and then select the max of the 4th. So I believe fold is the appropriate method for this. My question is, is there a good fold example out there. Add

collecting fails - requirements for collecting (clone, hashCode etc?)

2014-12-03 Thread Ron Ayoub
The following code is failing on the collect. If I don't do the collect and go with a JavaRDD it works fine. Except I really would like to collect. At first I was getting an error regarding JDI threads and an index being 0. Then it just started locking up. I'm running the spark context locally o

RE: collecting fails - requirements for collecting (clone, hashCode etc?)

2014-12-03 Thread Ron Ayoub
I didn't realize I do get a nice stack trace if not running in debug mode. Basically, I believe Document has to be serializable. But since the question has already been asked, are the other requirements for objects within an RDD that I should be aware of. serializable is very understandable. Ho

how do you turn off info logging when running in local mode

2014-12-04 Thread Ron Ayoub
I have not yet gotten to the point of running standalone. In my case I'm still working on the initial product and I'm running directly in Eclipse and I've compiled using the Spark maven project since the downloadable spark binaries require Hadoop. With that said, I'm running fine and I have thin

Java RDD Union

2014-12-05 Thread Ron Ayoub
I'm a bit confused regarding expected behavior of unions. I'm running on 8 cores. I have an RDD that is used to collect cluster associations (cluster id, content id, distance) for internal clusters as well as leaf clusters since I'm doing hierarchical k-means and need all distances for sorting d

RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
with the Java > objects inside an RDD when you get a reference to them in a method > like this. This is definitely a bad idea, as there is certainly no > guarantee that any other operations will see any, some or all of these > edits. > > On Fri, Dec 5, 2014 at 2:40 PM, Ron A

Modifying an RDD in forEach

2014-12-06 Thread Ron Ayoub
This is from a separate thread with a differently named title. Why can't you modify the actual contents of an RDD using forEach? It appears to be working for me. What I'm doing is changing cluster assignments and distances per data item for each iteration of the clustering algorithm. The cluster

RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
; inherently take up resource, unless you mark them to be persisted. > You're paying the cost of copying objects to create one RDD from next, > but that's mostly it. > > On Sat, Dec 6, 2014 at 6:28 AM, Ron Ayoub wrote: > > With that said, and the nature of iterative

RE: Modifying an RDD in forEach

2014-12-06 Thread Ron Ayoub
ze the functions in that DAG to create minimal materialization on its way to final output. Regards Mayur On 06-Dec-2014 6:12 pm, "Ron Ayoub" wrote: This is from a separate thread with a differently named title. Why can't you modify the actual contents of an RDD using forEach?

RDD lineage and broadcast variables

2014-12-12 Thread Ron Ayoub
I'm still wrapping my head around that fact that the data backing an RDD is immutable since an RDD may need to be reconstructed from its lineage at any point. In the context of clustering there are many iterations where an RDD may need to change (for instance cluster assignments, etc) based on a

Parquet compression codecs not applied

2015-01-08 Thread Ayoub Benali
rom_unixtime(ts)) as month, day(from_unixtime(ts)) as day from raw_foo") I tried that with spark 1.2 and 1.3 snapshot against hive 0.13 and I also tried that with Impala on the same cluster which applied correctly the compression codecs. Does anyone know what could be the problem ? Thanks, Ayoub.

Re: Parquet compression codecs not applied

2015-01-10 Thread Ayoub Benali
on that the compression coded is set differently with HiveContext. Ayoub. 2015-01-09 17:51 GMT+01:00 Michael Armbrust : > This is a little confusing, but that code path is actually going through > hive. So the spark sql configuration does not help. > > Perhaps, try: > set parque

Re: SQL JSON array operations

2015-01-15 Thread Ayoub Benali
You could try yo use hive context which bring HiveQL, it would allow you to query nested structures using "LATERAL VIEW explode..." On Jan 15, 2015 4:03 PM, "jvuillermet" wrote: > let's say my json file lines looks like this > > {"user": "baz", "tags" : ["foo", "bar"] } > > > sqlContext.json