Re: How to do broadcast join in SparkSQL

2014-10-07 Thread Jianshi Huang
Looks like https://issues.apache.org/jira/browse/SPARK-1800 is not merged into master? I cannot find spark.sql.hints.broadcastTables in latest master, but it's in the following patch. https://github.com/apache/spark/commit/76ca4341036b95f71763f631049fdae033990ab5 Jianshi On Mon, Sep 29, 2014

Re: Parsing one big multiple line .xml loaded in RDD using Python

2014-10-07 Thread jan.zikes
Thank you, this seems to be the way to go, but unfortunately, when I'm trying to use sc.wholeTextFiles() on file that is stored amazon S3 I'm getting following Error:   14/10/08 06:09:50 INFO input.FileInputFormat: Total input paths to process : 1 14/10/08 06:09:50 INFO input.FileInputFormat: To

Re: window every n elements instead of time based

2014-10-07 Thread Michael Allman
Yes, I meant batch interval. Thanks for clarifying. Cheers, Michael On Oct 7, 2014, at 11:14 PM, jayant [via Apache Spark User List] wrote: > Hi Michael, > > I think you are meaning batch interval instead of windowing. It can be > helpful for cases when you do not want to process very smal

Re: window every n elements instead of time based

2014-10-07 Thread Jayant Shekhar
Hi Michael, I think you are meaning batch interval instead of windowing. It can be helpful for cases when you do not want to process very small batch sizes. HDFS sink in Flume has the concept of rolling files based on time, number of events or size. https://flume.apache.org/FlumeUserGuide.html#hd

RE: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-07 Thread Haopu Wang
Liquan, yes, for full outer join, one hash table on both sides is more efficient. For the left/right outer join, it looks like one hash table should be enought. From: Liquan Pei [mailto:liquan...@gmail.com] Sent: 2014年9月30日 18:34 To: Haopu Wang Cc: d...@s

Re: Reading from HBase is too slow

2014-10-07 Thread Sean Owen
How did you run your program? I don't see from your earlier post that you ever asked for more executors. On Wed, Oct 8, 2014 at 4:29 AM, Tao Xiao wrote: > I found the reason why reading HBase is too slow. Although each > regionserver serves multiple regions for the table I'm reading, the number

Re: SparkStreaming program does not start

2014-10-07 Thread Sean Owen
Abraham is correct. spark-shell is for typing Spark code into. You can't give it a .scala file as an argument. spark-submit is for running a packaged Spark application. You also can't give it a .scala file. You need to compile a .jar file with your application. This is true for Spark Streaming too.

Support for Parquet V2 in ParquetTableSupport?

2014-10-07 Thread Michael Allman
Hello, I was interested in testing Parquet V2 with Spark SQL, but noticed after some investigation that the parquet writer that Spark SQL uses is fixed at V1 here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala#L350. An

Re: Spark Streaming : Could not compute split, block not found

2014-10-07 Thread Tian Zhang
Hi, we are using spark 1.1.0 streaming and we are hitting this same issue. Basically from the job output I saw the following things happen in sequence. 948 14/10/07 18:09:59 INFO storage.BlockManagerInfo: Added input-0-1412705397200 in memory on ip-10-4-62-85.ec2.internal:59230 (size: 5.3 MB, fr

Re: window every n elements instead of time based

2014-10-07 Thread Michael Allman
Hi Andrew, The use case I have in mind is batch data serialization to HDFS, where sizing files to a certain HDFS block size is desired. In my particular use case, I want to process 10GB batches of data at a time. I'm not sure this is a sensible use case for spark streaming, and I was trying to

Re: Same code --works in spark 1.0.2-- but not in spark 1.1.0

2014-10-07 Thread Andrew Ash
Hi Meethu, I believe you may be hitting a regression in https://issues.apache.org/jira/browse/SPARK-3633 If you are able, could you please try running a patched version of Spark 1.1.0 that has commit 4fde28c reverted and see if the errors go away? Posting your results on that bug would be useful

Re: Shuffle files

2014-10-07 Thread Andrew Ash
You will need to restart your Mesos workers to pick up the new limits as well. On Tue, Oct 7, 2014 at 4:02 PM, Sunny Khatri wrote: > @SK: > Make sure ulimit has taken effect as Todd mentioned. You can verify via > ulimit -a. Also make sure you have proper kernel parameters set in > /etc/sysctl.c

Re: Reading from HBase is too slow

2014-10-07 Thread Tao Xiao
I found the reason why reading HBase is too slow. Although each regionserver serves multiple regions for the table I'm reading, the number of Spark workers allocated by Yarn is too low. Actually, I could see that the table has dozens of regions spread over about 20 regionservers, but only two Spar

Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-07 Thread Li HM
Here is the hive-site.xml hive.metastore.local false controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM hive.metastore.uris thrift://*:2513 Remote location of the metastore server hive.metastore.wareh

Re: dynamic sliding window duration

2014-10-07 Thread Tobias Pfeiffer
Hi, On Wed, Oct 8, 2014 at 4:50 AM, Josh J wrote: > > I have a source which fluctuates in the frequency of streaming tuples. I > would like to process certain batch counts, rather than batch window > durations. Is it possible to either > > 1) define batch window sizes > Cf. http://apache-spark-u

Re: Record-at-a-time model for Spark Streaming

2014-10-07 Thread Tobias Pfeiffer
Jianneng, On Wed, Oct 8, 2014 at 8:44 AM, Jianneng Li wrote: > > I understand that Spark Streaming uses micro-batches to implement > streaming, while traditional streaming systems use the record-at-a-time > processing model. The performance benefit of the former is throughput, and > the latter is

Fwd: SparkStreaming program does not start

2014-10-07 Thread Tobias Pfeiffer
Can we please set the Reply-To header to the mailing list address... -- Forwarded message -- From: Tobias Pfeiffer Date: Wed, Oct 8, 2014 at 11:16 AM Subject: Re: SparkStreaming program does not start To: spr Hi, On Wed, Oct 8, 2014 at 7:47 AM, spr wrote: > I'm probably doin

Spark-Shell: OOM: GC overhead limit exceeded

2014-10-07 Thread sranga
Hi I am new to Spark and trying to develop an application that loads data from Hive. Here is my setup: * Spark-1.1.0 (built using -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive) * Executing Spark-shell on a box with 16 GB RAM * 4 Cores Single Processor * OpenCSV library (SerDe) * Hive table has

Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-07 Thread Cheng Lian
So it seems that the classpath issue has been resolved. The instantiation failure should be related to your hive-site.xml. Would you mind to create a public gist for your hive-site.xml? On 10/8/14 4:34 AM, Li HM wrote: Thanks Cheng. Here is the error message after a fresh build. $ mvn -Pyarn

spark fold question

2014-10-07 Thread chinchu
Hi, I am using the fold(zeroValue)(t1, t2) on the RDD & I noticed that it runs in parallel on all the partitions & then aggregates the results from the partitions. My data object is not aggregate-able & I was wondering if there's any way to run the fold sequentially. [I am looking to do a foldLeft

Re: SparkStreaming program does not start

2014-10-07 Thread Abraham Jacob
I have not played around with spark-shell much (especially for spark streaming), but was just suggesting that spark-submit logs could possibly tell you whats going on and yes for that you would need to create a jar. I am not even sure that you can give a .scala file to spark-shell Usage: ./bin/sp

Record-at-a-time model for Spark Streaming

2014-10-07 Thread Jianneng Li
Hello, I understand that Spark Streaming uses micro-batches to implement streaming, while traditional streaming systems use the record-at-a-time processing model. The performance benefit of the former is throughput, and the latter is latency. I'm wondering what it would take to implement record-at

bug with IPython notebook?

2014-10-07 Thread Andy Davidson
Hi I think I found a bug in the iPython notebook integration. I am not sure how to report it I am running spark-1.1.0-bin-hadoop2.4 on an AWS ec2 cluster. I start the cluster using the launch script provided by spark I start iPython notebook on my cluster master as follows and use an ssh tunnel

Re: Spark / Kafka connector - CDH5 distribution

2014-10-07 Thread Abraham Jacob
Never mind... my bad... made a typo. looks good. Thanks, On Tue, Oct 7, 2014 at 3:57 PM, Abraham Jacob wrote: > Thanks Sean, > > Sorry in my earlier question I meant to type CDH5.1.3 not CDH5.1.2 > > I presume it's included in spark-streaming_2.10-1.0.0-cdh5.1.3 > > But for some reason eclipse

Re: SparkStreaming program does not start

2014-10-07 Thread spr
|| Try using spark-submit instead of spark-shell Two questions: - What does spark-submit do differently from spark-shell that makes you think that may be the cause of my difficulty? - When I try spark-submit it complains about "Error: Cannot load main class from JAR: file:/Users/spr/.../try1.scal

Re: Shuffle files

2014-10-07 Thread Sunny Khatri
@SK: Make sure ulimit has taken effect as Todd mentioned. You can verify via ulimit -a. Also make sure you have proper kernel parameters set in /etc/sysctl.conf (MacOSX) On Tue, Oct 7, 2014 at 3:57 PM, Lisonbee, Todd wrote: > > Are you sure the new ulimit has taken effect? > > How many cores are

Re: SparkStreaming program does not start

2014-10-07 Thread Abraham Jacob
Try using spark-submit instead of spark-shell On Tue, Oct 7, 2014 at 3:47 PM, spr wrote: > I'm probably doing something obviously wrong, but I'm not seeing it. > > I have the program below (in a file try1.scala), which is similar but not > identical to the examples. > > import org.apache.spark

RE: Shuffle files

2014-10-07 Thread Lisonbee, Todd
Are you sure the new ulimit has taken effect? How many cores are you using? How many reducers? "In general if a node in your cluster has C assigned cores and you run a job with X reducers then Spark will open C*X files in parallel and start writing. Shuffle consolidat

Re: Spark / Kafka connector - CDH5 distribution

2014-10-07 Thread Abraham Jacob
Thanks Sean, Sorry in my earlier question I meant to type CDH5.1.3 not CDH5.1.2 I presume it's included in spark-streaming_2.10-1.0.0-cdh5.1.3 But for some reason eclipse complains that import org.apache.spark.streaming.kafka cannon be resolved, even though I have included the spark-streaming_2.

SparkStreaming program does not start

2014-10-07 Thread spr
I'm probably doing something obviously wrong, but I'm not seeing it. I have the program below (in a file try1.scala), which is similar but not identical to the examples. import org.apache.spark._ import org.apache.spark.streaming._ import org.apache.spark.streaming.StreamingContext._ println("P

Storing shuffle files on a Tachyon

2014-10-07 Thread Soumya Simanta
Is it possible to store spark shuffle files on Tachyon ?

Re: Spark / Kafka connector - CDH5 distribution

2014-10-07 Thread Sean Owen
Yes, it is the entire Spark distribution. On Oct 7, 2014 11:36 PM, "Abraham Jacob" wrote: > Hi All, > > Does anyone know if CDH5.1.2 packages the spark streaming kafka connector > under the spark externals project? > > > > -- > ~ >

Spark / Kafka connector - CDH5 distribution

2014-10-07 Thread Abraham Jacob
Hi All, Does anyone know if CDH5.1.2 packages the spark streaming kafka connector under the spark externals project? -- ~

Re: MLLib Linear regression

2014-10-07 Thread Xiangrui Meng
Did you test different regularization parameters and step sizes? In the combination that works, I don't see "A + D". Did you test that combination? Are there any linear dependency between A's columns and D's columns? -Xiangrui On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak wrote: > BTW, one detail:

Re: anyone else seeing something like https://issues.apache.org/jira/browse/SPARK-3637

2014-10-07 Thread Andrew Or
Hi Steve, what Spark version are you running? 2014-10-07 14:45 GMT-07:00 Steve Lewis : > java.lang.NullPointerException > at java.nio.ByteBuffer.wrap(ByteBuffer.java:392) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) > at org.apache.spark.scheduler.Task.run(Task.scala:54

anyone else seeing something like https://issues.apache.org/jira/browse/SPARK-3637

2014-10-07 Thread Steve Lewis
java.lang.NullPointerException at java.nio.ByteBuffer.wrap(ByteBuffer.java:392) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurren

Re: Shuffle files

2014-10-07 Thread SK
- We set ulimit to 50. But I still get the same "too many open files" warning. - I tried setting consolidateFiles to True, but that did not help either. I am using a Mesos cluster. Does Mesos have any limit on the number of open files? thanks -- View this message in context: http:/

RE: MLLib Linear regression

2014-10-07 Thread Sameer Tilak
BTW, one detail: When number of iterations is 100 all weights are zero or below and the indices are only from set A. When number of iterations is 150 I see 30+ non-zero weights (when sorted by weight) and indices are distributed across al sets. however MSE is high (5.xxx) and the result does no

MLLib Linear regression

2014-10-07 Thread Sameer Tilak
Hi All,I have following classes of features: class A: 15000 featuresclass B: 170 featuresclass C: 900 featuresClass D: 6000 features. I use linear regression (over sparse data). I get excellent results with low RMSE (~0.06) for the following combinations of classes:1. A + B + C 2. B + C + D3. A

Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-07 Thread Li HM
Thanks Cheng. Here is the error message after a fresh build. $ mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.0 -Phive -DskipTests clean package [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM ...

dynamic sliding window duration

2014-10-07 Thread Josh J
Hi, I have a source which fluctuates in the frequency of streaming tuples. I would like to process certain batch counts, rather than batch window durations. Is it possible to either 1) define batch window sizes or 2) dynamically adjust the duration of the sliding window? Thanks, Josh

Re: Stupid Spark question

2014-10-07 Thread Sean Owen
You can create a new Configuration object in something like a mapPartitions method and use that. It will pick up local Hadoop configuration from the node, but presumably the Spark workers and HDFS data nodes are colocated in this case, so the machines have the correct Hadoop config locally. On Tue

Re: return probability \ confidence instead of actual class

2014-10-07 Thread Sean Owen
It looks like you are directly computing the SVM decision function in both cases: val predictions2 = m_users_double.map{point=> point.zip(weights).map(a=> a._1 * a._2).sum + intercept }.cache() clf.decision_function(T) This does not give you +1/-1 in SVMs (well... not for most points, which wi

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Matei Zaharia
The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn't support the syntax you have. Look at how you'd write this in HiveQL, and then try doing that with HiveContext.

Re: return probability \ confidence instead of actual class

2014-10-07 Thread Sunny Khatri
Not familiar with scikit SVM implementation ( and I assume you are using linearSVC). To figure out an optimal decision boundary based on the scores obtained, you can use an ROC curve varying your thresholds. On Tue, Oct 7, 2014 at 12:08 AM, Adamantios Corais < adamantios.cor...@gmail.com> wrote:

Re: Cannot read from s3 using "sc.textFile"

2014-10-07 Thread Daniil Osipov
Try using s3n:// instead of s3 (for the credential configuration as well). On Tue, Oct 7, 2014 at 9:51 AM, Sunny Khatri wrote: > Not sure if it's supposed to work. Can you try newAPIHadoopFile() passing > in the required configuration object. > > On Tue, Oct 7, 2014 at 4:20 AM, Tomer Benyamini

Stupid Spark question

2014-10-07 Thread Steve Lewis
I am porting a Hadoop job to Spark - One issue is that the workers need to read files from hdfs reading a different file based on the key or in some cases reading an object that is expensive to serialize. This is easy if the worker has access to the JavaSparkContext (I am working in Java) but thi

Re: Kafka->HDFS to store as Parquet format

2014-10-07 Thread Buntu Dev
Thanks for the input Soumitra. On Tue, Oct 7, 2014 at 10:24 AM, Soumitra Kumar wrote: > Currently I am not doing anything, if anything change start from scratch. > > In general I doubt there are many options to account for schema changes. > If you are reading files using impala, then it may allo

Re: Kafka->HDFS to store as Parquet format

2014-10-07 Thread Soumitra Kumar
Currently I am not doing anything, if anything change start from scratch. In general I doubt there are many options to account for schema changes. If you are reading files using impala, then it may allow if the schema changes are append only. Otherwise existing Parquet files have to be migrated

Re: Kafka->HDFS to store as Parquet format

2014-10-07 Thread Buntu Dev
Thanks for the info Soumitra.. its a good start for me. Just wanted to know how you are managing schema changes/evolution as parquetSchema is provided to setSchema in the above sample code. On Tue, Oct 7, 2014 at 10:09 AM, Soumitra Kumar wrote: > I have used it to write Parquet files as: > > va

Spark Streaming Fault Tolerance (?)

2014-10-07 Thread Massimiliano Tomassi
Reading the Spark Streaming Programming Guide I found a couple of interesting points. First of all, while talking about receivers, it says: *"If the number of cores allocated to the application is less than or equal to the number of input DStreams / receivers, then the system will receive data, bu

Re: Kafka->HDFS to store as Parquet format

2014-10-07 Thread Soumitra Kumar
I have used it to write Parquet files as: val job = new Job val conf = job.getConfiguration conf.set (ParquetOutputFormat.COMPRESSION, CompressionCodecName.SNAPPY.name ()) ExampleOutputFormat.setSchema (job, MessageTypeParser.parseMessageType (parquetSchema)) rdd saveAsNewAPIHadoopFile (rddToFile

Re: Cannot read from s3 using "sc.textFile"

2014-10-07 Thread Sunny Khatri
Not sure if it's supposed to work. Can you try newAPIHadoopFile() passing in the required configuration object. On Tue, Oct 7, 2014 at 4:20 AM, Tomer Benyamini wrote: > Hello, > > I'm trying to read from s3 using a simple spark java app: > > - > > SparkConf sparkConf = new Sp

Re: Kafka->HDFS to store as Parquet format

2014-10-07 Thread bdev
After a bit of looking around, I found saveAsNewAPIHadoopFile could be used to specify the ParquetOutputFormat. Has anyone used it to convert JSON to Parquet format or any pointers are welcome, thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-HD

Re: Schema change on Spark Hive (Parquet file format) table not working

2014-10-07 Thread barge.nilesh
To find root cause, I installed hive 0.12 separately and tried the exact same test through Hive CLI and it *passed*. So, looks like it is a problem with Spark-SQL. Has anybody else faced this issue (Hive-parquet table schema change)?? Should I create JIRA ticket for this? -- View this message

Re: Parsing one big multiple line .xml loaded in RDD using Python

2014-10-07 Thread Davies Liu
Maybe sc.wholeTextFile() is what you want, you can get the whole text and parse it by yourself. On Tue, Oct 7, 2014 at 1:06 AM, wrote: > Hi, > > I have already unsucesfully asked quiet simmilar question at stackoverflow, > particularly here: > http://stackoverflow.com/questions/26202978/spark-an

akka.remote.transport.netty.NettyTransport

2014-10-07 Thread Jacob Chacko - Catalyst Consulting
Hi All I have one master and one worker on AWS (amazon web service) and am running spark map reduce code provided on the link https://spark.apache.org/examples.html We are using Spark version 1.0.2 Word Count val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split("

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Gen
Hi, in fact, the same problem happens when I try several joins together: SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY) py4j.protocol.Py4JJavaError: An error occurred while callin

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread TANG Gen
Hi, the same problem happens when I try several joins together, such as 'SELECT * FROM sales INNER JOIN magasin ON sales.STO_KEY = magasin.STO_KEY INNER JOIN eans ON (sales.BARC_KEY = eans.BARC_KEY and magasin.FORM_KEY = eans.FORM_KEY)' The error information is as follow: py4j.protocol.Py4JJavaEr

Re: Spark 1.1.0 with Hadoop 2.5.0

2014-10-07 Thread Cheng Lian
The build command should be correct. What exact error did you encounter when trying Spark 1.1 + Hive 0.12 + Hadoop 2.5.0? On 10/7/14 2:21 PM, Li HM wrote: Thanks for the replied. Please refer to my another post entitled "How to make ./bin/spark-sql work with hive". It has all the error/excepti

Re: Unable to ship external Python libraries in PYSPARK

2014-10-07 Thread yh18190
Hi David, Thanks for the reply and effort u put to explain the concepts.Thanks for example.It worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-ship-external-Python-libraries-in-PYSPARK-tp14074p15844.html Sent from the Apache Spark User List

Re: GraphX: Types for the Nodes and Edges

2014-10-07 Thread Oshi
Hi again, Thank you for your suggestion :) I've tried to implement this method but I'm stuck trying to union the payload before creating the graph. Below is a really simplified snippet of what have worked so far. //Reading the articles given in json format val articles = sqlContext.jsonFile(pa

Same code --works in spark 1.0.2-- but not in spark 1.1.0

2014-10-07 Thread MEETHU MATHEW
Hi all, My code was working fine in spark 1.0.2 ,but after upgrading to 1.1.0, its throwing exceptions and tasks are getting failed. The code contains some map and filter transformations followed by groupByKey (reduceByKey in another code ). What I could find out is that the code works fine un

Fwd: Cannot read from s3 using "sc.textFile"

2014-10-07 Thread Tomer Benyamini
Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf sparkConf = new SparkConf().setAppName("TestApp"); sparkConf.setMaster("local"); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "XX");

Cannot read from s3 using "sc.textFile"

2014-10-07 Thread Tomer Benyamini
Hello, I'm trying to read from s3 using a simple spark java app: - SparkConf sparkConf = new SparkConf().setAppName("TestApp"); sparkConf.setMaster("local"); JavaSparkContext sc = new JavaSparkContext(sparkConf); sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "XX");

Re: Hive Parquet Serde from Spark

2014-10-07 Thread quintona
I have found related PRs in the parquet-mr project: https://github.com/Parquet/parquet-mr/issues/324, however using that version of the bundle doesn't solve the issue. The issue seems to related to packaged scope in separate class loaders. I am busy looking for a work around. -- View this messa

Re: Relation between worker memory and executor memory in standalone mode

2014-10-07 Thread MEETHU MATHEW
Try to set --total-executor-cores to limit how many total cores it can use. Thanks & Regards, Meethu M On Thursday, 2 October 2014 2:39 AM, Akshat Aranya wrote: I guess one way to do so would be to run >1 worker per node, like say, instead of running 1 worker and giving it 8 cores, you c

Parsing one big multiple line .xml loaded in RDD using Python

2014-10-07 Thread jan.zikes
Hi, I have already unsucesfully asked quiet simmilar question at stackoverflow, particularly here:  http://stackoverflow.com/questions/26202978/spark-and-python-trying-to-parse-wikipedia-using-gensim. I've also unsucessfully tryied some workaround, but unsucessfuly, workaround problem can be f

Re: Strategies for reading large numbers of files

2014-10-07 Thread deenar.toraskar
Hi Landon I had a problem very similar to your, where we have to process around 5 million relatively small files on NFS. After trying various options, we did something similar to what Matei suggested. 1) take the original path and find the subdirectories under that path and then parallelize the

Re: Spark and Python using generator of data bigger than RAM as input to sc.parallelize()

2014-10-07 Thread jan.zikes
The file itself is for now just wikipedia dump, that can be downloaded from here  http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. It's basically one big .xml that I need to parse in a way to have title + text on one line of the data. For this I currently use gens

Re: Spark Streaming saveAsNewAPIHadoopFiles

2014-10-07 Thread Abraham Jacob
Hi All, Continuing on this discussion... Is there a good reason why the def of "saveAsNewAPIHadoopFiles" in org/apache/spark/streaming/api/java/JavaPairDStream.scala is defined like this - def saveAsNewAPIHadoopFiles( prefix: String, suffix: String, keyClass: Class[_], val

HiveServer1 and SparkSQL

2014-10-07 Thread deenar.toraskar
Hi Shark supported both the HiveServer1 and HiveServer2 thrift interfaces (using $ bin/shark -service sharkserver[1 or 2]). SparkSQL seems to support only HiveServer2. I was wondering what is involved to add support for HiveServer1. Is this something straightforward to do that I can embark on mys

Re: return probability \ confidence instead of actual class

2014-10-07 Thread Adamantios Corais
Well, apparently, the above Python set-up is wrong. Please consider the following set-up which DOES use 'linear' kernel... And the question remains the same: how to interpret Spark results (or why Spark results are NOT bounded between -1 and 1)? On Mon, Oct 6, 2014 at 8:35 PM, Sunny Khatri wrote: