Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Bertrand
n stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 172.31.61.41): java.lang.IllegalArgumentException: Unknown codec: com.hadoop.compression.lzo.LzoCodec Could you please help me reading the file with pyspark ? Thank you for your help, Cheers, Bertrand -- View t

Re: Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Bertrand
Thanks for your prompt reply. I will follow https://issues.apache.org/jira/browse/SPARK-2394 and will let you know if everything works. Cheers, Bertrand -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-about-Google-Books-Ngrams-with-pyspark-1-4-1

Re: Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-01 Thread Bertrand
age 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 172.31.12.23): java.lang.IllegalArgumentException: Unknown codec: com.hadoop.compression.lzo.LzoCodec Thanks for your help, Cheers, Bertrand -- View this message in context: http://apache-spark-user-list.1001560.n3

Re: Question about Google Books Ngrams with pyspark (1.4.1)

2015-09-02 Thread Bertrand
/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. : java.lang.ClassNotFoundException: com.hadoop.mapreduce.LzoTextInputFormat v Thanks for yo

LZO-compressed files

2015-09-03 Thread Bertrand
n get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. : java.lang.ClassNotFoundException: com.hadoop.mapreduce.LzoTextInputFormat v Could you please help me read a LZO file with pyspark ? Thank you for your he

SparkContext#cancelJobGroup : is it safe? Who got burn? Who is alive?

2016-06-14 Thread Bertrand Dechoux
2. Who is or was using the *interruptOnCancel* ? Do you got burn? It is still working without any incident? Thanks in advance for any info, feedbacks and war stories. Bertrand Dechoux

Re: Replacing Esper with Spark Streaming?

2015-09-15 Thread Bertrand Dechoux
proven wrong (which also would implies EsperTech is dead, which I also doubt...) Bertrand On Mon, Sep 14, 2015 at 2:31 PM, Todd Nist wrote: > Stratio offers a CEP implementation based on Spark Streaming and the > Siddhi CEP engine. I have not used the below, but they may be of some >

Re: Belief propagation algorithm is open sourced

2016-12-15 Thread Bertrand Dechoux
? Which ones? LibDAI, which created the supported format, "supports parameter learning of conditional probability tables by Expectation Maximization" according to the documentation. Is it your reference tool? Bertrand On Thu, Dec 15, 2016 at 5:21 AM, Bryan Cutler wrote: > I'll ch

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Bertrand Dechoux
chael already explained. Bertrand On Mon, Jun 16, 2014 at 1:23 PM, Michael Cutler wrote: > Hello Wei, > > I talk from experience of writing many HPC distributed application using > Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel > Virtual Machine (PVM) way

Re: Shark vs Impala

2014-06-22 Thread Bertrand Dechoux
not a good idea to compete against the optimizer, it is of course also true for 'BigData'. Bertrand On Sun, Jun 22, 2014 at 1:32 PM, Flavio Pompermaier wrote: > Hi folks, > I was looking at the benchmark provided by Cloudera at > http://blog.cloudera.com/blog/2014/05/new-sql

Pig 0.13, Spark, Spork

2014-07-07 Thread Bertrand Dechoux
d version of it. Regards Bertrand Dechoux

Re: Pig 0.13, Spark, Spork

2014-07-07 Thread Bertrand Dechoux
t the Pig 0.13 release? Is the pluggable execution engine flexible enough in order to avoid having Spork as a fork of Pig? Pig + Spark + Fork = Spork :D As a (for now) external observer, I am glad to see competition in that space. It can only be good for the community in the end. Bertrand Dechoux

Re: Re: Pig 0.13, Spark, Spork

2014-07-08 Thread Bertrand Dechoux
#x27; for Spark. @Zhang : Could you elaborate your reference about Twitter? Bertrand Dechoux On Tue, Jul 8, 2014 at 4:04 AM, 张包峰 wrote: > Hi guys, previously I checked out the old "spork" and updated it to Hadoop > 2.0, Scala 2.10.3 and Spark 0.9.1, see github project of mine &g

Re: KMeans code is rubbish

2014-07-10 Thread Bertrand Dechoux
A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wrote: > Can someone please run the standard kMeans code on this input wit

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Bertrand Dechoux
A patch proposal on the apache JIRA for Spark? https://issues.apache.org/jira/browse/SPARK/ Bertrand On Thu, Jul 10, 2014 at 2:37 PM, Rahul Bhojwani wrote: > And also that there is a small bug in implementation. As I mentioned this > earlier also. > > This is my first time I am re

Re: How does Spark speculation prevent duplicated work?

2014-07-15 Thread Bertrand Dechoux
concept. As long as you apply functions with no side effect (ie the only impact is the returned results), then you just need to not take into account results from additional attempts of the same task/operator. Bertrand Dechoux On Tue, Jul 15, 2014 at 9:34 PM, Andrew Ash wrote: > Hi Nan, > >

Re: Large scale ranked recommendation

2014-07-18 Thread Bertrand Dechoux
And you might want to apply clustering before. It is likely that every user and every item are not unique. Bertrand Dechoux On Fri, Jul 18, 2014 at 9:13 AM, Nick Pentreath wrote: > It is very true that making predictions in batch for all 1 million users > against the 10k items will be

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Bertrand Dechoux
> Is there any documentation from cloudera on how to run Spark apps on CDH Manager deployed Spark ? Asking the cloudera community would be a good idea. http://community.cloudera.com/ In the end only Cloudera will fix quickly issues with CDH... Bertrand Dechoux On Wed, Jul 23, 2014 at 9:28

Re: EOFException when I list all files in hdfs directory

2014-07-25 Thread Bertrand Dechoux
Well, anyone can open an account on apache jira and post a new ticket/enhancement/issue/bug... Bertrand Dechoux On Fri, Jul 25, 2014 at 4:07 PM, Sparky wrote: > Thanks for the suggestion. I can confirm that my problem is I have files > with zero bytes. It's a known bug and is

Rename filter() into keep(), remove() or take() ?

2014-02-27 Thread Bertrand Dechoux
. Has another name been already discussed? It could be keep() or remove(). But take() could also be reused and instead of providing a number, the filter function could be requested. Regards Bertrand

Re: Rename filter() into keep(), remove() or take() ?

2014-02-27 Thread Bertrand Dechoux
I understand the explanation but I had to try. However, the change could be made without breaking anything but that's another story. Regards Bertrand Bertrand Dechoux On Thu, Feb 27, 2014 at 2:05 PM, Nick Pentreath wrote: > filter comes from the Scala collection method "filter&q

Re: Rename filter() into keep(), remove() or take() ?

2014-02-28 Thread Bertrand Dechoux
out. I understand that the ROI is really likely not worth it. Thanks for the feedback Bertrand On Thu, Feb 27, 2014 at 3:38 PM, Nick Pentreath wrote: > Agree that filter is perhaps unintuitive. Though the Scala collections API > has "filter" and "filterNot" which

Re: What is the difference between map and flatMap

2014-03-12 Thread Bertrand Dechoux
In a single phrase : if you understand what map() does and what a flatten() might do, then flatMap() is like a map() followed by a flatten(). Like previously said, the concepts in themselves are not Spark specific. Bertrand On Wed, Mar 12, 2014 at 1:19 PM, Xuefeng Wu wrote: > It is the s

Re: best practices for pushing an RDD into a database

2014-03-14 Thread Bertrand Dechoux
But you might run into performance issue. I don't know the subject about Spark but with Hadoop MapReduce, Sqoop might be a solution in order to handle with care the database Bertrand Dechoux On Fri, Mar 14, 2014 at 4:47 AM, Christopher Nguyen wrote: > Nicholas, > > > (Can we

Re: Hadoop Input Format - newAPIHadoopFile

2014-03-19 Thread Bertrand Dechoux
I don't know the Spark issue but the Hadoop context is clear. old api -> org.apache.hadoop.mapred new api -> org.apache.hadoop.mapreduce You might only need to change your import. Regards Bertrand On Wed, Mar 19, 2014 at 11:29 AM, Pariksheet Barapatre wrote: > Hi, > >

PySpark still reading only text?

2014-04-16 Thread Bertrand Dechoux
his subject? Thanks in advance Bertrand

Re: PySpark still reading only text?

2014-04-17 Thread Bertrand Dechoux
Spark SQL as of now. Does it also imply the reverse is true? That I can write data as hive data with spark SQL using results from a random (python) Spark application? Bertrand Dechoux On Thu, Apr 17, 2014 at 7:23 AM, Matei Zaharia wrote: > Yes, this JIRA would enable that. The Hive support

Re: PySpark still reading only text?

2014-04-17 Thread Bertrand Dechoux
According to the Spark SQL documentation, indeed, this project allows python to be used while reading/writing table ie data which not necessarily in text format. Thanks a lot! Bertrand Dechoux On Thu, Apr 17, 2014 at 10:06 AM, Bertrand Dechoux wrote: > Thanks for the IRA reference. I rea

Re: PySpark still reading only text?

2014-04-22 Thread Bertrand Dechoux
Cool, thanks for the link. Bertrand Dechoux On Mon, Apr 21, 2014 at 7:31 PM, Nick Pentreath wrote: > Also see: https://github.com/apache/spark/pull/455 > > This will add support for reading sequencefile and other inputformat in > PySpark, as long as the Writables are either simple

Re: Hadoop 2.3 Centralized Cache vs RDD

2014-05-16 Thread Bertrand Dechoux
dfs.namenode.path.based.cache.refresh.interval.ms might be too large? You might want to ask a broader mailing list. This is not related to Spark. Bertrand On Fri, May 16, 2014 at 2:56 AM, hequn cheng wrote: > I tried centralized cache step by step following the apache hadoop oficial > website, but it seems centralized cache d

Re: Real world

2014-05-16 Thread Bertrand Dechoux
http://spark-summit.org ? Bertrand On Thu, May 8, 2014 at 2:05 AM, Ian Ferreira wrote: > Folks, > > I keep getting questioned on real world experience of Spark as in mission > critical production deployments. Does anyone have some war stories to share > or know of reso