Spark and Speech Recognition

2015-07-28 Thread Peter Wolf
Hello, I am writing a Spark application to use speech recognition to transcribe a very large number of recordings. I need some help configuring Spark. My app is basically a transformation with no side effects: recording URL --> transcript. The input is a huge file with one URL per line, and the

Re: Spark and Speech Recognition

2015-07-30 Thread Peter Wolf
his? > > val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls > => speachRecognizer(urls)) > > Let 24 be the total number of cores that you have on all the workers. > > Thanks > Best Regards > > On Wed, Jul 29, 2015 at 6:50 AM, Pe

Spark as Relational Database

2014-10-25 Thread Peter Wolf
Hello all, We are considering Spark for our organization. It is obviously a superb platform for processing massive amounts of data... how about retrieving it? We are currently storing our data in a relational database in a star schema. Retrieving our data requires doing many complicated joins a

Re: Spark as Relational Database

2014-10-26 Thread Peter Wolf
process the data in Spark and then store it in the > relational database of your choice. > > > > > On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf wrote: > >> Hello all, >> >> We are considering Spark for our organization. It is obviously a superb >

Re: Spark as Relational Database

2014-10-26 Thread Peter Wolf
nal > database to analyze. But in the long run, I would recommend using a more > purpose built, huge storage database such as Cassandra. If your data is > very static, you could also just store it in files. > On Oct 26, 2014 9:19 AM, "Peter Wolf" wrote: > >> My under

Re: Spark as Relational Database

2014-10-27 Thread Peter Wolf
ttp://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying > > > > There are many other datastores that can do a better job at storing your > events. You can process your data and then store them in a relational >

Re: Spark as Relational Database

2014-10-27 Thread Peter Wolf
perform it using the raw Spark RDD API > <http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.rdd.RDD>, > but its often the case that the in-memory columnar caching of spark SQL is > faster and more space efficient. > > On Mon, Oct 27, 2014 at 6:27 AM, Pete