Hello, I am writing a Spark application to use speech recognition to
transcribe a very large number of recordings.
I need some help configuring Spark.
My app is basically a transformation with no side effects: recording URL
--> transcript. The input is a huge file with one URL per line, and the
his?
>
> val data = sc.textFile("/sigmoid/audio/data/", 24).foreachPartition(urls
> => speachRecognizer(urls))
>
> Let 24 be the total number of cores that you have on all the workers.
>
> Thanks
> Best Regards
>
> On Wed, Jul 29, 2015 at 6:50 AM, Pe
Hello all,
We are considering Spark for our organization. It is obviously a superb
platform for processing massive amounts of data... how about retrieving it?
We are currently storing our data in a relational database in a star
schema. Retrieving our data requires doing many complicated joins a
process the data in Spark and then store it in the
> relational database of your choice.
>
>
>
>
> On Sat, Oct 25, 2014 at 11:18 PM, Peter Wolf wrote:
>
>> Hello all,
>>
>> We are considering Spark for our organization. It is obviously a superb
>
nal
> database to analyze. But in the long run, I would recommend using a more
> purpose built, huge storage database such as Cassandra. If your data is
> very static, you could also just store it in files.
> On Oct 26, 2014 9:19 AM, "Peter Wolf" wrote:
>
>> My under
ttp://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
> >
> > There are many other datastores that can do a better job at storing your
> events. You can process your data and then store them in a relational
>
perform it using the raw Spark RDD API
> <http://spark.apache.org/docs/1.1.0/api/scala/index.html#org.apache.spark.rdd.RDD>,
> but its often the case that the in-memory columnar caching of spark SQL is
> faster and more space efficient.
>
> On Mon, Oct 27, 2014 at 6:27 AM, Pete