Can you not use sc.wholeTextFile() and use a custom parser or a regex to
extract out the TransactionIDs?
Thanks
Best Regards
On Sat, Jul 11, 2015 at 8:18 AM, ssbiox wrote:
> Hello,
>
> I have a very specific question on how to do a search between particular
> lines of log file. I did some resea
Did you try setting the HADOOP_CONF_DIR?
Thanks
Best Regards
On Sat, Jul 11, 2015 at 3:17 AM, maxdml wrote:
> Also, it's worth noting that I'm using the prebuilt version for hadoop 2.4
> and higher from the official website.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-
Yes, that is correct. You can use this boiler plate to avoid spark-submit.
//The configurations
val sconf = new SparkConf()
.setMaster("spark://spark-ak-master:7077")
.setAppName("SigmoidApp")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.s
I have to do the following tasks on a dataset using Apache Spark with Scala as
the programming language:
Read the dataset from HDFS. A few sample lines look like this:
deviceid,bytes,eventdate
15590657,246620,20150630
14066921,1907,20150621
14066921,1906,20150626
6522013,2349,20150626
6522013,252
I have to do the following tasks on a dataset using Apache Spark with Scala as
the programming language:
Read the dataset from HDFS. A few sample lines look like this:
deviceid,bytes,eventdate
15590657,246620,20150630
14066921,1907,20150621
14066921,1906,20150626
6522013,2349,20150626
6522013,252
Honestly you are addressing this wrongly - you do not seem.to have a
business case for changing - so why do you want to switch
Le sam. 11 juil. 2015 à 3:28, Mohammed Guller a
écrit :
> Hi Ravi,
>
> First, Neither Spark nor Spark SQL is a database. Both are compute
> engines, which need to be pa
Le sam. 11 juil. 2015 à 14:53, Roman Sokolov a écrit :
> Hello. Had the same question. What if I need to store 4-6 Tb and do
> queries? Can't find any clue in documentation.
> Am 11.07.2015 03:28 schrieb "Mohammed Guller" :
>
>> Hi Ravi,
>>
>> First, Neither Spark nor Spark SQL is a database. Bo
Note that if you use multi-part upload, each part becomes 1 block, which
allows for multiple concurrent readers. One would typically use fixed-size
block sizes which align with Spark's default HDFS block size (64 MB, I
think) to ensure the reads are aligned.
On Sat, Jul 11, 2015 at 11:14 AM, Steve
seek() is very, very expensive on s3, even short forward seeks. If your code
does a lot of, it will kill performance. (forward seeks are better in s3a,
which with Hadoop 2.3 is now something safe to use, and in the s3 client that
Amazon include in EMR), but its still sluggish.
The other killers
Looks like reduceByKey() should work here.
Cheers
On Sat, Jul 11, 2015 at 11:02 AM, leonida.gianfagna <
leonida.gianfa...@gmail.com> wrote:
> Thanks a lot oubrik,
>
> I got your point, my consideration is that sum() should be already a
> built-in function for iterators in python.
> Anyway I trie
Thanks a lot oubrik,
I got your point, my consideration is that sum() should be already a
built-in function for iterators in python.
Anyway I tried your approach
def mysum(iter):
count = sum = 0
for item in iter:
count += 1
sum += item
return sum
wordCountsGrouped = wor
Hi All,
I am facing this issue in my production environment.
My worker dies by throwing this exception.
But i see the space is available on all the partitions on my disk
I did NOT see any abrupt increase in DIsk IO, which might have choked the
executor to write on to the stderr file.
But still m
Hi Roman,
Yes, Spark SQL will be a better solution than standard RDBMS databases for
querying 4-6 TB data. You can pair Spark SQL with HDFS+Parquet to build a
powerful analytics solution.
Mohammed
From: David Mitchell [mailto:jdavidmitch...@gmail.com]
Sent: Saturday, July 11, 2015 7:10 AM
To: R
Hi,
I've finally fixed this. The problem was that I wasn't providing a type for
the DStream in ssc.actorStream
/* with this inputDStream : ReceiverInputDStream[Nothing] and we get
SparkDriverExecutionException: Execution error
* Caused by: java.lang.ArrayStoreException: [Ljava.lang.Object;
You can certainly query over 4 TB of data with Spark. However, you will
get an answer in minutes or hours, not in milliseconds or seconds. OLTP
databases are used for web applications, and typically return responses in
milliseconds. Analytic databases tend to operate on large data sets, and
retu
Reducing no.of partitions may have impact on memory consumption. Especially
if there is uneven distribution of key used in groupBy.
Depends on your dataset.
On Sat, Jul 11, 2015 at 5:06 AM, Umesh Kacha wrote:
> Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10
> I thin
Hello. Had the same question. What if I need to store 4-6 Tb and do
queries? Can't find any clue in documentation.
Am 11.07.2015 03:28 schrieb "Mohammed Guller" :
> Hi Ravi,
>
> First, Neither Spark nor Spark SQL is a database. Both are compute
> engines, which need to be paired with a storage sy
What is your business case for the move?
Le ven. 10 juil. 2015 à 12:49, Ravisankar Mani a écrit :
> Hi everyone,
>
> I have planned to move mssql server to spark?. I have using around 50,000
> to 1l records.
> The spark performance is slow when compared to mssql server.
>
> What is the best da
Suppose i have RDD with 10 tuples and cluster with 100 cores (standalone
mode) the by dafault how the partition will be done.
I did not get how it will divide 20 tuples set (RDD) to 100 cores .(By
default )
Mentioned in documentation -
*spark.default.parallelism*
For distributed shuffle operati
1.spark streaming 1.3 creates as many RDD Partitions as there are kafka
partitions in topic. Say I have 300 partitions in topic and 10 executors
and each with 3 cores so , is it means at a time only 10*3=30 partitions
are processed and then 30 like that since executors launch tasks per RDD
partitio
On 10 Jul 2015, at 23:10, algermissen1971 wrote:
> Hi,
>
> initially today when moving my streaming application to the cluster the first
> time I ran in to newbie error of using a local file system for checkpointing
> and the RDD partition count differences (see exception below).
>
> Having
Hi Sriknath thanks much it worked when I set spark.sql.shuffle.partitions=10
I think reducing shuffle partitions will slower my group by query of
hiveContext or it wont slow it down please guide.
On Sat, Jul 11, 2015 at 7:41 AM, Srikanth wrote:
> Is there a join involved in your sql?
> Have a lo
22 matches
Mail list logo