Last time I checked, Camus doesn't support storing data as parquet, which
is a deal breaker for me. Otherwise it works well for my Kafka topics with
low data volume.
I am currently using spark streaming to ingest data, generate semi-realtime
stats and publish to a dashboard, and dump full dataset
I meant not sure how to do variance in one shot :-)
With mean in hand, you can obvious broadcast the variable, and do another
map/reduce to calculate variance per key.
On Fri, Aug 1, 2014 at 4:39 PM, Xu (Simon) Chen wrote:
> val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKe
val res = rdd.map(t => (t._1, (t._2.foo, 1))).reduceByKey((x,y) =>
(x._1+x._2, y._1+y._2)).collect
This gives you a list of (key, (tot, count)), which you can easily
calculate the mean. Not sure about variance.
On Fri, Aug 1, 2014 at 2:55 PM, kriskalish wrote:
> I have what seems like a relati
Hi Roberto,
Ultimately, the info you need is set here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69
Being a spark newbie, I extended org.apache.spark.rdd.HadoopRDD class as
HadoopRDDWithEnv, which takes in an additional parameter (varnam
I am pretty happy with using pyspark with ipython notebook. The only issue
is that I need to look at the console output or spark ui to track task
progress. I wonder if anyone thought of or better wrote something to
display some progress bars on the same page when I evaluate a cell in ipynb?
I know
Hi all,
I implemented a transformation on hdfs files with spark. First tested in
spark-shell (with yarn), I implemented essentially the same logic with a
spark program (scala), built a jar file and used spark-submit to execute it
on my yarn cluster. The weird thing is that spark-submit approach is
gt;
>> Not a stupid question! I would like to be able to do this. For now, you
>> might try writing the data to tachyon <http://tachyon-project.org/>
>> instead of HDFS. This is untested though, please report any issues you run
>> into.
>>
>> Mich
This might be a stupid question... but it seems that saveAsParquetFile()
writes everything back to HDFS. I am wondering if it is possible to cache
parquet-format intermediate results in memory, and therefore making spark
sql queries faster.
Thanks.
-Simon
the
> cache space in the spark.
>
> -Sandy
>
> > On Jun 5, 2014, at 9:44 AM, "Xu (Simon) Chen" wrote:
> >
> > I am slightly confused about the "--executor-memory" setting. My yarn
> cluster has a maximum container memory of 8192MB.
> &g
tence level is MEMORY_ONLY so
> that setting will have no impact.
>
>
> On Thu, Jun 5, 2014 at 4:41 PM, Xu (Simon) Chen wrote:
>
>> I have a working set larger than available memory, thus I am hoping to
>> turn on rdd compression so that I can store more in-memory. Stran
I have a working set larger than available memory, thus I am hoping to turn
on rdd compression so that I can store more in-memory. Strangely it made no
difference. The number of cached partitions, fraction cached, and size in
memory remain the same. Any ideas?
I confirmed that rdd compression wasn
I am slightly confused about the "--executor-memory" setting. My yarn
cluster has a maximum container memory of 8192MB.
When I specify "--executor-memory 8G" in my spark-shell, no container can
be started at all. It only works when I lower the executor memory to 7G.
But then, on yarn, I see 2 cont
N/M.. I wrote a HadoopRDD subclass and append one env field of the
HadoopPartition to the value in compute function. It worked pretty well.
Thanks!
On Jun 4, 2014 12:22 AM, "Xu (Simon) Chen" wrote:
> I don't quite get it..
>
> mapPartitionWithIndex takes a function th
Maybe your two workers have different assembly jar files?
I just ran into a similar problem that my spark-shell is using a different
jar file than my workers - got really confusing results.
On Jun 4, 2014 8:33 AM, "Ajay Srivastava" wrote:
> Hi,
>
> I am doing join of two RDDs which giving differ
spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L456
>
>
> On Thu, May 29, 2014 at 7:49 PM, Xu (Simon) Chen
> wrote:
>
>> Hello,
>>
>> A quick question about using spark to parse text-format CSV files stored
>> on hdfs.
>>
>&
tribution.sh#L102
>
> Any luck if you use JDK 6 to compile?
>
>
> On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen
> wrote:
> > OK, my colleague found this:
> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
> >
> > And my jar file has
OK, my colleague found this:
https://mail.python.org/pipermail/python-list/2014-May/671353.html
And my jar file has 70011 files. Fantastic..
On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen wrote:
> I asked several people, no one seems to believe that we can do this:
> $ PYTHONPATH=/p
ou
> used when building? Also which CDH-5 version are you building against?
>
> On Mon, Jun 2, 2014 at 8:11 AM, Xu (Simon) Chen wrote:
> > OK, rebuilding the assembly jar file with cdh5 works now...
> > Thanks..
> >
> > -Simon
> >
> >
> > On Sun, J
n
> application on YARN in general. In my experience, the steps outlined there
> are quite useful.
>
> Let me know if you get it working (or not).
>
> Cheers,
> Andrew
>
>
>
> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen :
>
> Hi folks,
>>
>> I have
5.0.1 -DskipTests clean
package"
Anything that I might have missed?
Thanks.
-Simon
On Mon, Jun 2, 2014 at 12:02 PM, Xu (Simon) Chen wrote:
> 1) yes, that sc.parallelize(range(10)).count() has the same error.
>
> 2) the files seem to be correct
>
> 3) I have trouble at th
hich has more detailed information about how to debug running an
> application on YARN in general. In my experience, the steps outlined there
> are quite useful.
>
> Let me know if you get it working (or not).
>
> Cheers,
> Andrew
>
>
>
> 2014-06-02 17:24 GMT+02:00 Xu (
Hi folks,
I have a weird problem when using pyspark with yarn. I started ipython as
follows:
IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 --num-executors
4 --executor-memory 4G
When I create a notebook, I can see workers being created and indeed I see
spark UI running on my client
OK, rebuilding the assembly jar file with cdh5 works now...
Thanks..
-Simon
On Sun, Jun 1, 2014 at 9:37 PM, Xu (Simon) Chen wrote:
> That helped a bit... Now I have a different failure: the start up process
> is stuck in an infinite loop outputting the following message:
>
> 14/06
uot; instead of using two named
> resource managers? I wonder if somehow the YARN client can't detect
> this multi-master set-up.
>
> On Sun, Jun 1, 2014 at 12:49 PM, Xu (Simon) Chen
> wrote:
> > Note that everything works fine in spark 0.9, which is packaged in CDH5:
>
the classpath is being
> set-up correctly.
>
> - Patrick
>
> On Sat, May 31, 2014 at 5:51 PM, Xu (Simon) Chen
> wrote:
> > Hi all,
> >
> > I tried a couple ways, but couldn't get it to work..
> >
> > The following seems to be what the
Hi all,
I tried a couple ways, but couldn't get it to work..
The following seems to be what the online document (
http://spark.apache.org/docs/latest/running-on-yarn.html) is suggesting:
SPARK_JAR=hdfs://test/user/spark/share/lib/spark-assembly-1.0.0-hadoop2.2.0.jar
YARN_CONF_DIR=/opt/hadoop/conf
Hello,
A quick question about using spark to parse text-format CSV files stored on
hdfs.
I have something very simple:
sc.textFile("hdfs://test/path/*").map(line => line.split(",")).map(p =>
(XXX, p[0], p[2]))
Here, I want to replace XXX with a string, which is the current csv
filename for the l
27 matches
Mail list logo