Probably a noob question.
But I am trying to create a hive table using spark-sql.
Here is what I am trying to do:
hc = HiveContext(sc)
hdf = hc.parquetFile(output_path)
data_types = hdf.dtypes
schema = "(" + " ,".join(map(lambda x: x[0] + " " + x[1], data_types)) +")"
hc.sql(" CREATE TABLE IF
For local machine, I dont think there is any to install.. Just unzip and go
to $SPARK_DIR/bin/spark-shell and that will open up a repl...
On Tue, Feb 10, 2015 at 3:25 PM, King sami wrote:
> Hi,
>
> I'm new in Spark. I want to install it on my local machine (Ubunti 12.04)
> Could you help me plea
I would be interested too.
On Tue, Feb 10, 2015 at 9:41 AM, Kartik Mehta
wrote:
> Hi Sami and fellow Spark friends,
>
> I too am looking for joint learning, online.
>
> I have set up spark but need to do on multi nodes on my home server. We
> can form a group and do group learning?
>
> Thanks,
>
I think you have to run that using $SPARK_HOME/bin/pyspark /path/to/pi.py
instead of normal "python pi.py"
On Mon, Feb 9, 2015 at 11:22 PM, Ashish Kumar
wrote:
> *Command:*
> sudo python ./examples/src/main/python/pi.py
>
> *Error:*
> Traceback (most recent call last):
> File "./examples/src/m
Hi,
Probably a naive question.. But I am creating a spark cluster on ec2
using the ec2 scripts in there..
But is there a master param I need to set..
./bin/pyspark --master [ ] ??
I don't yet fully understand the ec2 concepts so just wanted to confirm
this??
Thanks
--
Mohit
"When you want succ
Hi,
I might be asking something very trivial, but whats the recommend way of
using third party libraries.
I am using tables to read hdf5 format file..
And here is the error trace:
print rdd.take(2)
File "/tmp/spark/python/pyspark/rdd.py", line , in take
res = self.context.runJob(s
Perhaps, its just me but "lag" function isnt familiar to me ..
But have you tried configuring the spark appropriately
http://spark.apache.org/docs/latest/configuration.html
On Tue, Oct 14, 2014 at 5:37 PM, Manas Kar
wrote:
> Hi,
> I have an RDD containing Vehicle Number , timestamp, Position.
Hi,
I am using pyspark shell and am trying to create an rdd from numpy matrix
rdd = sc.parallelize(matrix)
I am getting the following error:
JVMDUMP039I Processing dump event "systhrow", detail
"java/lang/OutOfMemoryError" at 2014/09/10 22:41:44 - please wait.
JVMDUMP032I JVM requested Heap dump
Hi,
I was wondering if Personalized Page Rank algorithm is implemented in
graphx. If the talks and presentation were to be believed (
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/02/graphx@strata2014_final.pdf)
it is.. but cant find the algo code (
https://github.com/amplab/graphx/tree
Building on what Davies Liu said,
How about something like:
def indexing(splitIndex, iterator,*offset_lists* ):
count = 0
offset = sum(*offset_lists*[:splitIndex]) if splitIndex else 0
indexed = []
for i, e in enumerate(iterator):
index = count + offset + i
for j, ele in enumerate(e
p of
> that?
>
>
> Appreciate your help
>
> -Sathish
>
>
> On Wed, Aug 6, 2014 at 6:22 PM, Mohit Singh wrote:
>
>> My naive set up..
>> Adding
>> os.environ['SPARK_HOME'] = "/path/to/spark"
>> sys.path.append("/path/to/sp
One possible straightforward explanation might be your solution(s) might be
stuck in local minima?? And depending on your weights initialization, you
are getting different parameters?
Maybe have same initial weights for both the runs...
or
I would probably test the execution with synthetic dataset
My naive set up..
Adding
os.environ['SPARK_HOME'] = "/path/to/spark"
sys.path.append("/path/to/spark/python")
on top of my script.
from pyspark import SparkContext
from pyspark import SparkConf
Execution works from within pycharm...
Though my next step is to figure out autocompletion and I bet the
Hi,
We have setup spark on a HPC system and are trying to implement some
data pipeline and algorithms in place.
The input data is in hdf5 (these are very high resolution brain images) and
it can be read via h5py library in python. So, my current approach (which
seems to be working ) is writing a
Hi,
I guess Spark is using streaming in context of streaming live data but
what I mean is something more on the lines of hadoop streaming.. where one
can code in any programming language?
Or is something among that lines on the cards?
Thanks
--
Mohit
"When you want success as badly as you wan
Hi,
I have a csv file... (say "n" columns )
I am trying to do a filter operation like:
query = rdd.filter(lambda x:x[1] == "1234")
query.take(20)
Basically this would return me rows with that specific value?
This manipulation is taking quite some time to execute.. (if i can
compare.. maybe slo
Not sure whether I understand your question correctly or not?
If you are trying to use hadoop ( as in map reduce programming model), then
basically you would have to use hadoop api's to solve your program.
But if you have data stored in hdfs, and you want to use spark to process
that data, then jus
Hi,
Is there something equivalent of LazyOutputFormat equivalent in spark
(pyspark)
http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/LazyOutputFormat.html
Basically, something like where I only save files which has some data in it
rather than saving all the files as
ext*(conf = conf)
>
>
> On Fri, Feb 28, 2014 at 9:37 AM, Mohit Singh wrote:
>
>> Hi Bryn,
>> Thanks for the suggestion.
>> I tried that..
>> conf = pyspark.SparkConf().set("spark.executor.memory","20G")
>> But.. got an error
arning("Using SPARK_MEM to set amount of memory to use per
> executor process is " +
> > "deprecated, instead use spark.executor.memory")
> > }
> >
> > Thanks,
> > Bryn
> >
> >
> > On Wed, Feb 26, 2014 at 6:28 PM, Mohit Singh
.setSparkHome("/your/path/to/spark")
>.set("spark.executor.memory", "20G")
>.set("spark.logConf", "true")
> sc = pyspark.SparkConf(conf = conf)
>
> Hope that helps,
Hi,
I am experimenting with pyspark lately...
Every now and then, I see this error bieng streamed to pyspark shell .. and
most of the times.. the computation/operation completes.. and sometimes, it
just gets stuck...
My setup is 8 node cluster.. with loads of ram(256GB's) and space( TB's)
per nod
22 matches
Mail list logo