Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-16 Thread Ryan
I don't think you can parallel insert into a hive table without dynamic partition, for hive locking please refer to https://cwiki.apache.org/confluence/display/Hive/Locking. Other than that, it should work. On Mon, Apr 17, 2017 at 6:52 AM, Amol Patil wrote: > Hi All, > > I'm writing generic pys

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-16 Thread Ryan
you can build a search tree using ids within each partition to act like an index, or create a bloom filter to see if current partition would have any hit. What's your expected qps and response time for the filter request? On Mon, Apr 17, 2017 at 2:23 PM, MoTao wrote: > Hi all, > > I have 10M (

How to store 10M records in HDFS to speed up further filtering?

2017-04-16 Thread MoTao
Hi all, I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on average. In my daily application, I need to filter out 10K BINARY according to an ID list. How should I store the whole data to make the filtering faster? I'm using DataFrame in Spark 2.0.0 and I've tried row-based form

Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-16 Thread Gaurav1809
Hi All, I am looking for a data visualization (and analytics) tool. My processing is done through Spark. There are many tools available around us. I got some suggestions on Apache Zeppelin too? Can anybody throw some light on its power and capabilities when it comes to data analytics and visualizat

Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-16 Thread Amol Patil
Hi All, I'm writing generic pyspark program to process multiple datasets using Spark SQL. For example Traffic Data, Crime Data, Weather Data. Dataset will be in csv format & size may vary from *1 GB* to *10 GB*. Each dataset will be available at different timeframe (weekly,monthly,quarterly). My

Re: Memory problems with simple ETL in Pyspark

2017-04-16 Thread Patrick McCarthy
The partitions helped! I added repartition() and my function looks like this now: feature_df = (alldat_idx .withColumn('label',alldat_idx['label_val'].cast('double')) .groupBy('id','label') .agg(sort_array(collect_list('indexedSegs')).alias('collect_list_is')) .repartitio