date:20170416

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-16 Thread Ryan

I don't think you can parallel insert into a hive table without dynamic partition, for hive locking please refer to https://cwiki.apache.org/confluence/display/Hive/Locking. Other than that, it should work. On Mon, Apr 17, 2017 at 6:52 AM, Amol Patil wrote: > Hi All, > > I'm writing generic pys

Re: How to store 10M records in HDFS to speed up further filtering?

2017-04-16 Thread Ryan

you can build a search tree using ids within each partition to act like an index, or create a bloom filter to see if current partition would have any hit. What's your expected qps and response time for the filter request? On Mon, Apr 17, 2017 at 2:23 PM, MoTao wrote: > Hi all, > > I have 10M (

How to store 10M records in HDFS to speed up further filtering?

2017-04-16 Thread MoTao

Hi all, I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on average. In my daily application, I need to filter out 10K BINARY according to an ID list. How should I store the whole data to make the filtering faster? I'm using DataFrame in Spark 2.0.0 and I've tried row-based form

Shall I use Apache Zeppelin for data analytics & visualization?

2017-04-16 Thread Gaurav1809

Hi All, I am looking for a data visualization (and analytics) tool. My processing is done through Spark. There are many tools available around us. I got some suggestions on Apache Zeppelin too? Can anybody throw some light on its power and capabilities when it comes to data analytics and visualizat

Spark SQL (Pyspark) - Parallel processing of multiple datasets

2017-04-16 Thread Amol Patil

Hi All, I'm writing generic pyspark program to process multiple datasets using Spark SQL. For example Traffic Data, Crime Data, Weather Data. Dataset will be in csv format & size may vary from *1 GB* to *10 GB*. Each dataset will be available at different timeframe (weekly,monthly,quarterly). My

Re: Memory problems with simple ETL in Pyspark

2017-04-16 Thread Patrick McCarthy

The partitions helped! I added repartition() and my function looks like this now: feature_df = (alldat_idx .withColumn('label',alldat_idx['label_val'].cast('double')) .groupBy('id','label') .agg(sort_array(collect_list('indexedSegs')).alias('collect_list_is')) .repartitio

Re: Spark SQL (Pyspark) - Parallel processing of multiple datasets

Re: How to store 10M records in HDFS to speed up further filtering?

How to store 10M records in HDFS to speed up further filtering?

Shall I use Apache Zeppelin for data analytics & visualization?

Spark SQL (Pyspark) - Parallel processing of multiple datasets

Re: Memory problems with simple ETL in Pyspark

6 matches

Site Navigation

Mail list logo

Footer information