If you are using EMR, please try their latest release, there will be very few reasons left for using SPARK ever at all (particularly given that hiveContext rides a lot on HIVE) if you are using SQL.
Just over regular csv data I have seen Hive on TEZ performance gains by 100x (query 64 million rows x 570 columns in 2.5 mins) , and when using ORC the performance gains are super fast (query 64 million rows x 570 columns in 54 seconds) and with proper partitioning and indexing in ORC its blazing fast (query 64 million rows x 570 columns in 19 seconds). There is perhaps a reason why SPARK makes things slow while using ORC :) Regards, Gourav On Thu, Jul 21, 2016 at 12:40 PM, Ashutosh Kumar <[email protected]> wrote: > It works. Is it better to have hive in this case for better performance ? > > On Thu, Jul 21, 2016 at 12:30 PM, Simone <[email protected]> > wrote: > >> If you have a folder, and a bunch of json inside that folder- yes it >> should work. Just set as path something like "path/to/your/folder/*.json" >> All files will be loaded into a dataframe and schema will be the union of >> all the different schemas of your json files (only if you have different >> schemas) >> It should work - let me know >> >> Simone Miraglia >> ------------------------------ >> Da: Ashutosh Kumar <[email protected]> >> Inviato: 21/07/2016 08:55 >> A: Simone <[email protected]>; user @spark >> <[email protected]> >> Oggetto: Re: Reading multiple json files form nested folders for data >> frame >> >> That example points to a particular json file. Will it work same way if I >> point to top level folder containing all json files ? >> >> On Thu, Jul 21, 2016 at 12:04 PM, Simone <[email protected]> >> wrote: >> >>> Yes you can - have a look here >>> http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets >>> >>> Hope it helps >>> >>> Simone Miraglia >>> ------------------------------ >>> Da: Ashutosh Kumar <[email protected]> >>> Inviato: 21/07/2016 08:19 >>> A: user @spark <[email protected]> >>> Oggetto: Reading multiple json files form nested folders for data frame >>> >>> I need to read bunch of json files kept in date wise folders and perform >>> sql queries on them using data frame. Is it possible to do so? Please >>> provide some pointers . >>> >>> Thanks >>> Ashutosh >>> >> >> >
