ra/browse/SPARK-4502
>
> On Thu, Oct 29, 2015 at 6:00 PM, Sadhan Sood
> wrote:
>
>> I noticed when querying struct data in spark sql, we are requesting the
>> whole column from parquet files. Is this intended or is there some kind of
>> config to control this behaviour? W
I noticed when querying struct data in spark sql, we are requesting the
whole column from parquet files. Is this intended or is there some kind of
config to control this behaviour? Wouldn't it be better to request just the
struct field?
/sql-programming-guide.html#scheduling
>
> You likely want to put each user in their own pool.
>
> On Tue, Oct 20, 2015 at 11:55 AM, Sadhan Sood
> wrote:
>
>> Hi All,
>>
>> Does anyone have fair scheduling working for them in a hive server? I
>> have one
Hi All,
Does anyone have fair scheduling working for them in a hive server? I have
one hive thriftserver running and multiple users trying to run queries at
the same time on that server using a beeline client. I see that a big query
is stopping all other queries from making any progress. Is this s
I am trying to run a query on a month of data. The volume of data is not
much, but we have a partition per hour and per day. The table schema is
heavily nested with total of 300 leaf fields. I am trying to run a simple
select count(*) query on the table and running into this exception:
SELECT
Hi Spark users,
We are running Spark on Yarn and often query table partitions as big as
100~200 GB from hdfs. Hdfs is co-located on the same cluster on which Spark
and Yarn run. I've noticed a much higher I/O read rates when I increase the
number of executors cores from 2 to 8( Most tasks run in
Interestingly, if there is nothing running on dev spark-shell, it recovers
successfully and regains the lost executors. Attaching the log for that.
Notice, the "Registering block manager .." statements in the very end after
all executors were lost.
On Wed, Aug 26, 2015 at 11:27 AM, S
Attaching log for when the dev job gets stuck (once all its executors are
lost due to preemption). This is a spark-shell job running in yarn-client
mode.
On Wed, Aug 26, 2015 at 10:45 AM, Sadhan Sood wrote:
> Hi All,
>
> We've set up our spark cluster on aws running on yarn (run
Hi All,
We've set up our spark cluster on aws running on yarn (running on hadoop
2.3) with fair scheduling and preemption turned on. The cluster is shared
for prod and dev work where prod runs with a higher fair share and can
preempt dev jobs if there are not enough resources available for it.
It
Hi Xu-dong,
Thats probably because your table's partition path don't look like
hdfs://somepath/key=value/*.parquet. Spark is trying to extract the
partition key's value from the path while caching and hence the exception
is being thrown since it can't find one.
On Mon, Jan 26, 2015 at 10:45 AM, Z
t are not supported.
>
>
> On 12/20/14 6:17 AM, Sadhan Sood wrote:
>
> Hey Michael,
>
> Thank you for clarifying that. Is tachyon the right way to get compressed
> data in memory or should we explore the option of adding compression to
> cached data. This is because our uncomp
date = 201412XX' - the way we are doing right now.
>>
>>
>> On Thu, Dec 18, 2014 at 6:46 PM, Michael Armbrust > > wrote:
>>>
>>> There is only column level encoding (run length encoding, delta
>>> encoding, dictionary encoding) and no gene
t;
> There is only column level encoding (run length encoding, delta encoding,
> dictionary encoding) and no generic compression.
>
> On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood
> wrote:
>>
>> Hi All,
>>
>> Wondering if when caching a table backed by lzo compr
Hi All,
Wondering if when caching a table backed by lzo compressed parquet data, if
spark also compresses it (using lzo/gzip/snappy) along with column level
encoding or just does the column level encoding when
"*spark.sql.inMemoryColumnarStorage.compressed"
*is set to true. This is because when I
We create the table definition by reading the parquet file for schema and
store it in hive metastore. But if someone adds a new column to the schema,
and if we rescan the schema from the new parquet files and update the table
definition, would it still work if we run queries on the table ?
So, old
Thanks Michael, opened this https://issues.apache.org/jira/browse/SPARK-4520
On Thu, Nov 20, 2014 at 2:59 PM, Michael Armbrust
wrote:
> Can you open a JIRA?
>
> On Thu, Nov 20, 2014 at 10:39 AM, Sadhan Sood
> wrote:
>
>> I am running on master, pulled yesterday I believe b
Ah awesome, thanks!!
On Thu, Nov 20, 2014 at 3:01 PM, Michael Armbrust
wrote:
> In 1.2 by default we use Spark parquet support instead of Hive when the
> SerDe contains the word "Parquet". This should work with hive partitioning.
>
> On Thu, Nov 20, 2014 at 10:33 AM
I am running on master, pulled yesterday I believe but saw the same issue
with 1.2.0
On Thu, Nov 20, 2014 at 1:37 PM, Michael Armbrust
wrote:
> Which version are you running on again?
>
> On Thu, Nov 20, 2014 at 8:17 AM, Sadhan Sood
> wrote:
>
>> Also attaching the parquet
We are loading parquet data as temp tables but wondering if there is a way
to add a partition to the data without going through hive (we still want to
use spark's parquet serde as compared to hive). The data looks like ->
/date1/file1, /date1/file2 ... , /date2/file1,
/date2/file2,/daten/filem
Also attaching the parquet file if anyone wants to take a further look.
On Thu, Nov 20, 2014 at 8:54 AM, Sadhan Sood wrote:
> So, I am seeing this issue with spark sql throwing an exception when
> trying to read selective columns from a thrift parquet file and also when
> caching t
R:0 D:0 V:
value 2: R:0 D:0 V:
value 3: R:0 D:0 V:
value 4: R:0 D:0 V:
value 5: R:0 D:0 V:
value 6: R:0 D:0 V:
value 7: R:0 D:0 V:
value 8: R:0 D:0 V:
value 9: R:0 D:0 V:
I am happy to provide more information but any help is appreciated.
On Sun, Nov 16, 2014 at 7:40 PM, Sadhan Sood wrote:
&g
ah makes sense - Thanks Michael!
On Mon, Nov 17, 2014 at 6:08 PM, Michael Armbrust
wrote:
> You are perhaps hitting an issue that was fixed by #3248
> <https://github.com/apache/spark/pull/3248>?
>
> On Mon, Nov 17, 2014 at 9:58 AM, Sadhan Sood
> wrote:
>
>> W
While testing sparkSQL, we were running this group by with expression query
and got an exception. The same query worked fine on hive.
SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
'/MM/dd') as pst_date,
count(*) as num_xyzs
FROM
all_matched_abc
GROUP BY
he meanwhile, would you mind to help
> to narrow down the problem by trying to scan exactly the same Parquet file
> with some other systems (e.g. Hive or Impala)? If other systems work, then
> there must be something wrong with Spark SQL.
>
> Cheng
>
> On Sun, Nov 16, 20
in master and branch-1.2 is 10,000
> rows per batch.
>
> On 11/14/14 1:27 AM, Sadhan Sood wrote:
>
> Thanks Chneg, Just one more question - does that mean that we still
> need enough memory in the cluster to uncompress the data before it can be
> compressed again or does that j
While testing SparkSQL on a bunch of parquet files (basically used to be a
partition for one of our hive tables), I encountered this error:
import org.apache.spark.sql.SchemaRDD
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
pressed to true. This property is
> already set to true by default in master branch and branch-1.2.
>
> On 11/13/14 7:16 AM, Sadhan Sood wrote:
>
> We noticed while caching data from our hive tables which contain data
> in compressed sequence file format that it gets uncompresse
We noticed while caching data from our hive tables which contain data in
compressed sequence file format that it gets uncompressed in memory when
getting cached. Is there a way to turn this off and cache the compressed
data as is ?
output location for shuffle 0
The data is lzo compressed sequence file with compressed size ~ 26G. Is
there a way to understand why shuffle keeps failing for one partition. I
believe we have enough memory to store the uncompressed data in memory.
On Wed, Nov 12, 2014 at 2:50 PM, Sadhan Sood wrote
I think you can provide -Pbigtop-dist to build the tar.
On Wed, Nov 12, 2014 at 3:21 PM, Sean Owen wrote:
> mvn package doesn't make tarballs. It creates artifacts that will
> generally appear in target/ and subdirectories, and likewise within
> modules. Look at make-distribution.sh
>
> On Wed,
Just making sure but are you looking for the tar in assembly/target dir ?
On Wed, Nov 12, 2014 at 3:14 PM, Ashwin Shankar
wrote:
> Hi,
> I just cloned spark from the github and I'm trying to build to generate a
> tar ball.
> I'm doing : mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
> -Ds
lerBackend
(Logging.scala:logError(75)) - Asked to remove non-existent executor 372
2014-11-12 19:11:21,655 INFO scheduler.DAGScheduler
(Logging.scala:logInfo(59)) - Executor lost: 372 (epoch 3)
On Wed, Nov 12, 2014 at 12:31 PM, Sadhan Sood wrote:
> We are running spark on yarn with combined mem
We are running spark on yarn with combined memory > 1TB and when trying to
cache a table partition(which is < 100G), seeing a lot of failed collect
stages in the UI and this never succeeds. Because of the failed collect, it
seems like the mapPartitions keep getting resubmitted. We have more than
en
port. I guess the Thrift server didn't start
> successfully because the HiveServer2 occupied the port, and your Beeline
> session was probably linked against HiveServer2.
>
> Cheng
>
>
> On 11/11/14 8:29 AM, Sadhan Sood wrote:
>
> I was testing out the spark thrift j
I was testing out the spark thrift jdbc server by running a simple query in
the beeline client. The spark itself is running on a yarn cluster.
However, when I run a query in beeline -> I see no running jobs in the
spark UI(completely empty) and the yarn UI seem to indicate that the
submitted query
;
> On Fri, Oct 24, 2014 at 12:06 PM, Sadhan Sood
> wrote:
>
>> Is there a way to cache certain (or most latest) partitions of certain
>> tables ?
>>
>> On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust > > wrote:
>>
>>> It does have support for c
Is there a way to cache certain (or most latest) partitions of certain
tables ?
On Fri, Oct 24, 2014 at 2:35 PM, Michael Armbrust
wrote:
> It does have support for caching using either CACHE TABLE or
> CACHE TABLE AS SELECT
>
> On Fri, Oct 24, 2014 at 1:05 AM, ankits wrote:
>
>> I want t
These seem like s3 connection errors for the table data. Wondering, since
we don't see that many failures on hive. I also set the spark.task.maxFailures
= 15.
On Fri, Oct 24, 2014 at 12:15 PM, Sadhan Sood wrote:
> Hi,
>
> Trying to run a query on spark-sql but it keeps failing w
Hi,
Trying to run a query on spark-sql but it keeps failing with this error on
the cli ( we are running spark-sql on a yarn cluster):
org.apache.spark.SparkException: Job cancelled because SparkContext was
shut down
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$
Thanks Michael, you saved me a lot of time!
On Wed, Oct 22, 2014 at 6:04 PM, Michael Armbrust
wrote:
> The JDBC server is what you are looking for:
> http://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbc-server
>
> On Wed, Oct 22, 2014 at 11:10 AM,
We want to run multiple instances of spark sql cli on our yarn cluster.
Each instance of the cli is to be used by a different user. This looks
non-optimal if each user brings up a different cli given how spark works on
yarn by running executor processes (and hence consuming resources) on
worker nod
to pass a
> comma-delimited
> > list of paths.
> >
> > I've opened SPARK-3928: Support wildcard matches on Parquet files to
> request
> > this feature.
> >
> > Nick
> >
> > On Mon, Oct 13, 2014 at 12:21 PM, Sadhan Sood
> wrote:
>
How can we read all parquet files in a directory in spark-sql. We are
following this example which shows a way to read one file:
// Read in the parquet file created above. Parquet files are
self-describing so the schema is preserved.// The result of loading a
Parquet file is also a SchemaRDD.val
We want to persist table schema of parquet file so as to use spark-sql cli
on that table later on? Is it possible or is spark-sql cli only good for
tables in hive metastore ? We are reading parquet data using this example:
// Read in the parquet file created above. Parquet files are
self-describi
-- Forwarded message --
From: Sadhan Sood
Date: Sat, Oct 11, 2014 at 10:26 AM
Subject: Re: how to find the sources for spark-project
To: Stephen Boesch
Thanks, I still didn't find it - is it under some particular branch ? More
specifically, I am looking to modify the
45 matches
Mail list logo