Hi,
The method readAllFootersInParallel is implemented in Parquet's
ParquetFileReader. So the spark config "spark.sql.files.ignoreCorruptFiles"
doesn't work for it.
Reading all footers in parallel can speed up the task. However, we can't
control if ignoring corrupt files or not.
Of course we ca
Forget to say, another option is we can replace readAllFootersInParallel
with our parallel reading logic, so we can ignore corrupt files.
Liang-Chi Hsieh wrote
> Hi,
>
> The method readAllFootersInParallel is implemented in Parquet's
> ParquetFileReader. So the spark config
> "spark.sql.files.
Hi Imran,
Yes, you're right. I stand corrected! Thanks.
This is the part that opened my eyes:
> By the time that task has been assigned a location, and its running on an
> executor, it doesn't matter anymore.
That's why a task does not have to have it after deserialization (!)
Thanks a lot.
O
Ryan,
I agree that Hive 1.2.1 work reliably with Spark 2.x , but i went through
with current stable version of Hive which is 2.0.1 and I am working with
that. seems good but i want to make sure the which version of Hive is more
reliable with Spark 2.x and i think @Ryan you replied the same which i
Lars,
Thank you, I want to use DI for configuring all the properties (wiring) for
below architectural approach.
Oracle -> Kafka Batch (Event Queuing) -> Spark Jobs( Incremental load from
HBase -> Hive with Transformation) -> Spark Transformation -> PostgreSQL
Thanks.
On Thu, Dec 29, 2016 at 3:2
Ted Yu,
You understood wrong, i said Incremental load from HBase to Hive,
individually you can say Incremental Import from HBase.
On Wed, Dec 21, 2016 at 10:04 PM, Ted Yu wrote:
> Incremental load traditionally means generating hfiles and
> using org.apache.hadoop.hbase.mapreduce.LoadIncrementa
Hi,
another nice approach is to use instead of it Reader monad and some
framework to support this approach (e.g. Grafter -
https://github.com/zalando/grafter). It's lightweight and helps a bit with
dependencies issues.
2016-12-28 22:55 GMT+01:00 Lars Albertsson :
> Do you really need dependency
Just saw that there are many people with >= 8 open PRs. Some are
legitimately in flight but many are probably stale. To set a good example,
would (everyone) mind flicking through what they've got open and see if
some PRs are stale and should be closed?
https://spark-prs.appspot.com/users
Username
Ok. I will go through and check my open PRs.
Sean Owen wrote
> Just saw that there are many people with >= 8 open PRs. Some are
> legitimately in flight but many are probably stale. To set a good example,
> would (everyone) mind flicking through what they've got open and see if
> some PRs are st
Hi I would like to know more about typeface aggregations in spark.
http://stackoverflow.com/questions/40596638/inquiries-about-spark-2-0-dataset/40602882?noredirect=1#comment70139481_40602882
An example of these is
https://blog.codecentric.de/en/2016/07/spark-2-0-datasets-case-classes/
ds.groupByK
Hi all,
(cc-ing dev since I've hit some developer API corner)
What's the best way to convert an InternalRow to a Row if I've got an
InternalRow and the corresponding Schema.
Code snippet:
@Test
public void foo() throws Exception {
Row row = RowFactory.create(1);
StructType
preliminary findings: seems to be transient, and affecting 4% of
builds from late december until now (which is as far back as we keep
build records for the PRB builds).
408 builds
16 builds.gc <--- failures
it's also happening across all workers at about the same rate.
and best of all, the
Your understanding is correct - it is indeed slower due to extra
serialization. In some cases we can get rid of the serialization if the
value is already deserialized.
On Wed, Jan 4, 2017 at 7:19 AM, geoHeil wrote:
> Hi I would like to know more about typeface aggregations in spark.
>
> http://
Thanks for the clarification.
rxin [via Apache Spark Developers List] <
ml-node+s1001551n20462...@n3.nabble.com> schrieb am Mi. 4. Jan. 2017 um
23:37:
> Your understanding is correct - it is indeed slower due to extra
> serialization. In some cases we can get rid of the serialization if the
> valu
Let me double-check mind too.
2017-01-04 21:57 GMT+09:00 Liang-Chi Hsieh :
>
> Ok. I will go through and check my open PRs.
>
>
> Sean Owen wrote
> > Just saw that there are many people with >= 8 open PRs. Some are
> > legitimately in flight but many are probably stale. To set a good
> example,
>
You need to resolve and bind the encoder.
ExpressionEncoder enconder = RowEncoder.apply(struct).resolveAndBind();
Andy Dang wrote
> Hi all,
> (cc-ing dev since I've hit some developer API corner)
>
> What's the best way to convert an InternalRow to a Row if I've got an
> InternalRow and the co
After checking the codes, I think there are few issues regarding this
ignoreCorruptFiles config, so you can't actually use it with Parquet files
now.
I opened a JIRA https://issues.apache.org/jira/browse/SPARK-19082 and also
submitted a PR for it.
khyati wrote
> Hi Reynold Xin,
>
> In spark 2.
Hi Chetan
What do you mean by incremental load from HBase? There is a timestamp
marker for each cell, but not at Row level.
On Wed, Jan 4, 2017 at 10:37 PM, Chetan Khatri
wrote:
> Ted Yu,
>
> You understood wrong, i said Incremental load from HBase to Hive,
> individually you can say Incrementa
We've been able to use ipopo dependency injection framework in our pyspark
system and deploy .egg pyspark apps that resolve and wire up all the components
(like a kernel architecture. Also similar to spring) during an initial
bootstrap sequence; then invoke those components across spark.
Just re
I believe that these two were indeed originally related. In the old
hash-based shuffle, we wrote objects out immediately to disk as they were
generated by an RDD's iterator. On the other hand, with the original
version of the new sort-based shuffle, Spark buffered a bunch of objects
before writing
I've noticed a bunch of the recent builds failing because of GC limits, for
seemingly unrelated changes (e.g. 70818, 70840, 70842). Shane, have there
been any recent changes in the build configuration that might be causing
this? Does anyone else have any ideas about what's going on here?
-Kay
I have spent a lot of time trying to figure out the following problem. I need
to consume messages from the topic of remote Kafka queue using Scala and
Spark. By default the port of Kafka on remote machine is set to `7072`, not
`9092`. Also, on remote machine there are the following versions install
22 matches
Mail list logo