flatmap?
--
Chris Miller
On Thu, Apr 7, 2016 at 10:25 PM, greg huang wrote:
> Hi All,
>
>Can someone give me a example code to get rid of the empty string in
> JavaRDD? I kwon there is a filter method in JavaRDD:
> https://spark.apache.org/docs/1.6.0/api/java/org/apache/spa
With Avro you solve this by using a default value for the new field...
maybe Parquet is the same?
--
Chris Miller
On Tue, Mar 22, 2016 at 9:34 PM, gtinside wrote:
> Hi ,
>
> I have a table sourced from* 2 parquet files* with few extra columns in one
> of the parquet file. Simp
If you have lots of small files, distcp should handle that well -- it's
supposed to distribute the transfer of files across the nodes in your
cluster. Conductor looks interesting if you're trying to distribute the
transfer of single, large file(s)...
right?
--
Chris Miller
On Wed, Ma
Short answer: Nope
Less short answer: Spark is not designed to maintain sort order in this
case... it *may*, but there's no guarantee... generally, it would not be in
the same order unless you implement something to order by and then sort the
result based on that.
--
Chris Miller
On Wed, M
r you described.
--
Chris Miller
On Tue, Mar 15, 2016 at 11:22 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:
> There are many solutions to a problem.
>
> Also understand that sometimes your situation might be such. For ex what
> if you are accessing S3 from
Cool! Thanks for sharing.
--
Chris Miller
On Sun, Mar 13, 2016 at 12:53 AM, Todd Nist wrote:
> Below is a link to an example which Silvio Fiorito put together
> demonstrating how to link Zeppelin with Spark Stream for real-time charts.
> I think the original thread was pack in early
for maintain.
>
> I'm just wondering, whats the best way to store Stats table( a database or
> parquet file?)
> What exactly are you trying to do? Zeppelin is for interactive analysis of
> a dataset. What do you mean "realtime analytics" -- do you mean build a
> re
removed.
Finally, if I add rdd.persist(), then it doesn't work. I guess I would need
to do .map(_._1.datum) again before the map that does the real work.
--
Chris Miller
On Sat, Mar 12, 2016 at 4:15 PM, Chris Miller
wrote:
> Wow! That sure is buried in the documentation! But yeah, that
tln(record.get("myValue"))
})
*
What am I doing wrong?
--
Chris Miller
On Sat, Mar 12, 2016 at 1:48 PM, Peyman Mohajerian
wrote:
> Here is the reason for the behavior:
> '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable
> objec
What exactly are you trying to do? Zeppelin is for interactive analysis of
a dataset. What do you mean "realtime analytics" -- do you mean build a
report or dashboard that automatically updates as new data comes in?
--
Chris Miller
On Sat, Mar 12, 2016 at 3:13 PM, trung kien wrote:
one the datum?
Seems I'm not the only one who ran into this problem:
https://github.com/GoogleCloudPlatform/DataflowJavaSDK/issues/102. I can't
figure out how to fix it in my case without hacking away like the person in
the linked PR did.
Suggestions?
--
Chris Miller
For anyone running into this same issue, it looks like Avro deserialization
is just broken when used with SparkSQL and partitioned schemas. I created
an bug report with details and a simplified example on how to reproduce:
https://issues.apache.org/jira/browse/SPARK-13709
--
Chris Miller
On Fri
Gut instinct is no, Spark is overkill for your needs... you should be able
to accomplish all of that with a relational database or a column oriented
database (depending on the types of queries you most frequently run and the
performance requirements).
--
Chris Miller
On Mon, Mar 7, 2016 at 1:17
Guru:This is a really great response. Thanks for taking the time to explain
all of this. Helpful for me too.
--
Chris Miller
On Sun, Mar 6, 2016 at 1:54 PM, Guru Medasani wrote:
> Hi Lan,
>
> Streaming Means, Linear Regression and Logistic Regression support online
> machine lea
instead of writing to the file from coalesce, sort that data structure,
then write your file.
--
Chris Miller
On Sat, Mar 5, 2016 at 5:24 AM, jelez wrote:
> My streaming job is creating files on S3.
> The problem is that those files end up very small if I just write them to
> S3
> dir
oding the same files work fine with Hive, and I
imagine the same deserializer code is used there too.
Thoughts?
--
Chris Miller
On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman wrote:
> your field name is
> *enum1_values*
>
> but you have data
> { "foo1": "test123&q
ool.run(DataFileWriteTool.java:99)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
********
Any other ideas?
--
Chris Miller
On Thu, Mar 3, 2016 at 9:38 PM, Igor Berman wrote:
> your field name is
> *enum1_values*
>
> but
.lang.Thread.run(Thread.java:745)
********
In addition to the above, I also tried putting the test Avro files on HDFS
instead of S3 -- the error is the same. I also tried querying from Scala
instead of using Zeppelin, and I get the same error.
Where should I begin with troubleshooting th
add the partitions manually so that I can specify a location.
For what it's worth, "ActionEnum" is the first field in my schema. This
same table and query structure works fine with Hive. When I try to run this
with SparkSQL, however, I get the above error.
Anyone have any idea what the problem is here? Thanks!
--
Chris Miller
19 matches
Mail list logo