I've not done this in Scala yet, but in PySpark I've run into a similar
issue where having too many dataframes cached does cause memory issues.
Unpersist by itself did not clear the memory usage, but rather setting the
variable equal to None allowed all the references to be cleared and the
memory i
Hi Mich -
Can you run a filter command on df1 prior to your map for any rows where
p(3).toString != '-' then run your map command?
Thanks
Mike
On Tue, Sep 27, 2016 at 5:06 PM, Mich Talebzadeh
wrote:
> Thanks guys
>
> Actually these are the 7 rogue rows. The column 0 is the Volume column
>
While the SparkListener method is likely all around better, if you just
need this quickly you should be able to do a SSH local port redirection
over putty. In the putty configuration:
- Go to Connection: SSH: Tunnels
- In the Source port field, enter 4040 (or another unused port on your
machine)
My guess is there's some row that does not match up with the expected
data. While slower, I've found RDDs to be easier to troubleshoot this kind
of thing until you sort out exactly what's happening.
Something like:
raw_data = sc.textFile("")
rowcounts = raw_data.map(lambda x: (len(x.split(",")),
Hi Kevin -
There's not really a race condition as the 64 bit value is split into a
31 bit partition id (the upper portion) and a 33 bit incrementing id. In
other words, as long as each partition contains fewer than 8 billion
entries there should be no overlap and there is not any communication
5
>
> 102 B 10
>
>
>
> I want to update the FCSt DF data, based on Product=Item#
>
>
>
> So the resultant FCST DF should have data
>
> Product, fcst_qty
>
> A 85
>
> B 35
>
>
>
> Ho
I would also suggest building the container manually first and setup
everything you specifically need. Once done, you can then grab the history
file, pull out the invalid commands and build out the completed
Dockerfile. Trying to troubleshoot an installation via Dockerfile is often
an exercise in
on the
> sales qty (coming from the sales order DF)
>
>
>
> Hope it helps
>
>
>
> Subhajit
>
>
>
> *From:* Mike Metzger [mailto:m...@flexiblecreations.com]
> *Sent:* Friday, August 26, 2016 1:13 PM
> *To:* Subhajit Purkayastha
> *Cc:* user @spark
>
Without seeing the makeup of the Dataframes nor what your logic is for
updating them, I'd suggest doing a join of the Forecast DF with the
appropriate columns from the SalesOrder DF.
Mike
On Fri, Aug 26, 2016 at 11:53 AM, Subhajit Purkayastha
wrote:
> I am using spark 2.0, have 2 DataFrames, Sa
on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 25 August 2016 at 17:29, Mike Metzger
> wrote:
>
>> Are you trying to always add x
Are you trying to always add x numbers of digits / characters or are you
trying to pad to a specific length? If the latter, try using format
strings:
// Pad to 10 0 characters
val c = 123
f"$c%010d"
// Result is 000123
// Pad to 10 total characters with 0's
val c = 123.87
f"$c%010.2f"
//
Assuming you know the number of elements in the list, this should work:
df.withColumn('total', df["_1"].getItem(0) + df["_1"].getItem(1) +
df["_1"].getItem(2))
Mike
On Mon, Aug 15, 2016 at 12:02 PM, Javier Rey wrote:
> Hi everyone,
>
> I have one dataframe with one column this column is an arr
mage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 August 2
m relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 5 August 2016 at 17:34, Mike Metzger
> wrote:
>
>> Tony -
>>
>
hat is the
>>> performance implication on a large dataset ?
>>>
>>> @sonal - I can't have a collision in the values.
>>>
>>>> On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger
>>>> wrote:
>>>> You can use the monotonically_
ge dataset ?
>>
>> @sonal - I can't have a collision in the values.
>>
>>> On Fri, Aug 5, 2016 at 9:15 PM, Mike Metzger
>>> wrote:
>>> You can use the monotonically_increasing_id method to generate guaranteed
>>> uniqu
You can use the monotonically_increasing_id method to generate guaranteed
unique (but not necessarily consecutive) IDs. Calling something like:
df.withColumn("id", monotonically_increasing_id())
You don't mention which language you're using but you'll need to pull in the
sql.functions library.
This is a little ugly, but it may do what you're after -
df.withColumn('total', expr("+".join([col for col in df.columns])))
I believe this will handle null values ok, but will likely error if there
are any string columns present.
Mike
On Thu, Aug 4, 2016 at 8:41 AM, Javier Rey wrote:
> Hi
18 matches
Mail list logo