Re: [gem5-users] DRAM controller write requests merge

Rizwana Begum via gem5-users Fri, 23 Jan 2015 06:55:56 -0800

Hello Andreas,

I get your concerns that not having right timing and power models of MC and
PHY, and therefore not being able to model DFS correctly for these two
components might lead to incorrect conclusions about memory system
performance and energy scaling with frequency. However, I am hoping that
just by scaling DRAM module frequency down (in multiples of spec
frequency), and keeping both MC and PHY at spec frequencies, should give me
an idea of how DRAM core module itself impacts the overall system (CPU +
Memory subsystem) behavior.


For implementation purposes, I want to defer having dedicated clock domain
for MC and a multiplier for PHY until I can get good MC, PHY timing models.
Given that I have no better latency model than static latencies used in
Gem5 to model MC, PHY and device round trip latencies, I want to go with
these static latencies and not scale them at all while scaling DRAM  module
frequency. And, I think this makes sense if I am only scaling DRAM module
frequency down and not up. Scaling frequency only down makes sure that MC
and PHY can always support peak bandwidth being delivered by DRAM module.
For now, I will restrict my experiments to only scaling memory frequency
down (I would appreciate any concerns/suggestions here).

Thanks for pointing me to DRAMPower model and its publications. I don't
fully understand the difference between Micron power models and DRAMPower
models yet. However, I am digging deep and hoping to have full
understanding of DRAMPower model soon (as I will be doing DFS for power
model as well).

I am pushing on memory DFS as I am targeting towards "predicting" memory
performance while changing a "performance knob" of memory. To me it looks
easier to predict performance of memory with change in frequency (as it
involves a couple of equations to predict timings parameters) than
predicting performance of memory with usage of different power down modes
(Actually I have no idea if I would be able to predict how long it would
take if I stop/start using any of these modes.)

Thank you,
-Rizwana

On Fri, Jan 23, 2015 at 4:17 AM, Andreas Hansson <andreas.hans...@arm.com>
wrote:

>  Hi Rizwana,
>
>  If you really want to do DRAM DFS using the ClockDomains of gem5, we
> will indeed need to have a dedicated clock domain for the controller, and a
> multiplier for the PHY/interface. It can be done, we would just have to
> enhance all the timings to be expressed in clocks where appropriate, and
> then implement all the equations for the timings that are max or an
> absolute time and a number of clocks. It can be done, it’s quite a lot of
> work though, and it will add a big chunk of complexity to the code.
>
>  Also, note that the DFS work done by Qingyuan and Rizwana (till now)
> uses the Micron power calculator. This is very dangerous (in my opinion),
> as the calculator assumes best-case timings, and when you start using DFS I
> have little or no confidence in this model. See all the DRAMPower
> publications on this issue. This particular issue is solved when using
> DRAMPower together with the gem5 DRAM controller model, so going forward
> you should not have these problems Rizwana. In addition, neither of these
> models (Micron or DRAMPower) include the PHY power, which is particularly
> non-linear. In short, I would be very careful in drawing any conclusions
> from these results.
>
>  Don’t get me wrong, I’m not trying to discourage you from looking at
> DRAM DFS. Just be aware that it’s not as easy and straightforward as it
> seems.
>
>  Andreas
>
>   From: Tao Zhang <tao.zhang.0...@gmail.com>
> Date: Friday, 23 January 2015 01:24
> To: Rizwana Begum <rizwana....@gmail.com>
> Cc: Andreas Hansson <andreas.hans...@arm.com>, gem5 users mailing list <
> gem5-users@gem5.org>
> Subject: Re: DRAM controller write requests merge
>
>  >>>> MC I believe should have either it's own clock domain, or might
> work in L1/L2/Core clock domain.
>
>  It is more reasonable to assume MC working at the same frequency as DRAM
> rather than the high CPU clock frequency. In fact, MC frequency is
> relatively flexible in a real chip. It can run even much slower than DRAM
> frequency as long as the its peak bandwidth >= DRAM peak bandwidth. Two
> CDCs (clock domain crossing) may be needed in the MC. One between the
> system bus and MC while the other is between MC and PHY/DRAM bus.
>
>  When it comes to the pros/cons DVFS vs. low power, Qingyuan Deng has a
> series of papers talking about it with various granularity: MemScale ,
> MultiScale, CoScale.  (http://paul.rutgers.edu/~qdeng/) I am not going to
> argue which one is better. But at least it can give you a straightforward
> insight on this technology in memory subsystem.
>
>  -Tao
>
> On Thu, Jan 22, 2015 at 3:59 PM, Rizwana Begum <rizwana....@gmail.com>
> wrote:
>
>> Hello Andreas,
>>
>>  I agree totally with you that low power modes is the way to go for
>> getting better energy-performance tradeoffs for memory than going with DFS.
>> However, In past I have experimented with DFS for memory bus only. With
>> change in memory frequency I scaled only tBURST linearly with frequency and
>> observed performance vs. frequency trend for memory intensive benchmarks
>> for open-page policy. For closed page policy, I had no performance
>> improvement with increase in memory frequency as command-to-command static
>> latency of DRAM dominates. I also had energy vs frequency trade-off as
>> background energy scales with memory frequency.
>>
>>  All of the above DFS exploration was done on an old Gem5 commit ( commit:
>> d2404e ). I had a simplified micron memory power model and the simple
>> frequency scaling mentioned above implemented on top of this old commit.
>> Recently we moved to a latest Gem5 commit (commit : 4a411f) that has
>> detailed power and performance model compared to the old commit. I am
>> trying to have a quick DFS implementation here and observe the trends of
>> energy and performance vs. memory frequency. Then I think, exploring low
>> power modes will be my next step.
>>
>>  I am able to express the timings parameters that are specific to DRAM
>> module in terms of memory frequency. Some of the DRAM related timing
>> parameters are static latencies, some are function of tCK (Got details from
>> micron datasheet). From discussion in this thread so far, I think PHY also
>> works in sync with memory frequency. While, MC I believe should have either
>> it's own clock domain, or might work in L1/L2/Core clock domain. However,
>> given that I don't have a good model for MC and PHY latencies, for now, I
>> am planning to only scale DRAM related parameters and leave PHY,MC static
>> latencies as they are.
>>
>>  I appreciate yours and Tao's inputs so far. I would be happy to receive
>> any more ideas if you have regarding my DFS implementation approach.
>>
>>  Thank you,
>> -Rizwana
>>
>>
>>
>> On Thu, Jan 22, 2015 at 5:10 PM, Andreas Hansson <andreas.hans...@arm.com
>> > wrote:
>>
>>> Hi Rizwana,
>>>
>>> All objects belong to a clock domain. That said, there is no clock
>>> domain specifically for the memory controller in the example scripts. At
>>> the moment all the timings in the controller are absolute, and are not
>>> expressed in cycles.
>>>
>>> In general the best strategy to modulate DRAM performance is to use the
>>> low power modes (rather than DVFS). The energy spent is far from
>>> proportional, and thus it is better to be completely off when possible. We
>>> have some patches that add low-power modes to the controller and they
>>> should hopefully be on RB soon.
>>>
>>> Andreas
>>>
>>> -----
>>> On Jan 22, 2015 at 9:59 PM, Rizwana Begum <rizwana....@gmail.com> wrote:
>>>
>>>
>>> Ah. I see. Thanks for pointing me to the static latencies, I missed on
>>> that. As the controller latency is modeled as static latency, am I right
>>> in
>>> saying that as of the latest commit MC is not attached to any clock
>>> domain
>>> in Gem5?
>>>
>>> Thanks,
>>> -Rizwana
>>>
>>> On Thursday, January 22, 2015, Andreas Hansson via gem5-users <
>>> gem5-users@gem5.org> wrote:
>>>
>>> > Hi Rizwana,
>>> >
>>> > The DRAM controller has two parameters to control the static latency in
>>> > the controller itself, and also the PHY and actual round trip to the
>>> > device. These parameters are called front end and back end latency,
>>> and you
>>> > can set them to match a given controller architecture and/or PHY
>>> > implementation. That should be enough for most cases I would think.
>>> >
>>> > Andreas
>>> >
>>> > -----
>>> > On Jan 22, 2015 at 9:22 PM, Rizwana Begum via gem5-users <
>>> > gem5-users@gem5.org <javascript:;>> wrote:
>>> >
>>> >
>>> > Great. That was helpful. So, am I right in assuming that Gem5 DRAM
>>> > controller model doesn't account for signal propagation delay on
>>> > command/data bus? I am coming to this conclusion as read response event
>>> > from MC is scheduled to upper ports after tCL+tBURST after read
>>> command is
>>> > issued. Infact, I had a chance to use DRAMSim2 in the past, and I don't
>>> > remember signal propagation delay being accounted there either. Is it
>>> too
>>> > small and can safely be ignored?
>>> >
>>> > Thanks,
>>> > -Rizwana
>>> >
>>> > On Thu, Jan 22, 2015 at 2:58 PM, Tao Zhang <tao.zhang.0...@gmail.com
>>> > <javascript:;>> wrote:
>>> >
>>> > > The timing of RL (aka tCL) is dedicated to DRAM module. This is the
>>> > > distance from DRAM module receive the CAS command to DRAM module put
>>> the
>>> > > first data on the interface/bus. On MC/PHY side, it should account
>>> for
>>> > the
>>> > > signal propagation delay on the command/data bus. In fact, signal
>>> "DQS"
>>> > is
>>> > > also used to assist the read data sampling.
>>> > >
>>> > > The bus protocol is defined by JEDEC. It is completely different from
>>> > > AMBA/AHB. The bus has only one master (MC) and may have multiple
>>> slaves
>>> > > (DRAM ranks). So it looks like a AHB-lite. But in general, they are
>>> two
>>> > > stories.
>>> > >
>>> > > -Tao
>>> > >
>>> > > On Thu, Jan 22, 2015 at 11:50 AM, Rizwana Begum <
>>> rizwana....@gmail.com
>>> > <javascript:;>>
>>> > > wrote:
>>> > >
>>> > >> Thanks Tao for your response. That clarifies a lot of my questions.
>>> So
>>> > >> here is what I understand:
>>> > >>
>>> > >> DRAM module runs at a particular clock frequency. Bus connecting
>>> DRAM
>>> > >> module and PHY runs in sync with this clock frequency. PHY as well
>>> runs
>>> > >> synchronously to DRAM module clock frequency. Now, for a 64bit bus,
>>> > burst
>>> > >> length of 8 (64 bytes transferred per burst) my understanding of
>>> read
>>> > >> operations is that, after the read command is issued, first bit of
>>> data
>>> > is
>>> > >> available after read latency. At immediate clock edge after read
>>> > latency,
>>> > >> 8bytes are sampled and transferred over the bus. Then every
>>> consecutive
>>> > >> rising and falling clock edges, 8 more bytes are sampled and
>>> transferred
>>> > >> over the bus for four consecutive clock cycles. Thereby, PHY
>>> receives
>>> > all
>>> > >> 64bytes worth data at the end of read latency + 4 clock cycles. Is
>>> this
>>> > >> right?
>>> > >>
>>> > >> Also, any idea if this bus (connecting DRAM and PHY) same as system
>>> bus?
>>> > >> For example, is it AMBA/AHB on latest ARM SoCs?
>>> > >>
>>> > >> Thanks again,
>>> > >> -Rizwana
>>> > >>
>>> > >> On Thu, Jan 22, 2015 at 12:49 PM, Tao Zhang <
>>> tao.zhang.0...@gmail.com
>>> > <javascript:;>>
>>> > >> wrote:
>>> > >>
>>> > >>> Hi Rizwana,
>>> > >>>
>>> > >>> see my understanding inline. Thanks,
>>> > >>>
>>> > >>> -Tao
>>> > >>>
>>> > >>> On Thu, Jan 22, 2015 at 8:12 AM, Rizwana Begum via gem5-users <
>>>  > >>> gem5-users@gem5.org <javascript:;>> wrote:
>>> > >>>
>>> > >>>> Hello Andreas,
>>> > >>>>
>>> > >>>> Thanks for the reply. Sure, I will try to get the patch up on
>>> review
>>> > >>>> board.
>>> > >>>> I have another question: Though this is related to DDR/MC
>>> architecture
>>> > >>>> and not directly related to Gem5 DDR model implementation, I am
>>> > hoping you
>>> > >>>> (or anyone else on the list) would have a good understanding to
>>> > clarify my
>>> > >>>> confusions:
>>> > >>>>
>>> > >>>> As far as I understand 'busBusyUntil' represents the data bus.
>>> This
>>> > >>>> variable is used to keep track of data bus availability:
>>> > >>>>
>>> > >>>> 1. Is the data bus is the bus used to transfer data from core DRAM
>>> > >>>> module to PHY?
>>> > >>>>
>>> > >>>
>>> > >>>    Yes, you are right. In addition, this is also the bus to
>>> transfer
>>> > >>> data from PHY to DRAM module.
>>> > >>>
>>> > >>>
>>> > >>>> 2. I believe PHY is the DRAM physical interface IP. Where is it
>>> > >>>> physically located? Is it located on core along side memory
>>> > controller (MC)
>>> > >>>> or on DIMMs? And what exactly does physical bus (the wires
>>> connecting
>>> > DIMMs
>>> > >>>> to MC) connect? DRAM and PHY or PHY and MC?
>>> > >>>>
>>> > >>>
>>> > >>>     It is on the core/MC side.  The physical bus refers to DRAM and
>>> > PHY.
>>> > >>> Logically, you can treat PHY as a part of MC and it just incurs
>>> some
>>> > extra
>>> > >>> latency. In this way, the physical bus can be extended to DRAM and
>>> MC.
>>> > >>>
>>> > >>>
>>> > >>>> 3. My confusion is that actual physical bus on SoC connecting the
>>> DRAM
>>> > >>>> module and MC should be different from data bus that
>>> 'busBusyUntil' is
>>> > >>>> representing. It takes tBURST ns (function of memory cycles) to
>>> > transfer
>>> > >>>> one burst on the data bus and the actual physical bus speed
>>> shouldn't
>>> > be
>>> > >>>> depending upon memory frequency for transferring data from DRAM to
>>> > MC. Am I
>>> > >>>> right?
>>> > >>>>
>>> > >>>
>>> > >>>     The "busBusyUntil" is still valid. The actual physical bus
>>> speed
>>> > >>> should be consistent with the SPEC (e.g., 800MHz, 933MHz,
>>> 1600MHz...).
>>> > >>> Remember, the JEDEC DRAM is a Synchronous DRAM. It means both PHY
>>> and
>>> > DRAM
>>> > >>> module should be in sync with the same clock frequency. As one end
>>> of
>>> > the
>>> > >>> connection is the DRAM module, PHY should run at the same
>>> frequency as
>>> > DRAM
>>> > >>> module runs.
>>> > >>>
>>> > >>>
>>> > >>>> I would appreciate if anyone can provide insight into these
>>> questions.
>>> > >>>>
>>> > >>>> Thank you,
>>> > >>>> -Rizwana
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> On Wed, Jan 21, 2015 at 4:45 PM, Andreas Hansson <
>>>  > >>>> andreas.hans...@arm.com <javascript:;>> wrote:
>>> > >>>>
>>> > >>>>>  Hi Rizwana,
>>> > >>>>>
>>> > >>>>>  It could very well be that you’ve hit a bug. I’d suggest to
>>> post a
>>> > >>>>> review on the reviewboard to make it more clear what changes
>>> need to
>>> > be
>>> > >>>>> done. If you’re not familiar with the process have a look at
>>> > >>>>> http://www.gem5.org/Commit_Access. The easiest is to use the
>>> > >>>>> reviewboard mercurial plugin.
>>> > >>>>>
>>> > >>>>>  I look forward to see the patch.
>>> > >>>>>
>>> > >>>>>  Thanks,
>>> > >>>>>
>>> > >>>>>  Andreas
>>> > >>>>>
>>> > >>>>>   From: Rizwana Begum via gem5-users <gem5-users@gem5.org
>>> > <javascript:;>>
>>> > >>>>> Reply-To: Rizwana Begum <rizwana....@gmail.com <javascript:;>>,
>>> > gem5 users mailing
>>> > >>>>> list <gem5-users@gem5.org <javascript:;>>
>>> > >>>>> Date: Wednesday, 21 January 2015 16:24
>>> > >>>>> To: gem5 users mailing list <gem5-users@gem5.org <javascript:;>>
>>>  > >>>>> Subject: [gem5-users] DRAM controller write requests merge
>>> > >>>>>
>>> > >>>>>  Hello Users,
>>> > >>>>>
>>> > >>>>>  I am trying to understanding write packets queuing in DRAM
>>> > >>>>> controller model. I am looking at 'addToWriteQueue' function.
>>> From my
>>> > >>>>> understanding so far, it merges write requests across burst
>>> > boundaries.
>>> > >>>>> Looking at following if statement:
>>> > >>>>>
>>> > >>>>>  if ((addr + size) >= (*w)->addr &&
>>> > >>>>>                            ((*w)->addr + (*w)->size - addr) <=
>>> > >>>>> burstSize) {
>>> > >>>>>                     // the new one is just before or partially
>>> > >>>>>                     // overlapping with the existing one, and
>>> > together
>>> > >>>>>                     // they fit within a burst
>>> > >>>>> ....
>>> > >>>>>  ....
>>> > >>>>> ....
>>> > >>>>> }
>>> > >>>>>
>>> > >>>>>  Merging here may make the write request go across burst
>>> boundary.
>>> > >>>>> Size computation in the beginning of the for loop of this
>>> function
>>> > suggests
>>> > >>>>> that packets are split at burst boundaries. For example, if the
>>> > packet addr
>>> > >>>>> is 16, burst size is 32 bytes and packet request size is 25 bytes
>>> > (all in
>>> > >>>>> decimal for ease), then 2 write bursts should be added to the
>>> queue:
>>> > 16-31,
>>> > >>>>> 32-40. However, while merging, lets say if there existed a packet
>>> > already
>>> > >>>>> in write queue from 32-40, then a write from 16-40 is added to
>>> the
>>> > queue
>>> > >>>>> which is across burst boundary. is that physically possible?
>>> > Shouldn't
>>> > >>>>> there be two write requests in the queue:16-31, 32-40 instead of
>>> one
>>> > single
>>> > >>>>> merged request?
>>> > >>>>>
>>> > >>>>>  Thank you,
>>> > >>>>> -Rizwana
>>> > >>>>>
>>> > >>>>>
>>> > >>>>>
>>> > >>>>> -- IMPORTANT NOTICE: The contents of this email and any
>>> attachments
>>> > >>>>> are confidential and may also be privileged. If you are not the
>>> > intended
>>> > >>>>> recipient, please notify the sender immediately and do not
>>> disclose
>>> > the
>>> > >>>>> contents to any other person, use it for any purpose, or store or
>>> > copy the
>>> > >>>>> information in any medium. Thank you.
>>> > >>>>>
>>> > >>>>> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1
>>> 9NJ,
>>> > >>>>> Registered in England & Wales, Company No: 2557590
>>> > >>>>> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge
>>> CB1
>>> > >>>>> 9NJ, Registered in England & Wales, Company No: 2548782
>>> > >>>>>
>>> > >>>>
>>> > >>>>
>>> > >>>> _______________________________________________
>>> > >>>> gem5-users mailing list
>>>  > >>>> gem5-users@gem5.org <javascript:;>
>>> > >>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>> > >>>>
>>> > >>>
>>> > >>>
>>> > >>
>>> > >
>>> >
>>> > _______________________________________________
>>> > gem5-users mailing list
>>> > gem5-users@gem5.org <javascript:;>
>>> > http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>>
>>
>
> -- IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
>
> ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2557590
> ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ,
> Registered in England & Wales, Company No: 2548782
>

_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] DRAM controller write requests merge

Reply via email to