Hi Daniel,

On 16.06.20 19:27, Daniel Schwierzeck wrote:
Actually I wanted to do a more complete header sync with Linux and
submit some minimal refactorings of start.S so that you have a better
working base. Unfortuneately I hadn't time to do so in the last 3 weeks.

I understand. Is there something I can help you with, to speed this
up?

Again, can I assist with some of your tasks? I really would like to see
the base Octeon support appear in mainline U-Boot in the upcoming
merge window. Please see below.

my current work state is pushed to u-boot-mips in branches
header_sync_v2 and lowlevel_refactoring. That should be enough for the
first refactoring step for the next merge window. I'll try to send
patches soon. With this you can also drop patches 2/12 and 8/12.

Okay. Thanks for working on this.


Regarding the copying to L2 cache: could we do this later and add the
init stuff at first? So we could have functionality at first and then
boot time optimization later. This also would give me some more time to
refactor the start.S hooks which would give you a better base for
implementing such optimizations.

IIUYC, then your main reason that I should remove the L2 cache copy
is, that you would like to refactor start.S first and add some hooks
for e.g. this L2C cache copy, where it would fit better than currently
in mips_sram_init(). I definitely promise to re-work this Octeon early
boot code to fit into your refactored start.S, once its ready. This
way, you wouldn't be pressed in time to get this rework done. And we
have a chance to get this Octeon base support integrated in a "decent"
state (bootup time, please see DDR4 init below) soon.

There are other reasons, beside the boot speedup, for the L2 cache
copy:

Running from L2 cache also helps with re-mapping the CFI flash
bootmap, so that the 8 MiB flash can be fully accessed. Otherwise
the flash is too big for the initial bootmap size and things like
environement can't be accessed.

The DDR init code that I'm currently working on - I'm actually
preparing the first patchset right now - is only tested with this
configration right now.

I'm also working on other peripheral drivers for Octeon. Some have
already started (like USB) and others are on my list. All this work
is pretty painful without the boot speedup, as e.g. the DDR4 init
will take very long (many minutes compared to a few seconds, as I've
been told).

But of course this is your call. If you insist on removing the L2 cache
copy from the base Octeon port, then I will re-visit the patchet and
try to address it.

I had a deeper look into it, but I still can't understand why and how
this L2 copy even works. Especially after a cold-boot from flash without
any Octeon specific CPU initialization.

U-Boot is a position-dependent binary and all branches and jumps are
implemented with absolute addresses. You can't simply add an offset to
$pc (except when changing KSEG) and expect U-Boot to be running from the
new location. That's why we have the relocation step and the reloc table
where all absolute addresses are fixed relative to the relocation address.

Yes, I am aware of this. That's why TEXT_BASE is already set to the
"destination" address in L2 cache and not to the boot space (see below).

Assembly dump of the current patch series:

ffffffff80000000 <_start>:
ENTRY(_start)
        /* U-Boot entry point */
        b       reset
ffffffff80000000:       1000013f        b       ffffffff80000500 <reset>
         mtc0   zero, CP0_COUNT # clear cp0 count for most accurate boot timing
ffffffff80000004:       40804800        mtc0    zero,$9
        ...
ffffffff80000500 <reset>:
1:      mfc0    t0, CP0_EBASE
ffffffff80000500:       400c7801        mfc0    t0,$15,1
        and     t0, t0, EBASE_CPUNUM
ffffffff80000504:       318c03ff        andi    t0,t0,0x3ff

        /* Hang if this isn't the first CPU in the system */
2:      beqz    t0, 4f
ffffffff80000508:       11800004        beqz    t0,ffffffff8000051c <reset+0x1c>
         nop
ffffffff8000050c:       00000000        nop
3:      wait
ffffffff80000510:       42000020        wait
        b       3b
ffffffff80000514:       1000fffe        b       ffffffff80000510 <reset+0x10>
         nop
ffffffff80000518:       00000000        nop

        /* Init CP0 Status */
4:      mfc0    t0, CP0_STATUS
ffffffff8000051c:       400c6000        mfc0    t0,$12
...

If you boot off from flash at ffffffffbfc00000, the very first
instruction will jump to ffffffff80000500. This can only work if there
is already some magic memory/TLB mapping from ffffffffbfc00000 to
ffffffff80000000 or I-caches are already active.

I'm assuming that the first "b reset" will translate into an relative
machine opcode, so a difference in TEXT_BASE and current PC won't be a
problem. Its impossible to code an absolute 64bit address in this 32bit
opcode. Could this be the case (you know MIPS far better than me)?

The first "real" call is to mips_sram_init(), which I addresses by
re-calculating the address in this patch:

    mips: start.S: Add CONFIG_MIPS_INIT_JUMP_OFFSET

But I might have missed something. Still it works reliably on my
platform this way.

Could you describe the basic system design (or share some documents)
i.e. how CPUs, busses, cache coherency engine, memory controller, flash
are connected and at which addresses them are mapped? And how the boot
process works? Are the I-caches already active when booting from flash?
This would really help me for understanding the code ;)

I agree. Unfortunately I can't share the manuals. As for the caches,
I can only assume that the I-cache is disabled at boot as booting is
so slow without this L2 copy.

Also the complete copy of the U-Boot binary to L2 cache breaks the
U-Boot design philosophy.

Yes, I am aware of this. And I agree that its not perfect. But the
necessary changes to the MIPS common code are very minimal.

The task of start.S is to initialize the CPU,
initialize the external memory, prepare global_data/BSS and setup the
initial stack space. Originally the MIPS start.S required to implement
everything in assembly. Later we added the option to use global_data and
initial stack from some SRAM to be able to use a full C environment for
the lowlevel_init() code. That's why start.S is now cluttered and needs
some refactoring.

I agree. start.S has become quite complex with all the #ifdef's.

Wouldn't it possible for the first Octeon patch series to simply fit in
this scheme and use the normal relocation step to copy the U-Boot binary
to RAM or effectively L2 cache? Without the DDR init code there is not
much code which would run slowly from flash.

Yes. But relocation is done to L2 cache in this case. And with the
DDR init code, relocation needs to be done to DDR (after DDR init).
So this would need a 2nd "relocation", which is cumbersome and not
supported. I also "played" with SPL for a while to overcome this
2nd relocation. But this didn't work very good AFAIR.

Or do you see a way to do this in a "clean way"?

While writing this, it occurred to me that you maybe just work-around a
non-working or not properly initialized I-cache when booting from flash.

Frankly, I don't really know or remember. I'll take another look at the
MIPS common cache setup and will try to use it, as I-cache is enabled
there.

Or maybe fetching single instructions from flash to I-cache is slower
than copying the code at once? Can't that be fixed? Would something like
arch/mips/mach-mscc/cpu.c:vcoreiii_tlb_init() work with Octeon?

Maybe. I'll take a look at this code again to see, if it fits for Octeon
as well.

Or was
something like that already done in Aaron's custom start.S but not in a
much understandable way (when not knowing the Octeon internals)?

Its a combination of both AFAIR. TLB setup and copy to L2 cache.

My reasoning was to reduce the complexity and "only" keep this L2 cache
copy seemed to work quite good. I also didn't want to introduce some
TLB mapping(s), as this also increases the complexity again. But as
mentioned above, I'll check again, what can be done here.


Do you have some guidance for me on this? How should we continue here?
Can I help with the Linux header sync or start.S refactoring? Did you
start with this work already or do you have a rough schedule?

Sorry for being so persistent on this. I really don't want to rush
you, but as mentioned above, I really would like to see at least the
Octeon base port in mainline U-Boot in the next merge window. And if
there is something that I can do to help you, then please let me know.


I also want to merge the base port as soon as possible.

Great. ;)

Than we could
figure out how to properly accelerate the DDR init code respectively all
code running from flash. But I want to avoid adding a second and custom
relocation step if there are better ways. And the discussion about this
should not block the initial patch series.

Could you rebase your patch series to u-boot-mips/header_sync_v2 and
move the L2 copy code to the DDR init patch series?

Good idea.

Than I could apply
the base port to u-boot-mips/next and you could still develop based on
the current L2 copy code?

Yes, I'll rework the code and will send a new base port v3 patchset
very soon.

Thanks for all your input and your assistance to improve the quality
of the code.

Thanks,
Stefan

Reply via email to