Hello KR, No need to apologize! Thank you for all your help on this issue. I actually didn't know about these "trampoline" sections, I had to look them up after your email. If you think it might make the code clearer, I can update the comments about `_eronly` to reflect that it is no longer the end of .text and .rodata, and instead just the start of .data? I can also mention these trampoline sections in between.
What do you think? Matteo On Wed, May 28, 2025 at 8:11 PM <kr....@kerogit.eu> wrote: > Well, I think I managed to figure it out. > > First, an apology - I didn't read the patch in #16457 properly and > thought it just moves the _eronly assignment while it, in fact, changes > ABSOLUTE to LOADADDR too. Sorry for that. > > After spotting this, I went and tried to find out how the change makes > things different because from my understanding, it should do the same > thing - assignment of current address where .data section begins should > be the same as assignment of address of the .data section. > > Turns out there is a difference - the linker also generated a > .trampoline section and placed it between _eronly assignment and > beginning of .data section. My guess is that when this section is not > mentioned in the linker script, the linker has free reign as to where to > put it...? Anyway, that made _eronly value incorrect by .trampoline > section size and things went badly from there. > > The issue can therefore also be fixed by this patch: > > diff --git a/boards/avr/atmega/arduino-mega2560/scripts/flash.ld > b/boards/avr/atmega/arduino-mega2560/scripts/flash.ld > index f365646dfa..f9ceeb3a08 100644 > --- a/boards/avr/atmega/arduino-mega2560/scripts/flash.ld > +++ b/boards/avr/atmega/arduino-mega2560/scripts/flash.ld > @@ -122,6 +122,12 @@ SECTIONS > _etext = . ; > } > flash > > + .trampolines : > + { > + *(.trampolines) > + *(.trampolines.*) > + } > flash > + > _eronly = ABSOLUTE(.); > > .data : > > The reason why I wasn't able to reproduce this on mega1284p-xplained is > then simple - there is no .trampolines section for that chip in the > compiled binary. > > In my opinion, the proposed patch in #16457 is a better solution. While > it does not preserve what the code comments say about the _eronly value > ("this is where .text + .rodata ends"), it is a better match to what the > _eronly value is actually used for ("this is start of .data section") > > On 2025-05-28 11:59, kr....@kerogit.eu wrote: > > Alright, I looked into the possible bug mentioned in the last paragraph > > of my previous message and it seems the code is actually correct. What > > I thought was that if register pair Z overflows, subsequent reads would > > read from incorrect memory locations. However, that does not seem to be > > the case - according to AVR Instruction Set Manual DS40002198A, ELPM > > instruction post-increment "applies to the entire 24-bit concatenation > > of the RAMPZ and Z-Pointer Registers" > > > > As for the change of _eronly however, I still think that is incorrect. > > Will try to look into that more during the evening unless someone > > figures something else out. > > > > What compiler are you using anyway? From the kernel version mentioned > > in GitHub I assume gcc 14.2? Might be the reason why everything works > > fine for me and not for anyone else, I am still on 5.4 from Debian > > stable. (Can try that too, Debian testing also has 14.2.) > > > > On 2025-05-28 09:16, kr....@kerogit.eu wrote: > >> Hello, > >> > >> I don't that is a correct fix and in fact, from a quick look at things > >> it seems to me that it actually introduces another bug. As described > >> in src/atmega/atmega_head.S, that value is supposed to be "Start of > >> .data section in FLASH". You changed it to end of .data section in > >> FLASH. It is used after __do_copy_data label as an address where > >> initialized data start in flash memory. So if I read this correctly, > >> you are copying incorrect data to your initialized variables. > >> > >> If the program starts working this way, that might suggest that the > >> initialization data are placed incorrectly in the image? Anyway, the > >> linker script is pretty much identical to linker script of > >> mega1284p-xplained and that one works out of the box without need to > >> move _eronly. > >> > >> Hm, looking at this, I think I just spotted a bug in __do_copy_data > >> but that would only hit if your initialized data region crossed a 64kB > >> boundary. Is that the case in any chance? It would cause _some_ > >> initialized variables to contain random garbage... > >> > >> On 2025-05-28 07:17, Matteo Golin wrote: > >>> Managed to find a fix here: > >>> https://github.com/apache/nuttx/pull/16457 > >>> > >>> On Tue, May 27, 2025 at 11:24 PM Matteo Golin > >>> <matteo.go...@gmail.com> > >>> wrote: > >>> > >>>> Thank you for your help KR! > >>>> > >>>> I also noticed this CONFIG_RAM_START=0x800100. Unfortunately, > >>>> changing > >>>> that value to the proper 0x800200 did not really help the issue. > >>>> > >>>> I went ahead and generated the disassembly like you suggested, and > >>>> it does > >>>> appear that the .text section starts with jump instructions for all > >>>> the > >>>> interrupt vectors: > >>>> > >>>> 00000000 <_stext>: > >>>> 0: 0c 94 72 00 jmp 0xe4 ; 0xe4 <__start> > >>>> 4: 0c 94 95 00 jmp 0x12a ; 0x12a > >>>> <atmega_int0> > >>>> 8: 0c 94 98 00 jmp 0x130 ; 0x130 > >>>> <atmega_int1> > >>>> c: 0c 94 9b 00 jmp 0x136 ; 0x136 > >>>> <atmega_int2> > >>>> 10: 0c 94 9e 00 jmp 0x13c ; 0x13c > >>>> <atmega_int3> > >>>> 14: 0c 94 a1 00 jmp 0x142 ; 0x142 > >>>> <atmega_int4> > >>>> 18: 0c 94 a4 00 jmp 0x148 ; 0x148 > >>>> <atmega_int5> > >>>> 1c: 0c 94 a7 00 jmp 0x14e ; 0x14e > >>>> <atmega_int6> > >>>> 20: 0c 94 aa 00 jmp 0x154 ; 0x154 > >>>> <atmega_int7> > >>>> 24: 0c 94 ad 00 jmp 0x15a ; 0x15a > >>>> <atmega_pcint0> > >>>> 28: 0c 94 b0 00 jmp 0x160 ; 0x160 > >>>> <atmega_pcint1> > >>>> 2c: 0c 94 b3 00 jmp 0x166 ; 0x166 > >>>> <atmega_pcint2> > >>>> 30: 0c 94 b6 00 jmp 0x16c ; 0x16c <atmega_wdt> > >>>> 34: 0c 94 b9 00 jmp 0x172 ; 0x172 > >>>> <atmega_tim2_compa> > >>>> 38: 0c 94 bc 00 jmp 0x178 ; 0x178 > >>>> <atmega_tim2_compb> > >>>> 3c: 0c 94 bf 00 jmp 0x17e ; 0x17e > >>>> <atmega_tim2_ovf> > >>>> 40: 0c 94 c2 00 jmp 0x184 ; 0x184 > >>>> <atmega_tim1_capt> > >>>> 44: 0c 94 c5 00 jmp 0x18a ; 0x18a > >>>> <atmega_tim1_compa> > >>>> 48: 0c 94 c8 00 jmp 0x190 ; 0x190 > >>>> <atmega_tim1_compb> > >>>> ...and so on... > >>>> > >>>> `g_idle_topstack` is 0x800e2b in the disassembly. With RAM_START > >>>> being > >>>> 0x800200, and a size of 8192 bytes, that gives a `heap_size` of 5076 > >>>> bytes. > >>>> That all seems fine. > >>>> > >>>> However, printing out the value of `g_idle_topstack` (in binary > >>>> using > >>>> `up_putc`) I can see that the value does not match. The printed > >>>> value is > >>>> "1000011001110110" while the binary representation of 0x800e2b is > >>>> "100000000000111000101011". > >>>> Printing out the `heap_size` tells me it's "1111100111001101", which > >>>> is > >>>> not 5076. > >>>> > >>>> It seems your suspicion about a mismatch is probably correct, > >>>> although > >>>> it's not due to a misplaced vector table it seems. > >>>> > >>>> I did, for fun, try enabling `CONFIG_DEBUG_OPT_UNUSED_SECTIONS` and > >>>> it did > >>>> remove the vectors at the start of the assembly file. That image > >>>> didn't > >>>> work. > >>>> Then I kept `CONFIG_DEBUG_OPT_UNUSED_SECTIONS` enabled and put the > >>>> `KEEP(*(.vectors))` suggestion in the linker script. Vectors are > >>>> kept in > >>>> the image, and `g_idle_topstack` changes to 0x800be4. This image > >>>> didn't > >>>> work either. > >>>> > >>>> That got me thinking if there was another section getting > >>>> unexpectedly > >>>> tossed, so I put `KEEP(...)` around basically everything in the > >>>> linker > >>>> script. That didn't produce a different result, and still had the > >>>> `g_idle_topstack` of 0x800e2b. > >>>> > >>>> I will continue to debug this, but if you have any other suggestions > >>>> please let me know. Thank you very much for all your help so far! > >>>> > >>>> Matteo > >>>> > >>>> On Tue, May 27, 2025 at 7:33 PM <kr....@kerogit.eu> wrote: > >>>> > >>>>> Thanks to you as well. > >>>>> > >>>>> I spotted an error in the config posted both in the issue #16444 > >>>>> and PR > >>>>> #16443. It is the > >>>>> > >>>>> CONFIG_RAM_START=0x800100 > >>>>> > >>>>> Unless I overlooked something, the RAM starts at address 0x200 > >>>>> (0x800200) for AtMega 2560. > >>>>> > >>>>> Not sure if that's what causing the problem though - as far as I > >>>>> can > >>>>> see, that value is actually only used as an input to CONFIG_RAM_END > >>>>> calculation and incorrect value in this macro simply reduces heap > >>>>> size. > >>>>> > >>>>> The symptoms of the bug would match though - the program would be > >>>>> trying > >>>>> to use I/O registers as RAM. > >>>>> > >>>>> Thing is - seeing the bug was narrowed down to heap initialization > >>>>> in > >>>>> nx_start - I remember that I ran into pretty much the same issue > >>>>> when > >>>>> adding support for AVR DA/DB family... and I can't remember what > >>>>> was the > >>>>> problem. > >>>>> > >>>>> Maybe check the disassembly file? (RAW_DISASSEMBLY needs to be set > >>>>> for > >>>>> the file to be generated.) Verify that the .text section starts > >>>>> with jmp > >>>>> instructions for interrupt vectors. This is only a vague memory but > >>>>> I > >>>>> think that I ran into the issue because the linker removed .vectors > >>>>> section, thinking it was unreferenced. (But that should not happen > >>>>> here, > >>>>> the linked config shows CONFIG_DEBUG_OPT_UNUSED_SECTIONS is not > >>>>> set.) > >>>>> > >>>>> Anyway, when that happened, the whole program got "shifted" - > >>>>> except for > >>>>> program memory that stored values of initialized variables. The > >>>>> data > >>>>> copy loop in initialization code (atmega_head.S) then read the > >>>>> values > >>>>> from the shifted location leading to a mismatch - variables were > >>>>> populated with wrong data. One of those variables was > >>>>> g_idle_topstack > >>>>> which directly sets start of heap. I think it initialized it to > >>>>> zero...? > >>>>> > >>>>> Again, all of this is based on vague memory. TLDR version is: if > >>>>> able, > >>>>> check g_idle_topstack in up_allocate_heap and verify that it is > >>>>> correct. > >>>>> > >>>>> > >>>>> On 2025-05-27 15:28, Alan C. Assis wrote: > >>>>> > Thank you KR, > >>>>> > > >>>>> > I posted your message in that issue for reference. > >>>>> > > >>>>> > I have a BK-AVR128 board: > >>>>> > https://aliexpress.com/item/1005006234334573.html > >>>>> > and will test NuttX on it to confirm. > >>>>> > > >>>>> > If everything is working as expected I will submit the board > support to > >>>>> > the > >>>>> > mainline. > >>>>> > > >>>>> > BR, > >>>>> > > >>>>> > Alan > >>>>> > > >>>>> > On Tue, May 27, 2025 at 6:39 AM <kr....@kerogit.eu> wrote: > >>>>> > > >>>>> >> Hello, > >>>>> >> > >>>>> >> as mentioned in one of my previous e-mails to this list (RFC: > >>>>> >> decoupling > >>>>> >> ability to always panic from board_reset), I noticed #16444 and > #16443 > >>>>> >> on GitHub. > >>>>> >> > >>>>> >> I found some time to look into this and I believe I found the > reason > >>>>> >> for > >>>>> >> "If you add avr_lowputc calls in the board initialization code, > you'll > >>>>> >> see that the TX LED stays stuck on indefinitely." > >>>>> >> > >>>>> >> Provided the board initialization code that sentence is talking > about > >>>>> >> is > >>>>> >> atmega_boardinitialize() in avr_boot.c, then avr_lowputc() likely > does > >>>>> >> not work because of the configuration used. According to #16444, > the > >>>>> >> configuration has: > >>>>> >> > >>>>> >> CONFIG_DEV_CONSOLE=y > >>>>> >> CONFIG_CONSOLE_SYSLOG=y > >>>>> >> > >>>>> >> These two are processed in src/atmega/atmega_config.h > >>>>> >> > >>>>> >> #ifndef CONFIG_DEV_CONSOLE > >>>>> >> # undef USE_SERIALDRIVER > >>>>> >> # undef USE_EARLYSERIALINIT > >>>>> >> #else > >>>>> >> # if defined(CONFIG_CONSOLE_SYSLOG) > >>>>> >> # undef USE_SERIALDRIVER > >>>>> >> # undef USE_EARLYSERIALINIT > >>>>> >> # elif defined(HAVE_USART_DEVICE) > >>>>> >> # define USE_SERIALDRIVER 1 > >>>>> >> # define USE_EARLYSERIALINIT 1 > >>>>> >> # else > >>>>> >> # undef USE_SERIALDRIVER > >>>>> >> # undef USE_EARLYSERIALINIT > >>>>> >> # endif > >>>>> >> #endif > >>>>> >> > >>>>> >> With the configuration above, the outer ifndef is not true and > first > >>>>> >> if > >>>>> >> defined in else block is true, which results into: > >>>>> >> > >>>>> >> undef USE_SERIALDRIVER > >>>>> >> undef USE_EARLYSERIALINIT > >>>>> >> > >>>>> >> Since CONFIG_STANDARD_SERIAL is also set, the undef of > >>>>> >> USE_SERIALDRIVER > >>>>> >> is reverted by define USE_SERIALDRIVER 1 later. However, > >>>>> >> USE_EARLYSERIALINIT remains unset. > >>>>> >> > >>>>> >> This causes avr_earlyserialinit() to not be built nor called from > >>>>> >> avr_lowinit(). Serial port peripheral is therefore not > initialized yet > >>>>> >> when atmega_boardinitialize() is called. I don't know what exactly > >>>>> >> happens when you attempt to transmit data with the port not > enabled > >>>>> >> but > >>>>> >> my guess would be that "transmit data register empty" status flag > is > >>>>> >> just never cleared and the program ends up in a loop waiting for > that > >>>>> >> to > >>>>> >> happen. > >>>>> >> > >>>>> >> > >>>>> >> Other than that - I recently tested NSH on mega1284p-xplained > (well, a > >>>>> >> breadboard with the chip stuck in it actually) and it worked for > me. > >>>>> >> As > >>>>> >> far as I can see, all AtMega devices use the same code for > managing > >>>>> >> serial ports so it should work out of the box. > >>>>> >> > >>>>> >> Someone somewhere in some forum on the net has or had a footer in > his > >>>>> >> posts saying something along the lines of that non-functional > serial > >>>>> >> port is 99% mismatching baud rates, might be worth a re-check. > >>>>> >> > >>>>> >> As for the PR itself (copying from e-mail mentioned at the > beginning) > >>>>> >> - > >>>>> >> I would recommend trying to use KEEP(*(.vectors)) as seen in > >>>>> >> boards/avr/avrdx/breadxavr/scripts/breadxavr.ld - the default > config > >>>>> >> should then not need the "# CONFIG_DEBUG_OPT_UNUSED_SECTIONS is > not > >>>>> >> set" > >>>>> >> line. > >>>>> >> > >>>>> > >>>> >