Hi,
Yes before we claim having a power cut resilient FS , it has to be
tested on real hardware.
It will also be the occasion to actually fire-proof littlefs, because I
have no idea if that was ever done before.
We suggested the idea in a littlefs github issue but it got nowhere IIRC.
Sebastien
On 13/09/2024 15:20, Alan C. Assis wrote:
Hi Tomek,
I think it is possible to modify the NAND Simulator to help testing
partial writes and induced errors.
But I agree with Sebastien that it is better to test in real hardware.
Since the MnemoFS already works in the simulator we can get the
initial version working in any flash, even SPI NOR Flash.
BR,
Alan
On Fri, Sep 13, 2024 at 10:05 AM Tomek CEDRO <to...@cedro.info> wrote:
very nice discussion!
can problems with NAND create a semi-dereministic group of well
defined
issues with random characteristics?
if so then we could create a model nand for sim with controllable
errors in
order to verify various nand drivers, filesystems, etc?
:-)
--
CeDeROM, SQ7MHZ, http://www.tomek.cedro.info
On Fri, Sep 13, 2024, 14:52 Alan C. Assis <acas...@gmail.com> wrote:
> Hi Sebastien,
>
> Thank you for your helpful considerations.
>
> As I explained before he used the SIM NAND Simulator that he
created and
> integrated on NuttX.
>
> Also as I explained in my previous email, we need help to test
in real
> hardware.
>
> Since you have previous experience with NAND Flash, maybe you
could help
> here (of course, if you are interested to help)!
>
> First we need to create a driver for a SPI NAND Flash (I bought
this model:
> https://aliexpress.com/item/1005005307786079.html) and use it with
> MnemoFS.
>
> This model that I selected has internal error detection, etc, it
means we
> don't need to worry about taking care of bad blocks ourselves.
>
> If you look inside nuttx/drivers/mtd/ many of the pieces we need are
> already there, we just need to understand how to use SPI NAND,
FTL, MTD,
> etc.
>
> Xiang, since you and your team ported YAFFS to NuttX, maybe you
guys could
> help us to get MnemoFS working on real flash on NuttX.
>
> BR,
>
> Alan
>
>
>
> On Fri, Sep 13, 2024 at 6:17 AM Sebastien Lorquet
<sebast...@lorquet.fr>
> wrote:
>
> > Hello
> >
> >
> > This is quite a complete report with a lot of details, this
shows that
> > you have put some large amount of mental energy in this
project, so
> > congratulations and thank you.
> >
> > What I'm about to write is not a critic but a complement that may
> > interest you.
> >
> >
> > Since I've worked with critical flash systems for more than 10
years
> > now, I have read the part of your document that deals with
power loss
> > with great interest.
> >
> > Resilience to power loss is *absolutely critical* to any embedded
> > filesystem.
> >
> >
> > Did you do power interruption tests on your code? Can you
guarantee that
> > the device format stays consistent/recoverable when the power
is cut at
> > any code location? Did you identify power critical code
sections (with
> > relation to power cut, not cpu access) ?
> >
> > Remember, if it's not tested, it doesnt work...
> >
> >
> > The most critical part of your work is the journal. Do you
make sure
> > that the checksum is written 1-last, and 2-completely? How do
you make
> > sure that the journal entries are correctly applied to their final
> > storage locations?
> >
> > The largest problem in that area is flash metastability. The
checksum
> > MIGHT appear correct on one read, but not correct at the next
access.
> > The reason for this is the analog nature of flash writes (and
erases),
> > which injects a number of electrons in a floating gate. 0 and
1 bits are
> > separated by thresholds, but these thresholds vary with
temperature and
> > time (wear), so it might appear that a bit is correct by being
just at
> > the threshold, but the next access will result in a flipped bit.
> >
> > These issues are NOT theoretical, they happen all the time in
all flash
> > devices, you just have to tickle the devices often enough at
the right
> > moment so you begin to see these.
> >
> > These tests require the ability to fully cut the power to a
test board
> > with microsecond precision. No need for pulses, just an adjustable
> > delay. Test is triggered by a command that also start a
countdown, and
> > timeout is increased microsecond by microsecond until you
reach the
> > point that the flash is actually written. Usually, there is a
point
> > where timeouts result in partial writes. Then the board will start
> > acting funny and will start entering the error branches that
are usually
> > never taken. Board capacitors are not a problem, they just
increase the
> > delays. They always discharge the same way during all repeated
tests, so
> > they have no influence on the process.
> >
> > It is quite hard to make sure that everything is correct, but a
> > sufficient amount of dedication is required to be aware of the
potential
> > problems.
> >
> > How do you know in your filesystem that the checksum has been
written
> > only after all the previous data are written? How do you know the
> > checksum write is complete. There are software techniques for
this. This
> > also requires the flash to support overwrites, so making this
work with
> > ECC is harder (but possible).
> >
> > Fine details absolutely matters here.
> >
> > Thanks,
> >
> > Sebastien
> >
> >
> > On 12/09/2024 17:48, Saurav Pal wrote:
> > > Hi all,
> > >
> > > Here's my final report <
> https://resyfer.github.io/blogs/mnemofs/endeval/
> > >
> > > on mnemofs, a NAND flash file system for NuttX, on which I
worked
> during
> > my
> > > tenure as a GSoC 2024 Contributor for ASF. I would be
grateful for any
> > > suggestions and criticism.
> > >
> > > Best regards,
> > > Saurav Pal.
> > >
> >
>