Blog:
Flash Health Monitoring on Torizon

terça-feira, 6 de setembro de 2022
Torizon
Torizon
Introduction

Hey there, Leo here! Some years ago, I worked on an innovation project - by now archived - named Flash Analytics Tool. One of the project goals was to research and apply methods to estimate flash memory lifespan on Embedded Linux - more specifically, using Torizon in its early days - thus helping our customers develop flash-friendly applications and get an estimate of how long they could expect their devices to withstand in the field.

The Flash Analytics Tool innovation project

The Flash Analytics Tool innovation project

Back then, I learned that the e.MMC 5.0 standard provided a high-level overview of the flash health status, in increments of 10%, out-of-the-box. While this didn’t seem to meet all our requirements for lifetime estimation, it did provide a great overview of the flash health status.

Such data allows you, for example, to do preventive maintenance on a product before it fails in the field.

eMMC on a Verdin iMX8M Plus

eMMC on a Verdin iMX8M Plus

Since then, Torizon evolved and received many interesting new features, including the Torizon Platform Services integration, which in turn, enabled device monitoring.

As of TorizonCore 5.7.0, mmc-utils was also added to the TorizonCore distribution, and it reminded me of the work done in Flash Analytics. While mmc-utils can be run in a container, and it would have been the preferred way otherwise, it doesn’t stop us from using it right away from the base OS.

Having all tools available, I decided to give it a try and monitor things in a device fleet using the Torizon Platform Services.

The Torizon Platform Services dashboard

The Torizon Platform Services dashboard

It all led me to write this article, in which you are going to:
  • Learn how to read and monitor standard eMMC health data on TorizonCore.
  • Learn how to visualize it on a time series chart in the Torizon Platform Services.
A Bit of eMMC

First and foremost, no pun intended on this section’s title! That said, let’s recap some important concepts - at a high level - to make the best out of this article. Feel free to skip it, or come back as needed.

Flash Memory, Raw NAND and eMMC

There are mainly two base technologies behind flash: NAND and NOR. They are named after the respective architecture at a transistor level. Even though NAND costs less and is provided in many options of high storage capacity ICs, the software stack to support its use is much more complex due to its operation.

The eMMC standard, which stands for Embedded MultiMedia Card, is comprised of raw NAND flash + a controller. It delegates the complexity of NAND operations to eMMC manufacturers and it makes eMMC look similar to a hard disk from a functional standpoint:
  • You can execute read, write and erase operations
  • It is split into blocks
Raw NAND and eMMC

Raw NAND and eMMC

Internally, though, raw NAND operates a bit differently. It has:
  • Cells: the smallest division of a raw NAND, it may contain:
    • SLC: 1 bit per cell - smallest density, highest cost, highest reliability and highest lifespan
    • MLC: 2 bits per cell
    • TLC: 3 bits per cell
    • QLC: 4 bits per cell - the opposite of SLC
    • It is important to state that MLC, TLC, and QLC can be configured to operate in pseudo-SLC mode, which increases lifespan and reliability at the tradeoff of reduced storage.
  • Pages: the smallest array of cells that can be accessed in a single read or program (switch bits from 1 to 0) operation.
  • Eraseblocks: the smallest array of pages that can be erased (switch bits from 0 to 1) in a single operation, usually around 512kB to 4MB.
Raw NAND structure from a logic perspective

Raw NAND structure from a logic perspective

Raw NAND technologies comparison

Raw NAND technologies comparison

Last but not least, eMMC manufacturers overprovision the ICs with extra eraseblocks known as reserved blocks. They are not seen as additional storage and replace eraseblocks as they become bad, giving extended life to the devices. Often, some blocks become bad very early, much before expected, so the reserved blocks are also there to ensure the nominal capacity is met in an eMMC’s early days.

Flash Wear and eMMC

With time, eraseblocks wear out and become bad. When it happens, you may be able to read it, but you lose the ability to program and erase it. This is called a bad block.

Since the eMMC controller knows how many blocks there are inside the IC and is capable of identifying and marking bad blocks, it is also able to calculate the percentage of the flash that is still available, and the percentage that is worn out. The result of this calculation is standardized values that can be easily read from the Linux user space - as this article will explain how to.

The first thing you must know is that there are 3 standard health status fields present in the eMMC Extended CSD partition:
  • Device lifetime estimation type A: lifetime estimation for eraseblocks configured as MLC - which is the default for the user area partition (where the OS and data are stored) of MLC eMMCs. If you don’t configure your device as pSLC, you will most likely see this value increase over time. Data is provided in steps of 10%:
    • For example, 0x02 means 10%-20% device lifetime reached.
  • Device lifetime estimation type B: lifetime estimation for eraseblocks configured as SLC - usually the boot area partition (where the bootloader is stored) blocks and those configured by the user in pSLC mode. Usually, the bootloader area is barely touched and you most likely won’t see this indicator value change significantly over the product lifespan. Data is provided in steps of 10%:
    • For example, 0x02 means 10%-20% device lifetime reached.
  • Pre EOL information: overall status for reserved blocks. This indicator signalizes that the eMMC lifespan is near its end. Possible values are:
    • 0x00 - Not defined.
    • 0x01 - Normal: consumed less than 80% of the reserved blocks.
    • 0x02 - Warning: consumed 80% of the reserved blocks.
    • 0x03 - Urgent: consumed 90% of the reserved blocks.

If you want to learn more about eMMC, read the articles Flash Memory Overview on Toradex Products, eMMC (Linux), or watch the webinar Flash Memory in Embedded Linux Systems.

Read the eMMC health status on TorizonCore

In the Toradex BSP, the default eMMC device is symlinked to /dev/emmc. This is quite convenient as it standardizes access across our entire range of SoMs. To read each of the aforementioned properties individually:

sudo mmc extcsd read /dev/emmc | grep -i EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A
sudo mmc extcsd read /dev/emmc | grep -i EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B
sudo mmc extcsd read /dev/emmc | grep -i EXT_CSD_PRE_EOL_INFO
Monitor the eMMC health status on TorizonCore
Device monitoring in TorizonCore is enabled through Fluent Bit. Quoting the project website home page, Fluent Bit is:
An End to End Observability Pipeline

Fluent Bit is a super fast, lightweight, and highly scalable logging and metrics processor and forwarder. 
It is the preferred choice for cloud and containerized environments.

Monitoring functionality in this framework is described through input, filter, and output plugins. I won’t dive deep into Fluent Bit itself, as there is great documentation available on docs.fluentbit.io.

Specific to Torizon, you need to add one input and one filter entry to send custom data and make it readily available in the Platform Services. I also won’t go in-depth here, as we have a dedicated article about device monitoring in TorizonCore.

To use a JSON parser on Fluent Bit, we must consolidate raw data into a JSON formatted string. The command below does that for us. It is actually a chain of commands, where the output of a command is piped  as input for another command. The backslash at the end of each line is only there to allow us to break the big command into multiple lines, making it easier to read:
sudo mmc extcsd read /dev/emmc | \
grep -e EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A \
     -e EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B \
     -e EXT_CSD_PRE_EOL_INFO | \
rev | \
cut -c 1 | \
jq -R -c -s \
     'split("\n") | { "emmc_life_time_est_typ_a": .[0], "emmc_life_time_est_typ_b": .[1], "emmc_pre_eol_info": .[2] }'

To better understand it, I suggest you run it adding one step at a time: first, run mmc alone, then run mmc | grep, mmc | grep | rev, and so on.

Once verified that the command above outputs a JSON formatted string with the data we’re interested in, let’s write the input and filter, to be appended to /etc/fluent-bit/fluent-bit.conf:
[INPUT]
    Name          exec
    Tag           emmc_health
    Command       mmc extcsd read /dev/emmc | grep -e EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_A -e EXT_CSD_DEVICE_LIFE_TIME_EST_TYP_B -e EXT_CSD_PRE_EOL_INFO | rev | cut -c 1 | jq -R -c -s 'split("\n") | { "emmc_life_time_est_typ_a": .[0], "emmc_life_time_est_typ_b": .[1], "emmc_pre_eol_info": .[2] }'
    Parser        json
    Interval_Sec  300

[FILTER]
    Name       nest
    Match      emmc_health
    Operation  nest
    Wildcard   *
    Nest_under custom
Once the configuration is done, restart the service to apply the changes:
sudo systemctl restart fluent-bit
Some notes:
  • Because Fluent Bit is a daemon and runs as root, there is no need to use sudo in the configuration file.
  • To be shown in the Platform Services, data must necessarily be nested under custom. This is specific to Torizon and how we implement it on the server side.
  • The time interval of 300 seconds (5 minutes) is unrealistically small for measuring flash health in increments of 10%, as provided by the eMMC standard. The ideal interval might depend on how much your application writes. From a very simple test I ran to wear the flash as fast as possible, it took roughly half a day to see a 10% increment. Especially for devices on a cellular connection, to save bandwidth, it makes sense to increase the interval.

Also, know that you can get extra inspiration and learn more from another example of Disk usage custom metric.

Create a custom TorizonCore image with eMMC monitoring

You might be wondering how to replicate this configuration without doing it manually over and over again, for hundreds or maybe thousands of devices.

To answer that question, use the TorizonCore Builder tool to capture the Fluent Bit changes and create a custom TorizonCore image. This workflow allows you to install the custom TorizonCore during production programming with Toradex Easy Installer and send updates to devices already deployed in the field.

Display the eMMC health status in the Torizon Platform Services

Before switching our focus from TorizonCore to the Torizon Platform Services, it is a good time to go grab a cup of coffee or have a look at your social network feeds. It might take a few minutes until data is sent to and shows up on the platform.

Login to your app.torizon.io account, provision your device - if not provisioned yet - and then:
  • Either: in the device management section → select a device → click the action View Detail → make sure you are in the device information tab:
Device information tab selected

Device information tab selected

Fleet overview tab selected

Fleet overview tab selected

No matter which you choose, the option to customize metrics will be presented to you. It should be intuitive to add and configure a chart, so I won’t describe it step-by-step. Here is my eMMC Health chart configuration:

eMMC Health chart creation and customization

eMMC Health chart creation and customization

And it’s all set! All you have to do is wait for data points to arrive at the platform over time. In my setup, I’ve used a fleet with 8 devices:

A fleet with 8 devices provisioned to the Platform Services

A fleet with 8 devices provisioned to the Platform Services

The eMMC Health chart for each device in the fleet

The eMMC Health chart for each device in the fleet

In the chart above, we can see that the device BR-J-08 has an unusually high health degradation. Looking at it more closely:

A device with unusual eMMC health degradation

A device with unusual eMMC health degradation

The affected device BR-J-08 ran a script over the weekend that wears out the flash. I don’t recommend you do it on your device, at all. In any case, you can find the script in Appendix I - wearing the flash on purpose.

To learn more about device monitoring, watch our webinar on-demand Secure Device Monitoring - Check Health, Resources and Performance and read the article Device monitoring in TorizonCore.

Conclusion
Here are some key takeaways from this little experiment:
  • TorizonCore has included mmc-utils, you can use it out-of-the-box for device monitoring (even though you also could do it in a container before).
  • Sending any metric you have access to the Platform Services is easy. You are not constrained to the defaults.
  • Visualizing fleet and device time series data is very easy and powerful. You can identify anomalies in a timely manner to have them fixed before the device fails.
  • I can imagine a bright future when we implement device monitoring alarms. Specifically to the eMMC, it would be so convenient to set an alarm on the Pre EOL information.

Given all of that, I'd love to hear about your experiences and learn what is important to you, what you like, and what you think is missing.
See you on my next blog, bye!

References
Flash memory and eMMC:
Device monitoring and the Torizon Platform Services:
TorizonCore customization and production programming:
New features and bug fixes:
Appendix I - wearing the flash on purpose
Think carefully before wearing the flash on purpose, as it will reduce the life of your device!

Passively watching the flash wear takes a really long time. Here is the super simple script I left running over the weekend to wear the flash quickly:

while true; do 
	dd if=/dev/urandom of=/home/torizon/testfile bs=4096 count=250000
	sync
	rm /home/torizon/testfile
	sync
done
Autor:
Leonardo Veiga
, TorizonCore Product Owner, Toradex

Deixe um comentário

Please login to leave a comment!
Have a Question?