.NET CF Initialization Error With VF61, WEC2013 and CF 3.9

We are seeing Problems with devices that suddenly fail to start .NET Applications.

The Error Message is:

.NET Initialization Error

The application failed to load required components. If the .NET Compact Framework is installed on a storage card, please ensure that this card is in place and launch the application again. If this fails, a re-installation of the .Net Compact Framework is recommended. Support Info: -2147483628 (80000014)

I tried re-installing but that failed as some of the files are in use and cannot be overwritten.

In my search I found this article. I tried flashing a fresh registry just in case that it has been damaged by a floating RX. It did not help. I am still waiting on an answer from our carrier board developer about that.

Reflashing the filesystem did fix the Problem for this device. Thus I assume that somehow the filesystem got damaged. Reproduction by reflashing the damaged Filesystem did also work as expected (failed again).

We haven’t seen this type of Problem with V1.5 of the Vybrid image. And now after switching to V1.7 I get three defects within one Month.

How can this type of error happen? What can I do to prevent it? What else can I check? I need urgent help on this topic as this is already running in production.

I got the information that our RX of both UART is definitely NOT floating.

Hi @Troubadix75 ,

It really looks like some .NET files got corrupted or deleted. Could you try to compare all files from the “bad” backup to the files in the “good” backup? (by copying them out to a USB Stick and then comparing them on a PC with a diff tool).

It’s very strange that you see this issues in 1.7 and not in 1.5, as we improved something in the Flash Error correction in 1.7. But maybe the reason is another one…

Once we see exactly how the corruption looks like (what file, where in the file and what values) i can maybe say more about where this could come from.

Do you write a lot on the FlashDisk during device operation?

Can you try to execute this bootloader command on one of the affected devices (and send me the output)?

dumpfshealth 0 4096

Hi @germano.tx ,

thank you for the reply.

I got the differences between the two Filesystem states. All differences ar binary files. I think the critical file for the fault we see is the “GAC_mscorelib_v3_9_0_0_cneutral_1.dll”. I’ll attach both versions of this File.


Translations
Binärdateien sind unterschiedlich = Binary files are different
Ordner sind unterschiedlich = Folders are different

The bitmap file is not used, but it does show pixel errors.

And here is the dump you asked for.

We are writing log-Files to a SD-Card. Depending on what is happening and how many messages we have to display this can be quite a lot. But that should not affect the Flashdisk.

We are also writing some values to a XML File on the Flashdisk. This happens automatically every 10 Minutes and on some user interactions. The machines that have failed tend to have a lot of these interactions, thus it is possible that we have quite a lot writing but it is impossible to tell how often. I would guess every few seconds over a time period of maybe 15 Minutes and then only the automatic saving for a couple hours.

But the file writing has not change in years and we just now started seeing these errors on quite new machines. This software (in variations) is running on thousands of devices (PXA270 and VF61) for many years now, many of them running 24/7. And I know of not a single Flashdisk failure up until this summer.

Hi @Troubadix75 ,

Looking at the differences in the OK and BAD version i see that the bad parts of the file are always aligned to 2k (Page Size of NAND).

This is an indication that there is something wrong with the Logical to Physical mapping of Pages.

We had this issue in BSP before 1.7, that’s why we did added the Error correction to the SpareArea (which contains the Logical Page number associated to each Physical page).

The only thing i can imaging is that those modules that exhibit the issue have been upgraded to BSP 1.7, but the filesystem has not been rewritten, leaving the spare area without ECC (it’s only added on rewrite, so if the .NET files have not been rewritten with BSP 1.7, then the issue could still happen). Is this maybe the case?

See also:

Hi @germano.tx ,

these Boards have been updated from V1.5 to V1.7 and I did try to only update the OS Image and Bootloader. But that failed right away. So I switched to flashing them completely (OS, Bootloader, FS, Config). I use an automated script to do that.
echo Clear Registry…
“%tempdir%\update” /cu

echo Write Bootloader
"%tempdir%\update" /u bootloader,raw,%tempdir%\eboot.img

echo Write Operating System
"%tempdir%\update" /u os,bin,%tempdir%\os.bin

echo Write Registry
"%tempdir%\update" /u registry,raw,%tempdir%\reg.ivr

echo Write Filesystem
"%tempdir%\update" /u filesystem,raw,%tempdir%\fs.ivf

echo Write Splashscreen
"%tempdir%\update" /u splashscreen,raw,%tempdir%\ss.dat

echo Write configblock
"%tempdir%\update" /u configblock,raw,%tempdir%\configblock.cfg

I must confess that I am not 100% sure if I really did the complete flashing on ALL the devices I updated. As said I did try to do the shortcut. But I am 95% sure that this failed right on the first devices and I had to recover them by reflashing everything.

Could the Problem be caused by the source of the FS Image? I am not 100% sure but I think I updated one device the “critical” way (only FS and Bootloader worked on my Testdevice), added some Files and then got the filesystem image of that device. The device did not show any errors at that point and many devices have been flashed with this Filesystem Image without Problems so far.

Hi @Troubadix75 ,

Can you give more details on what failed when you try to only update Bootloader & Image?

As for the “complete” update… that’s also including Bootloader & Image, so i don’t understand the why they should work in this procedure and not only the 2 of them.

There is something else that is very important to understand:
If you start with a Image 1.5 and execute the “complete” flashing as mentioned by you above (i assume it’s a batch file that does everything you mention and then reboots, right?), then you end up with a filesystem that is still written without spare area ECC as image 1.5 is still running, 1.7 will only run after a reboot!

So to make sure everything is written with the spare area ECC you should rewrite the filesystem after running on 1.7

Hi @germano.tx ,

I can’t recall what exactly failed after updating just the Bootloader & Image. I think our Application refused to start, but I am absolutely not sure.

You are right, above is a snippet from the batch file we use to set up all our devices. It flashes everything in the order you see above and then reboots. So yes, it all happens on the old (V1.5) Image. On the affected machines it was definitely this way.

Does it make a difference what updatetool we use? Currently our batch file brings its own updatetool (from Image V1.5).

I try to find out which image comes with our new devices to see if we have a general Problem here or if we are safe in the Production. What does Toradex put on the Modules before they deliver them?

I understand: If we reflash the Filesystem in Image V1.5 the spare ECC is not written. Thus we are still vulnerable for bitflips.
The question I have is: Is the risk higher or the same compared to running Image V1.5 (not updating to V1.7). That is important to know because I have to decide if we have to make a recall of devices/machines that may have been flashed the wrong way.

On the systems that are running right now but have no spare ECC: Can we fix it by making a backup and restore of the filesystem with the updatetool provided in the Image?

Hi @germano.tx ,

do you have answers to my questions above? It’s kinda urgent.

Thanks.

Hi @Troubadix75 ,

Sorry for the delay, i was quite busy…

UpdateTool Version to use: In general newer UpdateTool is always better as it might fix some issues, but if you did not see any issues with the one included in 1.5 that’s ok too.

Preinstalled Version: You should not rely on the fact that you get a specific version… we will update the production tester preinstalled image for time to time (but only when a stable version is released). Probably most of the module you got will still have 1.5 or 1.6 preinstalled, but at some point you will get 1.7 preinstalled.

Using 1.7 without rewriting filesystem: The risk of using 1.7 without rewriting the filesystem is lower than using 1.5, as at least when something new is written or something old is rewritten, the ECC is added. Just the file that you only read stay at risk of bitflip over time.

Fixing current filesystem: Yes, you can fix it by doing a backup and restoring it, but if there are already corrupted files those will be restored wrong too. What you could also do is to overwrite all “read only” files (like .net binaries or other executables/dlls) with a known good version (from USB Stick) as this will rewrite them with ECC.

Hi @germano.tx ,

thank you for the answer and please forgive me if I am a little pushy. The holidays come as a surprise again this year.

I am still a little concerned why we got these failures just now as they could have happened on V1.5 as well if I understand you correctly. But since you seem to be sure that it was the bitflip I will accept that answer.

Our next step will be to improve the imaging in production to make sure the ECC is added.

Hi @germano.tx ,

on my search on how to improve my production programming script I looked into the Toradex Production programming template.Found here
The Toradex script has the same flaw my own script has. It does not reboot between writing the OS image and writing the filesystem. I thought you should know so someone can fix it.

REM ################################################################################
REM Configuration Section

REM Path to UpdateTool.exe (relatvie to this batch file) 
set PATH_UPDATETOOL=..\UpdateTool

REM Configuration Section end
REM ################################################################################


ECHO Toradex Production Programming
ECHO ******************************
REM First clear registry
"%PATH_UPDATETOOL%\updatetool.exe" /cu 
REM Update bootloader 
"%PATH_UPDATETOOL%\updatetool.exe" /u bootloader,raw,\SD Card\backup\eboot.img
REM Update image
"%PATH_UPDATETOOL%\updatetool.exe" /u os,bin,\SD Card\backup\nk.bin
REM Update Config block
"%PATH_UPDATETOOL%\updatetool.exe" /u configblock,raw,\SD Card\backup\ConfigBlock.cfg
REM Update Registry
"%PATH_UPDATETOOL%\updatetool.exe" /u registry,raw,\SD Card\backup\Registry.ivr
REM Update File System
"%PATH_UPDATETOOL%\updatetool.exe" /u filesystem,raw,\SD Card\backup\FileSystem.ivf
ECHO Production programming done.
ECHO *******************************

Hi @Troubadix75 ,

In general this script is fine, it’s just in the transition from the old to the new ECC writing that it can be a problem… also there is no universal way to generate a script that reboots inbetween so it’s impossible to give a general script that will work for everyboby.

One other option you have is to run it twice (with a reboot inbetween).

Hi @germano.tx ,

unfortunately I have to unaccept your answer in this case and ask you for some help again.

You explained to me that all files that have been overwritten while th BSP V1.7 is running are safe from this kind of error. But now I had two confirmed issues where a file that has been uploaded via ftp to the device while BSP V1.7 was already running, and which worked perfectly for a while, are magically broken. In one case it is an executable file (.exe) which definitely has not been touched otherwise (definitely no power off while writing issue).

I can not confirm if the production flashing still had the issue where the filesystem was flashed while still running on an older BSP. But this file was definitely overwritten later as it is a customized version of the original.

Since the executable is a .Net assembly I cannot just upload it here for everyone to see.

Hi,

I really need some support here.

As stated in this Comment I have now corrupt files that have definitely been written while the V1.7 OS was running. Also these files have not been touched after they were copied to the device. So no power fail while writing. Replacing the file brought the system back for now. But something must be wrong and I need to find out what and how to correct it permanently. I keep getting failing devices all over the world.

According to what @germano.tx wrote in this thread ECC should have prevented these failures.

We have never had a single corrupted file on the older V1.5 OS. At the moment I am thinking about just switching back to V1.5 because for me it looks like a bug in the newer OS. That or the quality of the flash has become a lot worse last year.

Hi, @Troubadix75 .
Do you have any updates about your problem with OS v1.7 or 1.5?
We have very same problem with corrupt data on VF61 with OS v1.5 and now thinking about try to use new v1.7.
For statistics - we was ship 530 devices and 107 of it getting back from customers with problem of corrupt data in flash (Bootloader or OS).

Hi @danluck ,

we have never seen this error while we were still using V1.5.
We only ran into trouble after switching to V1.7.
It seems to be solved now (via EMail-Support). There seems to be an erroneous detection of the used SMI size (Sector Metadata Information; a new, better one was introduced with V1.7) if the flash memory is not completely empty (all zero). To fix that we have gotten a new version of the Updatetool (V8.2) to erase the flash and a registry entry to actively set the SMI size.

Since you are seeing the Problem on V1.5 it cannot be the same root cause. V1.5 only supports SMI8 and thus has no detection to switch to SMI16.