CAN problem on Verdin i.MX8M Mini

swisstinu · January 25, 2021, 3:58am

Hi

We detected a problem when using the CAN-FD bus on a Verdin module mounted on a dahlia carrier board.
When running candump on the verdin board and receiving frames with different lengths there are sometimes some bytes which are “wrong” and look like a copy of another position in the frame.

Monitoring the traffic on the CAN bus using an IXXAT USB converter does not show wrong bytes.
There are no Error Frames generated on the bus.
Different baud rates were tried and also with and without BRS, but all are showing this behavior.

Any help in fixing this issue is very appreciated.

Here is an example:

(005.300871) can0 RX - - 123 [2] C0 01

(005.404941) can0 RX - - 123 [3] C0 01 02

(005.511296) can0 RX - - 123 [4] C0 01 02 C0

(005.614543) can0 RX - - 123 [5] C0 01 02 03 04

(005.715990) can0 RX - - 123 [6] C0 01 02 03 04 04

(005.820872) can0 RX - - 123 [2] C0 01

(005.922946) can0 RX - - 123 [3] C0 01 02

(006.030066) can0 RX - - 123 [4] C0 01 02 C0

(006.134619) can0 RX - - 123 [5] C0 01 02 03 04

(006.238398) can0 RX - - 123 [6] C0 01 02 03 04 04

And the data monitored by IXXAT is this:

(005.301871) can0 RX - - 123 [2] C0 01

(005.405941) can0 RX - - 123 [3] C0 01 02

(005.512296) can0 RX - - 123 [4] C0 01 02 03

(005.615543) can0 RX - - 123 [5] C0 01 02 03 04

(005.716990) can0 RX - - 123 [6] C0 01 02 03 04 05

(005.821372) can0 RX - - 123 [2] C0 01

(005.924046) can0 RX - - 123 [3] C0 01 02

(006.031066) can0 RX - - 123 [4] C0 01 02 03

(006.135619) can0 RX - - 123 [5] C0 01 02 03 04

(006.239398) can0 RX - - 123 [6] C0 01 02 03 04 05

jaski.tx · January 25, 2021, 9:04am

Hi @swisstinu and Welcome to the Toradex Community!

Thanks for contacting the Toradex Support.

Could you provide the exact version of the used Software ( uname -a)?
Have you done any changes to the kernel and device-tree? If yes, please share these changes?

Regarding your tests setup, could you provide the commands used on host and target side?

Thanks and best regards,
Jaski

swisstinu · January 25, 2021, 10:16am

@jaski.tx
Thank you for getting in touch.

I cloned the kernel from toradex_5.4-2.1.x-imx branch @ 1266d0110fce and applied the preempt_rt patch 5.4.77-rt43 to it and built a fully preemptible kernel (RT).
No modifications were done to the device-tree.

ip link set can0 type can bitrate 1000000 sample-point 0.75 dbitrate 1000000 dsample-point 0.75 fd on

On Verdin (Rx): candump can0 -e -d -x

On PC (Tx):

a script which loops:

cansend can0 123#00

cansend can0 123#00 01

…

Note: the same kernel but with the device tree for apalis and ixora works without issues on an iMX8QM

jaski.tx · January 25, 2021, 10:40am

Hi @swisstinu

Thanks for the Input. Apalis iMX8QM is using flexcan from SoC but the verdin module has an external SPI CAN chip. So the issue might be the driver in combination with RealTime Kernel. Let me reproduce this issue using a RT-Image.

Best regards,
Jaski

jaski.tx · January 25, 2021, 5:16pm

HI @swisstinu

I tested on my side with the software version Linux verdin-imx8mm 5.4.77-rt43-5.2.0-devel+git.1266d0110fce and I don’t see any bytes which are long. I was using candump and for different lengths, the received package was filled with 00 at the end.

Could you install the following image on your side and check if you still see the issue?
TDX Wayland with XWayland RT 5.2.0-devel-20210124+build.200

Thanks and best regards,
Jaski

swisstinu · January 26, 2021, 6:45am

Hi @jaski.tx

Thank you for your investigation.

As you have recommended I installed the TDX Wayland with XWayland RT 5.2.0-devel-20210124+build.200 image and run candump via ssh.

But I still can see the issue, it can occur for some time and then hide and later occur again. It is not filled with 0, it more looks like a copy of the first byte.

can0 RX B - 123 [01] 00

can0 RX B - 123 [02] 00 01

can0 RX B - 123 [03] 00 01 00

can0 RX B - 123 [04] 00 01 02 00

can0 RX B - 123 [05] 00 01 02 03 04

can0 RX B - 123 [01] 00

can0 RX B - 123 [02] 00 01

can0 RX B - 123 [03] 00 01 02

can0 RX B - 123 [04] 00 01 02 03

can0 RX B - 123 [05] 00 01 02 03 04

then I changed my script to have 0C as the starting byte to avoid the 00 to show that it is not filled.

Then I receive such things:

can0 RX B - 123 [02] 0C 01

can0 RX B - 123 [03] 0C 01 0C

can0 RX B - 123 [04] 0C 01 0C 01

can0 RX B - 123 [05] 0C 01 0C 01 02

can0 RX B - 123 [01] 0C

can0 RX B - 123 [02] 0C 01

can0 RX B - 123 [03] 0C 01 0C

can0 RX B - 123 [04] 0C 01 0C 01

can0 RX B - 123 [05] 0C 01 02 03 04

here is exactly my testscript:¨

#!/bin/bash

while :

do

cansend can0 123#0C

cansend can0 123#0C01

cansend can0 123#0C0102

cansend can0 123#0C010203

cansend can0 123#0C01020304

done

If I use fixed frame size then no data is being corrupted:

can0 RX - - 123 [5] 0C 01 02 03 04

can0 RX - - 123 [5] 0C 01 02 03 04

can0 RX - - 123 [5] 0C 01 02 03 04

Could you maybe test again on your side using this script?

Thank you and kind regards, swisstinu

jaski.tx · January 26, 2021, 7:00am

Hi

Thanks for the test script. I will try this later out and let you know.

Best regards,
Jaski

Edward · January 26, 2021, 10:35am

Hi @swisstinu,

I thought you had issues with FlexCAN, but since you have something like me and @jaski.tx says it’s SPI, I checked and see it is indeed the same MCP2518FD. I see something like you using this chip and like you it is “it can occur for some time and then hide and later occur again”. Doing several variable payload size transfers driver enters bad state and damages received message payload. Usually 3rd and 4th byte of payload gets overwritten with 1st and 2nd. Then several similar transfers later driver enters good state and everything is fine, then again and again bad-good.

I’m going to integrate MCP2518FD in our board using Colibri. I verified already with other tools that messages are sent on bus not damaged, so MCP should receive them OK.
I had VF61 M4 code working, which talked to CAN bus using MCP2518FD, but I didn’t notice this issue. Perhaps I missed it. It will take me some time to reenable MCP2518FD on M4 so I could recheck if issue happens in bare metal or not.
Please, if someone confirms it is or it isn’t MCP2518FD HW issue sooner than me, then please let community know about it.

BTW, it doesn’t depend are messages send in FD format, with BRS on of off, issue occurs even receiving standard CAN messages.

There are two drivers for MCP2518FD, one from Martin Sperl and another one from Marc Kleine-Budde. Latest version of this has issue, and 2nd one didn’t work me. Marc’s version sends up to 4 messages and then complains about buffer full and doesn’t receive anything. Perhaps I missed something in device tree or didn’t integrate it properly.

Regards,
Edward

jaski.tx · January 26, 2021, 12:25pm

Hi @swisstinu, Hi @Edward

I was able to reproduce the issue. I will address this to the R&D team and let you know, once we know more.

Thanks very much for bringing this issue up.

Best regards,
Jaski

Edward · January 26, 2021, 3:00pm

Hi @jaski.tx ,

Thanks for your support.
Checked in the same conditions, same message sequence, M4 code. Issue clearly doesn’t appear.

Edward

jaski.tx · January 27, 2021, 6:58am

Hi @Edward

Thanks very much for your support.

Could you share your M4 code, please?

Thanks and best regards,
Jaski

Edward · January 27, 2021, 8:17am

Hi @jaski.tx,

I’ll try to reply non public

jaski.tx · January 27, 2021, 8:46am

Thanks I put your comment to private.

Edward · January 27, 2021, 9:53am

@jaski.tx,

did you see it with code attached?

One more observation:

4 byte payload messages are destroyed depending on intermediate messages payload length. Having correct 4 byte data like this

C8 34 03 00

If between 4-byte messages I send only 1-byte messages then wrong 4-byte message data is damaged like this

C8 C8 34 03

Correct pattern seems being shifted on byte to the right.

If 2-byte messages between 4-byte messages then it looks like this, 2 bytes shift right:

C8 34 C8 34

Looks like some pointer or index is not always reset.

Edward

jaski.tx · January 27, 2021, 10:13am

hi @Edward

Yes, I can see the attached code. Thanks for the explanation of the bug.

Best regards,
Jaski

swisstinu · January 27, 2021, 12:23pm

Hi @Edward and @jaski.tx

FYI:
I tested now additionally with the TDX Wayland with XWayland RT 5.2.0-devel image without PREEMPT_RT patch. And could also observer the issue…

Regards
Swisstinu

swisstinu · February 1, 2021, 7:08pm

Hi @jaski.tx

I could get rid of the issue. Using the mcp251xfd driver from Marc Kleine-Budde from 30 Sep 2020 and backporting it to toradex kernel 5.4.77 I cannot reproduce the misbehavior with my script.

Kind regards

swisstinu

Edward · February 1, 2021, 8:08pm

Hi @swisstinu,

Could you please point me where did you get driver sources? Versions of mcp251xfd I tried were unable to receive any messages and sent only up to 4 messages, then complained about buffer shortage and didn’t send anything until if down-up.

Regards,
Edward

swisstinu · February 1, 2021, 8:14pm

Hi @Edward

Here you are: https://github.com/varigit/linux-imx/tree/5.4-2.1.x-imx_var01/drivers/net/can/spi/mcp251xfd

This version is working on my side on the Verdin. But, if I stress the bus with a lot of frames without delays I get the warning “mcp251xfd spi2.0 can0: RX-0: MAB overflow detected”. Maybe this could be optimized using other SPI settings.

Regards, swisstinu

jaski.tx · February 2, 2021, 6:38am

Hi @swisstinu

Perfect that you solved the issue. Thanks for your valuable Input.

Best regards,
Jaski