Strange memory allocation failure (imx6)

Hi,

I am running into a strange problem where my linux kernel crashes with memory allocation errors after running my program for a while. The allocation is typically from the flexcan driver, but the crash log appears to show that there is still memory available in various areas. Here is one such log:

[ 2061.066160] kswapd0: page allocation failure: order:0, mode:0x2280020(GFP_ATOMIC|__GFP_NOTRACK)
[ 2061.074920] CPU: 0 PID: 36 Comm: kswapd0 Tainted: G           O    4.9.84-2.8.2+gb2a7f2f #37
[ 2061.083362] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 2061.089892] Backtrace:
[ 2061.092375] [<8010ba5c>] (dump_backtrace) from [<8010bd34>] (show_stack+0x18/0x1c)
[ 2061.099953]  r7:80b03254 r6:60030193 r5:00000000 r4:80b1b030
[ 2061.105625] [<8010bd1c>] (show_stack) from [<803fe064>] (dump_stack+0x90/0xa4)
[ 2061.112859] [<803fdfd4>] (dump_stack) from [<801c7ef0>] (warn_alloc+0xf0/0x104)
[ 2061.120173]  r7:80b03254 r6:00000000 r5:00000000 r4:00000000
[ 2061.125843] [<801c7e04>] (warn_alloc) from [<801c8464>] (__alloc_pages_nodemask+0x4c0/0xc5c)
[ 2061.134285]  r3:00000000 r2:00000000 r1:8090fa80
[ 2061.138905]  r4:02280020
[ 2061.141449] [<801c7fa4>] (__alloc_pages_nodemask) from [<802010c4>] (new_slab+0x218/0x288)
[ 2061.149721]  r10:87c43b48 r9:00000015 r8:00000000 r7:00000000 r6:02080020 r5:00000000
[ 2061.157552]  r4:84001e00
[ 2061.160095] [<80200eac>] (new_slab) from [<802025cc>] (___slab_alloc.constprop.5+0x200/0x260)
[ 2061.168626]  r10:87c43b48 r9:00000000 r8:02080020 r7:84001e00 r6:87d7f410 r5:00000000
[ 2061.176457]  r4:00000000
[ 2061.179000] [<802023cc>] (___slab_alloc.constprop.5) from [<802029d0>] (kmem_cache_alloc+0xf0/0x120)
[ 2061.188140]  r10:87c43b48 r9:00000000 r8:60030113 r7:60030113 r6:00000000 r5:02080020
[ 2061.195971]  r4:84001e00
[ 2061.198519] [<802028e0>] (kmem_cache_alloc) from [<806dfebc>] (__build_skb+0x30/0x98)
[ 2061.206354]  r7:844f9840 r6:00000140 r5:8621a000 r4:87d7c184
[ 2061.212022] [<806dfe8c>] (__build_skb) from [<806e001c>] (__netdev_alloc_skb+0x8c/0x108)
[ 2061.220119]  r9:00000000 r8:60030113 r7:844f9840 r6:00000140 r5:8621a000 r4:87d7c184
[ 2061.227875] [<806dff90>] (__netdev_alloc_skb) from [<80577a48>] (alloc_can_skb+0x24/0xb0)
[ 2061.236058]  r9:00000020 r8:909f4030 r7:00040080 r6:8621a000 r5:87c43af4 r4:8621a000
[ 2061.243811] [<80577a24>] (alloc_can_skb) from [<8057a450>] (flexcan_poll+0xa0/0x3e8)
[ 2061.251558]  r7:00040080 r6:0000000a r5:00000000 r4:8621a000
[ 2061.257228] [<8057a3b0>] (flexcan_poll) from [<806edf4c>] (net_rx_action+0x120/0x2fc)
[ 2061.265064]  r10:87c43b48 r9:80b02d00 r8:0000000a r7:0000012c r6:0002afdd r5:8057a3b0
[ 2061.272896]  r4:8621a588
[ 2061.275441] [<806ede2c>] (net_rx_action) from [<8012a970>] (__do_softirq+0x100/0x260)
[ 2061.283278]  r10:00000003 r9:00000100 r8:80b02080 r7:ffffe000 r6:40000003 r5:80b0208c
[ 2061.291110]  r4:00000000
[ 2061.293651] [<8012a870>] (__do_softirq) from [<8012ae08>] (irq_exit+0xe0/0x148)
[ 2061.300966]  r10:00000004 r9:f4a01100 r8:86004000 r7:00000001 r6:00000000 r5:00000000
[ 2061.308798]  r4:80a77d30
[ 2061.311345] [<8012ad28>] (irq_exit) from [<8016e618>] (__handle_domain_irq+0x68/0xbc)
[ 2061.319183] [<8016e5b0>] (__handle_domain_irq) from [<801014bc>] (gic_handle_irq+0x50/0x94)
[ 2061.327541]  r9:f4a01100 r8:87c43c48 r7:f4a00100 r6:f4a0010c r5:80b1b200 r4:80b0344c
[ 2061.335290] [<8010146c>] (gic_handle_irq) from [<8010c88c>] (__irq_svc+0x6c/0x90)
[ 2061.342776] Exception stack(0x87c43c48 to 0x87c43c90)
[ 2061.347835] 3c40:                   80b9316c 80b93160 00000000 80b11940 00002d18 87dda0a0
[ 2061.356019] 3c60: 80b930dc 87c43d38 80b11954 80b11954 00000004 87c43cd4 87c43cd8 87c43c98
[ 2061.364201] 3c80: 801fd594 807fd050 60030113 ffffffff
[ 2061.369259]  r9:87c42000 r8:80b11954 r7:87c43c7c r6:ffffffff r5:60030113 r4:807fd050
[ 2061.377015] [<801fd520>] (get_swap_page) from [<801fb2f4>] (add_to_swap+0x14/0x64)
[ 2061.384592]  r10:00000004 r9:87c43de8 r8:00000000 r7:87c43d38 r6:87c43f00 r5:87dda0a0
[ 2061.392423]  r4:87dda0b4
[ 2061.394968] [<801fb2e0>] (add_to_swap) from [<801d2678>] (shrink_page_list+0x654/0xc38)
[ 2061.402975]  r5:87dda0a0 r4:87dda0b4
[ 2061.406558] [<801d2024>] (shrink_page_list) from [<801d340c>] (shrink_inactive_list+0x2ec/0x468)
[ 2061.415349]  r10:00000000 r9:80b45044 r8:80b443c4 r7:80b45040 r6:00000005 r5:80b443c0
[ 2061.423180]  r4:00000020
[ 2061.425721] [<801d3120>] (shrink_inactive_list) from [<801d3cf8>] (shrink_node+0x464/0x8a8)
[ 2061.434078]  r10:00000020 r9:87c43f00 r8:80b45044 r7:0000007e r6:00000008 r5:00000000
[ 2061.441910]  r4:00000000
[ 2061.444450] [<801d3894>] (shrink_node) from [<801d4960>] (kswapd+0x2a8/0x664)
[ 2061.451592]  r10:00000000 r9:80b443c0 r8:80b8b28c r7:80b94930 r6:ffffffff r5:00000000
[ 2061.459424]  r4:80b45044
[ 2061.461966] [<801d46b8>] (kswapd) from [<80143114>] (kthread+0x110/0x118)
[ 2061.468761]  r10:00000000 r9:00000000 r8:801d46b8 r7:80b443c0 r6:87c42000 r5:87c28140
[ 2061.476592]  r4:00000000
[ 2061.479135] [<80143004>] (kthread) from [<80107df0>] (ret_from_fork+0x14/0x24)
[ 2061.486364]  r8:00000000 r7:00000000 r6:00000000 r5:80143004 r4:87c28140
[ 2061.493067] Mem-Info:
[ 2061.495355] active_anon:844 inactive_anon:832 isolated_anon:54
[ 2061.495355]  active_file:1659 inactive_file:842 isolated_file:32
[ 2061.495355]  unevictable:0 dirty:1 writeback:367 unstable:0
[ 2061.495355]  slab_reclaimable:1017 slab_unreclaimable:2533
[ 2061.495355]  mapped:2182 shmem:1 pagetables:419 bounce:0
[ 2061.495355]  free:12049 free_pcp:36 free_cma:11933
[ 2061.528293] Node 0 active_anon:3376kB inactive_anon:3328kB active_file:6636kB inactive_file:3368kB unevictable:0kB isolated(anon):216kB isolated(file):128kB mapped:8728kB dirty:4kB writeback:1468kB shmem:4kB writeback_tmp:0kB unstable:0kB pages_scanned:32 all_unreclaimable? no
[ 2061.552804] Normal free:48196kB min:1368kB low:1708kB high:2048kB active_anon:3376kB inactive_anon:3328kB active_file:6636kB inactive_file:3368kB unevictable:0kB writepending:1472kB present:262144kB managed:249360kB mlocked:0kB slab_reclaimable:4068kB slab_unreclaimable:10132kB kernel_stack:1560kB pagetables:1676kB bounce:0kB free_pcp:144kB local_pcp:144kB free_cma:47732kB
[ 2061.585809] lowmem_reserve[]: 0 0 0
[ 2061.589353] Normal: 56*4kB (MC) 11*8kB (UC) 173*16kB (UC) 232*32kB (UC) 111*64kB (C) 43*128kB (C) 26*256kB (C) 8*512kB (C) 6*1024kB (C) 2*2048kB (C) 1*4096kB (C) 0*8192kB 0*16384kB 0*32768kB = 48200kB
3636 total pagecache pages
[ 2061.609981] 1100 pages in swap cache
[ 2061.613559] Swap cache stats: add 41388, delete 40288, find 17269/29049
[ 2061.620175] Free swap  = 464728kB
[ 2061.623491] Total swap = 524284kB
[ 2061.626807] 65536 pages RAM
[ 2061.629603] 0 pages HighMem/MovableOnly
[ 2061.633440] 3196 pages reserved
[ 2061.636582] 32768 pages cma reserved

This only happens if my program is changing the screen somewhat frequently (which appears to cause galcore to allocate more memory if I keep an eye on /sys/kernel/debug/gc/meminfo) AND the CAN driver is receiving data. If either one of those are not the case the system will run indefinitely. As seen above it appears that at time of crash there is still enough free normal memory and blocks, cma, and cache but it still fails to allocate anyway.

After reading this and some other things around the web I have tried these items to no avail:

  1. Increasing vm_min_free_kbytes - the more it is increased makes everything run slower and still have allocation issues
  2. Decreasing vm_min_free_kbytes to 50 - oddly for some reason this causes the crash to happen much less frequently, but it does still happen
  3. Adjust CMA allocation in command line, kernel config, device tree- none of these seem to have any affect on actual cma allocation
  4. Adjust galcore.contigous size to a smaller number- everything runs longer but still eventually fails

Curious if anyone has seen anything similar or had any thoughts

hi @tpeterson

Could you provide the Software version of your module?
Have you done any changes to the Kernel and device tree? If yes, could you share these changes?
What is your Application?

Best regards, Jaski

Hi Jaski,

The kernel version is 4.9.84-2.8.2+gb2a7f2f, unless you were looking for some other software version?

The kernel and device tree both have many changes to adapt to our custom carrier board and connect to new hardware. I attached the diffs for the device tree, I’m not sure of a good way to get the kernel config diff.

The main application is a gtk3-based hmi. It doesn’t appear to change much in memory usage over time even after days of running- if the content of the screen is changed slowly (e.g. every 20 seconds). If it is changed quickly (e.g. every 4 seconds) it causes the allocation failure eventually, but the memory usage of the program itself doesn’t ever change much. Galcore seems to use more memory though as the screen changes more.

device tree diffs

Thanks for the devicetree file. Could you upload your kernel config ( .config ) too? Thanks.

Jaski,

Here is the config file: file

Thank you

Thanks for the config file. Could you update to Bsp 2.8b5 and if you still have this Issue?
Is it possible that you share a sample application? Thanks.
Best regards, Jaski

Hi Jaski,

I will recompile with the latest BSP, but it will take some time to get it all up and running. I’m afraid our application would be too large to share, though I can looking at seeing if I can replicate the issue with a simpler gtk application.

As another note, during testing it seems that memory may be leaking in Xorg or the graphics driver (galcore). When our application starts, the memory allocation shown at /sys/kernel/debug/gc/vidmem for Xorg and general gc memory usage at /sys/kernel/debug/gc/meminfo start at roughly 30MB. At this point, all objects have been initialized on our software, but as new screens are shown and old ones hidden the memory allocated in both of the above spots continually creeps up and never goes down. Once the Xorg allocation hits around 100MB is when the system crashes.

I don’t think, you have to recompile everything, but maybe just try running your compiled Application on the new minimal adapted kernel.

I’m afraid our application would be too large to share, though I can looking at seeing if I can replicate the issue with a simpler gtk application.

Yeah, it will be very helpful to have a sample application.

At this point, all objects have been initialized on our software, but as new screens are shown and old ones hidden the memory allocated in both of the above spots continually creeps up and never goes down.

How are these new screens created which are shown? Is there a limit for the old/hidden screens?

Best regards, Jaski

Jaski- The screens are created in gtk with the GtkBuilder class at app startup (some details here on this interface link ). The show/hide function of visible widgets is part of the gtk framework, and how it interfaces with X pixel buffers I’m not sure.

I compiled everything with the latest BSP, and the problem seems to be at least partially resolved. It appears that with the latest BSP the video memory allocation was changed so that the pool “gcvPOOL_SYSTEM” is properly used first (on the old kernel only about 1.6M would be used from this pool regardless of pool size). After that, the contiguous pool is used then the virtual pool. Because of this change it seems that the virtual pool usage is a lot less needed and somehow this extra headroom is enough to prevent the original issue seen. However, it is still a slight concern that the graphics driver seems to allocate memory indefinitely without freeing it while the app is running, and I’m still hoping to find out why. For now though at least the allocation crash isn’t happening as quickly.

Hi
These are good news.

However, it is still a slight concern that the graphics driver seems to allocate memory indefinitely without freeing it while the app is running, and I’m still hoping to find out why. For now though at least the allocation crash isn’t happening as quickly.

So your application is still crashing? It will be nice to get sample Code/Application, so we could try this out.

Best regards,
Jaski