MCXN ethernet fixes and improvements #93533

decsny · 2025-07-22T18:00:00Z

Main commit is to fix the deadlock causing everything to lock up randomly. (Fixes #92652)

There was a deadlock occurring, exposed by http server sample because of
situations like this caused by tx done work being blocked in deadlock:
1) The TX would be started by some thread and the driver TX sem would be
   taken.
2) The http server socket would get scheduled on the system workqueue to
   send something, claim the network TX interface mutex,
   and be blocked taking the semaphore.
3) The RX traffic class handler would get blocked trying to claim the
   network interface TX mutex, while trying to send an ACK in the TCP
   callback. This means the RX packets would not be processed.
4) Lots of RX unable to allocate packets errors would happen, and all RX
   would be dropped. This was the main symptom of the deadlock, which
   made it look like a memory leak but actually had nothing to do with
   the RX code nor any memory leak.
5) The TX DMA would finish and schedule the TX DMA done work onto the
   system work queue, behind the http server socket which is blocked on
   the waiting for the driver TX semaphore.
6) If the TX DMA done work would have ran, that's what gives the TX
   driver semaphore. So this is the reason for the deadlock of all these
   different threads and work items, the misqueue in the system
   workqueue.

Fix by just calling the TX DMA done code directly from the ISR, it
should be ISR safe, and really not a lot of code to execute, just
freeing some net buffers and the packet and updating the stats.
An optimization can be made later if needed, but for now,
solving the deadlock is a more urgent priority.

Rest of the commits are refactoring the RX code path, and some optimizations in the RX code path, namely:

Process as many descriptors as possible during the RX work, by walking through the whole descriptor ring until a DMA owned descriptor is reached, meaning we can even process descriptors that are sent by DMA while we are in the RX work handler.
Don't drop packets just because we didn't receive all the frames by the time the RX work runs. We can receive the frames and finish the packet once we get the rest of the frames. I don't see any valid reason for dropping the packet in that case as it was.

dleach02 · 2025-07-22T20:38:54Z

drivers/ethernet/eth_nxp_enet_qos/eth_nxp_enet_qos_mac.c

 {
-	struct nxp_enet_qos_tx_data *tx_data =


This change is avoiding using a workq to do packet reference freeing in the ISR. It is unfortunate that we can't use the workq for something that is truly a background low priority activity.

An alternative which I considered if the optimization is important is to make another workqueue for the TX, but it seems a bit like overkill to have a whole thread just to do a bit of code that only does a few memory frees.

I think the http server is really more at fault here, as it was the thing blocking the system workqueue. It seems like it shouldn't be an issue to give a semaphore from a sysworkq item, but I can see the argument now for sure that it could lead to confusing deadlocks in case of other programming mistakes. Seeing as how I spent basically a whole week trying to debug a memory leak in the RX code until realizing there was no memory leak and it was actually a deadlock that had something to do with the TX side. So for the sake of avoiding more situations like this for people to debug, I think we might as well avoid it by not giving a semaphore from the wild west that could become of the system work queue.

laodzu · 2025-07-25T15:32:59Z

So I can confirm that this fixes the hangs on the frdm_mcxn947n board for me. Good work, thanks!

decsny · 2025-07-25T17:02:51Z

So I can confirm that this fixes the hangs on the frdm_mcxn947n board for me. Good work, thanks!

I have just found this morning that by plugging into a newer more intense/busy network that there is still some problems getting exposed, which I am working on fixing, so I put a DNM until then. I actually already found/fixed one of the problems which was a subtle typo in this PR. But I am also finding some other really subtle issues that are rarely appearing. So I would not rely on this PR just yet. I might add also a few other commits that add some new error handling code.

There was a deadlock occurring, exposed by http server sample because of situations like this caused by tx done work being blocked in deadlock: 1) The TX would be started by some thread and the driver TX sem would be taken. 2) The http server socket would get scheduled on the system workqueue to send something, claim the network TX interface mutex, and be blocked taking the semaphore. 3) The RX traffic class handler would get blocked trying to claim the network interface TX mutex, while trying to send an ACK in the TCP callback. This means the RX packets would not be processed. 4) Lots of RX unable to allocate packets errors would happen, and all RX would be dropped. This was the main symptom of the deadlock, which made it look like a memory leak but actually had nothing to do with the RX code nor any memory leak. 5) The TX DMA would finish and schedule the TX DMA done work onto the system work queue, behind the http server socket which is blocked on the waiting for the driver TX semaphore. 6) If the TX DMA done work would have ran, that's what gives the TX driver semaphore. So this is the reason for the deadlock of all these different threads and work items, the misqueue in the system workqueue. Fix by just calling the TX DMA done code directly from the ISR, it should be ISR safe, and really not a lot of code to execute, just freeing some net buffers and the packet and updating the stats. An optimization can be made later if needed, but for now, solving the deadlock is a more urgent priority. Signed-off-by: Declan Snyder <[email protected]>

Instead of blocking forever if TX is busy, return an error. This can avoid deadlock situations. Signed-off-by: Declan Snyder <[email protected]>

Don't enable interrupt until after init because there can be a interrupt mistakenly happen during the init process which can cause various problems. Along similar lines, avoid issue for sporadic TX interrupt with no packet in tx done handler. Signed-off-by: Declan Snyder <[email protected]>

Instead of looping through only the amount of descriptors there are, maybe it is possible to get some more things received in one work item than even the max number of descriptors if RX is processed fast enough, instead of waiting for work to be scheduled again. So change to go around the ring until we actually hit a DMA owned descriptor. Signed-off-by: Declan Snyder <[email protected]>

Extract the RX underrun handle code to a separate function. Named the function like "dma_rx_resume" to make it clear what the function is really supposed to do. Demote the error about not being a first descriptor to a warning. Because most likely we already got an error about something else which caused us to drop the packet in the first place. The rest of the frames are expected to be dropped. And make the string shorter. Also remove the debug message because the control bits do not tell us any more information than we don't know already. They only tell us that we own the descriptor (known since we are processing the frame), that it is not a first descriptor (known since that is the reason we would drop it at this point, as indicated by the warning). Signed-off-by: Declan Snyder <[email protected]>

decsny · 2025-07-25T19:45:43Z

I think it was one of the optional commits that were optimizing the RX path were either introducing problems or exposing them due to things going through more continuously, so I just dropped them for now

sonarqubecloud · 2025-07-25T19:47:55Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

dleach02 · 2025-07-25T21:21:31Z

drivers/ethernet/eth_nxp_enet_qos/eth_nxp_enet_qos_mac.c

@@ -714,8 +737,12 @@ static int eth_nxp_enet_qos_mac_init(const struct device *dev)
 	/* Work upon a reception of a packet to a buffer */
 	k_work_init(&data->rx.rx_work, eth_nxp_enet_qos_rx);

-	/* Work upon a complete transmission by a channel's TX DMA */
-	k_work_init(&data->tx.tx_done_work, tx_dma_done);
+	/* This driver cannot work without interrupts. */


Why is this test moved to the bottom?

see commit message

decsny changed the title ~~MCXN ethernet fixes~~ MCXN ethernet fixes and improvements Jul 22, 2025

zephyrbot added platform: NXP Drivers NXP Semiconductors, drivers area: Ethernet labels Jul 22, 2025

zephyrbot requested review from clamattia, dbaluta, dleach02, lmajewski, maass-hamburg, manuargue, mmahadevan108 and pdgendt July 22, 2025 18:00

zephyrbot assigned dleach02 and mmahadevan108 Jul 22, 2025

decsny force-pushed the mcxn_eth_fixes branch 2 times, most recently from a7c6a4c to 70d1107 Compare July 22, 2025 19:12

dleach02 reviewed Jul 22, 2025

View reviewed changes

mmahadevan108 previously approved these changes Jul 23, 2025

View reviewed changes

laodzu mentioned this pull request Jul 24, 2025

"No new RX buf available" error in sample Echo and HttpServer applications when running on FRDM_N947 boards #92652

Open

1 task

decsny added the DNM This PR should not be merged (Do Not Merge) label Jul 25, 2025

decsny added 2 commits July 25, 2025 13:36

drivers: eth_nxp_enet_qos: Don't block TX send

df42b12

Instead of blocking forever if TX is busy, return an error. This can avoid deadlock situations. Signed-off-by: Declan Snyder <[email protected]>

decsny dismissed mmahadevan108’s stale review via e6adfc6 July 25, 2025 18:38

decsny force-pushed the mcxn_eth_fixes branch from 779249b to e6adfc6 Compare July 25, 2025 18:38

decsny added 3 commits July 25, 2025 14:40

decsny force-pushed the mcxn_eth_fixes branch from e6adfc6 to 5841264 Compare July 25, 2025 19:44

decsny removed the DNM This PR should not be merged (Do Not Merge) label Jul 25, 2025

dleach02 reviewed Jul 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MCXN ethernet fixes and improvements #93533

MCXN ethernet fixes and improvements #93533

decsny commented Jul 22, 2025 •

edited

Loading

Uh oh!

dleach02 Jul 22, 2025

Uh oh!

decsny Jul 22, 2025 •

edited

Loading

Uh oh!

laodzu commented Jul 25, 2025

Uh oh!

decsny commented Jul 25, 2025 •

edited

Loading

Uh oh!

decsny commented Jul 25, 2025 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Jul 25, 2025

Uh oh!

dleach02 Jul 25, 2025

Uh oh!

decsny Jul 25, 2025

Uh oh!

Uh oh!

MCXN ethernet fixes and improvements #93533

Are you sure you want to change the base?

MCXN ethernet fixes and improvements #93533

Conversation

decsny commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dleach02 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

decsny Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laodzu commented Jul 25, 2025

Uh oh!

decsny commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

decsny commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Jul 25, 2025

Quality Gate passed

Uh oh!

dleach02 Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

decsny Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

decsny commented Jul 22, 2025 •

edited

Loading

decsny Jul 22, 2025 •

edited

Loading

decsny commented Jul 25, 2025 •

edited

Loading

decsny commented Jul 25, 2025 •

edited

Loading