-
Notifications
You must be signed in to change notification settings - Fork 7.7k
MCXN ethernet fixes and improvements #93533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
a7c6a4c
to
70d1107
Compare
{ | ||
struct nxp_enet_qos_tx_data *tx_data = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is avoiding using a workq to do packet reference freeing in the ISR. It is unfortunate that we can't use the workq for something that is truly a background low priority activity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative which I considered if the optimization is important is to make another workqueue for the TX, but it seems a bit like overkill to have a whole thread just to do a bit of code that only does a few memory frees.
I think the http server is really more at fault here, as it was the thing blocking the system workqueue. It seems like it shouldn't be an issue to give a semaphore from a sysworkq item, but I can see the argument now for sure that it could lead to confusing deadlocks in case of other programming mistakes. Seeing as how I spent basically a whole week trying to debug a memory leak in the RX code until realizing there was no memory leak and it was actually a deadlock that had something to do with the TX side. So for the sake of avoiding more situations like this for people to debug, I think we might as well avoid it by not giving a semaphore from the wild west that could become of the system work queue.
So I can confirm that this fixes the hangs on the frdm_mcxn947n board for me. Good work, thanks! |
I have just found this morning that by plugging into a newer more intense/busy network that there is still some problems getting exposed, which I am working on fixing, so I put a DNM until then. I actually already found/fixed one of the problems which was a subtle typo in this PR. But I am also finding some other really subtle issues that are rarely appearing. So I would not rely on this PR just yet. I might add also a few other commits that add some new error handling code. |
There was a deadlock occurring, exposed by http server sample because of situations like this caused by tx done work being blocked in deadlock: 1) The TX would be started by some thread and the driver TX sem would be taken. 2) The http server socket would get scheduled on the system workqueue to send something, claim the network TX interface mutex, and be blocked taking the semaphore. 3) The RX traffic class handler would get blocked trying to claim the network interface TX mutex, while trying to send an ACK in the TCP callback. This means the RX packets would not be processed. 4) Lots of RX unable to allocate packets errors would happen, and all RX would be dropped. This was the main symptom of the deadlock, which made it look like a memory leak but actually had nothing to do with the RX code nor any memory leak. 5) The TX DMA would finish and schedule the TX DMA done work onto the system work queue, behind the http server socket which is blocked on the waiting for the driver TX semaphore. 6) If the TX DMA done work would have ran, that's what gives the TX driver semaphore. So this is the reason for the deadlock of all these different threads and work items, the misqueue in the system workqueue. Fix by just calling the TX DMA done code directly from the ISR, it should be ISR safe, and really not a lot of code to execute, just freeing some net buffers and the packet and updating the stats. An optimization can be made later if needed, but for now, solving the deadlock is a more urgent priority. Signed-off-by: Declan Snyder <[email protected]>
Instead of blocking forever if TX is busy, return an error. This can avoid deadlock situations. Signed-off-by: Declan Snyder <[email protected]>
Don't enable interrupt until after init because there can be a interrupt mistakenly happen during the init process which can cause various problems. Along similar lines, avoid issue for sporadic TX interrupt with no packet in tx done handler. Signed-off-by: Declan Snyder <[email protected]>
Instead of looping through only the amount of descriptors there are, maybe it is possible to get some more things received in one work item than even the max number of descriptors if RX is processed fast enough, instead of waiting for work to be scheduled again. So change to go around the ring until we actually hit a DMA owned descriptor. Signed-off-by: Declan Snyder <[email protected]>
Extract the RX underrun handle code to a separate function. Named the function like "dma_rx_resume" to make it clear what the function is really supposed to do. Demote the error about not being a first descriptor to a warning. Because most likely we already got an error about something else which caused us to drop the packet in the first place. The rest of the frames are expected to be dropped. And make the string shorter. Also remove the debug message because the control bits do not tell us any more information than we don't know already. They only tell us that we own the descriptor (known since we are processing the frame), that it is not a first descriptor (known since that is the reason we would drop it at this point, as indicated by the warning). Signed-off-by: Declan Snyder <[email protected]>
I think it was one of the optional commits that were optimizing the RX path were either introducing problems or exposing them due to things going through more continuously, so I just dropped them for now |
|
@@ -714,8 +737,12 @@ static int eth_nxp_enet_qos_mac_init(const struct device *dev) | |||
/* Work upon a reception of a packet to a buffer */ | |||
k_work_init(&data->rx.rx_work, eth_nxp_enet_qos_rx); | |||
|
|||
/* Work upon a complete transmission by a channel's TX DMA */ | |||
k_work_init(&data->tx.tx_done_work, tx_dma_done); | |||
/* This driver cannot work without interrupts. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this test moved to the bottom?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see commit message
Main commit is to fix the deadlock causing everything to lock up randomly. (Fixes #92652)
Rest of the commits are refactoring the RX code path, and some optimizations in the RX code path, namely: