[QUESTION]How to calculate MFU based on the flops?

when I train Qwen2.5-32B with Megatron, I found the throughput was 420 or so using H200x2 to train, the partion  
were tp=4,pp=2, so according to the mfu calculation, the util of GPU is 420/1979, that is very small, why is that? Was the logit of `num_floating_point_operations` training.py wrong? What's the approximate number when you guys train such model?