You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A quick look in cProfile surprisingly seems to show that the majority of time is actually spent in the list comprehensions in get_throughput_sum
This includes the first comprehension creating a list of post pressures where the instruction throughput is non zero port_pressures = [instr.port_pressure for instr in kernel if instr.throughput != 0.0]
and the final list sum over the columns tp_sum = [round(sum(col), 2) for col in zip(*port_pressures)]
At first I thought using numpy could avoid the 'zip' transposition, but it seems to have no improvement tp_sum = np.round(np.sum(port_pressures_np, axis=0), 2).tolist()
I think they're already optimized enough and since they already take 80% of runtime, it's probably not worth looking into the other functions. The assembly kernel I tested on was the 'triad' benchmark repeated about a hundred times over.
What we can do and what's a low hanging fruit is reducing the calls of get_throughput_sum() in lines 62-64 to call it only once, do the check in line 62 on the result and use the result for line 64 as well.
Furthermore, we could enhance the get_throughput_sum() function by adding a hint to only calculate (and therefore, do the comprehension) for a specific column/port if we know only this has changed.
The assembly kernel I tested on was the 'triad' benchmark repeated about a hundred times over.
I would assume the relative runtime should move more towards the graph computation when using a more complex kernel with 100 instructions and dependency chains inside of it, as the STREAM triad is only a handful of instructions long, e.g.:
Activity
stefandesouza commentedon Jun 12, 2024
A quick look in cProfile surprisingly seems to show that the majority of time is actually spent in the list comprehensions in


get_throughput_sum
This includes the first comprehension creating a list of post pressures where the instruction throughput is non zero
port_pressures = [instr.port_pressure for instr in kernel if instr.throughput != 0.0]
and the final list sum over the columns
tp_sum = [round(sum(col), 2) for col in zip(*port_pressures)]
At first I thought using numpy could avoid the 'zip' transposition, but it seems to have no improvement
tp_sum = np.round(np.sum(port_pressures_np, axis=0), 2).tolist()
I think they're already optimized enough and since they already take 80% of runtime, it's probably not worth looking into the other functions. The assembly kernel I tested on was the 'triad' benchmark repeated about a hundred times over.
JanLJL commentedon Jun 12, 2024
Interesting, thanks for this insight!
What we can do and what's a low hanging fruit is reducing the calls of
get_throughput_sum()
in lines 62-64 to call it only once, do the check in line 62 on the result and use the result for line 64 as well.Furthermore, we could enhance the
get_throughput_sum()
function by adding a hint to only calculate (and therefore, do the comprehension) for a specific column/port if we know only this has changed.I would assume the relative runtime should move more towards the graph computation when using a more complex kernel with 100 instructions and dependency chains inside of it, as the STREAM triad is only a handful of instructions long, e.g.: