-
Notifications
You must be signed in to change notification settings - Fork 5
Description
This issue acts as a point of reference to investigate and optimise memory consumption for our Github Actions log event handling processes
Following reconciliation of orphaned log lines in #278 we have observed the sending queue filling more quickly and excessive memory consumption resulting in pods being OOM killed
As an initial fix we have configured a larger sending queue in both dev and ops environments, which at the time of writing is set to a queue size of 50k with 50 consumers
This has resolved the queueing bottleneck and error sending queue is full for both envs however ops is consuming a large amount of memory during high traffic periods for GHA
Actions taken:
- Memory has been increased initially for the cicd-o11y ops pods, with a further increase staged
- Pprof enabled in ops to match dev
- Profiling data ingested to Pyroscope: alloc profile
Investigation:
- With the memory profiling data we can see there is high allocation to repeated construction of log entry data structures, which we can potentially improve with the use of pooling, batching and reduction of redundant copies
- plog.LogRecordSlice.AppendEmpty and pcommon.Value.SetStr are flagged as the largest allocations.
- Our profiling review engine flags that each incoming log entry results in the creation of new map, string, and slice structures resulting in allocations per log item.
- See flame graph analysis for additional information
As a first step we aim to introduce pooling of log writing processes and review memory consumption