Open
Description
Hi developers,
We met critical issue when kill storm topology.
We killed the topology as below.
Config conf = new Config(); conf.put(Config.NIMBUS_SEEDS, "SOME_NIMBUS_SEED_STRING"); KillOptions opt = new KillOptions(); opt.set_wait_secs_isSet(true); opt.set_wait_secs(10); Nimbus.Iface nimbusClient = NimbusClient.getConfiguredClient(conf).getClient(); nimbusClient.killTopologyWithOpts("TOPOLOGY_NAME", opt);
Topology workers were distributed across multiple supervisors.
Some supervisor's workers died normally.
But the problem is that,
Some supervisor workers never died with error message like below!!
2021-06-29 02:58:44.284 o.a.s.d.s.Container SLOT_6707 [INFO] SET worker-user baef41a4-b5f6-4ea3-8868-5537dfba82f8 root 2021-06-29 02:58:44.284 o.a.s.d.s.Container SLOT_6707 [INFO] Creating symlinks for worker-id: baef41a4-b5f6-4ea3-8868-5537dfba82f8 storm-id: TOPOLOGY_NAME for files(1): [resources] 2021-06-29 02:58:44.284 o.a.s.d.s.BasicContainer SLOT_6707 [INFO] Launching worker with assignment LocalAssignment(topology_id:TOPOLOGY_NAME, executors:[ExecutorInfo(task_start:17, task_end:17), ExecutorInfo(task_start:29, task_end:29), ExecutorInfo(task_start:5, task_end:5)], resources:WorkerResources(mem_on_heap:6272.0, mem_off_heap:0.0, cpu:30.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=6272.0, cpu.pcore.percent=30.0}, shared_resources:{}), owner:root) for this supervisor d2ee514a-e40e-40fb-b119-59763f3bb95d-10.233.112.14 on port 6707 with id baef41a4-b5f6-4ea3-8868-5537dfba82f8 2021-06-29 02:58:44.285 o.a.s.d.s.Slot SLOT_6708 [INFO] STATE kill-and-relaunch msInState: 6 topo:TOPOLOGY_NAME worker:d06bb5c5-25e2-4557-8996-4d40045022d1 -> waiting-for-worker-start msInState: 0 topo:TOPOLOGY_NAME worker:d06bb5c5-25e2-4557-8996-4d40045022d1 2021-06-29 02:58:44.286 o.a.s.d.s.Slot SLOT_6707 [INFO] STATE kill-and-relaunch msInState: 7 topo:TOPOLOGY_NAME worker:baef41a4-b5f6-4ea3-8868-5537dfba82f8 -> waiting-for-worker-start msInState: 0 topo:TOPOLOGY_NAME worker:baef41a4-b5f6-4ea3-8868-5537dfba82f8 2021-06-29 02:58:46.799 o.a.s.d.s.BasicContainer Thread-7269 [INFO] Worker Process d06bb5c5-25e2-4557-8996-4d40045022d1 exited with code: 254 2021-06-29 02:58:48.065 o.a.s.d.s.BasicContainer Thread-7270 [INFO] Worker Process baef41a4-b5f6-4ea3-8868-5537dfba82f8 exited with code: 254 2021-06-29 02:59:09.234 o.a.s.d.s.t.SupervisorHealthCheck timer [INFO] Running supervisor healthchecks... 2021-06-29 02:59:09.234 o.a.s.h.HealthChecker timer [INFO] The supervisor healthchecks succeeded. 2021-06-29 02:59:39.234 o.a.s.d.s.t.SupervisorHealthCheck timer [INFO] Running supervisor healthchecks... 2021-06-29 02:59:39.234 o.a.s.h.HealthChecker timer [INFO] The supervisor healthchecks succeeded. 2021-06-29 02:59:53.558 o.a.s.d.s.Supervisor pool-11-thread-9 [INFO] Got an assignments from master, will start to sync with assignments: SupervisorAssignments(...) 2021-06-29 02:59:53.936 o.a.s.d.s.Slot SLOT_6702 [INFO] SLOT 6702: Assignment Changed from LocalAssignment(topology_id:TOPOLOGY_NAME, executors:[ExecutorInfo(task_start:23, task_end:23), ExecutorInfo(task_start:11, task_end:11)], resources:WorkerResources(mem_on_heap:3200.0, mem_off_heap:0.0, cpu:20.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=3200.0, cpu.pcore.percent=20.0}, shared_resources:{}), owner:root) to null 2021-06-29 02:59:53.939 o.a.s.d.s.Container SLOT_6702 [INFO] Killing d2ee514a-e40e-40fb-b119-59763f3bb95d-10.233.112.14:25976cac-9170-44ec-b835-099377cda893 2021-06-29 02:59:54.293 o.a.s.d.s.Slot SLOT_6708 [INFO] SLOT 6708: Assignment Changed from LocalAssignment(topology_id:TOPOLOGY_NAME, executors:[ExecutorInfo(task_start:10, task_end:10), ExecutorInfo(task_start:22, task_end:22)], resources:WorkerResources(mem_on_heap:3200.0, mem_off_heap:0.0, cpu:20.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=3200.0, cpu.pcore.percent=20.0}, shared_resources:{}), owner:root) to null 2021-06-29 02:59:54.293 o.a.s.d.s.Slot SLOT_6707 [INFO] SLOT 6707: Assignment Changed from LocalAssignment(topology_id:TOPOLOGY_NAME, executors:[ExecutorInfo(task_start:17, task_end:17), ExecutorInfo(task_start:29, task_end:29), ExecutorInfo(task_start:5, task_end:5)], resources:WorkerResources(mem_on_heap:6272.0, mem_off_heap:0.0, cpu:30.0, shared_mem_on_heap:0.0, shared_mem_off_heap:0.0, resources:{offheap.memory.mb=0.0, onheap.memory.mb=6272.0, cpu.pcore.percent=30.0}, shared_resources:{}), owner:root) to null 2021-06-29 02:59:54.296 o.a.s.d.s.Slot SLOT_6708 [INFO] STATE waiting-for-worker-start msInState: 70011 topo:TOPOLOGY_NAME worker:d06bb5c5-25e2-4557-8996-4d40045022d1 -> kill msInState: 0 topo:TOPOLOGY_NAME worker:d06bb5c5-25e2-4557-8996-4d40045022d1 2021-06-29 02:59:54.296 o.a.s.d.s.Slot SLOT_6707 [INFO] STATE waiting-for-worker-start msInState: 70010 topo:TOPOLOGY_NAME worker:baef41a4-b5f6-4ea3-8868-5537dfba82f8 -> kill msInState: 0 topo:TOPOLOGY_NAME worker:baef41a4-b5f6-4ea3-8868-5537dfba82f8 2021-06-29 02:59:54.298 o.a.s.d.s.Slot SLOT_6708 [INFO] SLOT 6708 all processes are dead... 2021-06-29 02:59:54.298 o.a.s.d.s.Container SLOT_6708 [INFO] Cleaning up d2ee514a-e40e-40fb-b119-59763f3bb95d-10.233.112.14:d06bb5c5-25e2-4557-8996-4d40045022d1 2021-06-29 02:59:54.298 o.a.s.d.s.AdvancedFSOps SLOT_6708 [INFO] Deleting path /storm/workers/d06bb5c5-25e2-4557-8996-4d40045022d1/pids/141225 2021-06-29 02:59:54.298 o.a.s.d.s.AdvancedFSOps SLOT_6708 [INFO] Deleting path /storm/workers/d06bb5c5-25e2-4557-8996-4d40045022d1/heartbeats 2021-06-29 03:00:06.452 o.a.s.d.s.AdvancedFSOps AsyncLocalizer Task Executor - 1 [INFO] Deleting path /storm/supervisor/stormdist/TOPOLOGY_NAME/stormjar.jar 2021-06-29 03:00:06.472 o.a.s.d.s.AdvancedFSOps AsyncLocalizer Task Executor - 1 [INFO] Deleting path /storm/supervisor/stormdist/TOPOLOGY_NAME/stormjar.jar.version 2021-06-29 03:00:06.472 o.a.s.d.s.AdvancedFSOps AsyncLocalizer Task Executor - 1 [INFO] Deleting path /storm/supervisor/stormdist/TOPOLOGY_NAME/resources 2021-06-29 03:00:06.472 o.a.s.l.LocalizedResourceRetentionSet AsyncLocalizer Task Executor - 1 [INFO] Deleted blob: TOPOLOGY_NAME-stormjar.jar (REMOVED FROM CLUSTER). 2021-06-29 03:00:06.475 o.a.s.d.s.AdvancedFSOps AsyncLocalizer Task Executor - 1 [INFO] Deleting path /storm/supervisor/stormdist/TOPOLOGY_NAME/stormconf.ser 2021-06-29 03:00:06.475 o.a.s.d.s.AdvancedFSOps AsyncLocalizer Task Executor - 1 [INFO] Deleting path /storm/supervisor/stormdist/TOPOLOGY_NAME/stormconf.ser.version 2021-06-29 03:00:06.475 o.a.s.l.LocalizedResourceRetentionSet AsyncLocalizer Task Executor - 1 [INFO] Deleted blob: TOPOLOGY_NAME-stormconf.ser (REMOVED FROM CLUSTER). 2021-06-29 03:00:06.477 o.a.s.d.s.AdvancedFSOps AsyncLocalizer Task Executor - 1 [INFO] Deleting path /storm/supervisor/stormdist/TOPOLOGY_NAME/stormcode.ser 2021-06-29 03:00:06.477 o.a.s.d.s.AdvancedFSOps AsyncLocalizer Task Executor - 1 [INFO] Deleting path /storm/supervisor/stormdist/TOPOLOGY_NAME/stormcode.ser.version 2021-06-29 03:00:06.478 o.a.s.l.LocalizedResourceRetentionSet AsyncLocalizer Task Executor - 1 [INFO] Deleted blob: TOPOLOGY_NAME-stormcode.ser (REMOVED FROM CLUSTER). 2021-06-29 03:00:06.478 o.a.s.d.s.AdvancedFSOps AsyncLocalizer Task Executor - 1 [INFO] Deleting path /storm/supervisor/stormdist/TOPOLOGY_NAME 2021-06-29 03:00:07.062 o.a.s.d.s.Supervisor pool-11-thread-10 [WARN] Topology config is not localized yet... 2021-06-29 03:00:07.063 o.a.s.t.ProcessFunction pool-11-thread-10 [ERROR] Internal error processing sendSupervisorWorkerHeartbeat org.apache.storm.utils.WrappedNotAliveException: TOPOLOGY_NAME does not appear to be alive, you should probably exit at org.apache.storm.daemon.supervisor.Supervisor$1.sendSupervisorWorkerHeartbeat(Supervisor.java:448) ~[storm-server-2.2.0.jar:2.2.0] at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:374) ~[storm-client-2.2.0.jar:2.2.0] at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:353) ~[storm-client-2.2.0.jar:2.2.0] at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0] at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0] at org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:172) [storm-client-2.2.0.jar:2.2.0] at org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [storm-shaded-deps-2.2.0.jar:2.2.0] at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) [storm-shaded-deps-2.2.0.jar:2.2.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at java.lang.Thread.run(Unknown Source) [?:?] 2021-06-29 03:00:07.064 o.a.s.t.ProcessFunction pool-11-thread-3 [ERROR] Internal error processing sendSupervisorWorkerHeartbeat org.apache.storm.utils.WrappedNotAliveException: TOPOLOGY_NAME does not appear to be alive, you should probably exit at org.apache.storm.daemon.supervisor.Supervisor$1.sendSupervisorWorkerHeartbeat(Supervisor.java:448) ~[storm-server-2.2.0.jar:2.2.0] at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:374) ~[storm-client-2.2.0.jar:2.2.0] at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:353) ~[storm-client-2.2.0.jar:2.2.0] at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0] at org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:172) [storm-client-2.2.0.jar:2.2.0] at org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [storm-shaded-deps-2.2.0.jar:2.2.0] at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) [storm-shaded-deps-2.2.0.jar:2.2.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at java.lang.Thread.run(Unknown Source) [?:?] 2021-06-29 03:00:08.106 o.a.s.d.s.Supervisor pool-11-thread-9 [WARN] Topology config is not localized yet... 2021-06-29 03:00:08.107 o.a.s.t.ProcessFunction pool-11-thread-9 [ERROR] Internal error processing sendSupervisorWorkerHeartbeat org.apache.storm.utils.WrappedNotAliveException: TOPOLOGY_NAME does not appear to be alive, you should probably exit at org.apache.storm.daemon.supervisor.Supervisor$1.sendSupervisorWorkerHeartbeat(Supervisor.java:448) ~[storm-server-2.2.0.jar:2.2.0] at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:374) ~[storm-client-2.2.0.jar:2.2.0] at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:353) ~[storm-client-2.2.0.jar:2.2.0] at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0] at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0] at org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:172) [storm-client-2.2.0.jar:2.2.0] at org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [storm-shaded-deps-2.2.0.jar:2.2.0] at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) [storm-shaded-deps-2.2.0.jar:2.2.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?] at java.lang.Thread.run(Unknown Source) [?:?] 2021-06-29 03:00:08.108 o.a.s.d.s.Supervisor pool-11-thread-16 [WARN] Topology config is not localized yet... 2021-06-29 03:00:08.108 o.a.s.t.ProcessFunction pool-11-thread-16 [ERROR] Internal error processing sendSupervisorWorkerHeartbeat org.apache.storm.utils.WrappedNotAliveException: TOPOLOGY_NAME does not appear to be alive, you should probably exit at org.apache.storm.daemon.supervisor.Supervisor$1.sendSupervisorWorkerHeartbeat(Supervisor.java:448) ~[storm-server-2.2.0.jar:2.2.0] at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:374) ~[storm-client-2.2.0.jar:2.2.0] at org.apache.storm.generated.Supervisor$Processor$sendSupervisorWorkerHeartbeat.getResult(Supervisor.java:353) ~[storm-client-2.2.0.jar:2.2.0] at org.apache.storm.thrift.ProcessFunction.process(ProcessFunction.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0] at org.apache.storm.thrift.TBaseProcessor.process(TBaseProcessor.java:38) [storm-shaded-deps-2.2.0.jar:2.2.0] at org.apache.storm.security.auth.SimpleTransportPlugin$SimpleWrapProcessor.process(SimpleTransportPlugin.java:172) [storm-client-2.2.0.jar:2.2.0] at org.apache.storm.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:524) [storm-shaded-deps-2.2.0.jar:2.2.0] at org.apache.storm.thrift.server.Invocation.run(Invocation.java:18) [storm-shaded-deps-2.2.0.jar:2.2.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
This error message repeated forever until we killed that worker process.
Originally reported by sangheee, imported from: killed topology worker does not removed with warn and error that "Topology config is not localized yet..."
- status: Open
- priority: Major
- resolution: Unresolved
- imported: 2025-01-24
Activity
jira-importer commentedon Apr 23, 2022
radhikakv:
+1 to prioritize this bug fix
We recently migrated to v2.2.0 and this issue is completely messing up the storm topology process which is affecting Production runs.
Also suggest if there are any workarounds, clean-up scripts that needs to be executed until the bug is fixed.