Jump to content

User:Razzi/Debugging eventlogging to druid network flows internal hourly.service

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

In IRC I saw this alert today:

PROBLEM - Check unit status of eventlogging_to_druid_network_flows_internal_hourly on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit eventlogging_to_druid_network_flows_internal_hourly https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers

SSHing on to an-launcher showed this error:

Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 ERROR DataFrameToDruid: Druid ingestion task index_hadoop_network_flows_internal_lggaghgk_2022-02-18T22:00:35.639Z for network_flows_internal failed
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO HiveToDruid: Done.
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SparkContext: Invoking stop() from shutdown hook
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SparkUI: Stopped Spark web UI at http://an-launcher1002.eqiad.wmnet:4041
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnClientSchedulerBackend: Interrupting monitor thread
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnClientSchedulerBackend: Shutting down all executors
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: (serviceOption=None,
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]:  services=List(),
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]:  started=false)
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO YarnClientSchedulerBackend: Stopped
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO MemoryStore: MemoryStore cleared
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO BlockManager: BlockManager stopped
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO BlockManagerMaster: BlockManagerMaster stopped
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO SparkContext: Successfully stopped SparkContext
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO ShutdownHookManager: Shutdown hook called
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-fcccd681-efb8-4816-9a57-f8a66dc0b7db
Feb 18 22:20:39 an-launcher1002 eventlogging_to_druid_network_flows_internal_hourly[8443]: 22/02/18 22:20:39 INFO ShutdownHookManager: Deleting directory /tmp/spark-f21f647b-6de0-4473-94de-46767d4f8fc8
Feb 18 22:20:39 an-launcher1002 systemd[1]: eventlogging_to_druid_network_flows_internal_hourly.service: Main process exited, code=exited, status=1/FAILURE
Feb 18 22:20:39 an-launcher1002 systemd[1]: eventlogging_to_druid_network_flows_internal_hourly.service: Failed with result 'exit-code'.

Unfortunately it has been recovering and then failing continuously. Root cause TBD