Skip to content

bug: sync logging causes sluggish ot-robot-server performance and timeout errors under certain system load conditions #18034

Open
@strangemonad

Description

@strangemonad

Overview

We're investigating a handful of system load related issues that manifest in the following ways:

  • Slow/Poor Flex performance across the ot-robot-server and on-device app performace
  • Timeout errors
  • Stall crash errors

This becomes especially evident in protocols that have a large number of steps/actions (eg >10k) and many chained, short-lived actions eg multiple, short executing moves.

We believe there are several contributing factors to the degraded performance. We've been able to track down at least one of them to the use of synchronous logging in the ot-robot-server (and other services).

Steps to reproduce

Under load observer

  • CAN bus message summaries are written to logs (by ot-robot-server)
  • Logs synchronously write to journald see robot-server/robot_server/services/logging.py#_robot_log_config()
  • When journald flushes logs to disk (/var/log/.../), it will cause the ot-robot-server to block in the critical CAN bus message ack path.

There are several potential future improvements that would all be more involved to implement. eg better configuration of journald.

A very simple first improvement would be to first configure the logger to write to an async handler like a QueueHandler. This will prevent the python process from blocking during these momentary journald flushes to disc. Here's some sample code that incorporates async logging with systemd logging. This should probably be done at least to the ot-robot-server but all python services would benefit.

import logging
from logging.handlers import QueueHandler, QueueListener
from queue import Queue
from systemd.journal import JournalHandler

# Create a queue for log messages
log_queue = Queue()

# Set up the JournalHandler
journal_handler = JournalHandler()
journal_handler.setFormatter(logging.Formatter('[%(levelname)s] %(message)s'))

# Set up the QueueHandler and add it to the logger
queue_handler = QueueHandler(log_queue)
logger = logging.getLogger('async_logger')
logger.setLevel(logging.INFO)
logger.addHandler(queue_handler)

# Set up the QueueListener to process logs asynchronously
listener = QueueListener(log_queue, journal_handler)
listener.start()

# Example usage
logger.info("This is an async log message")

# Stop the listener when done
listener.stop()

Current behavior

No response

Expected behavior

No response

Operating system

None

System and robot setup or anything else?

Flex 8.3.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions