Open
Description
Haumea is currently exhibiting correctable memory errors and it would be great to a) monitor these events b) log these events.
For logging I found hardware.rasdaemon which can listen for these kinds of events:
# rasdaemon -f
rasdaemon: Improper PAGE_CE_ACTION, set to default soft
rasdaemon: Page offline choice on Corrected Errors is soft
rasdaemon: Improper PAGE_CE_THRESHOLD, set to default 50.
rasdaemon: Improper PAGE_CE_REFRESH_CYCLE, set to default 24h.
rasdaemon: Threshold of memory Corrected Errors is 50 / 24h
rasdaemon: ras:mc_event event enabled
rasdaemon: Enabled event ras:mc_event
rasdaemon: ras:aer_event event enabled
rasdaemon: Enabled event ras:aer_event
rasdaemon: ras:non_standard_event event enabled
rasdaemon: Enabled event ras:non_standard_event
rasdaemon: ras:arm_event event enabled
rasdaemon: Enabled event ras:arm_event
rasdaemon: [open_sys_file]:open file: /sys/devices/system/cpu/cpu0/online failed
rasdaemon: Cpu fault isolation is disabled
rasdaemon: mce:mce_record event enabled
rasdaemon: Enabled event mce:mce_record
rasdaemon: ras:extlog_mem_event event enabled
rasdaemon: Enabled event ras:extlog_mem_event
rasdaemon: net:net_dev_xmit_timeout event enabled
rasdaemon: Enabled event net:net_dev_xmit_timeout
rasdaemon: devlink:devlink_health_report event enabled
rasdaemon: Enabled event devlink:devlink_health_report
rasdaemon: block:block_rq_error event enabled
rasdaemon: Enabled event block:block_rq_error
rasdaemon: ras:memory_failure_event event enabled
rasdaemon: Enabled event ras:memory_failure_event
rasdaemon: Listening to events for cpus 0 to 15
<...>-1268491 [000] ..... 0.026543 mce_record 2024-12-06 15:35:41 +0000 Unified Memory Controller (bank=17), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error.
Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=1, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 319deb440, synd= b22c00100a800301, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
<...>-1276690 [000] ..... 0.026608 mce_record 2024-12-06 15:46:37 +0000 Unified Memory Controller (bank=17), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error.
Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=1, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 319deb440, synd= b22c00100a800301, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
<...>-1285769 [000] ..... 0.026739 mce_record 2024-12-06 16:08:27 +0000 Unified Memory Controller (bank=17), status= 9c2040000000011b, Corrected error, no action required., mci=CECC, mca= DRAM ECC error.
Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic', memory_channel=0,csrow=1, cpu_type= AMD Scalable MCA, cpu= 0, socketid= 0, misc= d01b0fff01000000, addr= 319deb440, synd= b22c00100a800301, ipid= 9600050f00, mcgstatus=0, mcgcap= 11c, apicid= 0
Metadata
Metadata
Assignees
Type
Projects
Status
Todo