The details of this depend on the facilities available. Generally the possible values we might want to generate alarms on fall into the following categories:
- DC values with hi/lo limits (e.g. voltages, temperatures)
- "Should never happen" error conditions (e.g. "L1A received while in Busy state") or "EvN mismatch"
- "Rate metered" conditions. e.g. "L1A received while in OFW state" may count at a low rate and this is OK
- Things which should (approximately) match, e.g.
STATUS.GENERAL.L1A_COUNT
and STATUS.LSC.SFPx.BUILT_EVENT_COUNT
.
The alarms currently being used are defined here:
https://svnweb.cern.ch/cern/wsvn/cmshcos/trunk/hcalAlarm/conf/
Hardware alarm conditions
These are unrelated to any running and indicate possible hardware failures in the board.
We can establish high and low limits for warning and error.
STATUS.FPGA.DIE_TEMP V6 die temperature in unit of 0.1 degree Celsius
STATUS.FPGA.MV_0V75_VREF 0.75V DDR3_Vref power voltage in millivolt
STATUS.FPGA.MV_0V75_VTT 0.75V DDR3_Vtt power voltage in millivolt
STATUS.FPGA.MV_12V0 12V power voltage in millivolt
STATUS.FPGA.MV_1V0 1.0V analog power voltage in millivolt
STATUS.FPGA.MV_1V0_BRAM 1.0V VccBRAM power voltage in millivolt
STATUS.FPGA.MV_1V0_INT 1.0V Vccint power voltage in millivolt
STATUS.FPGA.MV_1V2 1.2V analog power voltage in millivolt
STATUS.FPGA.MV_1V5 1.5V power voltage in millivolt
STATUS.FPGA.MV_1V8_AUX 1.8V VccAux power voltage in millivolt
STATUS.FPGA.MV_1V8_GTX 1.8V VccAuxGTX power voltage in millivolt
STATUS.FPGA.MV_2V0 2.0V VccAuxIO power voltage in millivolt
STATUS.FPGA.MV_2V5 2.5V power voltage in millivolt
STATUS.FPGA.MV_3V3 3.3V power voltage in millivolt
State timers
These are 64-bit timers (they have
_LO
and
_HI
words) which count the
total time the AMC13 has spent in the TTS BSY/OFW/SYN states.
You have to decide what to do about these, but any time spent in SYN
is probably an error.
STATUS.GENERAL.BUSY_TIME_LO busy time counter
STATUS.GENERAL.OF_WARN_TIME_LO L1A overflow warning time counter
STATUS.GENERAL.SYNC_LOST_TIME_LO L1A sync lost time counter
L1A when there shouldn't be any
These count (as it says) the number of
L1A seen when there shouldn't be any. Problem is, the OFW and even BSY ones are reported to count at a low rate due to excessive latency in the GT response to TTS. Probably you need a software rate meter on these.
STATUS.GENERAL.L1A_WHEN_BSY_LO L1A received when in BSY state
STATUS.GENERAL.L1A_WHEN_OFW_LO L1A received when in OFW state
STATUS.GENERAL.L1A_WHEN_SYN_LO L1A received when in SYN state
Errors in the data
STATUS.AMC.AMC_CRC_ERR AMC event CRC error detected
STATUS.AMC01.BP_CRC_ERR Backplane link CRC error
AMC links
These indicate problems with the uHTR data. They really should all be zero all the time,
but we may need to put a rate-meter limit to let a few through without alarming.
These indicate problems with the data sent by the uHTR.
EVN_MISMATCH_COUNTER_LO AMC Evn mismatch counter
ORN_MISMATCH_COUNTER_LO AMC OrN mismatch counter
BCN_MISMATCH_COUNTER_LO AMC BcN mismatch counter
BAD_EVENTLENGTH_COUNTER_LO AMC bad event length counter
TRAILER_EVN_MISMATCH_COUNTER_LO AMC event trailer Evn mismatch error counter
BAD_AMC_CRC_COUNTER_LO Bad CRC on event from AMC
AMC_TTS_DISC_COUNTER_LO TTS state 'disconnected' from AMC
AMC_TTS_SYNC_COUNTER_LO TTS state 'sync lost' from AMC
AMC_TTS_ERR_COUNTER_LO TTS state 'error' from AMC
These indicate problems with data received at the AMC13 end of the backplane
AMC13_EVN_MISMATCH_LO HTR event EVN mismatch counter
AMC13_BCN_MISMATCH_LO HTR event BCN mismatch counter
AMC13_ORN_MISMATCH_LO HTR event OCN mismatch counter
AMC13_BAD_LENGTH_LO AMC bad event length counter
--
EricHazen - 07 Oct 2015