Enable and disable hardware (FRUs) for Sun / Solaris Howto

CPU disable / enable (Solaris)

1. View /var/adm/messages to figure out which CPU is having problems.
2. Use psrinfo to view your CPU configuration
3. As root, take the CPU offline using the command ‘psradm -f 1′
3a. The syntax is ‘psradm ‘
4. Verify the CPU is offline with ‘psrinfo’
5. To bring an offline CPU back online, use ‘psradm -n 1′

The point being that disabling the proc using psradm still lets it act as a controller for the memory behind it. So if the OS asks it to check something in its memory banks, then the system may still crash. Using the asr-commands disables both the CPU and its memory controller.

Disable Hardware on SPARC Platforms from the OBP (System)

source

You can disable hardware directly from the OBP with “asr” commands. If it’s a production critical machine, and it won’t boot because of a failed component, you can disable the hardware from the OBP and get the machine back up (although crippled) to minimize your production downtime impact.

Rebooting with command: boot
Boot device: /pci@1e,600000/pci@0/pci@2/scsi@0/disk@0,0  File and args: -rsv
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54.
FCode UFS Reader 1.12 00/07/17 15:48:16.
Loading: /platform/SUNW,Sun-Fire-V445/ufsboot
Loading: /platform/sun4u/ufsboot
ERROR: Last Trap: Corrected ECC Error

{3} ok

YIKES!@#$! We have memory failure.

The OBP keyword “sifting” will search through all of the commands the OBP knows for a particular string. So to search for all of the commands that contain asr:

{3} ok sifting asr

In vocabulary  srassembler
(f001d858) rdasr        (f001d550) wrasr        (f001d53c) rdasr
In vocabulary  forth
(f008ee08) asr-list-keys        (f008ed2c) asr-enable
(f008ebd8) asr-disable          (f008d22c) .asr         (f008cb50) asr-clear
(f0052240) asr-policies

So, the main commands here then are asr-list-keys (show what we can disable) .asr (show what we already have disabled) asr-enable, asr-disable, and asr-clear

{3} ok asr-list-keys


key = net2&3                /pci@1f,700000/pci@0/pci@2/pci@0/@4
key = net0&1                /pci@1e,600000/pci@0/pci@1/pci@0/@4
key = ide                   /pci@1f,700000/pci@0/pci@1/pci@0/@1f
key = usb                   /pci@1f,700000/pci@0/pci@1/pci@0/@1c
key = pci7                  /pci@1f,700000/pci@0/@9
key = pci6                  /pci@1e,600000/pci@0/@9
key = pci5                  /pci@1f,700000/pci@0/pci@2/pci@0/@8
key = pci4                  /pci@1f,700000/pci@0/pci@2/pci@0/@8
key = pci3                  /pci@1e,600000/pci@0/pci@1/pci@0/@8
key = pci2                  /pci@1e,600000/pci@0/pci@1/pci@0/@8
key = pci1                  /pci@1f,700000/pci@0/@8
key = pci0                  /pci@1e,600000/pci@0/@8
key = cpu3-bank3
key = cpu3-bank2
key = cpu3-bank1
key = cpu3-bank0
key = cpu2-bank3
key = cpu2-bank2
key = cpu2-bank1
key = cpu2-bank0
key = cpu1-bank3
key = cpu1-bank2
key = cpu1-bank1
key = cpu1-bank0
key = cpu0-bank3
key = cpu0-bank2
key = cpu0-bank1
key = cpu0-bank0

Since we have an ECC memory error, we know it is with one of the above memory banks. By disabling the memory banks on each CPU one at a time, by trial and error we can find the failed memory.

{3} ok .asr
There are no devices disabled by ASR.

Disabling cpu0-2 kept hitting the ECC memory error.  Lets disable CPU3.

{3} ok asr-disable cpu3-bank0
{3} ok asr-disable cpu3-bank1
{3} ok asr-disable cpu3-bank2
{3} ok asr-disable cpu3-bank3

{3} ok .asr
cpu3-bank3              Disabled by USER
No reason given
cpu3-bank2              Disabled by USER
No reason given
cpu3-bank1              Disabled by USER
No reason given
cpu3-bank0              Disabled by USER
No reason given

And lets boot the machine

Sun Fire V445, No Keyboard
Copyright 2006 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.22.19, 24576 MB memory installed, Serial xxxxxxxxx
Ethernet address 0:14:4f:xx:xx:xx, Host ID: xxxxxxx

NOTICE: CPU 3 has 8192/8192 MB of memory disabled

ERROR: The following devices are disabled:
cpu3-bank3
cpu3-bank2
cpu3-bank1
cpu3-bank0

Thanks for telling me!

Rebooting with command: boot -rsv
Boot device: /pci@1e,600000/pci@0/pci@2/scsi@0/disk@0,0  File and args: -rsv
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54.
FCode UFS Reader 1.12 00/07/17 15:48:16.
Loading: /platform/SUNW,Sun-Fire-V445/ufsboot
Loading: /platform/sun4u/ufsboot
module /platform/sun4u/kernel/sparcv9/unix: text at [0x1000000, 0x107a767] data at 0×1800000
module misc/sparcv9/krtld: text at [0x107a768, 0x10933af] data at 0×184c760
module /platform/sun4u/kernel/sparcv9/genunix: text at [0x10933b0, 0x11f0f17] data at 0×1852040
module /platform/SUNW,Sun-Fire-V445/kernel/misc/sparcv9/platmod: text at [0x11f0f18, 0x11f1817] data at 0×18a45e0
module /platform/sun4u/kernel/cpu/sparcv9/SUNW,UltraSPARC-IIIi: text at [0x11f1880, 0x120278f] data at 0×18a4e80
SunOS Release 5.10 Version Generic_118833-33 64-bit
Copyright 1983-2006 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.
Ethernet address = 0:14:4f:2b:ea:aa
mem = 25165824K (0×600000000)
avail mem = 25226371072
root nexus = Sun Fire V445

YAY! Our gimpy machine is going back into production minus 8gb of memory. There will be a performance impact running on less system resources, but better something than nothing?

V440

Using the ASR commands to manually enable or disable CPUs on V480/V880

source

The user level commands 'asr-enable' and 'asr-disable' can be used to manually enable or disable system devices. To view the full list of devices that can be enabled (or disabled) type 'asr-enable' at the ok prompt (the example output is for V480):

{2} ok asr-enable

Usage: asr-enable <dev-id>
Where <dev-id> is an absolute device path, a device alias, or a device
label.
Valid device labels include: 
    cpu3-bank3      cpu3-bank2      cpu3-bank1      cpu3-bank0 
    cpu2-bank3      cpu2-bank2      cpu2-bank1      cpu2-bank0 
    cpu1-bank3      cpu1-bank2      cpu1-bank1      cpu1-bank0 
    cpu0-bank3      cpu0-bank2      cpu0-bank1      cpu0-bank0 
    pci-slot5       pci-slot4       pci-slot3       pci-slot2 
    pci-slot1       pci-slot0       gptwo-slotc     gptwo-slotb 
    gptwo-slota     ob-ide          ob-net0         ob-net1 
    ob-fcal         io-bridge9      io-bridge8      io-bridge5 
    cpu3            cpu2            cpu1            cpu0 
    *               cpu3-bank*      cpu2-bank*      cpu1-bank* 
    cpu0-bank*      pci*            pci-slot*       gptwo-slot* 
    io-bridge*      cpu* 

The .asr is another user-level command, that will display the current status (enabled or disabled) of devices that are supported by ASR (the example output is for V480):

{0} ok .asr
ASR Disablement Status
Component:     Status

CPU/Memory:    Enabled
IO-Bridge5:    Enabled
IO-Bridge8:    Enabled
IO-Bridge9:    Enabled
GPTwo Slots:   Enabled
Onboard FCAL:  Enabled
Onboard Net1:  Enabled
Onboard Net0:  Enabled
Onboard IDE:   Enabled
PCI Slots:     Enabled

The normal ASR function is that disabling a CPU with 'asr-disable' will effectively disable the entire CPU module, so disabling CPU1 will also take CPU3 out of the system. To bring a CPU back alive after it has been disabled you must 'asr-enable' the CPU and then power-cycle the system.

Similarly, if you have CPU1 & CPU3 disabled, then enabling (asr-enable) only CPU1 will still leave CPU3 disabled, so CPU1 will still be [effectively] disabled as well, so you must enable both CPUs (and power-cycle) before either CPU is available. Simply asr-enable'ing a cpu and reseting the system isn't good enough , you must power-cycle.

You need to use the .asr command at the ok prompt to check the status of each CPU. The OBP command 'reset-all' should be used immediately after 'asr-enable' or 'asr-disable', so that these commands can take effect.

Here are some examples (based on 4-way V480 server) of the steps you need to follow in order get a CPU(s) back alive after it has been disabled:

1. Example procedure to asr-disable and asr-enable single CPU (4-way system) :

 The steps to "asr-enable" a previously "asr-disable'd" CPU (this is not

needed

 if the CPU was failed by POST, this is only needed when the CPU has been 
 manually "asr-disable"d):
a) ok asr-disable cpu1
b) ok reset-all    --> CPU1 and CPU3 (the other cpu on the same module) now
disabled
                                and unavailable and the system will respond
with:
    Resetting ...
    WARNING: Offlining/Disabling CPU1...and CPU3...Done.

c) At this point if 'reset-all' is performed (or 'reset-all' followed by power cycle) CPU1 will still be unavailable. This can be verified via .env command (at the ok prompt), which will show the status only for CPU0&2, or at the OS level by using the commands 'psrinfo -v' and 'prtdiag -v'.

d) To enable CPU1:
     ok asr-enable cpu1
     ok .asr (to check status)
     ok reset-all   --> cpu1 is still unavailable (can be verified by
using  .env, which will only
                               show the status for CPU0 & CPU2
     Power-cycle (power-off/power-on) --> cpu1 & cpu3 are now available.
     This can be verified via the .env command (OBP level), which will now
show  the status
     for all 4 CPUs, or at the OS level by using the commands 'psrinfo -v'
and 'prtdiag -v'.

2. Example procedure to asr-disable and asr-enable CPU1 & CPU3 (4-way system):

{3} ok asr-disable cpu1
{3} ok asr-disable cpu3
{3} ok .asr (to check ASR Disablement Status)
Component:     Status

CPU0/Memory:   Enabled
CPU1:          Disabled
Memory Bank0:  Enabled
Memory Bank1:  Enabled
Memory Bank2:  Enabled
Memory Bank3:  Enabled
CPU2/Memory:   Enabled
CPU3:          Disabled
Memory Bank0:  Enabled
Memory Bank1:  Enabled
Memory Bank2:  Enabled
Memory Bank3:  Enabled
IO-Bridge5:    Enabled
IO-Bridge8:    Enabled
IO-Bridge9:    Enabled
GPTwo Slots:   Enabled
Onboard FCAL:  Enabled
Onboard Net1:  Enabled
Onboard Net0:  Enabled
Onboard IDE:   Enabled
PCI Slots:     Enabled

{3} ok reset-all
Resetting ... WARNING: Offlining/Disabling CPU1...and CPU3...Done.

To bring back CPU1 and CPU3 both CPU's need to be asr-enabled (if only CPU1 is enabled, after 'reset-all' the system will again offline (effectively disable) both CPU1 and CPU3):

ok asr-enable cpu1
ok asr-enable cpu3
ok reset-all
ok .asr (to check ASR Disablement Status)
Component:     Status

CPU/Memory:    Enabled
IO-Bridge5:    Enabled
IO-Bridge8:    Enabled
IO-Bridge9:    Enabled
GPTwo Slots:   Enabled
Onboard FCAL:  Enabled
Onboard Net1:  Enabled
Onboard Net0:  Enabled
Onboard IDE:   Enabled
PCI Slots:     Enabled

ok .env (will still not display the status for CPU1 & CPU3)

After power-cycle both CPU's will be back on-line.

3. To disable and then enable the entire CPU module in Slot B (both CPU1 & CPU3) the following commands can be used as well:

{3} ok asr-disable gptwo-slotb
{3} ok .asr
ASR Disablement Status
Component:     Status

CPU/Memory:    Enabled
IO-Bridge5:    Enabled
IO-Bridge8:    Enabled
IO-Bridge9:    Enabled
GPTwo Slot A:  Enabled
GPTwo Slot B:  Disabled
GPTwo Slot C:  Enabled
Onboard FCAL:  Enabled
Onboard Net1:  Enabled
Onboard Net0:  Enabled
Onboard IDE:   Enabled
PCI Slots:     Enabled

{3} ok reset-all
Resetting ...

WARNING: Offlining/Disabling CPU1...and CPU3...Done.

To bring back the cpu's in slot B use the command:

{0} ok asr-enable gptwo-slotb
{0} ok .asr
ASR Disablement Status
Component:     Status

CPU/Memory:    Enabled
IO-Bridge5:    Enabled
IO-Bridge8:    Enabled
IO-Bridge9:    Enabled
GPTwo Slots:   Enabled
Onboard FCAL:  Enabled
Onboard Net1:  Enabled
Onboard Net0:  Enabled
Onboard IDE:   Enabled
PCI Slots:     Enabled

After a 'reset-all' and power-cycle of the system the cpu's in slot B (cpu1 and cpu3) will be back online.

 
enable_and_disable_hardware_on_sun_and_solaris.txt · Last modified: 2012/03/24 03:11 (external edit)
Recent changes RSS feed Creative Commons License Driven by DokuWiki Made on Mac