FAQ¶
1. How does the hardware watchdog work?¶
The main body of the hardware watchdog is a timing circuit, whose main function is to monitor and manage the operating status of the CPU, and to reset the CPU in an abnormal state so that it can work again.
Android 10.0 factory firmware does not open the hardware watchdog, if customers need to implement their own software programs to use the hardware watchdog, the following provides the hardware watchdog operation method.
Enabling Watchdog
echo e > "/dev/wdt_crl"
Set the timeout (feed the dog)
Support setting 4 kinds of timeout time:0.64s,2.56s,10.24s,40.96s,corresponding to writing 0,1,2,3 to the device node.Customer can set different timeout time according to their needs,for example,customer needs to set timeout time to 2.56s,then it is.
echo 1 > "/dev/wdt_crl"
The software program needs to set the timeout time to clear the timer in the watchdog at regular intervals (commonly known as “feeding the dog”).
When the CPU fails, it cannot continue to provide the “feed dog” signal, which makes the watchdog timer accumulate and overflow, thus generating a reset signal to reset the CPU to restart the system and make the CPU work again.
Close Watchdog
echo d > "/dev/wdt_crl"
2.BMC FAQ¶
Troubleshooting on firmware flashing¶
Background knowledge¶
There are two big components in firmware flashing:
web frontend, to submit flashing request and display progress.
netrecovery-master backend, in charge of flashing the firmware and updating the progress.
The procedure of firmware flashing is:
Switch the daughter board to Loader mode.
Flash the auxiliary upgrading firmware (called netrecovery) and reboot to it.
Get DHCP IP address.
Run the update proggram to fetch from the network the data of firmware in the motherboard, and flash them to the eMMC storage.
Reset the duaghter board to reboot to the new firmware.
The prerequisites to firmware flashing are:
Make sure the daughter board can get DHCP IP address, and make network connection with the motherboard.
The USB OTG port of the cluster server must be disconncted from the PC.
Firmware file should be put in directory
/home/firefly/Firmware
, with extension of “.img”.
The progress is alway 0% after submitting request¶
Please check the journal of netrecovery-master
service:
sudo journalctl -f -u netrecovery-master
To restart the service, please run:
sudo systemctl restart netrecovery-master
Error occurred as “switch to recovery failed”¶
This error occurs when the daughter board cannot be switched to Loader mode, or cannot be flashed with the netrecovery
auxiliary upgrading firmare.
Please try again after some time. If that keeps failing, you have to flash the firmware by the USB cable.
Check the flashing log in detail¶
The example below takes the daughter board sub1-01
as an instant:
# Check log on daughter board
$ bmc sub1-01 shell
$ cd /tmp/log
$ cat history # Check command history, the first number is the process number.
$ cat *.err # Check error output.
# Check log on the motherboard
$ cd /var/netrecovery/sub1-01/state
$ cat master-*.out
$ cat history # Log files from daughter board will be downloaded if possible
$ cat *.err # Check error output
Prometheus¶
Check Prometheus status¶
Check service status:
sudo systemctl status prometheus
Check service journal:
sudo journalctl -u prometheus
Check size of Prometheus database¶
sudo du -hs /var/lib/prometheus/metrics2
Adjust policy of Prometheus database storage¶
Please edit /etc/default/prometheus
. The default setting is:
ARGS="--storage.tsdb.retention.time=7d --storage.tsdb.retention.size=4GB"
which means to keep max 7 days or 4 GB data of the database.
Reset Prometheus database¶
The following commands will delete all the Prometheus database and restart the service:
```shell
sudo systemctl stop prometheus
sudo rm -rf /var/lib/prometheus/metrics2/*
sudo systemctl start prometheus
```
node_exporter collector¶
node_filefd_allocated
¶
node_filefd_allocated
is number of the allocated file descrptor. Larger number means more files are opened.
Please reference:
node_context_switches_total
¶
node_context_switches_total
is the total number of context switches per second. Larger number means the context switches are higher and more frequent.
Please reference:
3.How to set the static IP or DHCP of the daughter board(Android) with commands?¶
Static IP¶
Take sub01 as an example, static IP: 192.168.1.240, netmask: 255.255.0.0, gateway: 192.168.1.1, primary DNS: 202.96.128.86, secondary DNS: 202.96.128.166
```shell
bmc_adb -s sub01 root
bmc_adb -s sub01 remount
bmc_adb -s sub01 shell "fireflyapi ethernet setIpAddress 1 192.168.1.240 255.255.0.0 192.168.1.1 202.96.128.86 202.96.128.166"
```
DHCP¶
Take sub01 as an example:
```shell
bmc_adb -s sub01 root
bmc_adb -s sub01 remount
bmc_adb -s sub01 shell "fireflyapi ethernet setIpAddress 0"
```