FAQ¶

1. How does the hardware watchdog work?¶

The main body of the hardware watchdog is a timing circuit, whose main function is to monitor and manage the operating status of the CPU, and to reset the CPU in an abnormal state so that it can work again.

Android 10.0 factory firmware does not open the hardware watchdog, if customers need to implement their own software programs to use the hardware watchdog, the following provides the hardware watchdog operation method.

Enabling Watchdog

echo e > "/dev/wdt_crl"

Set the timeout (feed the dog)

Support setting 4 kinds of timeout time:0.64s,2.56s,10.24s,40.96s,corresponding to writing 0,1,2,3 to the device node.Customer can set different timeout time according to their needs,for example,customer needs to set timeout time to 2.56s,then it is.

echo 1 > "/dev/wdt_crl"

The software program needs to set the timeout time to clear the timer in the watchdog at regular intervals (commonly known as “feeding the dog”).

When the CPU fails, it cannot continue to provide the “feed dog” signal, which makes the watchdog timer accumulate and overflow, thus generating a reset signal to reset the CPU to restart the system and make the CPU work again.

Close Watchdog

echo d > "/dev/wdt_crl"

2.BMC FAQ¶

All devices are unavailable in BMC¶

This is due to some certain reasons. Please check one by one in the following order for troubleshooting:

Limit to viewing period to last hard hour.
Run adb devices on the motherboard.

If there are no devices listed, please make sure that the USB OTG port be disconnected from any PC. If there is no connection, check the connection and power of the daughter boards to make sure there are no hardware promblems involed.

Run bmc query on the motherboard. If things are normal, there should be some output as shown belowed:

root@firefly:~# bmc query
node_cluster_up{instance="127.0.0.1:9100", job="node", nodename="main", subnode="main"} => 1 @[1603356844.328]
node_cluster_up{instance="127.0.0.1:9100", job="node", nodename="main", subnode="sub01"} => 1 @[1603356844.328]
node_cluster_up{instance="127.0.0.1:9100", job="node", nodename="main", subnode="sub02"} => 1 @[1603356844.328]
...

If none, check if running state of node_exporter in the motherboard:

root@firefly:~# bmc main metrics | grep node_cluster_up
# HELP node_cluster_up Value is 1 if the cluster subnode is 'up', 0 otherwise.
# TYPE node_cluster_up gauge
node_cluster_up{state="android",subnode="sub02"} 1
node_cluster_up{state="android",subnode="sub03"} 1
...

If there are no output, then there is failure in running node_exporter in the motherboard.

If there are some output, there might be something wrong in Prometheus service in collecting the monitoring data of the motherboard.

Run sudo systemctl status prometheus on the motherboard to check the service status.
Run sudo journalctl -u prometheus on the motherboard to check the service journal for reason of failure in detail.
- If the journal contains: “Handle Corrupt Prometheus Write-Ahead Log (WAL)”
  
  Please delete the corrupt file stated and restart the service:
```
sudo systemctl restart prometheus
```
- If the journal contains: “Error on ingesting out-of-order samples”
  
  Time of the mother board is not consisted with the one of browser. For example, if it is July 1 in the mother board, and July 21 in the browser, The browser will send query requesting data in the future in case of the backend server and will be replied with empty data.
  
  The system defaults to use NTP to synchronize system time. If network connectivity is not available from any reason, you need to set the correct system time manually:
```
sudo timedatectl set-ntp false                   # Disable NTP
sudo timedatectl set-timezone Asia/Shanghai      # Set timezone
sudo timedatectl set-time "2021-10-14 15:48:29"  # Set datetime
timedatectl status                               # Check result
```
  Then reset the Prometheus database:
```
sudo systemctl stop prometheus
sudo rm -rf /var/lib/prometheus/metrics2/*
sudo systemctl start prometheus
```

Troubleshooting on firmware flashing¶

Background knowledge¶

There are two big components in firmware flashing:

web frontend, to submit flashing request and display progress.
netrecovery-master backend, in charge of flashing the firmware and updating the progress.

The procedure of firmware flashing is:

Switch the daughter board to Loader mode.
Flash the auxiliary upgrading firmware (called netrecovery) and reboot to it.
Get DHCP IP address.
Run the update proggram to fetch from the network the data of firmware in the motherboard, and flash them to the eMMC storage.
Reset the duaghter board to reboot to the new firmware.

The prerequisites to firmware flashing are:

Make sure the daughter board can get DHCP IP address, and make network connection with the motherboard.
The USB OTG port of the cluster server must be disconncted from the PC.
Firmware file should be put in directory /home/firefly/Firmware, with extension of “.img”.

The progress is alway 0% after submitting request¶

Please check the journal of netrecovery-master service:

sudo journalctl -f -u netrecovery-master

To restart the service, please run:

sudo systemctl restart netrecovery-master

Error occurred as “switch to recovery failed”¶

This error occurs when the daughter board cannot be switched to Loader mode, or cannot be flashed with the netrecovery auxiliary upgrading firmare.

Please try again after some time. If that keeps failing, you have to flash the firmware by the USB cable.

Check the flashing log in detail¶

The example below takes the daughter board sub1-01 as an instant:

# Check log on daughter board
$ bmc sub1-01 shell
$ cd /tmp/log
$ cat history  # Check command history, the first number is the process number.
$ cat *.err    # Check error output.

# Check log on the motherboard
$ cd /var/netrecovery/sub1-01/state
$ cat master-*.out
$ cat history  # Log files from daughter board will be downloaded if possible
$ cat *.err    # Check error output

Prometheus¶

Check Prometheus status¶

Check service status:

sudo systemctl status prometheus

Check service journal:

sudo journalctl -u prometheus

Check size of Prometheus database¶

sudo du -hs /var/lib/prometheus/metrics2

Adjust policy of Prometheus database storage¶

Please edit /etc/default/prometheus. The default setting is:

ARGS="--storage.tsdb.retention.time=7d --storage.tsdb.retention.size=4GB"

which means to keep max 7 days or 4 GB data of the database.

Reset Prometheus database¶

The following commands will delete all the Prometheus database and restart the service:

```shell
sudo systemctl stop prometheus
sudo rm -rf /var/lib/prometheus/metrics2/*
sudo systemctl start prometheus
```

node_exporter collector¶

`node_filefd_allocated`¶

node_filefd_allocated is number of the allocated file descrptor. Larger number means more files are opened.

Please reference:

https://www.robustperception.io/kernel-file-descriptor-metrics-from-the-node-exporter

`node_context_switches_total`¶

node_context_switches_total is the total number of context switches per second. Larger number means the context switches are higher and more frequent.