FAQ

1. How does the hardware watchdog work?

The main body of the hardware watchdog is a timing circuit, whose main
function is to monitor and manage the operating status of the CPU, and
to reset the CPU in an abnormal state so that it can work again.

Android 10.0 factory firmware does not open the hardware watchdog, if
customers need to implement their own software programs to use the
hardware watchdog, the following provides the hardware watchdog
operation method.

Enabling Watchdog

   echo e > "/dev/wdt_crl"

Set the timeout (feed the dog)

Support setting 4 kinds of timeout
time:0.64s,2.56s,10.24s,40.96s,corresponding to writing 0,1,2,3 to the
device node.Customer can set different timeout time according to their
needs,for example,customer needs to set timeout time to 2.56s,then it
is.

   echo 1 > "/dev/wdt_crl"

The software program needs to set the timeout time to clear the timer
in the watchdog at regular intervals (commonly known as "feeding the
dog").

When the CPU fails, it cannot continue to provide the "feed dog"
signal, which makes the watchdog timer accumulate and overflow, thus
generating a reset signal to reset the CPU to restart the system and
make the CPU work again.

Close Watchdog

   echo d > "/dev/wdt_crl"

2.BMC FAQ

All devices are unavailable in BMC

This is due to some certain reasons. Please check one by one in the
following order for troubleshooting:

Limit to viewing period to last hard hour.

Run adb devices on the motherboard.  If there are no devices listed,
please make sure that the USB OTG port be disconnected from any PC. If
there is no connection, check the connection and power of the daughter
boards to make sure there are no hardware promblems involed.

Run bmc query on the motherboard. If things are normal, there should
be some output as shown belowed:  root@firefly:~# bmc query
node_cluster_up{instance="127.0.0.1:9100", job="node",
nodename="main", subnode="main"} => 1 @[1603356844.328]
node_cluster_up{instance="127.0.0.1:9100", job="node",
nodename="main", subnode="sub01"} => 1 @[1603356844.328]
node_cluster_up{instance="127.0.0.1:9100", job="node",
nodename="main", subnode="sub02"} => 1 @[1603356844.328] ...  If none,
check if running state of node_exporter in the motherboard:
root@firefly:~# bmc main metrics | grep node_cluster_up # HELP
node_cluster_up Value is 1 if the cluster subnode is 'up', 0
otherwise. # TYPE node_cluster_up gauge
node_cluster_up{state="android",subnode="sub02"} 1
node_cluster_up{state="android",subnode="sub03"} 1 ...  If there are
no output, then there is failure in running node_exporter in the
motherboard.  If there are some output, there might be something wrong
in Prometheus service in collecting the monitoring data of the
motherboard.

Run sudo systemctl status prometheus on the motherboard to check the
service status.

Run sudo journalctl -u prometheus on the motherboard to check the
service journal for reason of failure in detail.  If the journal
contains: "Handle Corrupt Prometheus Write-Ahead Log (WAL)"  Please
delete the corrupt file stated and restart the service:  sudo
systemctl restart prometheus  If the journal contains: "Error on
ingesting out-of-order samples"  Time of the mother board is not
consisted with the one of browser. For example, if it is July 1 in the
mother board, and July 21 in the browser, The browser will send query
requesting data in the future in case of the backend server and will
be replied with empty data.  The system defaults to use NTP to
synchronize system time. If network connectivity is not available from
any reason, you need to set the correct system time manually:  sudo
timedatectl set-ntp false                   # Disable NTP sudo
timedatectl set-timezone Asia/Shanghai      # Set timezone sudo
timedatectl set-time "2021-10-14 15:48:29"  # Set datetime timedatectl
status                               # Check result  Then reset the
Prometheus database:  sudo systemctl stop prometheus sudo rm -rf
/var/lib/prometheus/metrics2/* sudo systemctl start prometheus

Troubleshooting on firmware flashing

Background knowledge

There are two big components in firmware flashing:

web frontend, to submit flashing request and display progress.

netrecovery-master backend, in charge of flashing the firmware and
updating the progress.

The procedure of firmware flashing is:

Switch the daughter board to Loader mode.

Flash the auxiliary upgrading firmware (called netrecovery) and reboot
to it.

Get DHCP IP address.

Run the update proggram to fetch from the network the data of firmware
in the motherboard, and flash them to the eMMC storage.

Reset the duaghter board to reboot to the new firmware.

The prerequisites to firmware flashing are:

Make sure the daughter board can get DHCP IP address, and make network
connection with the motherboard.

The USB OTG port of the cluster server must be disconncted from the
PC.

Firmware file should be put in directory /home/firefly/Firmware, with
extension of ".img".

The progress is alway 0% after submitting request

Please check the journal of "netrecovery-master" service:

   sudo journalctl -f -u netrecovery-master

To restart the service, please run:

   sudo systemctl restart netrecovery-master

Error occurred as "switch to recovery failed"

This error occurs when the daughter board cannot be switched to Loader
mode, or cannot be flashed with the "netrecovery" auxiliary upgrading
firmare.

Please try again after some time. If that keeps failing, you have to
flash the firmware by the USB cable.

Check the flashing log in detail

The example below takes the daughter board "sub1-01" as an instant:

   # Check log on daughter board
   $ bmc sub1-01 shell
   $ cd /tmp/log
   $ cat history  # Check command history, the first number is the process number.
   $ cat *.err    # Check error output.

   # Check log on the motherboard
   $ cd /var/netrecovery/sub1-01/state
   $ cat master-*.out
   $ cat history  # Log files from daughter board will be downloaded if possible
   $ cat *.err    # Check error output

Prometheus

Check Prometheus status

Check service status:

   sudo systemctl status prometheus

Check service journal:

   sudo journalctl -u prometheus

Check size of Prometheus database

   sudo du -hs /var/lib/prometheus/metrics2

Adjust policy of Prometheus database storage

Please edit "/etc/default/prometheus". The default setting is:

   ARGS="--storage.tsdb.retention.time=7d --storage.tsdb.retention.size=4GB"

which means to keep max 7 days or 4 GB data of the database.

Reset Prometheus database

The following commands will delete all the Prometheus database and
restart the service:

   ```shell
   sudo systemctl stop prometheus
   sudo rm -rf /var/lib/prometheus/metrics2/*
   sudo systemctl start prometheus
   ```

node_exporter collector

node_filefd_allocated

"node_filefd_allocated" is number of the allocated file descrptor.
Larger number means more files are opened.

Please reference:

https://www.robustperception.io/kernel-file-descriptor-metrics-from-
the-node-exporter

node_context_switches_total

"node_context_switches_total" is the total number of context switches
per second. Larger number means the context switches are higher and
more frequent.

Please reference:

https://stackoverflow.com/questions/56724508/what-is-rate-node-
context-switches-total-ans-why-ratenode-context-switches-tota

3.How to set the static IP or DHCP of the daughter board(Android) with
commands?

Static IP

Take sub01 as an example, static IP: 192.168.1.240, netmask:
255.255.0.0, gateway: 192.168.1.1, primary DNS: 202.96.128.86,
secondary DNS: 202.96.128.166

   ```shell
   bmc_adb -s sub01 root
   bmc_adb -s sub01 remount
   bmc_adb -s sub01 shell "fireflyapi ethernet setIpAddress 1 192.168.1.240 255.255.0.0 192.168.1.1 202.96.128.86 202.96.128.166"
   ```

DHCP

Take sub01 as an example:

   ```shell
   bmc_adb -s sub01 root
   bmc_adb -s sub01 remount
   bmc_adb -s sub01 shell "fireflyapi ethernet setIpAddress 0"
   ```