Script to wake up TNS backup server from another TNS to start replication task - I can't get it working

Hello,

I’m trying to automate my replication task from my TNS-Fileserver to my TNS-Backup.

Both Servers have IPMI and run TNS 25.10.0.

I created the following script with AI, which I want to run as a CRON job from the main fileserver.

Basically: every 2 weeks, wake up backup server, replicate the whole storage pool to backup server and then shut down the backup server.

Unfortunately I get the following error message when I want to start the CRON job:

CallError

[EFAULT] CronTask “/mnt/Storage/_Apps/Scripts/replication_orchestrator.sh > /dev/null 2> /dev/null” exited with 137 (non-zero) exit status.

View Details

Error Name: EFAULT
Error Code: 14
undefined ----
Reason: [EFAULT] CronTask “/mnt/Storage/_Apps/Scripts/replication_orchestrator.sh > /dev/null 2> /dev/null” exited with 137 (non-zero) exit status.
undefined ----
Error Class: CallError
undefined ----
Trace: Show
undefined ----

What am I doing wrong here? I think this is a very common scenario? Any other thoughts how to achieve this automation?

#!/bin/bash
# Script: replication_orchestrator.sh
# Runs on TrueNAS SCALE Server A (Source) via Cron Job

# --------------------------------------------------------
# 0. CONFIGURATION (Edit These Values)
# --------------------------------------------------------

# IPMI Configuration for Server B (Target)
IPMI_HOST="xxx.xxx.xxx.xxx"             # Server B's dedicated IPMI IP
IPMI_USER="zzz"
IPMI_PASS="zzz"

# Server B (Target) Network IP (for ping check and replication)
TARGET_NAS_IP="yyy.yyy.yyy.yyy"         
TARGET_NAS_USER="root"               

# SSH Key Path (Private Key for Server A accessing Server B)
SSH_KEY_PATH="/mnt/Storage/_Apps/Scripts/TNS-Backup Key_public_key_rsa" 

# Replication Parameters
SOURCE_DATASET="Storage/00_Storage"           # Dataset on Server A to send
TARGET_DATASET="Storage/00_Storage"         # Dataset on Server B to receive
WAIT_TIMEOUT="300"                    # Max time (seconds) to wait for Server B to boot

# TrueNAS Shutdown Command (Executed remotely on Server B)
# NOTE: Requires full path and mandatory 'reason' argument for SCALE 25.04+ [13-15].
REMOTE_SHUTDOWN_CMD="/usr/bin/midclt call system.shutdown \"Automated post-replication shutdown\""


# --------------------------------------------------------
# 1. WAKE SERVER B (TARGET) VIA IPMI
# --------------------------------------------------------
wake_server_b() {
    echo "$(date): Attempting to wake Server B (${IPMI_HOST}) via IPMI..."
    # Using the required ipmitool command structure [5]
    /usr/bin/ipmitool -I lanplus -H "$IPMI_HOST" -U "$IPMI_USER" -P "$IPMI_PASS" chassis power on
    if [ $? -ne 0 ]; then
        echo "$(date): ERROR: IPMI command failed. Check connectivity or credentials."
        exit 1
    fi
}


# --------------------------------------------------------
# 2. WAIT FOR SERVER B TO BOOT (PING LOOP)
# --------------------------------------------------------
wait_for_server_b() {
    echo "$(date): Waiting up to ${WAIT_TIMEOUT} seconds for Server B (${TARGET_NAS_IP}) to become reachable..."
    local end_time=$((SECONDS + WAIT_TIMEOUT))
    
    while [ $SECONDS -lt $end_time ]; do
        # Ping the target IP to confirm OS is booted and network is up [16, 17]
        if ping -c 1 -W 2 "$TARGET_NAS_IP" &> /dev/null; then
            echo "$(date): Server B is reachable. Continuing."
            return 0
        fi
        sleep 10
    done
    
    echo "$(date): ERROR: Server B did not become reachable. Aborting replication."
    exit 1
}


# --------------------------------------------------------
# 3. PERFORM ZFS REPLICATION (Example Placeholder)
# --------------------------------------------------------
perform_replication() {
    echo "$(date): Starting ZFS Replication..."
    
    # --- IMPORTANT: Custom snapshot management logic should be implemented here ---
    # This example assumes snapshots are managed separately and finds the latest local one.
    
    # Find the latest snapshot on the source dataset
    LATEST_SNAP=$(/usr/sbin/zfs list -t snapshot -o name -r -d 1 "$SOURCE_DATASET" | tail -n 1)

    if [ -z "$LATEST_SNAP" ]; then
        echo "$(date): ERROR: No snapshots found on $SOURCE_DATASET."
        return 1
    fi
    echo "$(date): Found latest snapshot: $LATEST_SNAP. Sending..."

    # Execute ZFS send/receive over SSH using the private key [9, 10]
    # NOTE: The -I (incremental basis) argument is omitted for simplicity/initial send,
    # but highly recommended for subsequent runs.
    if /usr/sbin/zfs send -R "$LATEST_SNAP" | \
        /usr/bin/ssh -i "$SSH_KEY_PATH" "$TARGET_NAS_USER"@"$TARGET_NAS_IP" \
        "/usr/sbin/zfs receive -F -d $TARGET_DATASET" ; then
        
        echo "$(date): ZFS Replication completed successfully."
        return 0 # Success
    else
        echo "$(date): ZFS Replication FAILED (Exit Code: $?)."
        return 1 # Failure
    fi
}


# --------------------------------------------------------
# 4. SHUTDOWN SERVER B (TARGET)
# --------------------------------------------------------
shutdown_server_b() {
    echo "$(date): Replication complete. Initiating graceful shutdown of Server B (Target)."
    
    # Use SSH to remotely execute the middleware shutdown command on Server B [9, 10, 18]
    if /usr/bin/ssh -i "$SSH_KEY_PATH" "$TARGET_NAS_USER"@"$TARGET_NAS_IP" "$REMOTE_SHUTDOWN_CMD"; then
        echo "$(date): Server B shutdown sequence initiated successfully."
    else
        echo "$(date): WARNING: Failed to initiate graceful shutdown on Server B."
    fi
}


# --------------------------------------------------------
# MAIN EXECUTION FLOW
# --------------------------------------------------------

# Step 1: Power On Target Server
wake_server_b

# Step 2: Wait for Target Server to be ready
wait_for_server_b

# Step 3: Perform Replication and conditionally check its success (&& equivalent functionality) [19]
if perform_replication; then
    # Step 4: If replication succeeded, shut down Server B
    shutdown_server_b
else
    echo "$(date): Orchestration finished with errors. Target server remains running for manual investigation."
    exit 1
fi

Thank you

I still work on the issue and it’s getting frustrating.

I did some analysis and it seems that I cannot SSH into the Backup server.

The Replication task itself works with SSH connection, but if I go into the shell I get the following error:

ssh root@
root@: Permission denied (publickey).

I added the SSH public key from TNS-Fileserver as SSH key pair to TNS-Backup.