Solver Guide

QKAN supports multiple solver backends for computing quantum variational activation functions. Choose the right solver based on your hardware, model size, and deployment target.

Solver Overview

Solver	Device	Use Case	Install
`exact`	CPU / GPU	Default. Pure PyTorch reference implementation. Best for debugging and small models.	Included with `qkan`
`flash`	GPU	Triton fused kernels. Best speed/memory tradeoff for most GPU training.	Included with `qkan[gpu]`
`cutile`	GPU	cuTile (NVIDIA Tile Language) fused kernels. BF16/FP8 mixed-precision with coalesced state layout.	`pip install cuda-tile`
`cutn`	GPU / CPU	cuQuantum tensor network contraction. Best for extremely large layers near OOM.	GPU: `pip install cuquantum`, CPU: `pip install opt-einsum`
`qml`	CPU	PennyLane quantum circuits. For demonstration, not optimized.	`pip install pennylane`
`qiskit`	IBM QPU	IBM Quantum backends via Qiskit Runtime. For real quantum device inference.	`pip install qiskit qiskit-ibm-runtime`
`cudaq`	QPU / GPU	NVIDIA CUDA-Q. Supports AWS Braket QPUs, GPU simulators, and CPU simulators.	See CUDA-Q installation

Ansatz Choice

``pz`` (default): Most reliable quality across tasks. Uses RZ-RY-RZ rotation layers with data re-uploading.
``rpz``: Reduced pz encoding with trainable preactivation. Fewer parameters per layer.
``real``: Real-valued ansatz (no complex arithmetic). Can be faster but may hurt accuracy on some tasks.

Recommendation: Start with pz. Only switch to real if you validate it on your task.

Mixed Precision

The flash and cutile solvers support BF16 and FP8 compute via the c_dtype parameter:

qkan = QKAN([10, 10], solver="flash", c_dtype=torch.bfloat16, device="cuda")

c_dtype: Compute dtype for quantum simulation kernels (state vectors, trig ops).
p_dtype: Parameter storage dtype (theta, preacts). Keep at float32.

Performance (from #12):

BF16: 2.3–2.5x faster training, 45% less peak memory, identical convergence.
FP8 (torch.float8_e4m3fn): Additional memory savings for state checkpoints via prescaled storage.
All ansatzes (pz, rpz, real) are supported.

Note

p_dtype=torch.float8_e4m3fn is not supported — PyTorch has no FP8 arithmetic kernels. Use FP8 only for c_dtype.

Real Quantum Device Deployment

QKAN can run inference on real quantum hardware. The workflow:

Train locally with exact or flash solver on CPU/GPU.
Transfer weights to a device-backed model using initialize_from_another_model.
Run inference on the QPU.

IBM Quantum (Qiskit)

from qiskit_ibm_runtime import QiskitRuntimeService

service = QiskitRuntimeService(channel="ibm_quantum_platform")
backend = service.least_busy(operational=True, simulator=False)

ibm_model = QKAN(
    [1, 2, 1], solver="qiskit", fast_measure=False,
    solver_kwargs={
        "backend": backend,
        "shots": 1000,
        "optimization_level": 3,
        "parallel_qubits": backend.num_qubits,
    },
)
ibm_model.initialize_from_another_model(trained_model)

AWS Braket via CUDA-Q

qpu_model = QKAN(
    [1, 2, 1], solver="cudaq", fast_measure=False,
    solver_kwargs={
        "target": "braket",
        "machine": "arn:aws:braket:us-west-1::device/qpu/rigetti/Ankaa-3",
        "shots": 1000,
        "parallel_qubits": 20,
    },
)
qpu_model.initialize_from_another_model(trained_model)

Key parameters:

fast_measure=False: Required for real devices. Uses |alpha|^2 - |beta|^2 (Born rule) instead of the quantum-inspired |alpha| - |beta| shortcut.
parallel_qubits: Packs N independent single-qubit circuits onto N qubits of one multi-qubit job, reducing QPU submissions by ~Nx.
shots: Number of measurement samples per circuit. More shots = less statistical noise.

Error Mitigation

Real quantum hardware introduces gate errors, readout noise, and shot noise. QKAN provides framework-level error mitigation via the mitigation key in solver_kwargs:

solver_kwargs={
    "backend": backend,
    "shots": 1000,
    "parallel_qubits": 127,
    "mitigation": {
        "zne": {"scale_factors": [1, 3, 5]},  # Zero-Noise Extrapolation
        "n_repeats": 3,                        # Multi-shot averaging
        "clip_expvals": True,                  # Clamp <Z> to [-1, 1]
    },
}

Zero-Noise Extrapolation (ZNE): Runs circuits at amplified noise levels (via gate folding) and Richardson-extrapolates to the zero-noise limit. For Qiskit, you can alternatively use resilience_level=2 for Qiskit-native ZNE.

Multi-shot averaging (n_repeats): Runs the entire batch N times and averages results. Reduces variance from shot-to-shot fluctuations.

Expectation clipping (clip_expvals): Clamps values to [-1, 1] after mitigation. Prevents catastrophic outliers from ZNE extrapolation.

Mitigation Cost
Technique	Circuit multiplier	When to use
`clip_expvals` only	1x	Always (zero cost)
`n_repeats=3`	3x	When shot noise dominates
ZNE `[1, 3, 5]`	3x	When gate noise dominates
ZNE + `n_repeats=3`	9x	Maximum accuracy, inference only

IBM-specific options (passed directly, not under mitigation):

solver_kwargs={
    "resilience_level": 2,  # Qiskit-native ZNE
    "twirling": {"enable_gates": True, "enable_measure": True},
}