Prune, Quantize, Deploy
Efficient Edge AI from Model to Silicon
Pong Trairatvorakul · Director of Strategy · femtoAI
EDGE AI Taipei 2025

Prune, Quantize, Deploy Workshop

Visit workshop.femto.ai

EDGE AI Taipei 2025

Introduction
This hands-on workshop walks participants through the full edge AI deployment pipeline — from a Spoken Language Understanding model using a few MBs of memory in PyTorch to real-time inference on production hardware. Attendees will learn to compress models using pruning and quantization techniques, then deploy them on the femtoAI SPU-001 AI accelerator, a hardware platform specifically designed to exploit unstructured sparsity for ultra-efficient inference.
By the end, you will have a working understanding of how to prepare, optimize, and deploy compressed neural networks on real hardware — and the confidence to apply these techniques in their own edge AI workflows.
This workshop is a condensed version of the E2E Example on the femtoAI Developer Portal.
Click here to visit the full version of the tutorial.

Prune, Quantize, Deploy Workshop

Visit workshop.femto.ai

EDGE AI Taipei 2025

2

Setup
Some of these steps may take a long time to download. To help with this a USB drive with the files is available during the workshop. Learn more below.here
Environment Setup
  1. Download workshop code from here and unzip
  1. Install Conda for Python environment management
  1. Install run Docker if you don’t have it already
Setup
1. Open Terminal and change directory into downloaded zip folder
cd femtoAI_taipei_workshop
2. Create and activate Conda environment
conda create -n femtocrux_env python=3.10 conda activate femtocrux_env
3. Install femtocrux (femtoAI compiler API) from PyPi
pip install femtocrux
4. Download compiler docker image
python download_compiler.py
You can obtain the compiler key from: developer.femto.ai (an account and NDA is required)
5. Run setup inside the repo
bash setup.sh
This installs the pip-requirements in requirements.txt, downloads the Google Speech Commands Dataset, and splits the dataset into Train, Test, and Validation sets. Step may take a few minutes to run.
Note: You may need to install wget. On a Mac, we suggest using using homebrew (brew install wget)

Installing from USB Drive
Loading the compiler docker image
1. Load back into docker
docker load -i /Volumes/FEMTO_USB/femtocrux_2.5.1.tar
2. Verify image is loaded:
docker images
Now when you run pip install femtocrux, the code will automatically use the local image instead of trying to fetch it remotely.
Loading the workshop code and
Copy the femtoAI_taipei_workshop.zip file to your computer and unzip

Prune, Quantize, Deploy Workshop

Visit workshop.femto.ai

EDGE AI Taipei 2025

3

Training, Pruning, Quantization
Since these steps may take a long time to run and are sequential, we will be starting each step from a saved checkpoint (like a cooking show)

1. Train From Scratch

Change directory into src cd src Start training run python3 train.py \ globals.device=cpu \ globals.batch_size=128 \ globals.exp_tag=bce-max-fp \ globals.epochs=2 \ globals.prune_model=false \ globals.quant_model=false If you use a Mac with metal support, you can set global.device=mps You may need to set PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 if MPS runs out of memory export PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0

2. Pruning Train Run

python3 train.py \ globals.device=cpu \ globals.batch_size=128 \ globals.exp_tag=bce-max-fp-prune \ globals.epochs=2 \ model_load.checkpoint_path=./assets/bce-max-fp/checkpoints/model_final.pt \ model_load.model_type=fp \ globals.prune_model=true \ prune_cfg.prune_init=0.0 \ prune_cfg.prune_target=0.8 If you use a Mac with metal support, you can set global.device=mps

3. Post Training Quantization

python3 train.py \ globals.device=cpu \ globals.batch_size=128 \ globals.exp_tag=bce-max-prune-quantize \ model_load.checkpoint_path=./assets/bce-max-fp-prune/checkpoints/model_final.pt \ model_load.model_type=fp \ globals.prune_model=true \ prune_cfg.prune_init=0.8 \ prune_cfg.prune_target=0.8 \ globals.epochs=0 \ globals.quant_model=true If you use a Mac with metal support, you can set global.device=mps

4. Quantization Aware Training Run (QAT)

python3 train.py \ globals.device=cpu \ globals.batch_size=128 \ globals.exp_tag=bce-max-prune-QAT \ model_load.checkpoint_path=./assets/bce-max-prune-quantize/checkpoints/model_final.pt \ model_load.model_type=quant \ globals.prune_model=true prune_cfg.prune_init=0.8 \ prune_cfg.prune_target=0.8 \ globals.epochs=1 \ globals.quant_model=true This step is not compatible with globals.device=mps

Troubleshooting
If you run into an error about missing hid library:
brew install libusb hidapi export DYLD_LIBARY_PATH=/opt/homebrew/lib
If you run into an error about numpy version:
pip install "numpy<2.0"

Prune, Quantize, Deploy Workshop

Visit workshop.femto.ai

EDGE AI Taipei 2025

4

Deploy
Load firmware onto the EVK
  1. Locate the firmware .hex file at /src/assets/speech_command_firmware.hex
  1. Download the Teensy 4.1 Firmware Loader tool for your OS from: https://www.pjrc.com/teensy/loader.html
  1. Open the Loader Tool, select File > Open HEX File, and select the .hex file located in step 1a
  1. Connect the EVK to your computer with the USB cable, turn it on, then press the PROGRAM FIRMWARE button located between the Audio In and Audio Out jacks
  1. Complete the programming by selecting Operation > Program, then Operation > Reboot
Load model onto the EVK
  1. Locate the model files (0PROG_A, 0PROG_D, 0PROG_P) from:
    /src/assets/bce-max-prune-QAT/model_datas/e2e_example/io_records/apb_records
  1. Load all three files (0PROG_A0PROG_D0PROG_P) onto the root directory of the SD card
  1. Insert the SD card into the EVK
Test the deployed model
  1. Turn on the EVK using the switch on the side
  1. Wait until light flashes green
  1. Say “Yes” and light should blink

Prune, Quantize, Deploy Workshop

Visit workshop.femto.ai

EDGE AI Taipei 2025

5

Continue Your Edge AI Journey
Deep Dive into the Code
The workshop offered a foundational look into Edge AI. Now, take your time to explore the code, experiment with different parameters, and deepen your understanding of femtoAI's capabilities at your own pace.
Experience Sparsity in Action
Head over to the femtoAI demo table for a hands-on experience. See real-world applications of our cutting-edge sparsity technology and witness the efficiency of our models firsthand.
Join the Developer Community
Sign Up for Free
Unlock full access to femtoAI's tools, documentation, and resources. Create your free developer account today to start building and deploying your own efficient Edge AI solutions.

Prune, Quantize, Deploy Workshop

Visit workshop.femto.ai

EDGE AI Taipei 2025

6