Apex Compute is developing a next-generation inference chip for edge devices which is capable of achieving up to 20x greater efficiency compared to Nvidia's Jetson family when generating tokens.


Currently, GPU and CPU architectures face significant limitations for today's generative AI models, particularly transformer-based models, due to memory bottlenecks caused by reading all parameters(often billions) to generate a token, not having efficient arithmetic logic units for low precision numbers and insufficient custom hardware blocks for some special operations.


On the other hand, custom ASIC developers, such as Groq, predominantly target training workloads, which require at least fp16/bf16 precision or hybrid fp8 formats (especially with E5M2), but these typically don't allow the extensive use of extremely low-precision arithmetic beneficial for inference-only workloads. Recent research (arXiv:2402.17764) showed ternary weights (-1, 0, 1) transformers models work as good as high precision models, so this eliminates multiplications and simplifying computation to additions and subtractions only. This significantly reduces both power consumption and latency. However, fully realizing these advantages require specialized hardware architectures explicitly designed to handle ternary weights.


Additionally, model parameters and KV caches typically have low entropy, making them highly compressible. Custom hardware capable of compressing and decompressing these elements at runtime can substantially reduce DRAM traffic which is another huge power reduction item.


At the software level, significant opportunities exist to optimize tensor movements to minimize DRAM traffic and stay in local memory as much as possible to prevent memory overheads. In addition to this, current compilers designed for fixed hardware architectures, whereas we offer hardware architecture flexibility in the compiler as an additional optimization hyper parameter to help enabling up to 100% hardware utilization scheduling.


Our product will have the following features:

  • Extensive utilization of quantized and low-precision arithmetic to enhance computational efficiency.

  • Custom KV cache and weight compression/decompression hardware block to minimize DRAM traffic.

  • Hardware size as a tunable parameter in the compiler to help scheduling.

  • Up to 8GBytes of DRAM capacity which will support up to 40B parameters at 1.58bit/parameter.

  • No OS related overheads.

  • Popular models will be supported from hugging face safe tensors, gguf and pytorch checkpoint.

  • Under 10 watts compute power and 0.5 joules/token energy consumption.


Initially, the design will be implemented on a low-power FPGA as a proof-of-concept supporting camera and microphone inputs. The hardware is specifically designed to efficiently handle models within predefined size constraints, ensuring optimal performance.


What will this enable?

  • It will enable all transformers based ViT, VLA, LLM on edge devices like robots, drones and wearables instead of using power hungry GPUs.

  • Edge compute eliminates cloud requirements and real-time applications like robot control with these models will be possible.

  • Not having cloud will protect privacy.

  • [Optional - as alternative to ASIC manufacturing] FPGA level solution will enable flexible compute advantage, so the hardware never gets obsolete.


Reach out to me for more info: hunlu@apexcompute.com