Semidynamics Tensor Unit efficiency data for its “All-In-One” AI IP, which uses a LlaMA-2 7B-parameter Large Language Model (LLM), has been made public.
Roger Espasa, Semidynamics’ CEO, explained, “The traditional AI design uses three separate computing elements: a CPU, a GPU (Graphical Processor Unit) and an NPU (Neural Processor Unit) connected through a bus. This traditional architecture requires DMA-intensive programming, which is error-prone, slow, and energy-hungry plus the challenge of having to integrate three different software stacks and architectures. In addition, NPUs are fixed-function hardware that cannot adapt to future AI algorithms yet-to-be-invented.
“In contrast, Semidynamics has re-invented AI architecture and integrates the three elements into a single, scalable processing element. We combine a RISC-V core, a Tensor Unit that handles matrix multiplication (playing the role of the NPU) and a Vector Unit that handles activation-like computations (playing the role of the GPU) into a fully integrated, all-in-one compute element, as shown in Figure 1. Our new architecture is DMA-free, uses a single software stack based on ONNX and RISC-V and offers direct, zero-latency connectivity between the three elements. The result is higher performance, lower power, better area and a much easier-to-program environment, lowering overall development costs. In addition, because the Tensor and Vector Units are under the direct control of a flexible CPU, we can deploy any existing or future AI algorithm, providing great protection to our customer’s investments.”
LLMs, or large language models, have become an important component of AI applications. Figure 2 illustrates in detail how self-attention layers dominate LLM computation. As seen in Figure 2, these layers are made up of a matrix Transpose, a SoftMax activation function, and five matrix multiplications (MatMul). The vector unit (VU) of Semidynamics’ All-of-One solution can effectively handle Transpose and SoftMax, while the tensor unit (TU) handles matrix multiplication. Costly memory copying may be mainly avoided since the vector registers are shared by the Tensor and Vector Units.
As a result, data sent from the MatMul layers to the activation layers and vice versa has no delay and uses no energy. Weights and inputs need to be effectively fetched from memory into the vector registers in order to keep the TU and the VU active all the time. In light of this, Semidynamics‘ Gazzillion Misses technology offers unprecedented data movement capabilities.
High resource utilization can be achieved by enabling a large number of in-flight cache misses, which allows data to be downloaded ahead of time. Additionally, additional vector instructions optimized for retrieving and transposing 2D tiles are included in Semidynamics’ own tensor extension, significantly enhancing tensor processing.
Using Semidynamics’ ONNX Run Time Execution Provider, Semidynamics has executed the whole LlaMA-2 7B-parameter model (BF16 weights) on its All-In-One element and determined the Tensor Unit use for each of the model’s MatMul layers. Figure 3 presents the findings.
The A-tensor shape is used to arrange and summarize the results. The x-axis markings in Figure 2 indicate that there are six distinct forms in LlaMA-2. As can be observed, most forms have usage rates above 80%, which is significantly higher than alternative layouts.
Results are gathered under the most difficult circumstances—that is, for the first-token computation and with a batch of 1. In order to illustrate the combined efficiency of the Tensor Unit and the Gazzillion technology, Figure 4 provides additional data in the form of the Tensor Unit efficiency for large matrix sizes.
An annotation of the A+B matrix size is shown in Figure 4. It is evident that as the matrix’s N, M, and P dimensions contain more elements, the overall size in MBs soon surpasses any feasible cache or scratchpad.
The chart’s most notable feature is that, regardless of the overall size of the matrices, the performance is consistently little over 70%. The ability of Gazzillion technology to maintain a high streaming data rate between main memory and the Tensor Unit is responsible for this unexpected outcome.
Espasa concluded, “Our new All-In-One AI IP not only delivers outstanding AI performance but is also so much easier to program as there is now just one software stack instead of three. Developers can use the RISC-V stack they already know and they do not have to worry about software-managed local SRAMs, or DMAs. Furthermore, Semidynamics provides an ONNX runtime optimized for the All-In-One AI IP, which allows programmers to easily run their ML models. Therefore, our solution represents a big step forward in programmer friendliness and ease-of-integration into new SOC designs. Our customers using All-In-One will be able to pass on to their customers, developers, and users all these benefits in the form of better and easier-to-program silicon.
“Moreover, our All-In-One design is completely resilient to future changes in AI/ML algorithms and workloads. This is a huge risk protection for customers starting a silicon project that will not hit the market for several years. Knowing that your AI IP will still be relevant when your silicon enters volume production is a unique advantage of our technology.”
For Further Info: CLICK HERE