Taiwan Semiconductor Manufacturing Company Limited (TSMC), a supplier to Tesla, reports that the company is now producing its next-generation Dojo AI training tile.
Tesla Dojo system-on-wafer processor for AI training aka Dojo AI training tile is currently in mass production and will soon be implemented as announced during the most recent TSMC North American Technology Symposium. At the presentation, further information regarding the enormous processor Dojo AI training tile was made public.
TSMC’s integrated fan-out (InFO) technology for wafer-scale interconnections (InFO_SoW) connects:
- A 5-by-5 array of known good processor chips.
- Which are reticle size, or nearly so.
- Arranged on a carrier wafer to power Tesla’s Dojo system-on-wafer processor, also known as the Dojo Training Tile.
According to IEEE Spectrum, the goal of the InFO_SoW technology is to provide high-performance communication to the point that 25 dies of Tesla Dojo AI training tile would function as one processor.
Meanwhile, TSMC uses dummies to fill in the spaces left by die gaps in order to uniformize the wafer-scale processor.
The Tesla Dojo Training Tile essentially packs 25 ultra-high-performance processors. This makes the Tesla Dojo Training Tile exceptionally power hungry and requires a sophisticated cooling system.
To feed the system-on-wafer, Tesla uses a highly complex voltage-regulating module that delivers 18,000 Amps of power to the compute plane. The latter dissipates as much as 15,000W of heat and thus requires liquid cooling.
Through the Dojo initiative, Tesla has been creating its own AI training computing capacity in addition to purchasing NVIDIA gear. Its Dojo supercomputing platform went live with its first version this past summer.
Tesla has not yet released the Dojo system-on-a-wafer’s performance, but given all the difficulties encountered during construction, it appears to be an extremely potent tool for AI training.
Wafer-scale processors, such as Tesla Dojo and Cerebras’ wafer scale engine (WSE), are considerably more performance-efficient that multi-processor machines.
Their main advantages include high-bandwidth and low-latency communications between cores, reduced power delivery network impedance, and superior energy efficiency. Additionally, these processors can benefit from having redundant ‘extra’ cores — or, in case of Tesla, known-good processor cores.
However, for the time being, these CPUs come with built-in difficulties. Currently, system-on-wafers are limited to using on-chip memory alone, which is not versatile and might not be sufficient for many applications.