Human beings might have the means to do a lot, but that doesn’t change how we barely do anything better than growing on a consistent basis. This relentless commitment towards achieving an improved version of ourselves, in every possible situation, has brought the world some huge milestones, with technology emerging as quite a major member of the group. The reason why we hold technology in such a high regard is, by and large, predicated upon its skill-set, which guided us towards a reality that nobody could have ever imagined otherwise. Nevertheless, if we look beyond the surface for one hot second, it will become abundantly clear how the whole runner was also very much inspired from the way we applied those skills across a real world environment. The latter component, in fact, did a lot to give the creation a spectrum-wide presence, and as a result, initiated a full-blown tech revolution. Of course, the next thing this revolution did was to scale up the human experience through some outright unique avenues, but even after achieving a feat so notable, technology will somehow continue to bring forth the right goods. The same has turned more and more evident in recent times, and assuming one new discovery ends up with the desired impact, it will only put that trend on a higher pedestal moving forward.
The researching teams at Massachusetts Institute of Technology and NVIDIA have successfully developed two techniques that are meant to accelerate the processing of sparse tensors, a type of data structure that’s used for high-performance computing tasks. According to certain reports, the techniques in question will bring significant improvements to the performance and energy-efficiency of systems like the massive machine-learning models that drive generative artificial intelligence. Talk about how they will do so, the answer is rooted in exploiting sparsity, zero values in the tensors. You see, given these values have no meaningful role whatsoever; one can just skip over them and save on both computation and memory. This way it becomes possible to compress the tensor and allow larger portion to be stored in on-chip memory. That being said, there is also a reason why it hasn’t been achieved so far. For starters, finding the non-zero values in a large tensor is no easy task, as existing approaches often limit the locations of non-zero values by enforcing a sparsity pattern to simplify the search. With a pre-defined pattern now in place, the variety of tensors that can be processed efficiently becomes too thin. Another challenge in play here is how the number of non-zero values can actually vary in different regions of the tensors. Such variance makes it difficult to determine how much space is required to store different regions in memory. To overcome the problem in question, more space than necessary is often allocated, which in turn, forces the storage buffer to be underutilized. Extending the ripple effect is a notable increase in off-chip memory traffic that, of course, requires extra computation. But how did MIT and NVIDIA researchers solve this conundrum? Out of their two techniques, they got the first one to efficiently find the non-zero values for a wider variety of sparsity patterns. In second solution’s case, they created a method that can handle the case where the data doesn’t fit in memory. The stated focus is intended to go a long way when it comes to increasing the utilization of the storage buffer, while simultaneously reducing off-chip memory traffic. Although a tad different in their function, both methods boost the performance and reduce the energy demands of hardware accelerators.
“Typically, when you use more specialized or domain-specific hardware accelerators, you lose the flexibility that you would get from a more general-purpose processor, like a CPU. What stands out with these two works is that we show that you can still maintain flexibility and adaptability while being specialized and efficient,” said Vivienne Sze, associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the Research Laboratory of Electronics (RLE), and co-senior author of papers on both advances.
In reference to hardware accelerators, the researching teams have also developed a dedicated and improved iteration of the same. Named HighLight, the accelerator can handle a wide variety of sparsity patterns and still perform well when running models that don’t have any zero values. It delivers on this promised value proposition through “hierarchical structured sparsity”, which represents a wide variety of sparsity patterns that are actually just made from multiple simple sparsity patterns. Here, they divide the values in a tensor into smaller blocks, where each block has its own simple, sparsity pattern (perhaps two zeros and two non-zeros in a block with four values). Next up, they combine the blocks into a hierarchy. One can continue to combine blocks into larger levels, but the patterns remain simple at each step. This sort of capability enables us to find and skip zeros, meaning it can also effectively root out the problem of excess computation. If we are strictly talking numbers, then we must mention that the accelerator design is said to be six times more energy-efficient than other approaches.
“In the end, the HighLight accelerator is able to efficiently accelerate dense models because it does not introduce a lot of overhead, and at the same time it is able to exploit workloads with different amounts of zero values based on hierarchical structured sparsity,” said Yannan Nellie Wu, co-lead author on the study.
For the future, the researching teams at MIT and NVIDIA plan to apply hierarchical structured sparsity to more types of machine-learning models and different types of tensors in those models.