- M. Sadrosadati, A. H. Mirhosseini, S. Roozkhosh, H. Bakhishi, H. Sarbazi-Azad, Effective Cache Bank Placement for GPUs, DATE, 2017.
ABSTRACT: The placement of the Last Level Cache (LLC) banks in the GPU on-chip network can significantly affect the performance of memory-intensive workloads. In this paper, we attempt to offer a placement methodology for the LLC banks to maximize the performance of the on-chip network connecting the LLC banks to the streaming multiprocessors in GPUs. We argue that an efficient placement needs to be derived based on a novel metric that considers the latency hiding capability of the GPUs through thread level parallelism. To this end, we propose a throughput aware metric, called Effective Latency Impact (ELI). Moreover, we define an optimization problem to formulate our placement approach based on the ELI metric mathematically. To solve this optimization problem, we deploy a heuristic solution as this optimization problem is NP-hard. Experimental results show that our placement approach improves the performance by up to 15.7% compared to the state-of-the-art placement.
- A. H. Mirhosseini, M. Sadrosadati, M. Zare, H. Sarbazi-Azad, Quantifying the Difference in Resource Demand among Classic and Modern NoC Workloads, ICCD, 2016.
ABSTRACT: This paper quantifies the difference in resource demand between modern and classic NoC workloads. In the paper, we show that modern workloads are able to better utilize higher numbers of VCs and smaller C factors in order to attain performance and energy efficiency. This is because of the high throughput and possible local congestions in their traffic pattern. As a result, such workloads are more suitable for concurrency and redundancy energy reduction techniques where the voltage and frequency are reduced simultaneously and the increased power budget is used for introducing additional resources to the network in order to improve the performance.
- H. Aghilinasab, M. Sadrosadati, M. H. Samavatian, H. Sarbazi-Azad, Reducing Power Consumption of GPGPUs Through Instruction Reordering, ISLPED, 2016.
ABSTRACT: Execution units in GPGPU consume much static power. However, reducing the static power of execution units is not clear based on two reasons. First, the very long idle time of execution units in GPGPU is fragmented in to many short periods. Second, these units are very critical to total performance. In this paper, we propose a method to reduce the static power without any performance overhead. We utilize out-of-order execution of instructions to make the idle period of execution units much longer. Experimental results show that our proposal improves over the state-of-the-art in terms of power and performance by 25% and 8%, on average, respectively.
- M. Sadrosadati, R. Bashizade, S. Roozkhosh, A. Shafiee, H. Sarbazi-azad, A Method to Improve Adaptivity of Odd-even Routing Algorithm in Mesh NoCs, PDP, 2016. pdf
ABSTRACT: Networks-on-chip (NoCs) play an important role in the performance of Chip-Multiprocessors (CMPs) containing
tens to hundreds of cores. Enhancing the utilization of resources in NoCs results in improving their performance.
Adaptive routing algorithms help balancing the resource utilization in different parts of the network and hence, prevent
a resource becoming the performance bottleneck while other resources are still under-utilized. In this paper, we present
a novel approach, called Preemptive Waiting, which is applied to Odd-Even routing algorithm (PWOE). PWOE increases the
adaptivity degree of the original odd-even algorithm by 61%. Furthermore, PWOE improves odd-even in terms of
Probabilistic Relational Adaptivity Degree, a metric we suggest for comparison of the usefulness of adaptivitity
provided by adaptive routing algorithms, by 15.2% and 10.6% for synthetic traffic patterns and real applications,
respectively. PWOE reduces average packet latency for real applications by 30%, on average. Moreover, the saturation
traffic rate of NoC is also delayed by 13.4% under synthetic traffic loads.
- M. Sadrosadati, A. H. Mirhosseini, H. Aghilinasab, H. Sarbazi-Azad, An efficient DVS scheme for on-chip networks using reconfigurable Virtual Channel allocators, ISLPED, 2015.
ABSTRACT: Network-on-Chip (NoC) is a key element in the total power consumption of a chip multiprocessor. Dynamic Voltage Scaling is a promising method for power saving in NoCs since it contributes to reduction in both static and dynamic power consumptions. In this paper, we propose a novel scheme to reduce on-chip network power consumption when the number of Virtual Channels (VCs) with active allocation requests per cycle is less than the number of total VCs. In our method, we introduce a reconfigurable arbitration logic which can be configured to have multiple latencies and hence, multiple slack times. The increased slack times are then used to reduce the supply voltage of the routers in order to reduce the power consumption. By using this method, we manage to save power by up to 45.7% compared to a baseline architecture without any performance loss.
- A. H. Mirhosseini, M. Sadrosadati, A. Fakhrzadehgan, M. Modarressi, H. Sarbazi-Azad, An energy-efficient virtual channel power-gating mechanism for on-chip networks, DATE, 2015.
ABSTRACT: Power-gating is a promising method for reducing the leakage power of digital systems. In this paper, we propose a novel power-gating scheme for virtual channels in on-chip networks that uses an adaptive method to dynamically adjust the number of active VCs based on the on-chip traffic characteristics. Since virtual channels are used to provide higher throughput under high traffic loads, our method sets the number of virtual channel at each port selectively based on the workload demand, thereby do not negatively affect performance. Evaluation results show that by using this scheme, about 40% average reduction in static power consumption can be achieved with negligible performance overhead.