# Research on Topology and Policy for Low Power Consumption of Network-on-chip with Multicore Processors

Juan Fang College of Computer Science Beijing University of Technology Beijing, China fangjuan@bjut.edu.cn

Abstract—On-chip multi-processor technology has developed rapidly with progress in manufacturing processes, resulting in a continuous increase in the number of processor cores on a chip. However, increase in the number of cores leads to difficulties in communication between cores. To overcome this disadvantage, a network-on-chip (NoC) was proposed, which replaced traditional bus connections between cores using routers, resulting in more fluent communication. Although the quality of communication services has improved, the NoC still faces several technical challenges, such as ensuring the execution time, controlling the network delay effectively and reducing power consumption reasonably. In this study, we introduce a new topology and policy in router of NoC.in topology, we let more components connected with on router, and this has been beneficial in reducing power consumption because of the reduction in the number of routers. In policy, we adjusting the communication scheduling of routers and dynamically checking changes in port sequences has improved the communication infrastructure and increased the network throughput. Besides, dynamically adjusting the position where the minimum unit of data is stored in the router significantly reduces the congestion of the network and relieves the pressure of link. Compared with present techniques, through the power consumption is reduced by about 15%–20%. And can get more effect when compared with flattened butterfly this effect will be more pronounced with the increase in the cores and tasks.

*Keywords—Multicore processors; topology; NoC; power; network latency* 

# I. INTRODUCTION

A network-on-chip (NoC) based system can adapt to the clock mechanism of a complex system on a chip, and is considered an inevitable development trend in multicore technology for future integration processes. While there are still several factors that restrict its further development, such as data packing, buffering, synchronization and interfacing that add delays, and buffering and added logic causing increases in power consumption, a general improvement in performance compared with the traditional bus system can be achieved. However, poor design of the topology or routing policy will affect the overall performance, such as degradation in communication quality, and an increase in network latency and power consumption. In this study, we introduced scheme reduce router counts and increase efficiency service (RRCIES). We first improve topological structure by reducing the number of routers on the basis of the proposed topology to reduce jiajia Lu, chaojie She College of Computer Science Beijing University of Technology Beijing, China {lu\_jiajia, shechaojie }@emails.bjut.edu.cn

power consumption. Then adjust the checking sequence of virtual channels which has flits to pass the crossbar and dynamically allocating virtual channels for the minimum units of data (flits), which can reduce network congestion effectively and the increasing number of cores. The second section in this paper analyzes the current studies in this field. The third section introduces performance metrics and associated algorithm formula. The fourth section elaborates on the detailed implementation of the scheme as well as the degree of influence on the overall performance of the NoC. The fifth section tests the scheme and analyzes and organizes the data. Finally, the sixth section presents future research content and directions based on this scheme.

### II. RELATED WORK AND PROBLEM ANALYSIS

More and more research institutes and organizations are becoming aware that the NoC has great potential for development, and they are increasing investment in its research and promotion. Research on NoC is currently focused on NoC topologies, protocols, service quality, signal synchronization and low power consumption.

Regarding topological structure, ZarkeshHa, Payman [1], who presented a number of NoC network topologies, showed through experiments that communicating via a local bus was faster than other methods. Bourduas, Stephan, and Zeljko Zilic [2] proposed a more creative topology by dividing the cores into two layers as a Local-mesh and Sub-mesh, which could reduce the communication distance and improve performance. Nandakumar, Vivek S and Malgorzata Marek-Sadowska [3] designed 3D net-work-on-chip architecture alleviating the problem of long wires, but because of the limited chip area, the distances between the two nuclei were so close that it was difficult to cool the cores, which affected the performance of the core. Zahavi, Eitan, Israel Cidon, and Avinoam Kolodny [4] present GANA, a new Global Arbiter NoC Architecture. In GANA, the transmission of end-to-end data is timed by a global arbiter in a way that avoids any queuing in the network. Sayed, Mostafa S. [5] introduces new router architecture, called the Flexible Router, which improves the performance of the overall network using the same amount of available buffers but in more efficient way. Therefore there is no need to increase the size of buffers or to use extra virtual channels (VCs) which cause high power consumption, area overheads and complex logic.



This research was supported by the National Natural Science Foundation of China (Grant No. 61202076, No.61202062).

In spite of the adoption of physical communication lines being the fastest approach [1], because space was limited, connecting each nucleus using physical lines was unrealistic. Mohandesi, E., and M. Mohandesi [6] proposed a method that set the packets of data generated from cores that had large volumes of output network traffic with higher priority than those generated from cores with lower volumes. Other researchers [7-8] also made contributions based on packet switching. However, packet-based communication requires the transfer of considerable additional data, thus increasing the volume of traffic. As traffic increased, Liu, Shaoteng, Axel Jantsch, and Zhonghai Lu [9] proposed a circuit switching network-on-chip that could search the entire network quickly, which depended on the network size only and was independent of the network load. Some studies [10][11] have been very effective. The time needed to establish and eliminate a link cannot be ignored, and it is difficult to select a criterion that sets a determined threshold value for establishing or eliminating a link. Lusala, Angelo Kuti, and J. Legat [12] conducted a combination of the former two, and some other related research [13][14][15] on packet switching and circuit switching are still have great meaning for NoC. Abousamra, Ahmed K., Alex K. Jones, and Rami G. Melhem [16] set different weights for different communication data, which determined the degree of importance of the data and improved communication efficiency. Tsai and Po-An [17] present a latency prediction model to simultaneously consider path diversity and buffer occupancy information, and based on it processed Hybrid Path-Diversity-Aware (Hybrid PDA) adaptive routing to overcome congestion problem in NoC.

To achieve a reduction in NoC power, Akbari and Sara [18] presented a deadlockfree routing algorithm, which improved algorithms for avoiding deadlock based on 3D topology and solved the disadvantage caused by immaturity of the production process. Teimouri, Nasibeh, Mehdi Modarressi and Hamid Sarbazi-Azad [19] proposed a hybrid packet-circuit communication mode, by dividing the router and control paths into two parts and allocating a communication mode to each of them, with the advantages of using two communication modes simultaneously, and achieving a great improvement in reducing power consumption. Postman and Jacob [20] adjusted the policy of a flit passing the router, allowing flits that do not have to be stored in a buffer and combining the movement of data in the link and exchange switches to reduce the communication distance and power consumption.

#### **III. PERFORMANCE EVALUATION CRITERIA**

As noted, the performance quality of the NoC is mainly reflected in the execution time of a program, network delays and the most important factor, power consumption. In this study we mainly want to reduce the power, so we consider power consumption as criteria for measuring the standard of performance.

Power Consumption: Power Consumption is an important indicator in the performance evaluation of a NoC. It has great significance if we can reduce it and maintain performance at the same time. Power consumption of NoC is mainly composed of link power and router power. While in router, components such as buffer, crossbar and arbiter will generate power consumption.

$$Power\_Total = Power\_Link + \sum_{1}^{n} Power\_Router$$
(1)

 $Power \_Router = Power \_Buf + Power \_Arb + Power \_Xb$ (2)

#### IV. USING SPECIFIC SCHEME AND ANALYSIS

The network on a chip is the central link for the multicore processors on the chip. The performance of the network on a chip is mainly reflected in data transmission delay and time needed for program execution. General mesh topology is difficult to adapt to the requirements of multi-cores and a simple query strategy for a virtual channel is not enough to satisfy the amount of communication. Limitations in the network structure, the number of virtual channels and constraints on routing scheduling will lead to long latency and low rates of communication during the communication process between inter-cores, thus affecting the overall performance of the system. To improve this situation, RRCIES scheme mainly improves the performance by improving topology and routing scheduling policy.

Major topologies are crossbar, Mesh, Folder Torus and Pt2Pt. With the increase in the number of cores, a simple Mesh topology will extend the communication distance when it occurs between cores, which increases the average number of hops in the overall structure and the dynamic power consumption of router as a result of path conflicts, resulting in a decrease in overall performance, particularly for large traffic volumes when the situation becomes more apparent.

The topology of RRCIES is based on Mesh topology, we propose a novel topology let more cores connect the network through one router, and control the communication of all cores with others through the same router, routers link the next one like a mesh. Which will reduces the average distance of communication and number of buffers. The topology of RRCIES is shown as Fig. 1.



Mesh topology has the same number of routers and cores, with the increase of cores, the counts of router would increases simultaneously, which would occupy more area of chip, increases the intensity of chip and create more power consumption. While in RRCIES scheme, topology needs fewer routers with the same cores because one router connects four cores, which saved chip area and RRCIES reduces the number of routers and can radiate easily. On the other hand, flattened butterfly topology has more links between routers, which leads to more components will be created. These components will consume power, because there would be static power to maintain the components keep working. While in RRCIES, we don't need so many links and then there would be fewer components and less power consumption.

Through the improvement in topology, four cores are connected to the network through the same router, which reduces the number of routers in the network and shortens the communication distance between the cores, reducing the data transmission hops and time required for long-distance data transmission. The decrease in the number of routers also reduces the components in routers such as the number of virtual channels and buffers, which will save power. Thus the improvement not only improves the performance of the system but also reduces power consumption.

The second aspect of RRCIES is improving router policy. Data transmission between cores obtains the location of the next node by querying the NoC routing table. A data package takes several steps in going through a router, such as receiving the data, computing the output port for the data, obtaining priority to access the exchange zone, and switching data to the output port. In the process of distributing to the output ports and transferring data from a buffer to a switch zone, the system always searches for the first input unit to obtain the virtual-net that has data to be transferred. In this case, we know the first input unit has the highest level of authority to obtain an output port to the next node. Other input units have to wait if there are no output ports available for them, particularly if an input unit at the back of the queue has mass data to be transferred. Another limitation is that when there are flits belonging to one message in a particular channel, it will not allow another flit to be stored unless the flit belongs to the same message. Only when all the flits belonging to the same message move to other components and the state of the virtual channel has been set to idle will it allow other flits to be stored. Because a large amount of data cannot obtain a single virtual channel for storage, the particular channel will accumulate a large volume of data. In summary, the defects described above would cause network congestion and compromise system efficiency.

Faced with this situation, we improved the routing scheduling policy as follows: set a counter for every input unit in the router to periodically monitor the volumes of data sent to the router and sort them. The input unit that transmits the largest volume of data is placed at the front of the query sequence, which gives the input unit priority, clears the counter and continues counting. In addition, when the required virtual channel is occupied by another message, it will produce a conflict. To reduce the waiting time of a flit for a virtual channel, a search is made for a virtual channel close to it that is in an idle state. If one is found, then all the flits belonging to the message will be transmitted to the next component through this virtual channel. To ensure the feedback signal sent by the next node can transmit to the source node correctly, the subscript of the required virtual channel is stored.

The proposed policy first checks that every input unit has the opportunity to acquire priority and obtain an output port. This avoids some input units waiting for long periods before being checked. Second, dynamically allocating a virtual channel to a flit ensures the flit does not have to wait for a particular virtual channel. This reduces the waiting time from when the message is generated to when it is sent to the network, and reduces the overall communication time, so the data can pass through the router more quickly and improves communication performance.

From this dual approach to optimization, the running time performance is significantly improved. At the same time, because of the changes in the scheduling strategy and selecting a virtual channel, power consumption is also reduced. Moreover, with any increase in cores and tasks, the performance improvement will be much more apparent. Consult with Table I on the detail typographical styles.

#### V. RESULTS AND ANALYSIS

Our experiment used a Gem5 as a full system simulation platform and the PARSEC benchmark as our test program, taking the Orion model as the simulator of power. The configuration of all of the three schemes is shown as Table I.

TABLE I. CONFIGURATION INFORMATION OF EXPERIMENT

| Configuration         | Mesh                      | Flattened<br>butterfly             | RRCIES                                  |
|-----------------------|---------------------------|------------------------------------|-----------------------------------------|
| CPU model             | TimingSimple              | TimingSimple                       | TimingSimple                            |
| Number of<br>cores    | 16 & 64                   | 16 & 64                            | 16 & 64                                 |
| Cache<br>Protocol     | MOESI<br>_hammer          | MOESI<br>_hammer                   | MOESI<br>_hammer                        |
| Topology              | one router with one core  | two hops<br>between all<br>routers | one router<br>links to four<br>cores    |
| Policy of<br>check VC | statically<br>allocate VC | seem to mesh                       | allocate and<br>check VC<br>dynamically |
| Workload              | PARSEC                    | PARSEC                             | PARSEC                                  |



Fig. 2. Network Power for a Test\_of\_16\_core



Fig. 3. Network Power for a Test\_of\_64\_core

From Fig 2 and 3, because Mesh has more routers, so it has more arbiters and crossbars to keep the components working. At the same time more power would be supplied. Although flattened butterfly has the same count of routers with RRCIES, because it has more links between routers, which results in more buffers will be created and the buffers will cause power consumption to keep working, so power consumption will increase.

From the above results we observe that test with 64-core has improved significantly. That is to say, with an increase in cores and tasks, the improvement in performance becomes increasingly obvious. If there are only a few cores and tasks, because there are fewer tasks, the requirements for the proposed topology mode and router policy are not very high, and the effect of the improvement is not very obvious. With an increase in cores and tasks, the improvement in topology and router policy can add to the number of communication paths and reduce the hops for communication with a notable impact on reducing network congestion. In particular, faced with a large quantity of tasks, a strict path count, scheduling policy and space to store data will give better results. Thus our proposed scheme has significance for a system on-chip that has several cores and a large number of tasks.

## VI. CONCLUSIONS

Our proposed scheme RRCIES changes the topology by letting four cores connect to the network through the same

router, which shortens the communication distance and number of virtual channels to reduce the total power consumption of the system. Meanwhile the routing scheduling policy is improved by setting the priority for input units through counting the volumes of received data and dynamically storing them in allocated positions, which quickly saves the data in a buffer, reduces the waiting time in the buffer, and eases network congestion.

Our next step is to continue research on the routing strategy, such as dynamically adjusting the number of virtual channels according to the use of existing buffers, which should have great significance in reducing power consumption, shortening network latency and improving system efficiency.

#### ACKNOWLEDGMENT

This work was supported by the National Natural Science Foundation of China (Grant No. 61202076, 61202062). The authors would like to thank the reviewers for their efforts and for providing helpful suggestions that have led to several important improvements in our work. We would also like to thank all the teachers and students in our laboratory for useful discussions.

#### REFERENCES

- Zarkesh-Ha, Payman, et al. "Hybrid network on chip (HNoC): local buses with a global mesh architecture," Proceedings of the 12th ACM/IEEE international work-shop on System level interconnect prediction. ACM, 2010, pp. 9-14.
- [2] Bourduas, Stephan, and Zeljko Zilic. "A hybrid ring/mesh interconnect for net-work-on-chip using hierarchical rings for global routing," Networks-on-Chip, 2007. NOCS 2007. First International Symposium on. IEEE, 2007, pp. 195-204.
- [3] Nandakumar, Vivek S., and Malgorzata Marek-Sadowska. "Low power, high throughput network-on-chip fabric for 3D multicore processors," Computer Design (ICCD), 2011 IEEE 29th International Conference on. IEEE, 2011, pp. 453-454.
- [4] Zahavi, Eitan, Israel Cidon, and Avinoam Kolodny. "Gana: A novel low-cost con-flict-free NoC architecture," ACM Transactions on Embedded Computing Systems (TECS) 12.4 (2013):109.
- [5] Sayed, Mostafa S., et al. "Flexible router architecture for net-work-onchip,"Computers & Mathematics with Applications 64.5 (2012), pp.1301-1310.
- [6] Mohandesi, E., and M. Mohandesi. "Improving performance of NoCs by packet pri-oritization," Signals, Circuits and Systems (ISSCS), 2011 10th International Sympo-sium on. IEEE, 2011, pp. 1-4.
- [7] Castillo, Emilio, et al. "Advanced Switching Mechanisms for Forthcoming On-Chip Networks," Digital System Design (DSD), 2013 Euromicro Conference on. IEEE, 2013, pp. 598-605.
- [8] Fan, Hongbing, Yue-Ang Chen, and Yu-Liang Wu. "R-NoC: an efficient packet-switched reconfigurable networks-on-chip," Reconfigurable Computing: Architectures, Tools and Applications. Springer Berlin Heidelberg, 2012, pp. 365-371.
- [9] Liu, Shaoteng, Axel Jantsch, and Zhonghai Lu. "Parallel probing: Dynamic and constant time setup procedure in circuit switching NoC," Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, 2012, pp. 1289-1294.
- [10] Papamichael, Michael K., James C. Hoe, and Onur Mutlu. "Fist: A fast, lightweight, fpga-friendly packet latency estimator for noc modeling in full-system simulations," Networks on Chip (NoCS), 2011 Fifth IEEE/ACM International Symposium on. IEEE, 2011, pp. 137-144.
- [11] Marculescu, Radu, et al. "Outstanding research problems in NoC design: system,microarchitecture, and circuit perspectives," Computer-Aided

Design of Integrated Circuits and Systems, IEEE Transactions on 28.1 (2009), pp. 3-21.

- [12] Lusala, Angelo Kuti, and J. Legat. "A hybrid router combining circuit switching and packet switching with bus architecture for on-chip networks,"NEWCAS Conference (NEWCAS), 2010 8th IEEE International. IEEE, 2010, pp. 237-240.
- [13] Lusala, Angelo Kuti, and J. Legat. "A hybrid NoC combining SDM-TDM based circuit-switching with packet-switching for real-time applications," New Circuits and Systems Conference (NEWCAS), 2012 IEEE 10th International. IEEE, 2012, pp. 17-20.
- [14] Li, Li, et al. "NoC retrograde-turn routing algorithm based on packetcircuit switching," Dianzi Yu Xinxi Xuebao(Journal of Electronics and Information Technology) 33.11 (2011), pp. 2759-2763.
- [15] Tsai, Po-An, et al. "Hybrid path-diversity-aware adaptive routing with latency prediction model in Network-on-Chip systems," VLSI Design, Automation, and Test (VLSI-DAT), 2013 International Symposium on. IEEE, 2013, pp. 1-4.
- [16] Li, Hui, et al. "A hybrid packet-circuit switched router for optical network on chip," Computers & Electrical Engineering 39.7 (2013), pp. 2197-2206.

- [17] Abousamra, Ahmed K., Alex K. Jones, and Rami G. Melhem. "NoCaware cache design for multithreaded execution on tiled chip multiprocessors, "Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers. ACM, 2011.
- [18] Akbari, Sara, et al. "AFRA: A low cost high performance reliable routing for 3D mesh NoCs," Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012. IEEE, 2012, pp. 332-337.
- [19] Teimouri, Nasibeh, Mehdi Modarressi, and Hamid Sarbazi-Azad. "Power and Per-formance Efficient Partial Circuits in Packet-Switched Networks-on-Chip,"Parallel, Distributed and Network-Based Processing (PDP), 2013 21st Euromicro Interna-tional Conference on. IEEE, 2013, pp. 509-513.
- [20] Postman, Jacob, et al. "Swift: A low-power network-on-chip implementing the token flow control router architecture with swingreduced interconnects," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 21.8 (2013), pp.1432-1446.