# SynCron # **Efficient Synchronization Support** for Near-Data-Processing Architectures #### **Computer Architecture** Lecture 24: Cutting-edge Research in Computer Architecture III 23.12.2021 #### Christina Giannoula Nandita Vijaykumar, Nikela Papadopoulou, Vasileios Karakostas Ivan Fernandez, Juan Gómez Luna, Lois Orosa Nectarios Koziris, Georgios Goumas, Onur Mutlu ## Agenda ➤ How is synchronization implemented in commodity architectures? Would existing hardware synchronization mechnanisms be efficient in Near-Data-Processing Architectures? How to design an efficient synchronization mechanism for Near-Data-Processing Architectures? ## Near-Data-Processing (NDP) Systems #### Synchronization is Necessary **Graph Analytics** **Databases** Concurrent Data Structures ### Challenge: Efficient Synchronization ### SynCron # The first end-to-end synchronization solution for NDP architectures #### SynCron's Benefits: - 1. High System **Performance** - 2. Low Hardware Cost - 3. Programming Ease - **4. Generality** to Cover a Wide Range of Synchronization Primitives #### Outline NDP Synchronization Solution Space Our Mechanism: SynCron **Evaluation** #### Baseline NDP Architecture #### Synchronization challenges in NDP systems: - (1) Lack of a shared level of cache memory - (2) Lack of hardware cache coherence support - (3) Expensive communication across NDP units **Shared Memory** Message-passing **Shared memory locations** [EuroPar'06] Cohort Locks [TOPC'15] Ticket Locks [TOCS'91] ... MPPs: QOLB [ASPLOS'89] Lack of hardware cache coherence support Expensive communication across NDP units Expensive communication across NDP units Lack of a shared level of cache memory Prior schemes are **not suitable** or **efficient** for NDP systems #### Outline NDP Synchronization Solution Space Our Mechanism: SynCron **Evaluation** # SynCron: Overview SynCron consists of four key techniques: - 1. Hardware support for synchronization acceleration - 2. Direct buffering of synchronization variables - 3. Hierarchical message-passing communication - 4. Integrated hardware-only overflow management # 1. Hardware Synchronization Support - req\_sync - req\_async Local lock acquire ## 1. Hardware Synchronization Support ### 1. Hardware Synchronization Support - ✓ No Complex Cache Coherence Protocols - ✓ No Expensive Atomic Operations - ✓ Low Hardware Cost # 2. Direct Buffering of Variables # 2. Direct Buffering of Variables - ✓ No Costly Memory Accesses - ✓ Low Latency ## 4. Integrated Overflow Management SAFARI # 4. Integrated Overflow Management - ✓ Low Performance Degradation - ✓ High Programming Ease Counter3 = 0 # SynCron's Supported Primitives #### Lock primitive - lock\_acquire() - lock\_release() #### **Barrier** # Synchronization Metadata: - i. Queueing listii. Condition to be satisfied #### Semaphore primite - sem\_wait() - sem\_post() #### Condition variable primitive - cond\_wait() - cond\_signal() - cond\_broadcast() ### SynCron's Supported Primitives #### Lock primitive - lock\_acquire() - lock\_release() #### Barrier primitive - barrier\_wait\_within\_NDP\_unit() - barrier\_wait\_across\_NDP\_units() #### Semaphore primitive - sem\_wait() - sem\_post() #### Condition variable primitive - cond\_wait() - cond\_signal() - cond\_broadcast() ### **Lock Operation** All NDP cores compete for the same lock variable ### **Lock Operation** **SAFARI** 45 ## Lock Operation - Overflow SAFARI 46 ### Lock Operation - Overflow #### Lock Operation - Overflow #### Outline NDP Synchronization Solution Space Our Mechanism: SynCron **Evaluation** ### **Evaluation Methodology** - Simulators: - Zsim [Sanchez+, ISCA'13] - Ramulator [Kim+, CAL'15] - System Configuration: - 4x NDP units of 16 in-order cores - 16KB L1 Data + Instr. Cache - 4GB HBM memory - SynCron's Default Parameters: - Synchronization Processing Unit @1GHz - 12-cycle worst-case latency for a message to be served [Aladdin] - 64 entries in Synchronization Table, 1-cycle latency [CACTI] - 256 entries in indexing counters 2-cycle latency [CACTI] #### Workloads - 9x Pointer-chasing Data Structures from ASCYLIB [David+, ASPLOS'15] - 6x Graph Applications from Crono [Ahmad+, IISWC'15] - Time Series Analysis from Matrix Profile [Yeh+, ICDM'16] ### Comparison Points for SynCron #### 1. SynCron - 2. <u>Central</u> [Ahn+, ISCA'15]: - Synchronization Server: One NDP core of the NDP system - Centralized hardware message-passing communication ### Comparison Points for SynCron #### 1. SynCron - 2. <u>Central</u> [Ahn+, ISCA'15]: - Synchronization Server: One NDP core of the NDP system - Centralized message-passing communication - 3. <u>Hier</u> [Gao+, PACT'15 / Tang+, ASPLOS'19]: - Synchronization Servers: One NDP core per NDP unit - Hierarchical message-passing communication ## Comparison Points for SynCron #### 1. SynCron - 2. <u>Central</u> [Ahn+, ISCA'15]: - Synchronization Server: One NDP core of the NDP system - Centralized hardware message-passing communication - 3. <u>Hier</u> [Gao+, PACT'15 / Tang+, ASPLOS'19]: - Synchronization Servers: One NDP core per NDP unit - Hierarchical hardware message-passing communication #### 4. Ideal - Zero overhead for synchronization ## Throughput of Pointer Chasing ## Throughput of Pointer Chasing SynCron achieves the highest throughput under all contention scenarios ## Speedup in Real Applications SynCron performs best across all real applications ## System Energy in Real Applications SynCron reduces system energy significantly ## Memory technologies SynCron is **orthogonal** to the memory technology used #### Area and Power Overheads | | Synchronization Engine | ARM Cortex A7 | |-------------|------------------------|----------------| | Technology | 40nm | 28nm | | Area 9.78% | Total: 0.0461mm2 | Total: 0.45mm2 | | Power 2.70% | 2.7mW | 100mW | SynCron has low area and power overheads ## Sensitivity Studies - Various data placement techniques - Various transfer latencies on links across NDP units - Overflow management cost - Various sizes for the Synchronization Table # SynCron is **effective** for a **wide variety** of configurations #### Summary & Conclusion - Synchronization is a major system challenge for NDP systems - Prior schemes are not suitable or efficient for NDP systems - SynCron is the first end-to-end synchronization solution for NDP architectures - Syncron consists of four key techniques: - i. Hardware support for synchronization acceleration - ii. Direct buffering of synchronization variables - iii. Hierarchical message-passing communication - iv. Integrated hardware-only overflow management - SynCron's benefits: 90.5% and 93.8% of performance and energy of an Ideal zero-overhead scheme - SynCron is highly-efficient, low-cost, easy-to-use, and general to support many synchronization primitives 62 # SynCron ## **Efficient Synchronization Support** for Near-Data-Processing Architectures #### **Computer Architecture** Lecture 24: Cutting-edge Research in Computer Architecture III 23.12.2021 #### Christina Giannoula Nandita Vijaykumar, Nikela Papadopoulou, Vasileios Karakostas Ivan Fernandez, Juan Gómez Luna, Lois Orosa Nectarios Koziris, Georgios Goumas, Onur Mutlu