User Tools

Site Tools


readings

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

readings [2018/12/04 09:28] – [Lecture 20 (29.11 Thu.)] juangreadings [2019/12/12 09:02] (current) – external edit 127.0.0.1
Line 479: Line 479:
   * {{d7ce51c62671d5ffc1506786b0b7861ce00a.pdf| Jose A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt, "Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs". ISCA'13}}   * {{d7ce51c62671d5ffc1506786b0b7861ce00a.pdf| Jose A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt, "Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs". ISCA'13}}
   * {{22310236.pdf| Ed Grochowski, Ronny Ronen, John Shen, and Hong Wang, "Best of Both Latency and Throughput". ICCD 2004}}   * {{22310236.pdf| Ed Grochowski, Ronny Ronen, John Shen, and Hong Wang, "Best of Both Latency and Throughput". ICCD 2004}}
-  * {{amdahl.pdf|G. M. Amdahl, "Validity of the single processor approach to achieving large scale computing capabilities," AFIPS 1967}}+  * {{lecture1-amdahl.pdf|G. M. Amdahl, "Validity of the single processor approach to achieving large scale computing capabilities," AFIPS 1967}}
   * {{05389044.pdf|J. M. Tendler, J. S. Dodson, J. S. Fields, Jr., H. Le, and B. Sinharoy, "POWER4 System Microarchitecture". IBM J R&D 2002}}   * {{05389044.pdf|J. M. Tendler, J. S. Dodson, J. S. Fields, Jr., H. Le, and B. Sinharoy, "POWER4 System Microarchitecture". IBM J R&D 2002}}
   * {{719990eaab63a6bfa2988b5fd57a03b13229.pdf| Ron Kalla, Balaram Sinharoy, and Joel M. Tendler, "IBM Power5 Chip: A Dual-Core Multithreaded Processor". IEEE Micro 2004}}   * {{719990eaab63a6bfa2988b5fd57a03b13229.pdf| Ron Kalla, Balaram Sinharoy, and Joel M. Tendler, "IBM Power5 Chip: A Dual-Core Multithreaded Processor". IEEE Micro 2004}}
Line 496: Line 496:
   * {{https://people.inf.ethz.ch/omutlu/pub/tldram_hpca13.pdf|D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, O. Mutlu, "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture," HPCA 2013}}   * {{https://people.inf.ethz.ch/omutlu/pub/tldram_hpca13.pdf|D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, O. Mutlu, "Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture," HPCA 2013}}
   * {{https://people.inf.ethz.ch/omutlu/pub/raidr-dram-refresh_isca12.pdf|J. Liu, B. Jaiyen, R. Veras, O. Mutlu, "RAIDR: Retention-Aware Intelligent DRAM Refresh," ISCA 2012}}   * {{https://people.inf.ethz.ch/omutlu/pub/raidr-dram-refresh_isca12.pdf|J. Liu, B. Jaiyen, R. Veras, O. Mutlu, "RAIDR: Retention-Aware Intelligent DRAM Refresh," ISCA 2012}}
 +
 +===== Lecture 21 (05.12 Wed.) =====
 +=== Suggested (lecture 21): ===
 +  * {{https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html|NVIDIA, "CUDA C Programming Guide," Version 9.0, 2018}}
 +  * {{https://www.sciencedirect.com/science/book/9780128119860|D.B. Kirk and W.M. Hwu, "Programming Massively Parallel Processors. A Hands-on Approach," Third Edition, 2017}}
 +  * {{p140-fisher.pdf|J.A. Fisher, “Very Long Instruction Word architectures and the ELI-512,” ISCA 1983}}
 +  * {{Sung_2012.pdf|I.J. Sung, G.D. Liu, W.M. Hwu, "DL: A Data Layout Transformation System for Heterogeneous Computing," INPAR 2012}}
 +  * {{pseudo-randomly_interleaved_memory.pdf|B. R. Rau, "Pseudo-Randomly Interleaved Memory," ISCA 1991}}
 +  * {{Braak_2016.pdf|G.J.v.d. Braak, J. Gomez-Luna, J.M. Gonzalez-Linares, H. Corporaal, N. Guil, "Configurable XOR Hash Functions for Banked Scratchpad Memories in GPUs," IEEE TC, 2016}}
 +  * {{GomezLuna_2013.pdf|J. Gomez-Luna, J.M. Gonzalez-Linares, J.I. Benavides, N. Guil, "Performance Modeling of Atomic Additions on GPU Scratchpad Memory," IEEE TPDS, 2013}}
 +  * {{GomezLuna_2012.pdf|J. Gomez-Luna, J.M. Gonzalez-Linares, J.I. Benavides, N. Guil, "Performance Models for Asynchronous Data Transfers on Consumer Graphics Processing Units," JPDC, 2012}}
 +  * {{GomezLuna_2017.pdf|J. Gomez-Luna, I. E. Hajj, L. Chang, V. Garcia-Flores, S. G. de Gonzalo, T. B. Jablin, A. J. Peña, W. Hwu, "Chai: Collaborative heterogeneous applications for integrated-architectures," ISPASS 2017}}
 +
 +===== Lecture 22 (6.12 Thu.) =====
 +=== Required (lecture 22): ===
 +  * {{lecture1-amdahl.pdf|G. M. Amdahl, "Validity of the single processor approach to achieving large scale computing capabilities," AFIPS 1967}}
 +  * {{lamport.pdf|L. Lamport, "How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs," IEEE Transactions on Computers, 1979}}
 +  * {{a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|M. S. Papamarcos and J. H. Patel, "A low-overhead coherence solution for multiprocessors with private cache memories," ISCA 1984}}
 +=== Described in detail during lecture 22: ===
 +  * {{using_cache_memory_to_reduce_processor-memory_traffic.pdf|J. R. Goodman, "Using cache memory to reduce processor-memory traffic," ISCA 1983}}
 +  * {{a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|M. S. Papamarcos and J. H. Patel, "A low-overhead coherence solution for multiprocessors with private cache memories," ISCA 1984}}
 +  * {{a_new_solution_to_coherence_problems_in_multicache_systems.pdf|L. M. Censier and P. Feautrier, "A new solution to coherence problems in multicache systems," IEEE Trans. Computers, 1978}}
 +  * {{token_coherence_decoupling_performance_and_correctness.pdf|M. Martin, M. D. Hill, and D. A. Wood, "Token coherence: decoupling performance and correctness," ISCA 2003}}
 +=== Recommended (lecture 22): ===
 +  * {{flynn.pdf|M. J. Flynn, "Very High-Speed Computing Systems," Proc. of IEEE, 1966}}
 +  * {{multiprocessors-multicomputers.pdf|M. D. Hill, N. P. Jouppi, G. S. Sohi, "Multiprocessors and Multicomputers,” pp. 551-560 in Readings in Computer Architecture.}}
 +  * {{memory_consistency_and_event_ordering_in_scalable_shared-memory_multiprocessors.pdf|K. Gharachorloo, D. 
 +Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, "Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors," ISCA 1990}}
 +  * {{two_techniques_to_enhance_the_performanc_of_memory_consistency_models.pdf|K. Gharachorloo, A. Gupta, and J. Hennessy, "Two Techniques to Enhance the Performance of Memory Consistency Models," ICPP 1991}}
 +  * {{bulksc_bulk_enforcement_of_sequential_consistency.pdf|L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas, "BulkSC: bulk enforcement of sequential consistency," ISCA 2007}}
 +  * {{https://people.inf.ethz.ch/omutlu/pub/ThyNVM-transparent-crash-consistency-for-persistent-memory_micro15.pdf|J. Ren, J. Zhao, S. Khan, J., Y. Wu, and O. Mutlu, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems," MICRO 2015}}
 +  * {{https://people.inf.ethz.ch/omutlu/pub/NVMove-byte-based-persistence-tool_inflow16.pdf|H. Chauhan, I. Calciu, V. Chidambaram, E. Schkufza, O. Mutlu, and P. Subrahmanyam, "NVMove: Helping Programmers Move to Byte-Based Persistence," INFLOW 2016}}
 +  * {{a_new_solution_to_coherence_problems_in_multicache_systems.pdf|L. M. Censier and P. Feautrier, "A new solution to coherence problems in multicache systems," IEEE Trans. Computers, 1978}}
 +  * {{using_cache_memory_to_reduce_processor-memory_traffic.pdf|J. R. Goodman, "Using cache memory to reduce processor-memory traffic," ISCA 1983}}
 +  * {{the_sgi_origin_a_ccnuma_highly_scalable_server.pdf|J. Laudon and D. Lenoski, "The SGI Origin: A ccNUMA Highly Scalable Server," ISCA 1997}}
 +  * {{token_coherence_decoupling_performance_and_correctness.pdf|M. Martin, M. D. Hill, and D. A. Wood, "Token coherence: decoupling performance and correctness," ISCA 2003}}
 +  * {{on_the_inclusion_properties_for_multi-level_cache_hierarchies.pdf|J. Baer and W. Wang, "On the inclusion properties for multi-level cache hierarchies," ISCA 1988}}
 +  * {{designofacomputer_cdc6600.pdf|J. E. Thornton, "CDC 6600: Design of a Computer,” 1970}}
 +  * {{a_pipelined_shared_resource_mimd_computer.pdf | B. J. Smith, "A Pipelined, Shared Resource MIMD Computer", ICPP 1978}}
 +  * {{a_new_method_of_solving_numerical_equations_of_all_orders_by_continuous_.pdf|W. G. Horner, "A new method of solving numerical equations of all orders, by continuous approximation," Philosophical Transactions of the Royal Society, 1819}}
 +  * {{https://people.inf.ethz.ch/omutlu/pub/acs_asplos09.pdf|M. A. Suleman, O. Mutlu, M. K. Qureshi, and Y. N. Patt, "Accelerating critical section execution with asymmetric multi-core architectures," ASPLOS'09}}
 +  * {{co-operating_sequential_processes.pdf|E. W. Dijkstra, "Cooperating Sequential Processes," 1965}}
 +  * {{culler_parcomparch_5.1.pdf|Culler and Singh, Parallel Computer Architecture, Chapter 5.1 (pp 269–283)}}
 +  * {{culler_parcomparch_5.3.pdf|Culler and Singh, Parallel Computer Architecture, Chapter 5.3 (pp 291-305)}}
 +  * {{ph_computerorganizationanddesignthehardwaresoftwareinterface5th_5.10.pdf|P&H, Computer Organization and Design, Chapter 5.10 (pp 466-470)}}
 +
 +===== Lecture 23 (12.12 Wed.) =====
 +=== Described in detail during lecture 23): ===
 +  * {{bless_isca09.pdf|T. Moscibroda and O. Mutlu, "A Case for Bufferless Routing in On-Chip Networks", ISCA 2009}}
 +
 +=== Suggested (lecture 23): ===
 +  * {{app-aware-noc_micro09.pdf|R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, "Application-Aware Prioritization Mechanisms for On-Chip Networks", MICRO 2009}}
 +  * {{ultrasparc.pdf|M. Shah, J. Barreh, J. Brooks, R. Golla, G. Grohoski, N. Gura, R. Hetherington, P. Jordan, M. Luttrell, C. Olson, B. Saha, D. Sheahan, L. Spracklen, and A. Wynn, "UltraSPARC T2: A Highly-Threaded, Power-Efficient, SPARC SOC", ASSCC 2007}}
 +  * {{7d2822e9b7fcd60f147823478b59fcf7569e.pdf|J. H. Patel, "Processor-memory interconnections for multiprocessors", ISCA 1979}}
 +  * {{Ultracomputer.pdf|A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir, "The NYU Ultracomputer - Designing an MIMD Shared Memory Parallel Computer", IEEE Trans. on Comp. 1983}}
 +  * {{hierarchical-rings-with-deflection_sbacpad14.pdf|R. Ausavarungnirun, C. Fallin, X. Yu, K. Chang, G. Nazario, R. Das, G. Loh, and O. Mutlu, "Design and Evaluation of Hierarchical Rings with Deflection Routing", SBAC-PAD 2014}}
 +  * {{p272-leiserson.pdf|C.E. Leiserson, Z.S. Abuhamdeh, D.C. Douglas, C.R. Feynman, M.N. Ganmukhi, J.V. Hill, D. Hillis, B.C. Kuszmaul, M.A. St. Pierre, D.S. Wells, M.C. Wong, S.-W. Yang, R. Zak, "The Network Architecture of the Connection Machine CM-5", SPAA 1992}}
 +  * {{seitz_cacm_1985.pdf|C. L. Seitz, "The Cosmic Cube", CACM 1985}}
 +  * {{L8-TurnModel-ISCA92.pdf|C. J. Glass and L. M. Ni, "The Turn Model for Adaptive Routing", ISCA 1992}}
 +  * {{maze-routing_nocs15.pdf|M. Fattah, A. Airola, R. Ausavarungnirun, N. Mirzaei, P. Liljeberg, J. Plosila, S. Mohammadi, T. Pahikkala, O. Mutlu, and H. Tenhunen, "A Low-Overhead, Fully-Distributed, Guaranteed-Delivery Routing Algorithm for Faulty Network-on-Chips", NOCS 2015}}
 +  * {{Baran64.pdf|P. Baran, "On Distributed Communications Networks", IEEE Trans. Comm., 1964}}
 +  * {{bufferless_springer14.pdf|C. Fallin, G. Nazario, X. Yu, K. Chang, R. Ausavarungnirun, and O. Mutlu, "Bufferless and Minimally-Buffered Deflection Routing", Routing Algorithms in Networks-on-Chip (invited) 2014}}
 +  * {{virtual+channel.pdf|W. J. Dally, "Virtual Channel Flow Control", ISCA 1990}}
 +
 +
 +
 +===== Lecture 24 (13.12 Thu.) =====
 +=== Described in detail during lecture 24: ===
 +  * {{05749724.pdf|C. Fallin, C. Craik, and O. Mutlu, "CHIPPER: A Low-Complexity Bufferless Deflection Router", HPCA 2011}}
 +  * {{bufferless_springer14.pdf|C. Fallin, G. Nazario, X. Yu, K. Chang, R. Ausavarungnirun, and O. Mutlu, "Bufferless and Minimally-Buffered Deflection Routing", Routing Algorithms in Networks-on-Chip (invited book chapter), 2014}}
 +  * {{06209256.pdf|C. Fallin, G. Nazario, X. Yu, K. Chang, R. Ausavarungnirun, and O. Mutlu, "MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect", NOCS 2012}}
 +
 +=== Suggested (lecture 24): ===
 +  * {{app-aware-noc_micro09.pdf|R. Das, O. Mutlu, T. Moscibroda, and C. R. Das, "Application-Aware Prioritization Mechanisms for On-Chip Networks", MICRO 2009}}
 +  * {{https://people.inf.ethz.ch/omutlu/pub/hetero-adaptive-source-throttling_sbacpad12.pdf|K. Chang, R. Ausavarungnirun, C. Fallin, and O. Mutlu, "HAT: Heterogeneous Adaptive Throttling for On-Chip Networks,"
 +SBAC-PAD, 2012}}
 +  * {{bless_isca09.pdf|T. Moscibroda and O. Mutlu, "A Case for Bufferless Routing in On-Chip Networks", ISCA 2009}}
 +  * {{06970669.pdf|R. Ausavarungnirun, C. Fallin, X. Yu, K. Chang, G. Nazario, R. Das, G. H. Loh, and O. Mutlu, "Design and Evaluation of Hierarchical Rings with Deflection Routing", SBAC-PAD 2014}}
 +  * {{1-s2.0-s0167819116000399-main.pdf|R. Ausavarungnirun, C. Fallin, X. Yu, K. Chang, G. Nazario, R. Das, G. H.Loh, and O. Mutlu, "A Case for Hierarchical Rings with Deflection Routing: An Energy-Efficient On-Chip Communication Substrate", PARCO 2016}}
 +  * {{p106-das.pdf|R. Das, O. Mutlu, T. Moscibroda, and C.R. Das, "Aergia: Exploiting Packet Latency Slack in On-Chip Networks", ISCA 2010}}
 +  * {{https://people.inf.ethz.ch/omutlu/pub/pvc-qos_micro09.pdf|B. Grot, S.W. Keckler, O. Mutlu, "Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QoS Scheme for Networks-on-Chip", MICRO 2009}}
 +  * {{p401-grot.pdf|B. Grot, J. Hestness, S. W. Keckler, and O. Mutlu, "Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees", ISCA 2011}}
 +  * {{https://people.inf.ethz.ch/omutlu/pub/onchip-network-congestion-scalability_sigcomm2012.pdf|G. Nychis, C. Fallin, T. Moscibroda, O. Mutlu, and S. Seshan, "On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects," SIGCOMM, 2012}}
 +  * {{http://users.ece.cmu.edu/~omutlu/pub/noc-congestion_hotnets10.pdf|G. Nychis, C. Falling, T. Moscibroda, O. Mutlu, "Next Generation On-chip Networks: What Kind of Congestion Control Do We Need?" HotNets 2010}}
readings.txt · Last modified: 2019/12/12 09:02 by 127.0.0.1