The result of Reynolds v US (1879) ultimately showed that the government can never restrict actions of individuals based on sincerely held religious beliefs.
Which of the Civil War Amendments prohibited slavery?
Which of the Civil War Amendments prohibited slavery?
What does the Rule of Four refer to in the context of the U….
What does the Rule of Four refer to in the context of the U.S. Supreme Court?
The Supreme Court’s power of judicial review ______ and was…
The Supreme Court’s power of judicial review ______ and was established by the case ____________________.
Why is economics fun?
Why is economics fun?
Problem 1 – Basic caches (9 pts) [10 mins] 1.1 Basic Cache…
Problem 1 – Basic caches (9 pts) 1.1 Basic Cache & TLB (6 points) Consider a 32-bit virtual address space and 40-bit physical address space with 4-KB pages. Assume a 64-KB virtually-indexed physically-tagged data cache with 64-byte lines and 2-way associativity. What is the total number of tag bits stored in the tag array of the cache? You may write an expression without performing the arithmetic. Assume that there are 192 entries in the TLB which is 6-way associative. How many index bits does the TLB use? 1.2 Write Policy (3 points) What is the key advantage of write-back caches over write-through caches? Problem 2 – Advanced caches (21 pts) 2.1 Lockup-free caches (4 points) Assuming that a 2-stage pipelined cache miss rate is m, hit time is h, miss penalty is l, and k MSHR entries, write an expression for the maximum access rate in (a) a lock-up cache and (b) a lock-up free cache. (a) Maximum access rate for lock-up cache: (b) Maximum access rate for lock-up free cache: 2.2 Interleaved main memory (3 points) Assume that the main memory is cache-block-interleaved (i.e., consecutive blocks map to consecutive banks in a round-robin manner) using 4 banks and 64-byte blocks. If a program accesses word (4-byte) addresses in a stride of 128 bytes (i.e., A, A+128, A+256, . . . ), what fraction of the total main memory bandwidth is used by the program? 2.3 Cache Bandwidth (4 points) Assume that the processor can make 2 requests per cycle to a two-banked cache with a latency of 2 cycles and that the accesses may bank-conflict with a probability of 50%. Conflicting accesses are serviced sequentially while further accesses are stalled (gross simplification). Assume all the conflicting accesses go to one cache block. What is the cache requests per cycle (a) without and (b) with coalescing line buffers? (a) (b) 2.4 Victim Caches (3 points) What is the main reason due to which (1) a small victim cache achieves (2) high speedups even for a large cache? Just stating locality will not receive any points. 2.5 Multi-level caches (3 points) What is the purpose of the L3 cache? 2.6 Software technique for cache (4 points) Using software techniques, improve the cache performance for this code assuming 32-byte blocks. For (i=0; i< N; i++) { # N is large x += f(A); # A is 8-byte floating-point double } # f() takes about 25 cycles and memory latency is 300 cycles Write the modified code below. Problem 3 - Virtual Memory & I/O (19 points) 3.1 Page Table (9 points) Consider a virtual memory system that uses 16-bit addresses and 256-byte pages. The following virtual pages are mapped into main memory, in the order shown below from left to right. When there is a collision, the OS uses the next available slot, starting from the colliding slot going towards the bottom of the page table and wrapping around to the top if needed. virtual page number 0x00 0x01 0x10 0x11 0x21 0x22 physical page number 0x01 0x07 0x03 0x02 0x06 0x04 (6 points) An inverted page table with 16 entries is used (you should know what the fields mean). The hash function is XOR of the two uppermost nibbles (= 4 bits) of the address (a XOR b = a.~b + ~a.b). Fill the following table: (3 points) Now use a two-level physical page table. Use the first nibble (4 bits) of the address for the first level and the second nibble for the second level. How many second-level tables are required to map the pages given above? 3.2 Synonyms (4 points) Consider a system with 32-KB, 2-way set-associative L1 data cache and 4-KB pages (block size is < 4 KB). The cache designer tells you that searching more than 2 sets in parallel would make the cache slow. What minimum restriction should the OS impose to have no synonym problems while allowing cache and TLB access to occur in parallel? 3.3 Virtual-physical hierarchy (3 points) What is the key invariant satisfied by the L1 for synonyms in a virtual L1 and physical L2 with pointers to L1? 3.4 RAID (3 points) How is the performance problem with parity in RAID level 4 solved in RAID level 5? Problem 4 - Multicores (22 pts) 4.1 Coherence (11 points) In class, we discussed a single-writer-multiple-reader MSI protocol. Here, design a single-writer-single-reader MSI protocol. In addition to BusRd, BusRdX, and BusWB, assume two new bus transactions: BusInv which is a request to invalidate other caches BusXfer which is a response to BusRd or BusRdX making a cache-to-cache transfer without updating memory. Draw the new MSI protocol below. For each state, show all the relevant transitions. Keep your sketch as clean as possible. Don’t make a mess. Don’t forget to use BusInv and Busxfer wherever needed. Think about how the single-reader constraint changes the protocol. It may help to proceed from I to S to M, considering processor-side requests first and then bus-side requests. Your protocol must use all three states. 4.2 Synchronization (8 points) Write the assembly code to implement an atomic compare&swap (CAS) using load-linked (ll r, x) and store-conditional (sc r, x) Atomic compare&swap(r4,r5,x) works as follows: Typically r4 is 0 (for unlocked) and r5 is 1 (for locked) If (r4 == MEM) then { // all atomic r5 = MEM MEM = r5 } FYI: atomic CAS is implemented in many systems (e.g., IBM POWER and Intel x86). Complete the code below. CAS: // Atomic CAS (r4,r5,x) bne r4, r2, exit // r2 has x’s value mov r1, r5 // copy incoming r5 to r1 to prevent r5 from being overwritten exit: jr $31 // return with r2 as the return value In terms of performance, what is the key advantage of compare&swap over the test&set implementation discussed in class. 4.3 Consistency (3 points) A = B = 0 (initially) Thread 1 Thread 2 A = 1 B = 1 print B print A What pairs of values printed by threads 1 & 2 would be invalid in sequential consistency? List all such invalid pairs. Hint: Determine the ordering implied by the printed values. Problem 5 - Open Question (29 pts) Assume a technology where power dissipation is lower than that of CMOS but on-chip communication is even slower than off-chip misses in terms of latency and bandwidth; otherwise, the technology resembles CMOS in providing a vast number of small and fast transistors that can be clocked much faster than memory (e.g., GHz versus 400-cycle memory latency). Further, an important workload processes multiple, concurrent queries, where each query’s processing involves multiple steps each of which reads and updates significant parts of a different multi-megabyte, irregular data structure (e.g., a few 10s MB, well beyond on-chip capacity across all the steps) with strong temporal and spatial locality. Each step produces a compact output (e.g., a few 100s KB) used by the next step which uses nothing else from the previous step (keep in mind that each step does update its own data structure). Also, each step’s processing code is ultra-compact (tens of KB). We wish to match this software workload with the on-chip communication-limited technology. 5.1 (a) (3 points) What is an obvious scheme (hardware and software combination) to process multiple, concurrent queries involving irregular accesses? Be specific about (a) the hardware architecture, (b) key architecture feature, and (c) how the workload is mapped to the hardware (e.g., (a) GPU (b) SIMT where (c) different SMs run different matrix multiplications). (b) (3 points) What is the key difficulty with the obvious scheme? (c) (2 points) What well-known technique is a poor fit for this technology-workload combination? 5.2 (a) (3 points) What is a better scheme (hardware and software combination) than the obvious one? Like 5.1(a), state (i) the architecture, (ii) key feature, and (iii) workload mapping. Keep in mind that a mistake here means the rest of your answers would be wrong as well. (b) (2 points) How does your scheme avoid the difficulty faced by the obvious scheme? (c) (3 points) What is a simple way for your scheme to handle more steps than cores? (d) (6 points) What two distinct performance issues are faced by your scheme? How would you optimize your scheme to alleviate these issues? Think of simple optimizations. Issues: Optimizations: 5.3 (a) (3 points) What would be your processor architecture to process multiple, concurrent queries in the presence of inevitable off-chip memory accesses? (b) (2 points) What key implication/feature of your processor architecture continues to avoid the obvious scheme’s difficulty? (c) (2 points) How would you combine your answers to 5.2(c) and 5.3(a)? No discussions about the exam on Piazza or anywhere else until the online section finishes on Saturday 11/20 11:59 pm. Not a peep. Congratulations, you are almost done with this exam. DO NOT end the Honorlock session until you have submitted your work to Brightspace. When you have answered all questions: Use your smartphone to scan your answer sheet and save the scan as a PDF. Make sure your scan is clear and legible. Submit your work as follows: Email your PDF to yourself or save it to the cloud (Google Drive, etc.). Click this link to submit your work: Final Exam Return to this window and click the button below to agree to the honor statement. Click Submit Quiz to end the exam. End the Honorlock session.
In 2006, what was the recent U.S. market growth rate for Ita…
In 2006, what was the recent U.S. market growth rate for Italian sausage?
Which candidate positioning received the most first-place vo…
Which candidate positioning received the most first-place votes in the concept test?
Which of the markets in which Saxonville had product offerin…
Which of the markets in which Saxonville had product offerings was growing the fastest?
What was B&D’s 1990 market share in the Professional-Tradesm…
What was B&D’s 1990 market share in the Professional-Tradesman segment?