

Bart Iver van Blokland

## Parallel processors are everywhere..











| Product Name                           | Launch Date | Total Cores | Processor Base<br>Frequency | Cache                        | TDP     |
|----------------------------------------|-------------|-------------|-----------------------------|------------------------------|---------|
| Intel® Quark™ Microcontroller D2000    | Q3'15       | 1           | 32 MHz                      | 0 KB                         |         |
| Intel® Quark™ SE C1000 Microcontroller | Q4'15       | 1           | 32 MHz                      | 8 KB                         |         |
| Intel® Quark™ Microcontroller D1000    | Q3'15       | 1           | 33 MHz                      | 0 KB                         | 0.025 W |
| Intel® Quark™ SoC X1001                | Q2'14       | 1           | 400 MHz                     | 16 KB                        | 2.2 W   |
| Intel® Quark™ SoC X1011                | Q2'14       | 1           | 400 MHz                     | 16 KB                        | 2.2 W   |
| Intel® Quark™ SoC X1020                | Q2'14       | 1           | 400 MHz                     | 16 KB                        | 2.2 W   |
| Intel® Quark™ SoC X1021                | Q2'14       | 1           | 400 MHz                     | 16 KB                        | 2.2 W   |
| Intel® Quark™ SoC X1021D               | Q2'14       | 1           | 400 MHz                     | 16 KB                        | 2.2 W   |
| Intel® Quark™ SoC X1010                | Q1'14       | 1           | 400 MHz                     | 16 KB                        | 2.2 W   |
| Intel® Quark™ SoC X1020D               | Q1'14       | 1           | 400 MHz                     | 16 KB                        | 2.2 W   |
| Intel Atom® Processor E3815            | Q4'13       | 1           | 1.46 GHz                    | 512 KB L2<br>Cache           | 5 W     |
| Intel® Quark™ SoC X1000                | Q4'13       | 1           | 400 MHz                     | 16 KB                        | 2.2 W   |
| Intel® Celeron® Processor G470         | Q2'13       | 1           | 2.00 GHz                    | 1.5 MB Intel®<br>Smart Cache | 35 W    |
| Intel® Celeron® Processor 927UE        | Q1'13       | 1           | 1.50 GHz                    | 1 MB Intel®<br>Smart Cache   | 17 W    |

































- Multicore processors are ubiquitous
- Needed to fully utilise all cores of a processor
- Only way for chip manufacturers to improve performance
- Some processors are practically useless without it
- Problem is too large to fit one machine
- Can even be useful on single core machines





- Multicore processors are ubiquitous
- Needed to fully utilise all cores of a processor
- Only way for chip manufacturers to improve performance
- Some processors are practically useless without it
- Problem is too large to fit one machine
- Can even be useful on single core machines



- Multicore processors are ubiquitous
- Needed to fully utilise all cores of a processor
- Only way for chip manufacturers to improve performance
- Some processors are practically useless without it
- Problem is too large to fit one machine
- Can even be useful on single core machines







#### Intel Pentium 4 Northwood Buffer Allocation & **Execution Pipeline Start** Instruction Trace Cache Trace Cache Access. Register Rename Micro code Sequencer Register Alias History Tables (2x126) next Address Predict Trace Cache Distributed Tag comparators Register Alias Tables uOp Oueuc Micro code ROM & Flash Instruction Oueue (for le critical fields of the uOns ) Trace Cache Branch Prediction General Instruction Address Oueue Table (BTB), 512 entries. Memory Instruction Address Queue (queues register entries and latency Return Stacks (2x16 entries) fields of the uOps for scheduling) Trace Cache next IP's (2x) Floating Point, MMX, SSE Miscellaneous Tag Data Renamed Register File 128 entries of 128 bit. Instruction Decoder uOp Schedulers Up to 4 decoded uOps/cycle out. (from max. one x86 instr/cycle) FP Move Scheduler Instructions with more than four (8x8 dependency matrix) are handled by Micro Sequencer Parallel (Matrix) Scheduler Trace Cache LRU bits for the two double pumped ALU's Raw Instruction Bytes in Data TLB, 64 entry fully General Floating Point and associative, between threads Slow Integer Scheduler: dual ported (for loads and stores) (8x8 dependency matrix) Load / Store uOp Schedule Instruction Fetch (8x8 dependency matrix) from L2 cache and Load / Store Linear Addre **Branch Prediction** Collision History Table Front End Branch Prediction Integer Execution Core Tables (BTB), shared, 4096 entries in total (1) uOp Dispatch unit & Replay Buffer Dispatches up to 6 uOps / cycle Instruction TLB's 2x64 entry fully associative for 4k and 4M (2) Integer Renamed Register File pages. In: Virtual address [31:12] 128 entries of 32 bit + 6 status flags Out: Physical address [35:12] + 12 read ports and six write ports 2 page level bits (3) Databus switch & Bypasses to and 256 kByte 256 kByte from the Integer Register File. L2 Cache L2 Cache (4) Flags, Write Back Block Block (5) Double Pumped ALU 0 Front Side Bus Inter-(6) Double Pumped ALU 1 face, 400..800 MHz (7) Load Address Generator Unit (8) Store Address Generator Unit (11) ROB Reorder Buffer 3x42 entrie (13) Summed Address Index decode and Way Predic (9) Load Buffer (48 entries) (14) Cache Line Read / Write Transferbuffers and (12) 8 kByte Level 1 Data cache (10) Store Buffer (24 entries) April 19, 2003 www.chip-architect.com four way set associative. 1R/1W 256 bit wide bus to and from L2 cache



 Improving single core performance linearly requires an exponential number of transistors

 At some point it becomes worth it to spend those transistors on multiple independent cores instead





- Multicore processors are ubiquitous
- Needed to fully utilise all cores of a processor
- Only way for chip manufacturers to improve performance
- Some processors are practically useless without it
- Problem is too large to fit one machine
- Can even be useful on single core machines





Power consumption (load): 320W

Theoretical performance: 29,770,000,000 Flops



## Theoretical performance: 29,770,000,000 Flops



Theoretical performance: 29,770,000,000 Flops

If you'd do one calculation per second, the same number of calculations would take:

# 944 years



#### A more realistic workload...

# 

















































# At a resolution of 2560x1440 pixels, the GPU does all of that 200 times per second.

(and even gets to take short breaks)



- Multicore processors are ubiquitous
- Needed to fully utilise all cores of a processor
- Only way for chip manufacturers to improve performance
- Some processors are practically useless without it
- Problem is too large to fit one machine
- Can even be useful on single core machines





- Multicore processors are ubiquitous
- Needed to fully utilise all cores of a processor
- Only way for chip manufacturers to improve performance
- Some processors are practically useless without it
- Problem is too large to fit one machine
- Can even be useful on single core machines





## But why an entire course?

- Communication overhead
- Race conditions

## What can happen?

https://en.m.wikipedia.org/wiki/Therac-25

#### http://spritesmods.com/?art=hddhack&page=2

https://www.briandorey.com/docs/2020-01-15-sony-55xe9005-teardown/pcb-main.jpg

https://www.rcgeeks.co.uk/blogs/news/dji-mavic-mini-teardown-whats-inside

https://www.komplett.no/img/p/800/1219388.jpg

https://www.ebay.com/itm/402730890541

https://www.ifixit.com/Teardown/GoPro+HERO11+Black+Mini+Teardown/155069

https://news.satnews.com/2020/08/04/xiphos-reveals-their-new-space-processor-board-for-sdr-applications/

https://www.hpc.ntnu.no/idun/

https://cdn.benchmark.pl/uploads/backend\_img/c/newsy/2020-09/PM/nvidia-rtx-3080\_05.jpg

https://pbs.twimg.com/media/ELcxWo0U8AAAsiR.jpg

http://www.chip-architect.org/news/Northwood 130nm die text 1600x1200.jpg

https://www.zeiss.com/spectroscopy/applications-industries/oem-applications/semiconductor.html

https://wp.technologyreview.com/wp-content/uploads/2018/08/googlecbf009-11.jpg

https://tpucdn.com/cpu-specs/images/chips/2817-die-shot.jpg

https://www.techrepublic.com/wp-content/uploads/2011/11/22inteldieshot1997.jpg http://brainstones.narod.ru/collection/intel/intel pentium d 925 sl9ka wo lid.jpg