BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:Europe/Stockholm
X-LIC-LOCATION:Europe/Stockholm
BEGIN:DAYLIGHT
TZOFFSETFROM:+0100
TZOFFSETTO:+0200
TZNAME:CEST
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=-1SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:+0200
TZOFFSETTO:+0100
TZNAME:CET
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=10;BYDAY=-1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20191206T142453Z
LOCATION:Substanz 1\, 2
DTSTART;TZID=Europe/Stockholm:20180626T083000
DTEND;TZID=Europe/Stockholm:20180626T100000
UID:isc_hpc_ISC High Performance 2018_sess113_post121@linklings.com
SUMMARY:(RP08) Using GPU's FP16 Tensor Cores Arithmetic to Accelerate Mixe
d-Precision Iterative Refinement Solvers and Reduce Energy Consumption
DESCRIPTION:Research Poster\n\n(RP08) Using GPU's FP16 Tensor Cores Arithm
etic to Accelerate Mixed-Precision Iterative Refinement Solvers and Reduce
Energy Consumption\n\nHaidar, Tomov, Abdelfattah, Zounon, Dongarra\n\nThe
use of low-precision arithmetic in mixed-precision computing methods has
been a powerful tool to accelerate numerous scientific computing applicati
ons. Artificial intelligence (AI) in particular has pushed this to current
extremes, making use of half-precision floating-point arithmetic (FP16) i
n approaches based on neural networks. We present an investigation showing
that other HPC applications can harness this power too, and in particular
, the general HPC problem of solving Ax = b, where A is a large dense matr
ix, and the solution is needed in FP64 accuracy. Our approach is based on
the mixed-precision (FP16->FP64) iterative refinement technique – we gener
alize and extend prior advances into a framework, for which we develop arc
hitecture-specific algorithms and highly-tuned implementations where we sh
ow how the use of FP16-TC (tensor cores) arithmetic can provide up to 4X s
peedup and improve the energy consumption by a factor of 5. This is due to
the performance boost that the FP16->64 (Tensor Cores) provide and to its
better accuracy that outperforms the classical FP16 because the GEMM accu
mulation occur in FP32-bit arithmetic. In addition, we will highlight, for
the first time, that a V100 GPU is able to deliver 74 Gflops/Watt. One ca
n reproduce our results as the developments will be made available through
the MAGMA library.\n\nPasses: Conference Pass, Energy consumption, Hetero
geneous Systems, Parallel Algorithms\n\nTag: Conference Pass, Energy consu
mption, Heterogeneous Systems, Parallel Algorithms
URL:https://2018.isc-program.com/presentation/?id=post121&sess=sess113
END:VEVENT
END:VCALENDAR