%matplotlib inline
%config InlineBackend.figure_format = ‘retina’
Paper Note of gScale
Title: gScale: Scaling up GPU Virtualization with Dynamic Sharing of Graphics Memory Space
Topic: a state of art solution about improve the scaling ability of GPU Virtualization
Idea: analyze the bottleneck of gVirt that fixed size resource partition limits the vGPU instances, based on gVirt, come up with the idea that give each vGPU a private shadow graphics translation table (GTT) to share the high physical graphic memory and synchronize only when the instance is rendering and use on-demand copy to reduce cost, share the low graphic memory by using Extended Page Table (EPT) to establish a mapping relation from guest physical address and host physical address then link those directly so that CPU can access the host memory without using global graphic memory, use a part of low graphic memory and apply dynamic management to serve fence register, divide the high graphic memory to slots which can be shared by vGPU and avoid context switch for idle vGPU to handle many vGPU.
Contribution: an open-source GPU Virtualization solution that can reach the state of the art standar
Intellectual merit: a private shadow GTT to share high graphic memory, a mechanism named ladder mapping to share low graphic memory, use fence memory space pool to solve the invalid fence register problem caused by ladder mapping, and use slot sharing to reduce cost when launched many vGPU
Strengths: clear logic that analyze the bottleneck of gVirt and give improve ideas step by step
Weakness: not so clearly illustrate the ladder mapping idea via example
Paper Note of EvaluationDLonHPC
Title: Evaluation of Deep Learning Frameworks over Different HPC Architectures
Topic: talk about performance in training time of 3 DL frameworks over GPU, CPU using/without corresponding optimization tech NVLink, KNL
Idea: using different deep neural networks trained on different combination of GPU/CPU using/without optimization tech, different DL frameworks, different training batch size and scale up/out, control factors to evaluate the corresponding performance
Contribution: a set of training time performance benchmark in different framework-architecture setting
Intellectual merit: new in evaluation of NVLink & KNL, and evaluation of Caffe, TensorFlow, SINGA
Strengths: sufficient experiments, for example use TensorFlow in two different scaling situations as a mid factor so that to indirectly compare the performance of Caffe and SINGA
Weakness: hard to reimplementing since the difficulty of compiling frameworks/implement dnns
Paper Note of DeftNN
Title: DeftNN: Addressing Bottlenecks for DNN Execution on GPUs via Synapse Vector Elimination and Near-compute Data Fission
Topic: accelerate dnn application on GPU by neural pruning and data precision – storage trade-off
Idea: compute correlation of dnn layers to prune low contribution layers and calibrate based on the assumption that parameters of neuron follow same distribution, analyse the bottleneck of GPU memory bandwidth and come up with data fission methods
Contribution: DeftNN, a framework that speedup dl applications on GPU; Synapse vector elimination, a method that prune synapse without leaving inefficient sparse representation; improved data fusion tech
Intellectual merit: avoid sparsity via layer pruning rather than simply weight pruning, analysis/improve to address GPU hardware bandwidth bottleneck
Strengths: detailed experiments record
Weakness: 6 dnn applications are not well-known, and no relative architecture introduction
Paper Note of DeepMon
Title: DeepMon: Mobile GPU-based Deep Learning Framework for Continuous Vision Applications
Topic: Reduce deep learning applications’ latency on mobile devices.
Idea: Utilize GPU of mobile device, convert deep learning model, store metadata on host mem and parameters on device mem, cache first N conv layers’ output and refresh based on a histogram similarity policy, decompose conv parameters from one tensor to 3 small tensors, optimize conv operations via try unfolding & half float, propose multi kernels to fit different mobile GPU.
Contribution: A toolkit DeepMon that runs deep learning applications on commodity mobile with low latency, implemented with OpenCL and Vulkan
Intellectual merit: Proposed optimization methods to reduce latency without significantly loss performance, without offload to other powerful servers via server-client mode and save energy.
Strengths: Detailed observation & experiments on every step. For example when choosing number of bins at caching step, they set a lot of experiments to draw graph to see how the performance change.
Weakness: No enough guidance for dl developers who wants to make mobile apps: training/testing VGG and YOLO on different datasets
Binary Search Framework
Use Binary Search to find the first/last element with a property in a sorted array.
l: first position in your consideration
h: last position in your consideration, even can larger than the last index of the array
|
|
The idea is that if the mid point satisfies a property, then keep the opposite bound of round at mid, otherwise move the corresponding bound toward mid exceed 1.
Finally its true that l == h
Install Kaldi in OSX
I installed kaldi in my own Mac from scratch, and encountered some problems.
Here are some guidance for some one may have trouble in installing kaldi.
git clone https://github.com/kaldi-asr/kaldi.git kaldi-trunk --origin golden
./kaldi-trunk/tools/extras/check_dependencies.sh
- Install all the dependencies. Notice that
libtoolize
is namedlibtool
withinbrew
. cd kaldi-trunk/tools; make
cd ../src; ./configure; make