Parle: Parallelizing Stochastic Gradient Descent
Deep learning has been very been very successfully applied in many areas, such as Image Classification and Natural Language Processing. And this encourages more massive dataset to emerge in recent years. To tackle problems utilizing these scale of dataset, distributed and parallel training of deep learning has been researched and proved to have state of art performances.
Distributed and parallel training mainly composed by model parallelization and data parallelization. It is not perfect and instead comes with inevitable drawbacks, for example, communication overhead. Here I’ll introduce a Parle, a new algorithm solution, proposed by Chaudhari et al. in 2017.
This is indeed motivated by how difficult training in parallel is. The paper trained 6 instances of All-CNNs architecture on CIFAR-10. First it tries averaging the predictions, which performs slightly better than any individual work but at cost of large-test-time performance penalty. Then it took another approach: averaging the weights of models. This turns out to perform poorly. The authors observed a model obtained by averaging after aligning the weights perform much better than a naively averaged model. This leads to the idea to force the replicas aligning with each other and eventually converge to a one average model that combines the copies.
Parle exploits the “flat minima” phenomenon that has been shown to improve generalization of a deep networks, and makes an improvement on Elastic SGD (Zhang et al., 2015). It converges 2–4 faster than a data parallel implementation of SGD and achieves nearly state of art Top-1 error on several benchmark datasets, such as CIFAR-10 and CIFAR-100. It also inherits the advantages of Elastic-SGD.
Here x is parameter of a deep learning network, and f(x) is average loss (cross-entropy) over the entire dataset. The formula above is to solve for training of the deep learning network.
Parle trains multiple copies of the same model in parallel. Denote the copies (replicas) as x^a. It applies Elastic-SGD loss function to couple the copies. In details, it couples two replicas variable with a reference variable x. Performing gradient descent requires communicating with all x^a where a < n, and this introduces a large communications overhead. To solve this, it replaces f(x) with a loss function called “local entropy”.
Entropy-SGD (Chaudhari et al., 2016) is an algorithm to look for flat minima and solve this:
The Parle algorithm is then the following formula:
It can be think as running Entropy-SGD to minimize f_gamma and coupling replicas x^a with Elastic-SGD. It also uses a technique called “scoping”, letting gamma go to infinity and rho go to 0 as training progresses. For a small rho, the replicas are constrained to have a large overlap while minimizing the f_gamma. Letting rho go to infinity, the overlap will goes to unity and the replicas eventually converge to one single configuration.
Elastic-SGD has a large communication overhead while Entropy-SGD does not involve any communication. Parle strokes a balance between these two. It benefits from both synchronous and asynchronous gradient aggregation and it can run on all GPUs simultaneously and require infrequent communication with parameter server.
The paper experiments Parle, Entopy-SGD, Elastic-SGD, and SGD with Nesterov’s momentum on several benchmark datasets to compare their performances. Parle outperforms the others on MINST, CIFAR-10, and CIFAR-100.
Parle performs better than baseline on both CIFAR-10 and CIFAR-100. The comparison between Parle with 3 replicas and 8 replicas suggests that adding more replicas doesn’t bring much benefit. In addition, all of the training are using 3 GPUs, with more GPUs added, each replica can run itself, which will further accelerate the training time.
In terms of training error, training errors of Elastic-SGD and SGD converges to near-zero while Parle and Entropy have a larger training error and do not overfit much because they are looking for flat minima instead of global minima. In case global minima is in a steep cliff and model trained to seek this might not generalize well, flat minima designed to be minima in a wide valley, which may have large training error but generalize better on unseen data.
The replicas in the training above have access to the entire dataset. The paper further experiment on the situation when data is split between replicas. It turns out Elastic proximal term is strong enough to pull the replicas to a region in the parameter space that works for the entire dataset. And this observation can be a very promising direction for future work.
Another interesting property of Parle is that different replica can have different communication and computational capacities. For example, replicas with GPU are more suited to run Entropy-SGD while replicas with CPU / mobile devices are more suited to run Elastic-SGD steps as they are communicating quicker than computing. This property can make parle scale further.
UI Layout Optimization via Gradient Descent
User Interface Design is a very difficult process and there are many factors to consider to ensure a UI is easy to navigate. There are many tools and techniques that have been developed to aid this process. In recently years, deep learning approaches have been introduced in this area as well as it can find complex patterns in large dataset.
This paper is written by Duan et al. in 2020. It extends prior work in neural network predictions on UIs and develops an algorithm that takes in a UI layout and a task sequence, and iteratively adjust the layout elements via gradient descent.
It crowdsourced the completion times and error rates of a task sequence with 248 and 108 layout variations of one single layout. The data is used to train a task performance prediction model that predicts a completion time given UI and a task sequence.
This prediction model is extended by the model “Deep Menu” by Li et al. to predict. It uses a metric consisting of completion time and error rate for task performance. It also supports UI with a variety of element types (icons, buttons) and different interaction types (tapping, sliding,..). The task encoding includes task-specific features like interaction types, step, total steps.
It, in addition, considers the features of people who did the crowdsourced work: age, left-handed. All of these are part of input features to the predictor model. Because of LSTM’s capacity of learning and remembering information from the input task sequence, the model architecture continues to utilize LSTM. There are additional layers to first encode task and input features into vectors before putting them into training. These modifications are what differ the model from “Deep Menu”. The predictor eventually reaches accuracy of 0.79.
Then it starts with UI optimization where input is location and size of each layout element and objective function is predicted task performance. The optimization algorithm F(l) takes into account of task performance of all tasks in the sequence and different penalties.
As the UI elements are updated independently, they may inevitably overlap with each other or go out of boundary. Penalty functions can be used to punishing these unwished behaviors. Designers can use penalty functions to specify constraints they want to put on the layouts.
To verify the UI optimization works for unseen UI, the paper tested it on a recipe UI that hasn’t seen by the model. The task performance of the initial and optimized recipe UI was again crowdsourced and the data confirms the improvement.
The observed improvement for layout1 is 8.9% and 2.0% for layout 2. Layout 2 having smaller improvement is expected since it is initially a good layout.
Furthermore, the paper suggests both human and optimization algorithms can be misleading, the ideal solution should be human-AI collaboration. The optimization model can be helpful for designers daily work. Designers can give an initial layout and let it optimized by the algorithm and they can then further fixes some places and add aesthetic to the layout. At the same time, the task performance predictor can be used a good heuristic to UI design layout.
This paper extends ‘Deep Menu’ into a more complex setting and developed a technique to improve the mobile layouts via gradient descent. At the point of the work was done, there is no other work that applies gradient descent of deep learning models to predict human task performances. Li’s model did used gradient descent but it was to study memory effects. Therefore, this model is the first one that use deep learning to predict task performance of a general 2D user interface in a large interaction space.
There are still some limitations: the task is too simple, and UI aesthetic is not considered. However, they suggests more promising future in this area.
: PARLE: Parallelizing Stochastic Gradient Descent, Pratik Chaudhari et al, 2017, https://arxiv.org/pdf/1707.00424.pdf
: Optimizing User Interface via Gradient Descent, Peitong Duan et al, 2020, https://arxiv.org/pdf/2002.10702.pdf
: Entropy-SGD: Biasing Gradient Descent Into Wide Valleys, Pratik Chaudhari et al, 2017, https://arxiv.org/pdf/1611.01838.pdf
: Deep learning with Elastic Averaging SGD, Sixin Zhang et al, 2015, https://arxiv.org/abs/1412.6651