CS 544 Optimization methods in vision and learning
Instructor D.A. Forsyth, 3310 Siebel, daf@uiuc.edu
TA: Shruti Bhargava, shrutib2@illinois.edu
Office hours:
 DAF: 14h0015h00 F or see if I'm busy
 Shruti: 14h0015h00 W
MP's
Notes
 Week 1: Basic continuous optimization (descent directions, coordinate ascent, EM as coordinate ascent,
Newton's method, stabilized Newton's method); My notes
 Week 2, Week 3: More
basic continuous optimization (trust regions, dogleg
method, subspace method); My
notes: (Conjugate gradient Old notes) (Conjugate gradient and Polak Ribiere Notes use these for conjugate gradient as well);(Quasi Newton methods for big problems, including DFP, BFGS, limited memory methods) My notes:
 Week 4, 5 and 6: Constrained optimization methods
 Lagrangians, quadratic penalty method, augmented lagrangian method (My notes:)
 Linear programs as an example (My notes)
 Inequality constraints in Lagrangians and KKT; relations between primal and dual (My notes)
 Active set methods and box constraints (my notes)
 Logistic regression as a classifier (my notes)
 The SVM as an example (My notes)
 Stochastic gradient descent (my notes)
 Semidefinite programs (My notes  these really do start at page 5!)
 Interior point methods (My notes); merit functions and the Maratos effect (my notes)
 L1 logistic regression, homotopy paths, subgradients and pursuit methods (my notes)
 Quadratic programming and sequential quadratic programming (my notes)
 Initial remarks on Combinatorial optimization (my notes); Flow and cuts (my notes)
 alphaexpansion, alphabeta swaps; MRF's as cuts (my notes):
 Stochastic Average Gradient (my notes)
 Matchings (my notes); bipartite graph matching as a linear program (my notes)
 Dual decomposition methods (my notes); ADMM (my notes there really are two page 13's, sorry)
 The graphical models block
 alphaexpansion, alphabeta swaps; MRF's as cuts (my notes):
 Simple belief propagation ( My notes )
 Simple mean field inference (My notes
 Basic marginal polytope (My notes
 Variational inference using a tree; the local polytope; exponential models ( My notes)
 ADMM for discrete problems (my notes)
 Proximal Algorithms (my notes)
 Gradient Boost (my terse notes; more detailed notes with some stuff on xgboost)
 Striking behavior of gradient descent My notes
 The Linear Quadratic Regulator (my notes)
 Markov Decision Processes (my notes)
 Learning from an expert, I (my notes)
 Structure Learning (my notes)
Resources
Continuous optimization books
 Numerical Optimization (Springer Series in Operations
Research
and Financial Engineering) by Jorge Nocedal and Stephen Wright, 2006
 Convex Optimization by Stephen Boyd and Lieven
Vandenberghe, Cambridge, 2004
 Understanding and Using Linear Programming, J. Matousek and B. Gartner, Springer, 2007
 Introduction to Optimization, P. Pedregal, Springer, 2004
 Optimization for Machine Learning, Sra, Nowozin and Wright, MIT Press, 2011
 Foundations of Optimization, Guler, Springer, 2010
 Nonlinear Programming by Dimitri P. Bertsekas, Athena, 1999
 Practical Methods of Optimization by R. Fletcher,
Wiley, 2000
 Practical Optimization by Philip E. Gill, Walter
Murray, Margaret H. Wright, Academic, 1982
 The EM Algorithm and Extensions 2nd Edition
by Geoffrey McLachlan (Author), Thriyambakam Krishnan (Author)
Useful papers
 Sean Borman The Expectation Maximization Algorithm A short tutorial
 Yihua Chen and Maya Gupta EM Demystified: An ExpectationMaximization Tutorial
 Johathan Shewchuk, An
Introduction to the conjugate gradient method without the agonizing
pain, 1994
 F.A. Potra, S.J. Wright, "Interior point methods", Journal of Computational and Applied Mathematics
Volume 124, Issues 12, 1 December 2000, Pages 281302 (find it here from a computer inside UIUC VPN)
 S. ShalevShwartz and Y. Singer, "Logarithmic regret algorithms for strongly convex repeated games", TR, Hebrew University, 2007
Striking behavior of Gradient Descent
 The Implicit Bias of Gradient Descent on Separable Data
D Soudry, E Hoffer, MS Nacson, S Gunasekar, N Srebro
 Implicit regularization in matrix factorization
S Gunasekar, BE Woodworth, S Bhojanapalli, B Neyshabur, N Srebro
 Nonconvex Optimization Meets LowRank Matrix Factorization: An Overview
Yuejie Chi, Yue M. Lu, Yuxin Chen
 Implicit Regularization in Nonconvex Statistical Estimation: Gradient Descent Converges Linearly for Phase Retrieval, Matrix Completion and Blind Deconvolution
Cong Ma, Kaizheng Wang, Yuejie Chi, Yuxin Chen
Graphical Models

Variational inference in graphical models: The view from the marginal polytope
Martin J .Wainwright, Michael I. Jordan
Appeared in:
Allerton Conference on Control, Communication and Computing
October 13, 2003 Quick and to the point; takes some work
 An introduction to LP relaxations for MAP inference slides by Adrial Weller, 2017
 Graphical Models, Exponential Families, and Variational Inference MJ Wainwright and MIJordan, Foundations and trends in Machine Learning, 2008 The mother lode; everything is here, done rigorously, but quite hard work.

Understanding Belief Propagation and its Generalizations
Jonathan S. Yedidia, William T. Freeman, and Yair Weiss, IJCAI 2001 another mother lode, for bp and loopy bp
Decomposition
 S. Boyd, L. Xiao, A. Mutapcic, J. Mattingley, Notes on Decomposition methods, 2008
 Alexander M. Rush and Michael Collins. A tutorial on Lagrangian relaxation and dual decomposition for NLP.
In Journal of Artificial Intelligence Research, 2012
 Introduction to Dual Decomposition for Inference David Sontag, Amir Globerson and Tommi Jaakkola
Book chapter in Optimization for Machine Learning, editors S. Sra, S. Nowozin, and S. J. Wright: MIT Press, 2011 .
Alternating Direction Method of Multipliers
 Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein
Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
 André F. T. Martins, Mário A. T. Figueiredo, Pedro M. Q. Aguiar, Noah A. Smith, and Eric P. Xing.
"An Augmented Lagrangian Approach to Constrained MAP Inference."
International Conference on Machine Learning (ICML'11), Bellevue, Washington, USA, June 2011.
 J. Duchi, S. ShalevShwartz, Y. Singer, T. Chandra, "Efficient projections onto the l1ball for learning in high dimensions", ICML 08
 S. ShalevShwartz, Y. Singer, "Efficient Learning of Label Ranking by Soft Projections onto Polyhedra", JMLR, 2006
 André F. T. Martins, Noah A. Smith, Mário A. T. Figueiredo, Pedro M. Q. Aguiar.
"Dual Decomposition With Many Overlapping Components."
Empirical Methods in Natural Language Processing (EMNLP'11), Edinburgh, UK, July 2011.
 Efficient projections onto the l 1ball for learning in high dimensions, John Duchi, Shai ShalevShwartz, Yoram Singer, Tushar Chandra, Proceedings of the 25th international conference on Machine learning, 2008
L1 Logistic Regression and regularization paths
 J. Mairal, B. Yu. Complexity Analysis of the Lasso Regularization Path. International Conference on Machine Learning, 2012
 Osborne, M., Presnell, B., and Turlach, B. A new approach to variable selection in least squares prob lems. IMA J. Numer. Anal., 20(3):389–403, 2000
 Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani, Least Angle Regression Annals of Statistics (with discussion) (2004) 32(2), 407499.
 Trevor Hastie, Saharon Rosset, Rob Tibshirani and Ji Zhu, The Entire Regularization Path for the Support Vector Machine, NIPS 2004
Gradient Boost
 Jerome H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine , The Annals of Statistics, Vol. 29, No. 5 (Oct., 2001), pp. 11891232
Neat MRF algorithm cheatsheet
 Which Energy Minimization for my MRF/CRF? A CheatSheet, Bogdan Alexe, Thomas Deselaers, Marcin Eicher, Vittorio Ferrari, Peter Gehler, Alain Lehmann, Stefano Pellegrini, Alessandro Prest
Submodular function applications
 Submodular function tutorial slides (Krause, Guestrin, 2008)
 Batch Mode Active Learning and Its Application to Medical Image Classification, Hoi, Jin, Zhu and Lyu, ICML 08
Bilevel Optimization notes
 An Introduction to Bilevel Programming (Fricke, ND)
 Classfication model selection via bilevel programming, Kunapuli, Bennet, Hu and Pang, Optimization Methods and Software, 2008
Papers
Continuous optimization applications
Iterative scaling and the
like
 The improved
iterative scaling algorithm: A gentle introduction
A Berger  Unpublished manuscript, 1997
 Iain Bancarz, M. Osborne, Improved iterative scaling can
yield multiple globally optimal models with radically differing
performance levels, Proceedings of the 19th international
conference on Computational linguistics, 1  7, 2002
 Robert Malouf, A
comparison of algorithms for maximum entropy parameter estimation,
proceeding of the 6th conference on Natural language learning  Volume
20, 1  7, 2002
 F. Sha and F. Pereira, Shallow
parsing with conditional random fields, Proc HLTNAACL, Main
papers, pp 134141, 2003
 Hanna Wallach, Efficient
Training of Conditional Random Fields, University of
Edinburgh, 2002.
Boosting
 Jerome H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine , The Annals of Statistics, Vol. 29, No. 5 (Oct., 2001), pp. 11891232
Bundle adjustment
 Bill Triggs, Philip F. McLauchlan,
Richard I. Hartley and Andrew W. Fitzgibbon, Bundle Adjustment  A Modern
Synthesis,
Vision Algorithms: Theory and Practice: International Workshop on
Vision Algorithms, Corfu, Greece, September 1999, pp153177,
Matrix factorization
 Buchanan, A.M.; Fitzgibbon, A.W., Damped Newton algorithms for
matrix factorization with missing data, Computer Vision and
Pattern Recognition, 2005, 316  322
 Fast Maximum Margin Matrix Factorization for Collaborative Prediction, Jason D. M. Rennie, Nati Srebro, in Luc De Raedt, Stefan Wrobel (Eds.) Proceedings of the 22nd International Machine Learning Conference, ACM Press, 2005
 Scene Discovery by Matrix Factorization, N.Loeff, A. Farhadi, ECCV 2008
 T. Finley and T.Joachims, Supervised clustering with support vector machines, Proceedings of the 22nd international conference on Machine learning, 217  224, 2005
Registration
 A.W. Fitzgibbon, Robust
registration of 2D and 3D point sets, Image and Vision
Computing, Volume 21, Issues 1314, 1 December 2003, Pages 11451153
SVM's
 C.J.C. Burges, ``A
Tutorial on Support Vector Machines for Pattern
Recognition, '' Data Mining and Knowledge Discovery, 2,
121167 (1998)
 Sequential minimal
optimization: A fast algorithm for training support vector machines J Platt  Advances in Kernel MethodsSupport Vector Learning, 1999
 S. S. Keerthi, S. K. Shevade,C. Bhattacharyya, K. R. K.
Murthy, Improvements to
Platt's SMO Algorithm for SVM Classifier Design, Neural Computation. 2001;13:637649, 2001
 Dynamic visual category learning, T.Yeh and T. Darrell, CVPR, 2008
 K. Zhang, I.W. Tsang, J.T. Kwok, Maximum margin clustering made practical, Proceedings of the 24th international conference on Machine learning, 1119  1126, 2007
 Pegasos: Primal Estimated subGrAdient SOlver for SVM, S. ShalevShwartz, Y. Singer, N. Srebro, ICML 2008
Structure learning
 I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, Journal of Machine Learning Research (JMLR), 6(Sep):14531484, 2005.
 Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff
J. Andrew Bagnell
Martin A. Zinkevich
 Learning to localize objects with structured output regression, Blaschko and Lamport, ECCV 2008
Efficient use of subgradients
 Smooth minimization of nonsmooth functions (Nesterov; Math. Prog 2005).
Discrete Optimization
Resources
 M. Goemans and D.P Williamson, Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming, Journal of the ACM, Volume 42 , Issue 6, 1115  1145, 1995
 L.Vandenberghe and S.Boyd, Semidefinite programming, SIAM review, 38, 1, 4995, 1996
 A Tutorial on Dual Decomposition and Lagrangian Relaxation for Inference in Natural Language Processing, A.M. Rush and M. Collins, JMLR,
Applications
Matching
 Y Rubner, C Tomasi, LJ Guibas, The Earth Mover's Distance as a Metric for Image Retrieval, International Journal of Computer Vision, 2000
 Grauman, K. Darrell, T., Fast contour matching using approximate earth mover's distance. CVPR 2004
 E Levina, P Bickel, The earth mover’s distance is the Mallows distance: Some insights from statistics, Proc. ICCV, 2001
 S Belongie, J Malik, J Puzicha, Matching shapes, Proc. of ICCV, 2001
 Maciel, J.; Costeira, J.P.; A global solution to sparse correspondence problems, Pattern Analysis and Machine Intelligence, IEEE Transactions on, Volume 25, Issue 2, Feb. 2003 Page(s):187  19
 A.Berg, T. Berg and J. Malik, Shape matching and object recognition using low distortion correspondence, CVPR, 2005
Markov Random Fields
 Boykov, Y. and O. Veksler, Graph cuts in vision and graphics: theories and applications, Handbook of Math. Models of Computer Vision, Paragios, Chen, Faugeras (eds)
 Boykov, Y. Veksler, O. Zabih, R. Fast approximate energy minimization via graph cuts, PAMI, 23, 11, 12221239, 2001
 Boykov, Y. Veksler, O. Zabih, R., Markov random fields with efficient approximations, CVPR, 1998
 P. Kohli and M. Pawan Kumar and P.H.S. Torr, P^3 and Beyond: Solving Energies with Higher Order Cliques, CVPR 2007
 C.Olsson A.P. Eriksson and F. Kahl, Solving Large Scale Binary Quadratic Problems: Spectral Methods vs Semidefinite Programming, CVPR 2007
 Accelerated dual decomposition for MAP inference (Jojic, Gould, Koller; ICML 2010).
 MRF Energy Minimization and Beyond via Dual Decomposition (Komodakis, Paragios; PAMI 2010)
 A Comparative Study of Energy Minimization Methods for Markov Random Fields, R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M.Tappen, and C. Rother, ECCV 2006
 Minimizing Nonsubmodular Functions with Graph Cuts—A Review, V. Kolmogorov and C. Rother, PAMI 2007
 Convergent TreeReweighted Message Passing for Energy Minimization, V. Kolmogorov, PAMI 2006
 Approximate labeling via graph cuts based on linear programming, N Komodakis…  Pattern Analysis and Machine Intelligence, 2007
 Dynamic graph cuts for efficient inference in markov random fields, P Kohli…  Pattern Analysis and Machine Intelligence,
2007
Submodular functions
 Kang, Jin, Sukthankar, Correlated label propagation with application to multilabel learning, CVPR 2006