CS 544 Optimization methods in vision and learning


Instructor D.A. Forsyth, 3310 Siebel, daf@uiuc.edu

TA: Shruti Bhargava, shrutib2@illinois.edu

Office hours:



  1. Week 1: Basic continuous optimization (descent directions, coordinate ascent, EM as coordinate ascent, Newton's method, stabilized Newton's method); My notes
  2. Week 2, Week 3: More basic continuous optimization (trust regions, dogleg method, subspace method); My notes: (Conjugate gradient Old notes) (Conjugate gradient and Polak Ribiere Notes- use these for conjugate gradient as well);(Quasi Newton methods for big problems, including DFP, BFGS, limited memory methods) My notes:
  3. Week 4, 5 and 6: Constrained optimization methods
  4. Initial remarks on Combinatorial optimization (my notes); Flow and cuts (my notes)
  5. alpha-expansion, alpha-beta swaps; MRF's as cuts (my notes):
  6. Stochastic Average Gradient (my notes)
  7. Matchings (my notes); bipartite graph matching as a linear program (my notes)
  8. Dual decomposition methods (my notes); ADMM (my notes there really are two page 13's, sorry)
  9. The graphical models block
  10. Proximal Algorithms (my notes)
  11. Gradient Boost (my terse notes; more detailed notes with some stuff on xgboost)
  12. Striking behavior of gradient descent My notes
  13. The Linear Quadratic Regulator (my notes)
  14. Markov Decision Processes (my notes)
  15. Learning from an expert, I (my notes)
  16. Structure Learning (my notes)


    Continuous optimization books
  1. Numerical Optimization (Springer Series in Operations Research and Financial Engineering) by Jorge Nocedal and Stephen Wright, 2006
  2. Convex Optimization by Stephen Boyd and Lieven Vandenberghe, Cambridge, 2004
  3. Understanding and Using Linear Programming, J. Matousek and B. Gartner, Springer, 2007
  4. Introduction to Optimization, P. Pedregal, Springer, 2004
  5. Optimization for Machine Learning, Sra, Nowozin and Wright, MIT Press, 2011
  6. Foundations of Optimization, Guler, Springer, 2010
  7. Nonlinear Programming by Dimitri P. Bertsekas, Athena, 1999
  8. Practical Methods of Optimization by R. Fletcher, Wiley, 2000
  9. Practical Optimization by Philip E. Gill, Walter Murray, Margaret H. Wright, Academic, 1982
  10. The EM Algorithm and Extensions 2nd Edition by Geoffrey McLachlan (Author), Thriyambakam Krishnan (Author)

    Useful papers

  1. Sean Borman The Expectation Maximization Algorithm A short tutorial
  2. Yihua Chen and Maya Gupta EM Demystified: An Expectation-Maximization Tutorial
  3. Johathan Shewchuk, An Introduction to the conjugate gradient method without the agonizing pain, 1994
  4. F.A. Potra, S.J. Wright, "Interior point methods", Journal of Computational and Applied Mathematics
    Volume 124, Issues 1-2, 1 December 2000, Pages 281-302 (find it here from a computer inside UIUC VPN)
  5. S. Shalev-Shwartz and Y. Singer, "Logarithmic regret algorithms for strongly convex repeated games", TR, Hebrew University, 2007

Striking behavior of Gradient Descent

Graphical Models


  1. S. Boyd, L. Xiao, A. Mutapcic, J. Mattingley, Notes on Decomposition methods, 2008
  2. Alexander M. Rush and Michael Collins. A tutorial on Lagrangian relaxation and dual decomposition for NLP.
    In Journal of Artificial Intelligence Research, 2012
  3. Introduction to Dual Decomposition for Inference David Sontag, Amir Globerson and Tommi Jaakkola
    Book chapter in Optimization for Machine Learning, editors S. Sra, S. Nowozin, and S. J. Wright: MIT Press, 2011 .

Alternating Direction Method of Multipliers

  1. Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers
    S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein
    Foundations and Trends in Machine Learning, 3(1):1–122, 2011.
  2. André F. T. Martins, Mário A. T. Figueiredo, Pedro M. Q. Aguiar, Noah A. Smith, and Eric P. Xing.
    "An Augmented Lagrangian Approach to Constrained MAP Inference."
    International Conference on Machine Learning (ICML'11), Bellevue, Washington, USA, June 2011.
  3. J. Duchi, S. Shalev-Shwartz, Y. Singer, T. Chandra, "Efficient projections onto the l1-ball for learning in high dimensions", ICML 08
  4. S. Shalev-Shwartz, Y. Singer, "Efficient Learning of Label Ranking by Soft Projections onto Polyhedra", JMLR, 2006
  5. André F. T. Martins, Noah A. Smith, Mário A. T. Figueiredo, Pedro M. Q. Aguiar.
    "Dual Decomposition With Many Overlapping Components."
    Empirical Methods in Natural Language Processing (EMNLP'11), Edinburgh, UK, July 2011.
  6. Efficient projections onto the l 1-ball for learning in high dimensions, John Duchi, Shai Shalev-Shwartz, Yoram Singer, Tushar Chandra, Proceedings of the 25th international conference on Machine learning, 2008

L1 Logistic Regression and regularization paths

  1. J. Mairal, B. Yu. Complexity Analysis of the Lasso Regularization Path. International Conference on Machine Learning, 2012
  2. Osborne, M., Presnell, B., and Turlach, B. A new approach to variable selection in least squares prob- lems. IMA J. Numer. Anal., 20(3):389–403, 2000
  3. Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani, Least Angle Regression Annals of Statistics (with discussion) (2004) 32(2), 407-499.
  4. Trevor Hastie, Saharon Rosset, Rob Tibshirani and Ji Zhu, The Entire Regularization Path for the Support Vector Machine, NIPS 2004

Gradient Boost

  1. Jerome H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine , The Annals of Statistics, Vol. 29, No. 5 (Oct., 2001), pp. 1189-1232

    Neat MRF algorithm cheat-sheet

  1. Which Energy Minimization for my MRF/CRF? A Cheat-Sheet, Bogdan Alexe, Thomas Deselaers, Marcin Eicher, Vittorio Ferrari, Peter Gehler, Alain Lehmann, Stefano Pellegrini, Alessandro Prest

Submodular function applications

  1. Submodular function tutorial slides (Krause, Guestrin, 2008)
  2. Batch Mode Active Learning and Its Application to Medical Image Classification, Hoi, Jin, Zhu and Lyu, ICML 08

Bilevel Optimization notes

  1. An Introduction to Bilevel Programming (Fricke, ND)
  2. Classfication model selection via bilevel programming, Kunapuli, Bennet, Hu and Pang, Optimization Methods and Software, 2008



Continuous optimization applications

Iterative scaling and the like

  1. The improved iterative scaling algorithm: A gentle introduction
    A Berger - Unpublished manuscript, 1997
  2. Iain Bancarz, M. Osborne, Improved iterative scaling can yield multiple globally optimal models with radically differing performance levels, Proceedings of the 19th international conference on Computational linguistics, 1 - 7, 2002
  3. Robert Malouf, A comparison of algorithms for maximum entropy parameter estimation,
    proceeding of the 6th conference on Natural language learning - Volume 20, 1 - 7,  2002
  4. F. Sha and F. Pereira, Shallow parsing with conditional random fields, Proc HLT-NAACL, Main papers, pp 134-141, 2003
  5. Hanna Wallach, Efficient Training of Conditional Random Fields, University of Edinburgh, 2002.


  1.  Jerome H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine , The Annals of Statistics, Vol. 29, No. 5 (Oct., 2001), pp. 1189-1232

Bundle adjustment

  1.   Bill Triggs, Philip F. McLauchlan, Richard I. Hartley and Andrew W. Fitzgibbon,   Bundle Adjustment -- A Modern Synthesis, Vision Algorithms: Theory and Practice: International Workshop on Vision Algorithms, Corfu, Greece, September 1999, pp153-177,

Matrix factorization

  1.   Buchanan, A.M.; Fitzgibbon, A.W., Damped Newton algorithms for matrix factorization with missing data, Computer Vision and Pattern Recognition, 2005, 316 - 322
  2. Fast Maximum Margin Matrix Factorization for Collaborative Prediction, Jason D. M. Rennie, Nati Srebro, in Luc De Raedt, Stefan Wrobel (Eds.) Proceedings of the 22nd International Machine Learning Conference, ACM Press, 2005
  3. Scene Discovery by Matrix Factorization, N.Loeff, A. Farhadi, ECCV 2008
  4. T. Finley and T.Joachims, Supervised clustering with support vector machines, Proceedings of the 22nd international conference on Machine learning,  217 - 224, 2005


  1. A.W. Fitzgibbon, Robust registration of 2D and 3D point sets, Image and Vision Computing, Volume 21, Issues 13-14, 1 December 2003, Pages 1145-1153


        1. C.J.C. Burges, ``A Tutorial on Support Vector Machines for Pattern Recognition, '' Data Mining and Knowledge Discovery, 2, 121-167 (1998)
        2. Sequential minimal optimization: A fast algorithm for training support vector machines J Platt - Advances in Kernel Methods-Support Vector Learning, 1999
        3. S. S. Keerthi, S. K. Shevade,C. Bhattacharyya, K. R. K. Murthy, Improvements to Platt's SMO Algorithm for SVM Classifier Design,  Neural Computation. 2001;13:637-649, 2001
        4. Dynamic visual category learning, T.Yeh and T. Darrell, CVPR, 2008
        5. K. Zhang, I.W. Tsang, J.T. Kwok, Maximum margin clustering made practical, Proceedings of the 24th international conference on Machine learning,  1119 - 1126, 2007   
        6. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM, S. Shalev-Shwartz, Y. Singer, N. Srebro, ICML 2008

        Structure learning

        1. I. Tsochantaridis, T. Joachims, T. Hofmann, and Y. Altun, Large Margin Methods for Structured and Interdependent Output Variables, Journal of Machine Learning Research (JMLR), 6(Sep):1453-1484, 2005.
        2. Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff J. Andrew Bagnell Martin A. Zinkevich
        3. Learning to localize objects with structured output regression, Blaschko and Lamport, ECCV 2008

        Efficient use of subgradients

        1. Smooth minimization of non-smooth functions (Nesterov; Math. Prog 2005).



    Discrete Optimization