Tutorial

Home

What's New

Paper Subm & Rev

Organization

Call for Papers

Important Dates

Tutorial I: BigData: Hashing Algorithms for Large-Scale Search, Learning, and Compressed Sensing

Ping Li
Associate Professor
Department of Statistics and Biostatistics
Department of Computer Science
Rutgers University
Piscataway, NJ 08854, USA
pingli@stat.rutgers.edu
www.stat.rutgers.edu/home/pingli

Abstract: Modern applications of internet search and machine learning routinely encounter datasets with billions of examples in billion or even billion square dimensions (e.g., documents represented by high-order n-grams). Developing novel algorithms for efficient search and machine learning has become a highly active area of research. We will go over a series of state-of-the-art probabilistic hashing techniques, with applications in search, machine learning, and compressed sensing. Examples of applications inlcude: (i) fitting logistic regression (or SVM) with extremely high-dimensional data; (ii) efficiently finding near neighbors (e.g., images or documents) in sublinear time without scanning all items in the repository; (iii) recovery of sparse signals (e.g., anomaly events from surveillance cameras) from nonadaptive linear measurements.

Bio-Sketch: Ping Li is Associate Professor at Rutgers University, in the Department of Statistics and Biostatistics and in the Department of Computer Science. He graduated from Stanford University with Ph.D. in Statistics (plus Master's degrees in both CS and EE). Ping Li’s research interests include probabilistic hashing algorithms for big data, information retrieval, boosting, data streams, and compressed sensing. He has been publishing extensively in premier venues in data mining, machine learning and theory including WWW, NIPS, UAI, ICML, KDD, SODA, COLT, etc. Ping Li’s research has been funded by the Department of Defense (DoD), Microsoft, Google, and the National Science Foundation (NSF). In particular, he was one of the PIs of the recent NSF-Bigdata program. Ping Li received the Young Instigator Award (YIP) from the Air Force Office of Scientific Research (AFOSR) and the YIP from the Office of Naval Research (ONR). He also won a prize in 2010 Yahoo! Learning to Rank Grand Challenge using own boosting/tree algorithms.

Table of Content

The tutorial consists of four major components (roughly 45 minutes each). There will be a 10-min break between each component. The total length of the tutorial will be 3.5 hours. Depending on the interests of the audience, the context may be slightly adjusted during the tutorial. Some relevant references are provided below. The tutorial will select material from those papers.

1. Random projections, very sparse random projections, sign stable random projections

Ping Li, Trevor Hastie, and Kenneth Church, Very Sparse Random Projections, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2006
Ping Li, Trevor Hastie, and Kenneth Church, Improving Random Projections Using Marginal Information, Conference on Learning Theory (COLT), 2006.
Ping Li, Very Sparse Stable Random Projections For Dimension Reduction, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2007
Ping Li, Gennady Samorodnitsky, and John Hopcroft, Sign Cauchy Projections and Chi-Square Kernel, Neural Information Processing Systems (NIPS), 2013
Ping Li, Michael Mitzenmacher, and Anshumali Shrivastava, Coding for Random Projections, in International Conference on Machine Learning (ICML), 2014

2. b-bit minwise hashing and applications in search and learning with bigdata

Ping Li and Chritsian Konig, Theory and Applications of b-Bit Minwise Hashing, Research Highlight Article in Communications of the ACM, 54 (August), 2011
Ping Li, Christian Konig, and Wenhao Gui, b-Bit Minwise Hashing for Estimating Three-Way Similarities, Neural Information Processing Systems (NIPS), 2010
Ping Li, Anshumali Shrivastava, Joshua Moore, and Christian Konig, Hashing Algorithms for Large-Scale Learning, Neural Information Processing Systems (NIPS), 2011
Anshumali Shrivastava and Ping Li, Fast Near Neighbor Search In High-Dimensional Binary Data, European Conference on Machine Learning (ECML), 2012
Anshumali Shrivastava and Ping Li, In Defense of Minhash over Simhash, International Conference on Artificial Intelligence and Statistics (AISTATS), 2014

3. One permutation hashing, densified one permutation, and applications in search and learning

Ping Li, Art Owen, and Cun-Hui Zhang, One Permutation Hashing, Neural Information Processing Systems (NIPS), 2012
Anshumali Shrivastava and Ping Li, Densifying One Permutation Hashing via Rotation for Fast Near Neighbor Search, in International Conference on Machine Learning (ICML), 2014

4. Compressed sensing (sparse signal recovery) with heavy-tailed design

Ping Li, Cun-Hui Zhang, and Tong Zhang, Compressed Counting Meets Compressed Sensing, in Conference on Learning Theory (COLT), 2014
Ping Li, Cun-Hui Zhang, and Tong Zhang, Sparse Recovery with Very Sparse Compressed Counting, arXiv, 2014

Tutorial II: Databases and Algorithms in Computational Biology

M. Michael Gromiha, PhD
Associate Professor, Department of Biotechnology,
Indian Institute of Technology Madras, India
Associate Editor: BMC Bioinformatics
Website: http://www.biotech.iitm.ac.in/Gromiha
Email: gromiha@iitm.ac.in

Abstract: Computational biology is a fast growing field due to the advancement of fastest computers, efficient algorithms, massive data storage, rapid exchange of information and so on. The developments are on diverse areas such as biological sequence analysis, structural insights, interaction networks, annotation of genomes, microarray data analysis, gene regulatory networks, pathways, next generation sequencing etc. These computational investigations address the issue of therapeutically important targets and pave ways to provide deep insights to experimental researchers. Due to the functional importance of proteins and their interactions in living organisms several computations analyses have been carried out to understanding the mechanism of protein folding, binding specificity of proteins with other molecules such as proteins, DNA, RNA and ligands, interaction networks, pathways, aggregation etc. It is very essential and important to gain knowledge about the availability of databases in bioinformatics and the algorithms developed in computational biology on proteins and their interactions.
This tutorial is aiming to cover various aspects of protein bioinformatics resources from sequence to function. The first part will be focused on the utilities of databases and the second part is devoted to the development of algorithms and web based tools. The contents, usage and construction of databases on protein sequences and structures, folding and stability as well as the binding specificity of protein-protein and protein-nucleic acid complexes will be discussed with examples (2-4). The development of non-redundant datasets, features used to develop computational models will be explained. On the algorithms point of view, I will focus on different aspects on predicting the secondary and tertiary structures of proteins, annotation of membrane proteins, stability of proteins upon mutations, protein folding rates and identifying the binding sites in protein complexes (5-8). The methods used to assess the performance of prediction methods, validity procedures and applications of specific methods will be discussed in detail. In essence, the tutorial would provide deep understanding and basic necessary information to protein researchers/students.
Bio-Sketch: M. Michael Gromiha received his Ph.D in Physics from Bharathidasan University, Tiruchirappalli, India in 1994. He pursued his post doctoral research at the International Center for Genetic Engineering and Biotechnology (ICGEB), Trieste, Italy and The Institute of Physical and Chemical Research (RIKEN), Tsukuba, Japan. He worked as a Research Scientist/Senior Research Scientist at the Computational Biology Research Center (CBRC) of the National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan from 2002-2010. Currently, he is serving as an Associate Professor at the Department of Biotechnology, Indian Institute of Technology Madras, India. His main research interests are structural analysis, prediction, folding and stability of globular and membrane proteins, protein interactions and development of databases and online tools in Bioinformatics. He has published over 150 research articles, 30 reviews and written a book on “Protein Bioinformatics: From Sequence to Function”. His papers received more than 6000 citations. He is an Associate Editor of BMC Bioinformatics, Editor-in-Chief of Open Structural Biology Journal and a member of Nature Reader Panel. He is also a program committee member of ISMB, ECCB and InCoB. He has received several awards including Oxford University Press Bioinformatics prize, Okawa Science Foundation research grant award, Best paper award and Young scientist travel awards from JSPS, ISMB, ICTP, AMBO etc, ICMR award for Senior Biomedical Scientists and INSA International fellowship for senior scientists.

References:

Gromiha, MM. (2010) Protein Bioinformatics: From Sequence to Function, Elsevier/Academic Press.
Berman HM, Kleywegt GJ, Nakamura H, Markley JL. (2012) The protein data bank at 40: reflecting on the past to prepare for the future. Structure.20, 391-6.
Gromiha MM, Sarai A. (2010) Thermodynamic database for proteins: features and applications. Methods Mol Biol. 609:97-112.
Gromiha MM, Yabuki Y, Suresh MX, Thangakani AM, Suwa M, Fukui K. (2009) TMFunction: database for functional residues in membrane proteins. Nucleic Acids Res. 37(Database issue):D201-4
Gromiha MM.(2007) Prediction of protein stability upon point mutations. Biochem Soc Trans 35: 1569-73
Gromiha MM, Huang LT. (2011) Machine learning algorithms for predicting protein folding rates and stability of mutant proteins: comparison with statistical methods. Curr Protein Pept Sci. 12: 490-502.
Gromiha MM. and Nagarajan R. (2013) Computational approaches for predicting the binding sites and understanding the recognition mechanism of protein-DNA complexes. Adv. Protein Chem. Str. Biol. 91, 65-99.
Gromiha MM. and Ou Y-Y. (2014) Bioinformatics approaches for functional annotation of membrane proteins. Brief. Bioinf. 15:155-68.

Table of contents

Part I: Biological Databases
             Protein sequences and structures
             Protein folding and stability
             Protein interactions
             Development of non-redundant datasets

Part II: Algorithms in computational biology
             Prediction of secondary and tertiary structures of proteins
             Annotation of membrane proteins based on structure and function
             Stability of proteins upon mutations and protein folding rates
             Identifying the binding site residues in protein complexes.
             Construction of features for prediction methods
             Assessment of prediction performance
             Validation of methods