Posters Abstracts. Workshop on Algorithms for Modern Massive
Datasets
Evrim Acar
Department of Computer Science
Rensselaer Polytechnic Institute
COLLECTIVE SAMPLING AND ANALYSIS OF HIGH ORDER TENSORS FOR CHATROOM
COMMUNICATIONS
This work investigates the accuracy and efficiency tradeoffs between
centralized and collective algorithms for (i) sampling and (ii) n-way data
analysis techniques in multidimensional stream data, such as Internet
chatroom communications. Its contributions are threefold. First, we use
the Kolmogorov-Smirnov goodness-of-fit test to demonstrate that
statistical differences between real data obtained by collective sampling
in time dimension from multiple servers and that of obtained from a single
server are insignificant. Second, we show using the real data that
collective data analysis of 3-way data arrays (users x keywords x time) is
more efficient than centralized algorithms with respect to both space and
computational cost. Third, we examine the sensitivity of collective
constructions and analysis of high order data arrays to the choice of
server selection and sampling window size. We construct 4-way datasets
(users x keywords x time x servers) and analyze them to show the impact of
server and window size selections on the results.
Ioannis Antonellis
Computer Engineering and Informatics Department
University of Patras
TENSOR BASED TEXT REPRESENTATION: A NEW DIMENSION IN IR
We investigate the basics of and experiment with a tensor based document
representation in Information Retrieval. Most documents have an inherent
hierarchical structure that renders desirable the use of flexible
multidimensional representations such as those offered by tensor objects.
We focus on the performance of special instances of a Tensor Model, in
which documents are represented using second-order (matrices) and
third-order tensors. We exploit the local-structure encapsulated by the
proposed representation to approximate these tensors using high order
singular value and nonnegative tensor decompositions and assemble the
results to form the final term-document matrix. We analyze the spectral
properties of this matrix and observe that topic identification is
enhanced by deploying k-plane clustering. Our results provide evidence
that tensor based models can be particularly effective for IR, offering an
excellent alternative to traditional VSM and LSI especially for text
collections of multi-topic documents.
Joint work with Efstratios Gallopoulos.
Onureena Banerjee
Department of Electrical Engineering and Computer Sciences
University of California at Berkeley
CONVEX OPTIMIZATION TECHNIQUES FOR LARGE-SCALE COVARIANCE SELECTION
We consider the problem of fitting a large-scale covariance matrix to
multivariate Gaussian data in such a way that the inverse is sparse, thus
providing model selection. Beginning with a dense empirical covariance
matrix, we solve a maximum likelihood problem with an $l_1$-norm penalty
term added to encourage sparsity in the inverse. For models with tens of
nodes, the resulting problem can be solved using standard interior-point
algorithms for convex optimization, but these methods scale poorly with
problem size. We present two new algorithms aimed at solving problems
with a thousand nodes. The first, based on Nesterov's first-order
algorithm, yields a rigorous complexity estimate for the problem, with a
much better dependence on problem size than interior-point methods. Our
second algorithm uses block coordinate descent, updating row/columns of
the covariance matrix sequentially. Experiments with genomic data show
that our method is able to uncover biologically interpretable connections
among genes.
Christos Boutsidis
Computer Engineering and Informatics Department
University of Patras, Greece
PALSIR: A NEW APPROACH TO NONNEGATIVE TENSOR FACTORIZATION
PALSIR (Projected Alternating Least Squares with Initialization and
Regularization) is a new approach to Nonnegative Tensor Factorization
(NTF). PALSIR is designed to decompose a nonnegative ( D1 x D2 x D3)
tensor T into a sum of 'k' nonnegative (D1 x D2 x D3) rank-1 tensors T_i
each of which can be written as the outer product of three nonnegative
vectors: x_i, y_i and z_i of dimensions (D1 x 1), (D2 x 1) and (D3 x 1)
respectively, i=1:k. PALSIR consists of the following phases: i)
initialization of 2 out of the 3 vector groups x_i's, y_i's, z_i's; ii) an
iterative tri-alternating procedure where, at each stage, two of the three
groups of vectors remain fixed and a nonnegative solution is computed with
respect to the third group. Each of these stages requires solving {D1,D2
or D3} constrained least squares (LS) problems. These are solved by first
solving a linear ill-posed inverse problem in the least squares sense,
using Tikhonov regularization and then projecting the solutions back to
the feasible solution - instead of nonnegative LS. We applied PALSIR on a
3d cube of images of the space shuttle Columbia taken by an Air Force
telescope system (in its last orbit before disintegration upon re-entry in
February 2003) in order to identify a parts-based representation of the
image collection. PALSIR appears to be competitive compared to other
available NTF algorithms both in terms of computational cost and
approximation accuracy.
Joint work with E. Gallopoulos, P. Zhang and R.J. Plemmons.
Dongwei Cao
Department of Computer Science
University of Minnesota at Twin Cities
SUPPORT VECTOR MACHINE TRAINING WITH A FEW REPRESENTATIVES
We study algorithms that speed up the training process of support vector
machines by using only a relatively small number of representatives. We
show how kernel K-means usually can be expected to yield a good set of
representatives. The effectiveness is demonstrated with experiments on
some real datasets and a theoretical PAC-style generalization bound.
Maarten De Vos
Department of Electrical Engineering
Katholieke Universiteit Leuven
IMPOSING INDEPENDENCE CONSTRAINTS TO THE CP-MODEL
Data-driven decomposition techniques (like Independent Component Analysis
(ICA), Canonical Correlation Analysis (CCA), Canonical
Decomposition/PARAFAC (CP)) received in the last decades an increasing
amount of attention as an exploratory tool in biomedical signal processing
as opposed to model-driven techniques. Recently, "tensor-ICA", a
combination of ICA and the CP model was introduced as a new concept for
the decomposition of functional Magnetic Resonance data (fMRI) (Beckmann
et al, 2005). The trilinear structure was in that study imposed after the
computation of the independent components. We propose another algorithm
to compute a trilinear decomposition with supposed independence in one
mode. In this algorithm, independent component and trilinear structure
constraints are imposed at the same time. We also show that this new
algorithm outperforms the previously proposed tensor-ICA.
Joint research with Lieven De Lathauwer and Sabine Van Huffel.
Amit Deshpande
Department of Mathematics
Massachusetts Institute of Technology
FAST RELATIVE LOW-RANK MATRIX APPROXIMATION
Low-rank approximation using Singular Value Decomposition (SVD) is
computationally expensive for certain applications involving large
matrices.
Frieze-Kannan-Vempala (FKV) showed that from a small sample of rows of the
given matrix we can compute a low-rank approximation, which is (in
expectation) only an additive error worse than the "best" low-rank
approximation. This can be converted into a randomized algorithm to
compute this additive low-rank approximation in "linear" (in the number of
non-zero entries) time. But in general, their additive error can be
unbounded compared to the error of the "best" low-rank approximation.
Using some generalizations of the FKV sampling scheme, we strenghthen
their results for low-rank approximation within multiplicative error.
Based on this we get a randomized algorithm to find such an approximation
in "linear" time.
Joint work with Luis Rademacher, Santosh Vempala and Grant Wang.
Mariya Ishteva
Department of Electrical Engineering
Katholieke Universiteit Leuven
RANK-(R1,R2,R3) REDUCTION OF TENSORS BASED ON THE RIEMANNIAN TRUST-REGION
SCHEME
We consider unstructured third-order tensors and look for the best
(R1,R2,R3) low-rank approximation. In the matrix case, low-rank
approximation can be obtained from the truncated Singular value
decomposition (SVD). However, in the tensor case, the truncated
Higher-order SVD (HOSVD) gives a suboptimal low-rank approximation of a
tensor, which can only be used as a starting value for iterative
algorithms. The algorithm we present is based on the Riemannian
trust-region method. We express the tensor approximation problem as
minimizing a cost function on a proper manifold. Making use of second
order information about the cost function, superlinear convergence is
achieved.
Joint work with Lieven De Lathauwer, Pierre-Antoine Absil, Rodolphe
Sepulchre, Sabine Van Huffel
SungEun Jo
Department of Computer Science
Stanford University
SOLVING SECULAR EQUATIONS FOR LARGE TOTAL LEAST SQUARES/DATA LEAST SQUARES
PROBLEMS BY MEANS OF GAUSS QUADRATURE RULES
We approximate secular equations for total least squares (TLS) and data
least squares(DLS) problems by means of Lanczos tri-diagonalization
processes. Based on Gaussian Quadrature (GQ) rules, the best number of
Lanczos steps is determined with a given tolerance by investigating the
bounds of the secular equations. The numerical example shows the efficacy
of the GQ approach to solving the large TLS/DLS problems. We also discuss
some implementation issues such as bisection, stabilization, Q-less QR
preprocessing, and indefinite systems for TLS/DLS problems.
Joint work with Gene Golub and Zheng Su.
Andrew Knyazev
Department of Mathematical Sciences
University of Colorado at Denver
MULTISCALE SPECTRAL GRAPH PARTITIONING AND IMAGE SEGMENTATION
Spectral methods for graph partitioning, based on numerical solution of
eigenvalue problems with the graph Laplacian, are well known to produce
high quality partitioning, but are also considered to be expensive. We
discuss modern preconditioned eigensolvers for computing the Fiedler
vector of large scale eigenvalue problems. The ultimate goal is to find a
method with a linear complexity, i.e. a method with computational costs
that scale linearly with the problem size. We advocate the locally optimal
block preconditioned conjugate gradient method (LOBPCG), suggested by the
presenter, as a promising candidate, if matched with a high quality
preconditioner. We provide preliminary numerical results, e.g., we show
that a Fiedler vector for a 24 megapixel image can be computed in seconds
on IBM's BlueGene/L using BLOPEX in our BLOPEX software with Hypre
algebraic multigrid preconditioning.
Rajesh Kumar and Swaroop Jagadish
Department of Computer Science
University of California, Santa Barbara
RETRIEVAL OF BIOLOGICAL IMAGES BASED ON REGION SIMILARITY
The sub-regions of an image may be more 'interesting' than an entire
image. For e.g., an image of a retina has only a few regions that depict
detachment of proteins in a certain fashion. To detect the presence of
such protein detachments that are similar to a given image pattern, from a
large database of retinal images, is a non-trivial task. The optimal
algorithms for such pattern-matching are practically infeasible and
unscalable. The goal of this project is to develop efficient heuristics
for real time detection of such matching regions.
Pei Ling Lai
Department of Electronics Engineering
Southern Taiwan University of Technology
STOCHASTIC PROCESS METHODS FOR INFORMATION EXTRACTION
We consider several stochastic process methods for performing canonical
correlation analysis (CCA). The first uses a Gaussian Process formulation
of regression in which we use the current projection of one data set as
the target for the other and then repeat in the opposite direction. The
second uses a method which relies on probabilistically sphering the data,
concatenating the two streams and then performing a probabilistic PCA. The
third gets the canonical correlation projections directly without having
to calculate the filters first. We also investigate nonlinearity and
sparsification of these methods. Finally, we use a Dirichlet process of
Gaussian models in which the Gaussian models are determined by
Probabilistic CCA in order to model nonlinear relationships with a mixture
of linear correlations where the number of mixtures is not pre-determined.
Wanjun Mi
Institute for Computational and Mathematical Engineering
Stanford University
SHIFT SENSITIVITY OF EIGENFACE, EIGENPHASE, AND EIGENMAGNITUDE
Eigenface method applies Principal Component Analysis on a set of learning
images from which eigenface are extracted. It's widely used and studied in
statistical image recognition. We demonstrate that this method is
sensitive to shift in the learning images. The correlation of images with
the eigenvector, also known as eigenfeatures, are shifting as the learning
images shift. We also compare it with eigen-phase and eigen-magnitude
methods.
Niloy J. Mitra
Department of Computer Science
Stanford University
PROBABILISTIC FINGERPRINTS FOR SHAPES
We propose a new probabilistic framework for the efficient estimation of
similarity between 3D shapes. Our framework is based on local shape
signatures and is designed to allow for quick pruning of dissimilar
shapes, while guaranteeing not to miss any shape with significant
similarities to the query model in shape database retrieval applications.
Since directly evaluating 3D similarity for massive collections of
signatures on shapes is expensive and impractical, we propose a suitable
but compact approximation based on probabilistic fingerprints which are
computed from the shape signatures using Rabin?s hashing scheme and a
small set of random permutations. We provide a probabilistic analysis that
shows that while the preprocessing time depends on the complexity of the
model, the fingerprint size and hence the query time depends only on the
desired confidence in our estimated similarity. Our method is robust to
noise, invariant to rigid transforms, handles articulated deformations,
and effectively detects partial matches. In addition, it provides
important hints about correspondences across shapes which can then
significantly benefit other algorithms that explicitly align the models.
We demonstrate extension of our algorithm to streaming data application.
We demonstrate the utility of our method on a wide variety of geometry
processing applications.
A preliminary version of the work has been submitted to Symposium of
Geometry Processing (2006)
Kourosh Modarresi
Institute for Computational and Mathematical Engineering
Stanford University
OUTLINES OF NON-CLASSICAL TIKHONOV METHOD
Today we know that ill-posed problems, for which the solution does not
continuously depend on the variations in the data, arise naturally from
real physical problems and are not, in most cases, as a result of
incorrect modeling, but due to the inherent characteristics of the
original physical problems.
In solving the corresponding Linear Least squares Problems, resulting from
discretization of the original models, the usual solution methods such as
QR/SVD or Normal equation Algorithm, would result in solutions with very
little relevance to the exact solutions. The remedy, for the solution of
these ill-conditioned least squares problem, is application of some
regularization methods. The most used regularization methods are TSVD and
Tikhonov regularization methods. In using Tikhonov regularization method,
the most important steps are the choice of priori and the regularization
parameter, which controls the level of regularization for the problem.
Classical Tikhonov method is based on global priori and regularization
parameter. This approach underestimates the local features of the solution
and may result to oversmoothing of the original solution.To address this
difficulty, a more global approach is needed. In this work, we consider
this approach in the solution of the Tikhonov regularization method.
Joint work with Gene H. Golub.
Morten Mørup
Department of Signal Processing
Technical University of Denmark
EXTENSIONS OF NON-NEGATIVE MATRIX FACTORIZATION (NMF) TO HIGHER ORDER
DATA
Higher order matrix (tensor) decompositions are mainly used in
psychometrics, chemometrics, image analysis, graph analysis and signal
processing. For higher order data the two most commonly used
decompositions are the PARAFAC and the TUCKER model. If the data analyzed
is non-negative it may be relevant to consider additive non-negative
components. We here extend non-negative matrix factorization (NMF) to form
algorithms for non-negative TUCKER and PARAFAC decompositions.
Furthermore, we extend the PARAFAC model to account for shift and echo
effects in the data. To improve uniqueness of the decompositions we use
updates that can impose sparseness in any combination of modalities. The
algorithms developed are demonstrated on a range of datasets spanning from
electroencephalography to sound and chemometry signals.
Chaitanya Muralidhara
Department of Cellular and Molecular Biology
University of Texas at Austin
EXPLORING PHYLOGENETIC RELATIONSHIPS IN SEQUENCE ALIGNMENTS THROUGH
SINGULAR VALUE DECOMPOSITION
The 16S ribosomal RNA is a highly conserved molecule used to derive
phylogenetic relationships among organisms, by traditional methods for
phylogeny reconstruction. We describe the SVD analysis of an alignment of
16S rRNA sequences from organisms belonging to different phylogenetic
domains. The dataset is transformed from a matrix of positions $\times$
organisms to a tensor of positions $\times$ code $\times$ organisms
through a binary encoding that takes into account the nucleotide at each
position. The tensor is flattened into a matrix of (positions $\times$
code) $\times$ organisms and singular value decomposition is applied, to
obtain a representation in the ``eigenpositions'' $\times$
``eigenorganisms'' space. These eigenpositions and eigenorganisms are
unique orthonormal superpositions of the positions and organisms
respectively. We show that the significant eigenpositions correlate with
the underlying phylogenetic relationships among the organisms examined.
The specific positions that contribute to each of these relationships,
identified from the eigenorganisms, correlate with known sequence and
structure motifs in the data, which are associated with functions like
RNA-- or protein--binding. Among others, we identify unpaired adenosines
as significant contributors to phylogenetic distinctions. These adenosine
nucleotides, unpaired in the secondary (2D) structure, have been shown to
be involved in a variety of tertiary (3D) structural motifs, some of which
are believed to play a role in RNA folding [Gutell et al., \textit{RNA}
2000].
Joint work with Robin R. Gutell, Gene H. Golub, Orly Alter.
Larsson Omberg
Department of Physics
University of Texas at Austin
A TENSOR HIGHER-ORDER SINGULAR VALUE DECOMPOSITION FOR INTEGRATIVE
ANALYSIS OF DNA MICROARRAY DATA
The structure of DNA microarray data is often of an order higher than that
of a matrix, especially when integrating data from different studies.
Flattened into a matrix format, much of the information in the data is
lost. We describe the use of a higher-order singular value decomposition
(HOSVD) in transforming a tensor of genes $\times$ arrays $\times$
studies, which tabulates a series of DNA microarray datasets from
different studies, to a ``core tensor'' of ``eigengenes'' $\times$
``eigenarrays'' $\times$ ``eigenstudies,'' where the eigengenes,
eigenarrays and eigenstudies are unique orthonormal superpositions of the
genes, arrays and studies, respectively. This HOSVD, also known as N-mode
SVD, formulates the tensor as a linear superposition of all possible outer
products of an eigengene with an eigenarray with an eigenstudy, i.e.,
rank-1 ``subtensors,'' the superposition coefficients of which are
tabulated in the core tensor. Each coefficient indicates the significance
of the corresponding subtensor in terms of the overall information that
this subtensor captures in the data. We show that significant rank-1
subtensors can be associated with independent biological processes, which
are manifested in the data tensor. Filtering out the insignificant
subtensors off the data tensor simulates experimental observation of only
those processes associated with the significant subtensors. Sorting the
data according to the eigengenes, eigenarrays and eigenstudies appears to
classify the genes, arrays and studies, respectively, into groups of
similar underlying biology. We illustrate this HOSVD with an integration
of genome-scale mRNA expression data from yeast cell cycle time courses,
each of which is under a different oxidative stress condition. Novel
correlation between the DNA-binding of a transcription factor and the
difference in the effects of these oxidative stresses on the progress of
the cell cycle is predicted.
Joint work with Gene H. Golub, Orly Alter.
Sri Priya Ponnapalli
Department of Electrical and Computer Engineering
University of Texas at Austin
A NOVEL HIGHER-ORDER GENERALIZED SINGULAR VALUE DECOMPOSITION FOR
COMPARATIVE ANALYSIS OF MULTIPLE GENOME-SCALE DATASETS
We define a higher-order generalized singular value decomposition (GSVD)
of two or more matrices $D_{i}$ of the same number of columns and, in
general, different numbers of rows. Each matrix is factored into a
product $U_i \Sigma_i X^{-1}$ of a matrix $U_i$ composed of the normalized
column basis vectors, a diagonal matrix $\Sigma_i$, and a nonsingular
matrix $X^{-1}$ composed of the normalized row basis vectors. The matrix
$X^{-1}$ is identical in all the matrix factorizations. The row basis
vectors are the eigenvectors of $C$, the arithmetic mean of all quotients
of the correlation matrices $D_{i}^{T}D_{i}$. The $n$th diagonal element
of $\Sigma_{i}$, $\Sigma_{i,n}$, indicates the significance of the $n$th
row basis vector in the $i$th matrix in terms of the overall information
that the $n$th row basis vector captures in the $i$th matrix. The ratio
$\Sigma_{i,n} / \Sigma_{j,n}$ indicates the relative significance of the
$n$th row basis vector in the $i$th matrix relative to the $j$th matrix.
We show that the eigenvalues of $C$ that correspond to row basis vectors
of equal significance in all matrices $D_{i}$, such that $\Sigma_{i,n} /
\Sigma_{j,n}=1$, are equal to 1; the eigenvalues that correspond to row
basis vectors which are approximately insignificant in one or more
matrices $D_{i}$ relative to all the other matrices $D_{j}$, such that
$\Sigma_{i,n} / \Sigma_{j,n} \approx 0$, are $\gg 1$. We show that the
column basis vector $U_{i,n}$ is orthogonal to all other column basis
vectors if the corresponding $n$th row basis vector is of equal
significance in all matrices, such that the correspodning eigenvalue of
$C$ is 1. These properties of the GSVD of two matrices [Golub \& Van Loan,
Johns Hopkins Univ.~Press 1996] are preserved in this higher-order GSVD of
two or more matrices.
Recently we showed that the mathematical row basis vectors uncovered in
the GSVD of two genome-scale mRNA expression datasets from two different
organisms, human and the yeast {\it Saccharomyces cerevisiae}, during
their cell cycle, correspond to the similar and dissimilar among the
biological programs that compose each of the two datasets [Alter, Brown \&
Botstein, {\it PNAS} 2003]. We now show that the mathematical row basis
vectors uncovered in this higher-order GSVD of five genome-scale mRNA
expression datasets from five different organisms, human, the yeast {\it
Saccharomyces cerevisiae}, the yeast {\it Schizosacchomyces pombe},
bacteria and plant during their cell cycle, correspond to the similar and
dissimilar among the biological programs that compose each of the five
datasets. Row basis vectors of equal significance in all datasets
correspond to the cell cycle program which is common to all organisms; row
basis vectors which are approximately insignificant in one or more of the
datasets correspond to biological programs, such as synchronization
responses, that are exclusively manifested in all the other datasets and
might be exclusive to the corresponding organisms. Such comparative
analysis of genome-scale mRNA data among two or more model organisms, that
is not limited to orthologous or homologous genes across the different
organisms, promises to enhance fundamental understanding of the
universality as well as the specialization of molecular biological
mechanisms.
Joint work with Gene H. Golub, Orly Alter.
Luis Rademacher
Department of Mathematics
Massachusetts Institute of Technology
MATRIX APPROXIMATION AND PROJECTIVE CLUSTERING VIA ADAPTIVE SAMPLING
Frieze, Kannan and Vempala proved that a small sample of rows of a given
matrix contains a low rank approximation that minimizes the distance in
terms of the Frobenius norm to within a small additive error, and the
sampling can be done efficiently using just two passes over the matrix.
We generalize this work by showing that the additive error drops
exponentially by iterating the sampling in an adaptive manner. This result
is one of the ingredients of the linear time algorithm for multiplicative
low-rank approximation by Deshpande and Vempala.
The existence of a small certificate for multiplicative low-rank
approximation leads to a PTAS for the following projective clustering
problem: Given a set of points in Euclidean space and integers k and j,
find j subspaces of dimension k that minimize the sum over the points of
squared distances of each point to the nearest subspace.
Joint work with Amit Deshpande, Santosh Vempala and Grant Wang.
Inam Ur Rahman
Institute for Computational and Mathematical Engineering
Stanford University
MANIFOLD-VALUED DATA MINING
New types of sensors and devices are being built everyday. Not only we are
measuing huge amount of data but also new types of data, highly geometric
in nature and inherently different than traditional Eucledian-valued data.
Typical examples include human motion data in animation and diffusion
tensor data in medical imaging. These type of data takes values on special
Riemannain manifolds, called Symmetric Spaces. We call it
'Manifold-Valued' data. As the amount M-valued data touches terabyte mark,
tools are needed that can efficiently mine this type of data. For example,
radiologist in medical imaging might be interested in searching many
terabytes of diffusion tensor data to find those matching, in a certain
sense, given DT image . In animation one might be interested in extracting
those motion clips, from huge motion capture database, that matches given
query clip or description given by the animator. Hence all the traditional
supervised and unsupervised learning issues, like clustering,
classification, indexing, retrieval, searching etc, become important for
M-valued data. Learning algorithm for Eucledian-valued data may not be
appropriate and directly applicable because of highly geometric and
non-linear nature of M-valued data. In this work, we discuss new wavelet
like transform, that we have developed for M-valued data and its
applicability in facilitating above cited learning and data mining task.
David Skillicorn
School of Computing
Queen's University
MATRIX DECOMPOSITIONS AND SECURITY PROBLEMS
Problems in counterterrorism, fraud, law enforcement, and organizational
attempts to watch for corporate malfeasance all require looking for traces
in large datasets. Sophisticated `bad guys' face two countervailing
pressures: the needs of the task or mission, and the need to remain
concealed. Because they are sophisticated, they do not show up as
outliers; because they are unusual, they don't show up as mainstream
either. Instead, they are likely to appear at the `edges' of structures in
the data.
Several properties of matrix decompositions make them superb tools to look
in the `edges' or `corners' of datasets. For example, SVD transforms data
into a space in which distance from the origin corresponds, in a useful
sense, to interestingness; data from innocent people provides a picture of
normal correlation, against which unusual correlation stands out; and the
symmetry between records and attributes makes it possible to investigate
how `edge' records differ from normal ones. The machinery of spectral
graph partitioning can also be used to look for unusual records, or values
using link prediction. Unresolved issues of normalization and removing
stationarity are critical to making these approaches work.
Other matrix decompositions also have a role to play. ICA, for example, is
powerfully able to discover small tightly-knit subgroups within a dataset.
It was able to discover cells within a dataset of al Qaeda links. The
importance of textual data suggests a role (as yet unfulfilled) for NNMF.
Jimeng Sun
Department of Computer Science
Carnegie Mellon University
BEYOND STREAMS AND GRAPHS: DYNAMIC TENSOR ANALYSIS
Time-evolving data models have been widely studied in data mining field
such as time-series, data streams and graphs over time. We argue that all
these canonical examples can be covered and enriched using a flexible
model {\em tensor stream}, that is a sequence of tensors growing over
time.
Under this model, we propose two streaming algorithms for tensor PCA, a
generalization of PCA for a sequence of tensors instead of vectors. We
applied them in two real settings, namely, anomaly detection and multi-way
latent semantic indexing. We used two real, large datasets, one on network
flow data (100GB over 1 month) and one from DBLP (200MB over 25 years).
Our experiments show that our methods are fast, accurate and that they
find interesting patterns and outliers on the real datasets.
Grant Wang
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
EIGENCLUSTER: FINDING INNATE CLUSTERINGS ON THE FLY
We present a spectral algorithm for clustering massive data sets based on
pairwise similarities. The algorithm has guarantees on the quality of the
clustering found. The algorithm is especially well-suited for the common
case where data objects are encoded as sparse feature vectors and the
pairwise similarity between objects is the inner product between their
feature vectors; here, the algorithm runs in space linear in the number of
nonzeros in the object-feature matrix. The spectral algorithm outputs a
hierarchical clustering tree. We show how to use dynamic programming to
find the optimal tree-respecting clustering for many natural clustering
objective functions, such as k-means, k-median, min-diameter, and
correlation clustering. We evaluate the algorithm on a handful of
real-world datasets; the results show our method compares favorably with
known results. We also give an implementation of a meta-search engine
that clusters results from web searches.
This is joint work with David Cheng, Ravi Kannan, and Santosh Vempala.
Joab Winkler
Department of Computer Science
University of Sheffield
STRUCTURED MATRIX METHODS FOR THE COMPUTATION OF A RANK REDUCED SYLVESTER
MATRIX
The Sylvester resultant matrix S(p,q) is a structured matrix that can be
used to determine if two polynomials p=p(y) and q=q(y) are coprime, and if
they are not coprime, it allows their greatest common divisor (GCD) to be
computed. In particular, the rank loss of S(p,q) is equal to the degree of
the GCD of p(y) and q(y), and the GCD can be obtained by reducing S(p,q)
to row echelon form.
The computation of the GCD of two polynomials arises in many applications,
including computer graphics, control theory and geometric modelling.
Experimental errors imply that the data consists of noisy realisations of
the exact polynomials p(y) and q(y), and thus even if p(y) and q(y) have a
non-constant GCD, their noisy realisations, f(y) and g(y) respectively,
are coprime. It is therefore only possible to compute an approximate GCD,
that is, a GCD of the polynomials f*(y) and g*(y) that are obtained by
small perturbations of f(y) and g(y). Different perturbations of f(y) and
g(y) yield different approximate GCDs, all of which are legitimate if the
magnitude of these perturbations is smaller than the noise in the
coefficients. It follows that f*(y) and g*(y) have a non-constant GCD, and
thus the Sylvester resultant matrix S(f*, g*) is a low rank approximation
of the Sylvester matrix S(f,g).
The method of structured total least norm (STLN) is used to compute the
rank reduced Sylvester resultant matrix S(f*, g*), given inexact
polynomials f(y) and g(y). Although this problem has been considered
previously, there exist several issues that have not been addressed, and
that these issues have a considerable effect on the computed approximate
GCD.
The GCD of f(y) and g(y) is equal (up to a scalar multiplier) to the GCD
of f(y) and ag(y), where a is an arbitrary constant, and it is shown that
a has a significant effect on the computed results. In particular,
although the GCD of f(y) and ag(y) is independent (up to an arbitrary
constant) of a, an incorrect value of a leads to unsatisfactory numerical
answers. This dependence on the value of a has not been considered
previously, and methods for the determination of its optimal value are
considered. It is shown that a termination criterion of the optimisation
algorithm that is based on a small normalised residual may lead to
incorrect results, and that it is also necessary to monitor the singular
values of S(f*,g*) in order to achieve good results. Several non-trivial
examples are used to illustrate the importance of a, and the effectiveness
of a termination criterion that is based on the normalised residual and
the singular values of S(f*,g*).
The dependence of the computed solution on the value of a has implications
for the method that is used for the solution of the least squares equality
(LSE) problem that arises from the method of STLN. In particular, this
problem is usually solved by the penalty method (method of weights), which
requires that the value of the weight be set, but its value is defined
heuristically, that is, it is independent of the data (the coefficients of
the polynomials). As noted above, the value of the parameter a is crucial
to the success or failure of the computed solution, and thus the presence
of a parameter that is defined heuristically is not satisfactory. The QR
decomposition, which does not suffer from this disadvantage, is therefore
used to solve the LSE problem.
Joint work with John D. Allan.