Graph Embedding Day, Lyon, France
Gaussian Embeddings for Relational Data

Benjamin Piwowarski
based on works from Y. Jacob (L. Denoyer), L. Dos Santos, H. Titeux
LIP6CNRSUPMC

Motivation: Heterogeneous graphs

Social networks = heterogeneous nodes and relationships

Objectives

Many tasks exist
  • Node classification
  • Link prediction
  • Partitioning (community detection)
  • Collaborative Filtering
  • Graph classification
  • Information diffusion
  • Anomaly detection
  • Regression
  • ...

Graph representation

  • Graph patterns
  • Node statistics, e.g.
    • Incoming and outgoing edges
    • “Distance” to other nodes
    • ...
Limitations
  • Might not be adapted to the problem
  • Human expertise
Learning representations
Similarility with the problem of word representation

Objectives

  • Homogeneous representation space: each node nn corresponds to
  • xnRnx_n \in \mathbb{R}^n
  • No human expertise
  • Latent space geometry reflects properties of entities
  • Handle content (image, text, ...)
\Rightarrow

State of the art

Graphs and representations
  • Knowledge graphs [Jenatton 12, Bordes 13,...]
    p(xixj)\bbarzizj+zr\bbar2p(x_i \rightarrow x_j) \sim \bbar z_i-z_j+z_r\bbar ^2
  • Neighborhood prediction [Mikolov13,Perozzi 14,Grover 16,...]
    p(xjxi)σ(zizj)p(x_j|x_i) \propto \sigma(z_i \cdot z_j)
  • Task specific cost
    Classification
    Graph Neural Networks [Scarselli 09], Embeddings [Jacob 14]
    Regression
    Embeddings [Ziat 15, Smirnova 16], ...
    Recommendation
    Matrix Factorization [Koren 09], ...
  • Attributed graphs: Graph CNN [Defferrad 16, Kipf 17], ...

Why uncertain representations?

Representation models should cope with:
  • Lack of information (= isolated nodes)
  • Contradictory information (= neighbors with conflicting properties)
Using distributions instead of fixed points
  • Mean reflects usual embedding properties
  • Lack of information = prior variance
  • Contradictory information = increases variance
  • Relationships

Why not Bayesian approaches?

Bayesian approach = estimate the posterior
p(θD)p(\theta | \mathcal D)
θ\theta = learned representations
D\mathcal D = data
Limitations
  1. Model flexibility (interaction between linked nodes)
  2. Learning and inference complexity

Gaussian representations

Vilnis et al. (2015): Word Representations via Gaussian Embedding
  • Unsupervised word representation
  • Skip-Gram adaptation
  • Gaussian densities ZiZ_i for each word ii
E(Zi,Zj)=DKL(Zi\bbarZj)=xRnN(x;μj,Σj)logN(x;μj,Σj)N(x;μi,Σi)dx-E(Z_i, Z_j) = D_{KL}(Z_i \bbar Z_j) = \int_{x \in \mathbb{R}^n} \mathcal{N}(x; \mu_j, \Sigma_j) \log \frac{ \mathcal{N}(x; \mu_j, \Sigma_j) }{ \mathcal{N}(x; \mu_i, \Sigma_i) } dx
Gaussian embeddings (1 std. dev.) for selected words (Vilnus 2015)

Outline

  • Node classification
  • Node classification with Gaussian embeddings
  • Recommendation with Gaussian embeddings

Heterogenous graphs: classification and regression

Context

Task: Node Classification
LastFM tag prediction task
Data
Heterogeneous graph
Node type specific labels
A subset of labelled nodes
Task
Transduction: predict missing labels
ex. Predict photo tags, user preferences, etc.

State of the art

Content-based classification
Iterative classification [Neville 00, Belmouhcine 15]
  • Standard classification extended to relational data
  • Local classifier = node attributes + neighbor labels statistics
Label propagation
  • Random walks [Zhu 02, Zhou 14, Nandanwar 16]
    • y~i=ijpijyj\tilde y_i = \sum_{ij} p_{ij} y_j
    • No objective function
    • No learning (except edge weights)
  • Label-based regularization [Zhou 04,Belkin 06]
    iΔ(y~i,yi)+λij\bbary~iy~j\bbar2\sum_i \Delta(\tilde y_i , y_i ) + \lambda \sum_{ij} \bbar \tilde y_i - \tilde y_j \bbar ^2

State of the arts - limitations

Limits
  • Heterogeneity: nodes, edges and labels
  • Structure: nodes of same type might be far away in the graph
Solutions
  1. Projection to homogeneous graphs
    Problem: No interaction between node of different types
  2. Graffiti: Multi-hop random walk [Angelova 12]
    • Handles different node/label types
    • Two-hop random walk

Deterministic model (LaHNet)

Classification loss + graph regularization
L(z,θ)=Lclassification(z,θ)+LGraph(z)L(z, \theta) = L_{\mathrm{classification}}(z, \theta) + L_{\mathrm{Graph}}(z)
  1. Graph classification termLclassification(z,θ)=iNCLtiψi#LtiΔC(f(zi;θ),yi)L_{\mathrm{classification}}(z,\theta) = \sum_{i\in \mathcal{N}_C} \sum_{\ell \in \mathcal{L}_{t_i}} \frac{\psi_i}{\# {\mathcal{L}_{t_i}}} \Delta_C(f(z_i; \theta_\ell), y_{i\ell})
    where ΔC\Delta_C is a Hinge loss, and f(;θ)f(\cdot;\theta_\ell) a linear classifier
  2. Graph regularization termLgraph(z)=(i,j,r)Φijwr\bbarzizj\bbar2L_{\mathrm{graph}}(z) = \sum_{(i,j,r)} \Phi_{ij} w_r \bbar z_i - z_j \bbar ^2
Note: ψ\psi and ϕ\phi are hyperparameters

Properties

Connected nodes are close in the latent space
  1. (Indirectly) connected nodes of the same type will be classified similarly
  2. Exploits correlations between labels of (connected) nodes of different types

Learning relation-specific weights

LastFM statistics
P(User|User)
P(User|Track)

Learning relation-specific weights

  • Grid search: too many hyperparameters
  • Solution: Continous Optimization of Hyperparameter [Bengio 00, Luketina 16]
  • Coordinated gradient descent
    • Update representations (zz) and classifiers (θ\theta) with NC\mathcal{N}_C
    • Update relation weights NW\mathcal{N}_W st NCNW=\mathcal{N}_C \cap \mathcal{N}_W = \emptysetLW(w)=iNWLtiψi#LtiΔC(f(zi(w);θ(w)),yi)L_W(w) = \sum_{i\in \mathcal{N}_W} \sum_{\ell \in \mathcal{L}_{t_i}} \frac{\psi_i}{\# {\mathcal{L}_{t_i}}} \Delta_C(f(z_i(w); \theta_\ell(w)), y_{i\ell})
      Express representations wrt hyperparameters
      1. Lgraph=0    Closed form for zi(w)\nabla L_{\mathrm{graph}} = 0 \implies \text{Closed form for } z_i^\star(w)
      2. Gradient descent

Heterogeneous Classification with Gaussian Embeddings (HCGE)

Each node ii is associated with
ZiN(μi,Σi)Z_i \sim \mathcal N(\mu_i, \Sigma_i)
where Σi\Sigma_i is diagonal (D) or spherical (S)

Classification loss

  1. (1 = 2) ΔEV(Z,y)=max(0;1y×EZ(fθl(Z))\Delta_{EV}(Z, y) = \max(0; 1 - y \times \mathbb{E}_Z(f_{\theta_l}(Z))
  2. (1 ≠ 2) ΔPr(Z,y)=logP(y×fθl(z)>0)\Delta_{Pr}(Z, y) = - \log P(y \times f_{\theta_l}(z) > 0)

Heterogeneous Classification with Gaussian Embeddings (HCGE)

Graph Regularization loss

DKL(zj\bbarzi)=xRN(x;μj,Σj)logN(x;μj,Σj)N(x;μi,Σi)dxD_{KL}(z_j\bbar z_i) = \int_{x\in \mathbb{R}} \mathcal{N}(x; \mu_j,\Sigma_j) \log\frac{\mathcal{N}(x; \mu_j,\Sigma_j)}{\mathcal{N}(x; \mu_i,\Sigma_i)} dx =12(tr(Σi1Σj)+(μiμj)TΣi1(μiμj)dlogdet(Σj)det(Σi))=\frac{1}{2}\left(\mathrm{tr}(\Sigma_i^{-1}\Sigma_j) + (\mu_i - \mu_j)^T\Sigma_i^{-1}(\mu_i-\mu_j) - d - \log\frac{\det(\Sigma_j)}{\det(\Sigma_i)}\right)

Experimental protocol

  • 5 datasets (DBLP, FlickR, LastFM(x2), IMDB)
  • 3 representative baselines
    • Unsupervised (LINE)
    • Homogeneous (HLP)
    • Heterogeneous (Graffiti)
  • Evaluation = micro Precision@k
  • Varying size of the training set (in % of labelled nodes)

Results

Train ratio 10% 50%
Model DBLP Flickr LastFM DBLP Flickr LastFM
LINE 19.5 20.7 20.4 22.3 21.8 20.5
HLP 24.1 26.3 38.4 39.4 54.1 52.1
Graffiti 30.9 24.5 40.1 41.2 54.0 53.2
LaHNet 32.1 29.3 36.3 44.4 54.0 56.6
HCGE(ΔEV\Delta_{EV},S) 30.9 32.7 44.0 44.6 55.8 60.4
HCGE(ΔEV\Delta_{EV},D) 30.4 32.6 43.6 43.9 55.8 60.3
HCGE(ΔPr\Delta_{Pr},S) 27.9 29.7 27.8 45.5 54.8 58.5
HCGE(ΔPr\Delta_{Pr},D) 28.3 31.9 29.4 45.7 55.9 58.9

Learned weight

Learned weights (user, LastFM)

Classifiers interaction

Cosine of linear classifiers parameters (LastFM) for tracks and users (common labels)

HCGE: PageRank vs Variance

PageRank vs variance (LastFM)

Recommendation

Task

Collaborative filtering
  • Data = Ratings given by users on a (small) subset of items
  • Goal = recommend new items to users
  • Hypothesis = Users that rated similarly items will rate similarly new items

Matrix factorization

Koren (2009): Matrix Factorization Techniques for Recommender Systems
Score = inner product of user/item representations + bias
rui=μiμu+bir_{ui} = \mu_i \cdot \mu_u + b_i
Optimized by minimizing LSE (with some regularization)
(u,i)\bbarr^uirui\bbar2\sum_{(u,i)} \bbar \hat r_{ui} - r_{ui} \bbar ^2

Learning to rank approaches in recommendation

Pointwise
  • Regression: Matrix factorization (Koren 2009)
  • Classification: Each class is an item to recommend (Covington 2016)
Pairwise
  • BPR (Bayesian Personalized Ranking): likelihood of observed preferences (Rendle 2009)
  • Neural-network based: matching preferences (Sidana 2018)
Listwise
  • CliMF: reciprocical rank (Shi 2012)
  • CofiRank: lower bound of nDCG (Weimer 2007)

Limits

The main limits of these models
Cold-start
Usually dealt with using meta-information
Contradictions in ratings
No satisfying solution
Diversification
No direct way to estimate the covariance of results (Wang et Zhu, 2009)

Gaussian model for recommendation

XuN(μu,Σu) and XiN(μi,Σi)X_u \sim \mathcal N(\mu_u, \Sigma_u) \text{ and } X_i \sim \mathcal N(\mu_i, \Sigma_i)
with
Σ=diag(σ1,,σd)\Sigma_\bullet = \mathrm{diag}(\sigma_{\bullet 1}, \ldots, \sigma_{\bullet d})
Prior on (mean, precision) is a Normal-Gamma distribution (with mode: mean 0, variance 1)

Gaussian representations

Representation densities
Inner product densities
Three items having the same mean but a different variance, and a user

Pairwise learning to rank (GER-P)

Maximum A Posteriori criterion
p(DΘ)p(θ)=(u,i,j)Dp(i>ujΘ)p(θ)p\left(\mathcal D|\Theta\right)p(\theta)=\prod_{\left(u,i,j\right)\in \mathcal D}p \left(i>_{u}j|\Theta\right)p(\theta)

BPR

p(i>ujΘ)=σ(μiμu+birui(μjμu+bj)ruj)p\left(i>_{u}j|\Theta\right)=\sigma\left(\underbrace{\mu_{i}\cdot \mu_{u} +b_i }_{r_{ui}} - \underbrace{(\mu_{j}\cdot \mu_{u}+b_j)}_{r_{uj}}\right)

GER-P

p(i>ujΘ)=p(XiXu+bi>XjXu+bjΘ)p\left(i>_{u}j|\Theta\right) = p\left(\RVX_{i}\cdot\RVX_{u}+b_{i}>\RVX_{j}\cdot\RVX_{u}+b_{j}|\Theta\right) =p(Xu(XjXi)Zuij<bibjΘ)=p\Big(\underbrace{\RVX_{u}\cdot\left(\RVX_{j}-\RVX_{i}\right)}_{\RV{Z}_{uij}}<b_{i}-b_{j}\Big|\Theta\Big)

Pairwise learning to rank (GER-P)

We approximate Zuij\RV{Z}_{uij} using a normal with matching moments:
E[Zuij]=μu(μjμi)\mathbb{E}\left[ \mathbf{Z}_{uij} \right] = \mu_u^{\top}(\mu_j - \mu_i) Var[Zuij]=2μu(Σi+Σj)μu+(μjμi)Σu(μjμi)+tr(Σu(Σj+Σi))\mathrm{Var}\left[\mathbf{Z}_{uij}\right] = 2\mu_u^{\top} \left(\Sigma_i + \Sigma_j\right) \mu_u + \left(\mu_j-\mu_i\right)^{\top}\Sigma_u (\mu_j-\mu_i) +\mathrm{tr}\left(\Sigma_u \left(\Sigma_j + \Sigma_i\right) \right)
Brown (1977) Means and Variances of Stochastic Vector Products with Applications to Random Linear Models
In practice, we make an error of ± 0.05 in the estimation in 99% of the samples

Listwise learning to rank (GER-L)

SoftRank

Optimizes E(nDCGu,Iu)\mathbb{E}(nDCG | u, \RV{I}_u)
nDCG=1CGmax(i,r)listg(i)d(r)nDCG = \frac{1}{CG_{\max}} \sum_{(i,r) \in \mathrm{list}} g(i) d(r)

GER-L

p(Su,Iu)=E(nDCGu,Iu)=rip(Sr,u,i,Iu)p(R=ru,i,Iu)P(iu,Iu)p(\RV S | u, \RV{I}_u) = \mathbb{E}(nDCG | u, \RV{I}_u) = \sum_{r} \sum_{i} p(\RV{S} | r, \RV{u}, \RV{i}, \RV{I}_u) p(\RV{R} = r| \RV{u}, \RV{i}, \RV{I}_u) P(\RV{i}| \RV{u}, \RV{I}_u)

Listwise learning to rank (GER-L)

with
p(Sr,u,i,Iu)=g(i)d(r)CGmaxp(\RV{S} | r, \RV{u}, \RV{i}, \RV{I}_u) = \frac {g(i) d(r)}{CG_{\max}}
Recursion
P(R=ri,Iu{j})=P(R=ri,Iu)P(i>uj)+P(R=r1i,Iu)P(i<uj)P(\RV{R} = r | \RV{i}, \RV{I}_u \cup \{j\}) = P(\RV{R} = r | \RV{i}, \RV{I}_u) P(i >_u j) + P(\RV{R} = r - 1 | \RV{i}, \RV{I}_u) P(i <_u j)
We then use a maximum a posteriori framework (contrarily to SoftRank)
θ=argmaxθlogp(θ)+logp(Su,Iu,θ)observation\theta^\star = \mathrm{argmax}_{\theta} \log p(\theta) + \log \underbrace{p(\RV S | u, \RV{I}_u, \theta)}_{\mathrm{observation}}

Ranking

Direct
i>uj    p(i>ujΘ)>0.5i >_{u} j \iff p\left(i>_{u}j|\Theta\right) > 0.5 which gives sui=μuμi+bis_{ui} = \mu_u \cdot \mu_i + b_i
Positive
Exploits the variance of the score
sui=p(XuXi+bi>0) s_{ui} = p(\RVX_u \cdot \RVX_i + b_i > 0)
Portfolio
Tries to balance risk (depending on α\alpha)
E(iLSui)αV(iLSui)\Exp\left({\sum_{i\in L} \RV{S}_{ui}}\right) - \alpha \Var\left(\sum_{i\in L} \RV{S}_{ui}\right)
Optimized using a greedy approach
E(Sui)αV(Sui)2αjLncov(Sui,Suj)\mathbb{E}(\RV{S}_{ui}) - \alpha V(\RV{S}_{ui}) - 2 \alpha \sum_{j \in L_n} cov\left({\RV{S}_{ui}, \RV{S}_{uj}}\right)

Experimental protocol

  • 3 datasets (Movielens 100k, Yahoo and Yelp) for GER-P, Movielens 1M for GER-L
  • GER-P is evaluated with the "positive" stategy (GER-L with all)
  • Metric = nDCG@1, @5 and @10 (reranking of test items)
  • Baselines = Most Popular (MP), Soft Margin (SM), (BRPMF) BPR, CofiRank

Statistics

Dataset Users Items Ratings
% Train 10 20 50 10 20 50 10 20 50
Yahoo! 3386 1645 286 1000 1000 995 159 100 33
MovieLens 743 643 448 1336 1330 1307 95 91 81
Yelp 13775 8828 3388 44877 39355 27878 980 791 467

Experimental results (GER-P / Yahoo)

Train Size \rightarrow 10 30 50
Model \downarrow N@1 N@5 N@10 N@1 N@5 N@10 N@1 N@5 N@10
MP 53.0 59.1 67.3 52.5 58.3 66.4 53.6 57.8 64.0
BPRMF 52.8 59.0 67.2 52.2 58.3 66.4 52.2 57.7 63.5
SM 50.9 56.7 65.4 49.7 55.6 64.2 49.9 54.1 60.3
CR 53.5 60.3 68.2 57.8 61.7 68.9 56.0 60.0 65.6
GER-P 53.5 60.3 68.3 53.8 60.7 68.2 54.3 59.6 65.3

Experimental results (GER-P / MovieLens 100K)

Train Size \rightarrow 10 30 50
Model \downarrow N@1 N@5 N@10 N@1 N@5 N@10 N@1 N@5 N@10
MP 66.0 64.7 65.8 68.4 65.3 66.3 69.1 67.4 67.5
BPRMF 66.1 64.6 65.7 66.3 64.3 65.8 66.9 65.0 66.2
SM 55.9 57.5 60.3 58.3 59.6 61.6 58.6 60.4 62.5
CR 69.0 67.3 68.6 69.7 68.5 69.5 71.4 69.4 70.6
GER-P 70.3 67.7 70.0 72.0 69.5 71.1 72.5 71.3 71.5

Experimental results (GER-P / Yelp)

Train Size \rightarrow 10 30 50
Model \downarrow N@1 N@5 N@10 N@1 N@5 N@10 N@1 N@5 N@10
MP 40.7 41.5 46.9 39.5 39.9 44.7 37.4 37.6 41.4
BPRMF 40.8 41.3 46.8 39.6 39.8 44.6 37.3 37.2 40.9
SM 37.3 38.3 44.4 35.8 36.9 41.9 33.4 34.1 38.0
CR 47.1 46.9 51.1 46.5 46.6 50.4 46.2 45.8 48.6
GER-P 55.2 52.2 56.2 57.4 53.5 56.4 58.2 53.7 55.3

Experimental results (GER-L / MovieLens 1M)

Preliminary results
Train Size \rightarrow 10 30 50
Model \downarrow N@1 N@5 N@10 N@1 N@5 N@10 N@1 N@5 N@10
BPRMF 56.9 54.0 54.9 57.5 54.2 54.6 55.9 53.1 53.2
BPR-L 65.5 59.7 56.8 66.0 60.5 57.0 65.9 59.4 55.6
SOFTRANK-L 64.5 59.2 57.5 66.0 60.2 56.9 65.3 59.3 55.6
GER-P 68.5 62.3 58.1 67.3 61.4 56.9 68.0 61.1 56.2
GER-L 67.2 61.5 57.8 67.3 61.7 57.3 68.4 61.7 56.5

Qualitative results (mean and variance)

User
Item

Qualitative results (variances)

User (10%)
User (50%)
Item (10%)
Item (50%)

Conclusion

Conclusion and ongoing work

Main points
  • Importance of training for the task
Representation learning for graph node classification
  • Importance of learning hyperparameters
Gaussian representations
  • Successful approach in 3 tasks (recommendation, node classification and regression)
Ongoing works (Recommendation)
  • Full evaluation of GER-L and ranking strategies
  • Content-based recommendation: content cN(μ(c),σ(c))c \rightarrow \mathcal{N}(\mu(c), \sigma(c))

Our papers

Multilabel Classification on Heterogeneous Graphs with Gaussian Embeddings (ECML 2016)
Modeling Relational Time Series using Gaussian Embeddings (NIPS Time Series workshop 2016)
Gaussian Embeddings for Collaborative Filtering (SIGIR 2017)
Représentations Gaussiennes pour le Filtrage Collaboratif (CORIA 2018)
Representation Learning for Classification inHeterogeneous Graphs with Application to Social Networks (TKKD 2018)