what and why: mini hackathon

mini hackathon is a personal coding event in which:

I spend one day to code up the method(s) in a research paper

why am I doing this?

because doing PhD requires me reading a lot of papers, while leaving little time for coding, which is a lot of fun to me.

too much reading and little coding makes me a dull boy (i’m not saying reading is not fun).

plus, coding up the methods in my favourite papers help me learn better about the methodologies.

project description

main components in the method:

clustering of the data points
build kNN graph from the label vectors
embeddings learning from the graph: essentially a low-rank matrix approximation problem
linear regression on the input features and learned local embeddings

to avoid re-inventing the wheels, I tried to use exising libraries as much as possible

I used sklearn.cluster.KMeans. The matlab code of the paper also uses kmeans.

Neat and handy.

sklearn.neighbors.kneighbors_graph is a handy choice to build such graph.

it’s fast and allows parallelization. I learned that similarity search techniques such as kd-tree can be used to speed up the process.

one minor detail, the paper uses dot product similarity to build the graph, which is not supported in sklearn

so I switched to cosine distance.

implicit is fast library for matrix-factorization (although the library is intended for collaborative filtering, it suits our case.)

one detail is, the library only implements l2 norm. however, the paper uses l1 to enforce sparsity.

so I’m not able to apply l1 for now.

As l1 is not enforced for the embedding, I used Ridge, which only enforces l2.

I tested on bibtex dataset and evaluated using precision at k (p@k):

scores in the parenthesis are from the papers

update (2017-11-08)

~2% increase from the first round

however, just noticed that the dataset does not include validation set.

so the result is not very reliable.

some statistics:

I managed to code up the method using far fewer code than the code from the author