Skip to content

Commit a1cf695

Browse files
author
garrafao
committed
former -a option standard, improve efficiency
1 parent c3def01 commit a1cf695

File tree

7 files changed

+69
-239
lines changed

7 files changed

+69
-239
lines changed

README.md

+17-14
Original file line numberDiff line numberDiff line change
@@ -64,7 +64,7 @@ The scripts assume a corpus format of one sentence per line in UTF-8 encoded (op
6464
| Count | `representations/count.py` | VSM | |
6565
| PPMI | `representations/ppmi.py` | VSM | |
6666
| SVD | `representations/svd.py` | VSM | |
67-
| RI | `representations/ri.py` | VSM | - use `-a` for good performance |
67+
| RI | `representations/ri.py` | VSM | |
6868
| SGNS | `representations/sgns.py` | VSM | |
6969
| SCAN | [repository](https://github.com/ColiLea/scan) | TPM | - different corpus input format |
7070

@@ -75,7 +75,7 @@ Table: VSM=Vector Space Model, TPM=Topic Model
7575
|Name | Code | Applicability | Comment |
7676
| --- | --- | --- | --- |
7777
| CI | `alignment/ci_align.py` | Count, PPMI | |
78-
| SRV | `alignment/srv_align.py` | RI | - use `-a` for good performance <br> - consider using the efficient and more powerful [TRIPY](https://github.com/Garrafao/TRIPY) |
78+
| SRV | `alignment/srv_align.py` | RI | - consider using more powerful [TRIPY](https://github.com/Garrafao/TRIPY) |
7979
| OP | `alignment/map_embeddings.py` | SVD, RI, SGNS | - drawn from [VecMap](https://github.com/artetxem/vecmap) <br> - for OP- and OP+ see `scripts/` |
8080
| VI | `alignment/sgns_vi.py` | SGNS | - bug fixes 27/12/19 (see script for details) |
8181
| WI | `alignment/wi.py` | Count, PPMI, SVD, RI, SGNS | - consider using the more advanced [Temporal Referencing](https://github.com/Garrafao/TemporalReferencing) |
@@ -99,11 +99,11 @@ Find detailed notes on model performances and optimal parameter settings in [the
9999

100100
The evaluation framework of this repository is based on the comparison of a set of target words across two corpora. Hence, models can be evaluated on a triple (dataset, corpus1, corpus2), where the dataset provides gold values for the change of target words between corpus1 and corpus2.
101101

102-
| Dataset | Corpus 1 | Corpus 2 | Download | Comment |
103-
| --- | --- | --- | --- | --- |
104-
| DURel | DTA18 | DTA19 | [Dataset](https://www.ims.uni-stuttgart.de/data/durel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/durel/` |
105-
| SURel | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/data/surel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/surel/` |
106-
| SemCor LSC | SEMCOR1 | SEMCOR2 | [Dataset](https://www.ims.uni-stuttgart.de/data/lsc-simul), [Corpora](https://www.ims.uni-stuttgart.de/data/lsc-simul) | |
102+
| Dataset | Language | Corpus 1 | Corpus 2 | Download | Comment |
103+
| --- | --- | --- | --- | --- | --- |
104+
| DURel | German | DTA18 | DTA19 | [Dataset](https://www.ims.uni-stuttgart.de/data/durel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/durel/` |
105+
| SURel | German | SDEWAC | COOK | [Dataset](https://www.ims.uni-stuttgart.de/data/surel), [Corpora](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/wocc) | - version from Schlechtweg et al. (2019) at `testsets/surel/` |
106+
| SemCor LSC | English | SEMCOR1 | SEMCOR2 | [Dataset](https://www.ims.uni-stuttgart.de/data/lsc-simul), [Corpora](https://www.ims.uni-stuttgart.de/data/lsc-simul) | |
107107

108108
We provide several evaluation pipelines, downloading the corpora and evaluating the models on the above-mentioned datasets, see [pipeline](#pipeline).
109109

@@ -140,6 +140,7 @@ As is the scripts will reproduce the results from Schlechtweg et al. (2019) and
140140

141141
- September 1, 2019: Python scripts were updated from Python 2 to Python 3.
142142
- December 27, 2019: bug fixes in `alignment/sgns_vi.py` (see script for details)
143+
- March 23, 2020: updates in `representations/ri.py` and `alignment/srv_align.py` (see scripts for details)
143144

144145
### Error Sources
145146

@@ -153,19 +154,21 @@ BibTex
153154
title = {{A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains}},
154155
author = {Dominik Schlechtweg and Anna H\"{a}tty and Marco del Tredici and Sabine {Schulte im Walde}},
155156
booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
156-
year = {2019},
157-
address = {Florence, Italy},
158-
publisher = {Association for Computational Linguistics},
159-
pages = {732--746}
157+
year = {2019},
158+
address = {Florence, Italy},
159+
publisher = {Association for Computational Linguistics},
160+
pages = {732--746},
161+
doi = {10.18653/v1/P19-1072}
160162
}
161163
```
162164
```
163165
@inproceedings{SchlechtwegWalde20,
164166
title = {{Simulating Lexical Semantic Change from Sense-Annotated Data}},
165167
author = {Dominik Schlechtweg and Sabine {Schulte im Walde}},
166168
year = {2020}
167-
booktitle = {{The Evolution of Language: Proceedings of the 13th International Conference (EVOLANGXIII)}},
168-
editor = {C. Cuskley and M. Flaherty and H. Little and Luke McCrohon and A. Ravignani and T. Verhoef},
169-
publisher = {Online at {}},
169+
booktitle = {{The Evolution of Language: Proceedings of the 13th International Conference (EvoLang13)}},
170+
editor = {Ravignani, A. and Barbieri, C. and Martins, M. and Flaherty, M. and Jadoul, Y. and Lattenkamp, E. and Little, H. and Mudd, K. and Verhoef, T.},
171+
url = {http://brussels.evolang.org/proceedings/paper.html?nr=9},
172+
doi = {10.17617/2.3190925}
170173
}
171174
```

alignment/sgns_vi.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,7 @@ def main():
2626
Arguments:
2727
2828
<modelPath> = model for initialization
29-
<corpDir> = path to corpus directory with zipped files, each sentence in form 'year\tword1 word2 word3...'
29+
<corpDir> = path to corpus directory with zipped files
3030
<outPath> = output path for vectors
3131
3232
Options:
@@ -58,7 +58,7 @@ def main():
5858
# Load model
5959
model = Word2Vec.load(modelPath)
6060

61-
# Intersect vocabulary
61+
# Build vocabulary
6262
vocab_sentences = PathLineSentences(corpDir)
6363
logging.getLogger('gensim').setLevel(logging.ERROR)
6464
model.build_vocab(vocab_sentences, update=True)

alignment/srv_align.py

+21-121
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
import time
77
import numpy as np
88
from sklearn.random_projection import sparse_random_matrix
9-
from scipy.sparse import lil_matrix, csc_matrix, hstack, vstack
9+
from scipy.sparse import csr_matrix
1010
from utils_ import Space
1111

1212

@@ -20,21 +20,19 @@ def main():
2020
args = docopt('''Create two aligned low-dimensional vector spaces by sparse random indexing from two co-occurrence matrices.
2121
2222
Usage:
23-
srv_align.py [-l] (-s <seeds> | -a) <matrixPath1> <matrixPath2> <outPath1> <outPath2> <outPathElement> <dim> <t>
23+
srv_align.py [-l] <matrixPath1> <matrixPath2> <outPath1> <outPath2> <dim>
2424
25-
<seeds> = number of non-zero values in each random vector
2625
<matrixPath1> = path to matrix1
2726
<matrixPath2> = path to matrix2
2827
<outPath1> = output path for aligned space 1
2928
<outPath2> = output path for aligned space 2
30-
<outPathElement> = output path for elemental space (context vectors)
3129
<dim> = number of dimensions for random vectors
32-
<t> = threshold for downsampling (if t=None, no subsampling is applied)
3330
3431
Options:
3532
-l, --len normalize final vectors to unit length
36-
-s, --see specify number of seeds manually
37-
-a, --aut calculate number of seeds automatically as proposed in [1,2]
33+
34+
Note:
35+
Assumes intersected and ordered columns. Paramaters -s, -a and <t> have been removed from an earlier version for efficiency. Also columns are now intersected instead of unified.
3836
3937
References:
4038
[1] Ping Li, T. Hastie and K. W. Church, 2006,
@@ -46,134 +44,37 @@ def main():
4644
''')
4745

4846
is_len = args['--len']
49-
is_seeds = args['--see']
50-
if is_seeds:
51-
seeds = int(args['<seeds>'])
52-
is_aut = args['--aut']
5347
matrixPath1 = args['<matrixPath1>']
5448
matrixPath2 = args['<matrixPath2>']
5549
outPath1 = args['<outPath1>']
5650
outPath2 = args['<outPath2>']
57-
outPathElement = args['<outPathElement>']
5851
dim = int(args['<dim>'])
59-
if args['<t>']=='None':
60-
t = None
61-
else:
62-
t = float(args['<t>'])
6352

6453

6554
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
6655
logging.info(__file__.upper())
6756
start_time = time.time()
6857

6958
# Load input matrices
70-
space1 = Space(matrixPath1)
71-
matrix1 = space1.matrix
72-
space2 = Space(matrixPath2)
73-
matrix2 = space2.matrix
74-
75-
# Get mappings between rows/columns and words
76-
rows1 = space1.rows
77-
id2row1 = space1.id2row
78-
row2id1 = space1.row2id
79-
columns1 = space1.columns
80-
column2id1 = space1.column2id
81-
rows2 = space2.rows
82-
id2row2 = space2.id2row
83-
row2id2 = space2.row2id
84-
columns2 = space2.columns
85-
column2id2 = space2.column2id
59+
countSpace1 = Space(matrixPath1)
60+
countMatrix1 = countSpace1.matrix
61+
rows1 = countSpace1.rows
62+
columns1 = countSpace1.columns
63+
64+
countSpace2 = Space(matrixPath2)
65+
countMatrix2 = countSpace2.matrix
66+
rows2 = countSpace2.rows
67+
columns2 = countSpace2.columns
8668

87-
# Get union of rows and columns in both spaces
88-
unified_rows = sorted(list(set(rows1).union(rows2)))
89-
unified_columns = sorted(list(set(columns1).union(columns2)))
90-
columns_diff1 = sorted(list(set(unified_columns) - set(columns1)))
91-
columns_diff2 = sorted(list(set(unified_columns) - set(columns2)))
69+
# Generate random vectors
70+
randomMatrix = csr_matrix(sparse_random_matrix(dim,len(columns1)).toarray().T)
9271

93-
# Get mappings of indices of columns in original spaces to indices of columns in unified space
94-
c2i = {w: i for i, w in enumerate(unified_columns)}
95-
cj2i1 = {j: c2i[w] for j, w in enumerate(columns1+columns_diff1)}
96-
cj2i2 = {j: c2i[w] for j, w in enumerate(columns2+columns_diff2)}
97-
98-
if t!=None:
99-
rows_diff1 = list(set(unified_rows) - set(rows1))
100-
rows_diff2 = list(set(unified_rows) - set(rows2))
101-
102-
r2i = {w: i for i, w in enumerate(unified_rows)}
103-
rj2i1 = {j: r2i[w] for j, w in enumerate(rows1+rows_diff1)}
104-
rj2i2 = {j: r2i[w] for j, w in enumerate(rows2+rows_diff2)}
105-
106-
# Build spaces with unified COLUMNS
107-
new_columns1 = csc_matrix((len(rows1),len(columns_diff1))) # Get empty columns for additional context words
108-
unified_matrix1 = csc_matrix(hstack((matrix1,new_columns1)))[:,sorted(cj2i1, key=cj2i1.get)] # First concatenate matrix and empty columns and then order columns according to unified_columns
109-
110-
new_columns2 = csc_matrix((len(rows2),len(columns_diff2)))
111-
unified_matrix2 = csc_matrix(hstack((matrix2,new_columns2)))[:,sorted(cj2i2, key=cj2i2.get)]
72+
logging.info("Multiplying matrices")
73+
reducedMatrix1 = np.dot(countMatrix1,randomMatrix)
74+
reducedMatrix2 = np.dot(countMatrix2,randomMatrix)
11275

113-
# Build spaces with unified ROWS
114-
new_rows1 = csc_matrix((len(rows_diff1),len(unified_columns)))
115-
final_unified_matrix1 = csc_matrix(vstack((unified_matrix1,new_rows1)))[sorted(rj2i1, key=rj2i1.get)]
116-
117-
new_rows2 = csc_matrix((len(rows_diff2),len(unified_columns)))
118-
final_unified_matrix2 = csc_matrix(vstack((unified_matrix2,new_rows2)))[sorted(rj2i2, key=rj2i2.get)]
119-
120-
# Add up final unified matrices
121-
common_unified_matrix = np.add(final_unified_matrix1,final_unified_matrix2)
122-
123-
# Get number of total occurrences of any word
124-
totalOcc = np.sum(common_unified_matrix)
125-
126-
# Define function for downsampling
127-
downsample = lambda f: np.sqrt(float(t)/f) if f>t else 1.0
128-
downsample = np.vectorize(downsample)
129-
130-
# Get total normalized co-occurrence frequency of all contexts in both spaces
131-
context_freqs = np.array(common_unified_matrix.sum(axis=0)/totalOcc)[0]
132-
133-
134-
## Generate ternary random vectors
135-
if is_seeds:
136-
elementalMatrix = lil_matrix((len(unified_columns),dim))
137-
# Generate base vector for random vectors
138-
baseVector = np.zeros(dim) # Note: Make sure that number of seeds is not greater than dimensions
139-
for i in range(0,int(seeds/2)):
140-
baseVector[i] = 1.0
141-
for i in range(int(seeds/2),seeds):
142-
baseVector[i] = -1.0
143-
for i in range(len(unified_columns)): # To-do: make this more efficient by generating random indices for a whole array
144-
np.random.shuffle(baseVector)
145-
elementalMatrix[i] = baseVector
146-
if is_aut:
147-
elementalMatrix = sparse_random_matrix(dim,len(unified_columns)).T
148-
149-
# Initialize target vectors
150-
alignedMatrix1 = np.zeros((len(rows1),dim))
151-
alignedMatrix2 = np.zeros((len(rows2),dim))
152-
153-
154-
# Iterate over rows of space, find context words and update aligned matrix with low-dimensional random vectors of these context words
155-
for (matrix,id2row,cj2i,alignedMatrix) in [(matrix1,id2row1,cj2i1,alignedMatrix1),(matrix2,id2row2,cj2i2,alignedMatrix2)]:
156-
# Iterate over targets
157-
for i in id2row:
158-
# Get co-occurrence values as matrix
159-
m = matrix[i]
160-
# Get nonzero indexes
161-
nonzeros = m.nonzero()
162-
nonzeros = [cj2i[j] for j in nonzeros[1]]
163-
data = m.data
164-
pos_context_vectors = elementalMatrix[nonzeros]
165-
if t!=None:
166-
# Apply subsampling
167-
rfs = context_freqs[nonzeros]
168-
rfs = downsample(rfs)
169-
data *= rfs
170-
# Weight context vectors by occurrence frequency
171-
pos_context_vectors = pos_context_vectors.multiply(data.reshape(-1,1))
172-
# Add up context vectors and store as row for target
173-
alignedMatrix[i] = np.sum(pos_context_vectors, axis=0)
174-
175-
outSpace1 = Space(matrix=alignedMatrix1, rows=rows1, columns=[])
176-
outSpace2 = Space(matrix=alignedMatrix2, rows=rows2, columns=[])
76+
outSpace1 = Space(matrix=reducedMatrix1, rows=rows1, columns=[])
77+
outSpace2 = Space(matrix=reducedMatrix2, rows=rows2, columns=[])
17778

17879
if is_len:
17980
# L2-normalize vectors
@@ -183,7 +84,6 @@ def main():
18384
# Save the matrices
18485
outSpace1.save(outPath1)
18586
outSpace2.save(outPath2)
186-
Space(matrix=elementalMatrix, rows=unified_columns, columns=[]).save(outPathElement)
18787

18888
logging.info("--- %s seconds ---" % (time.time() - start_time))
18989

0 commit comments

Comments
 (0)