You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-2
Original file line number
Diff line number
Diff line change
@@ -6,9 +6,9 @@ This cross-platform sample tool detects exact and near duplicates of code mainta
6
6
7
7
To run the near-duplicate detection run:
8
8
```
9
-
$ dotnet run /path/to/DuplicateCodeDetector.csproj detect path/to/dataFolder outputFile
9
+
$ dotnet run /path/to/DuplicateCodeDetector.csproj [options] --dir=<folder> <output-file-prefix>
10
10
```
11
-
This will use all the `.jsonl.gz` files in the `dataFolder` and output an `outputFile` with the duplicate pairs and an `outputFile.json` with the groups of detected duplicates.
11
+
This will use all the `.gz` files in the `<folder>` and output an `<output-file-prefix>.json` with the groups of detected duplicates. Invoke `--help` for more options.
12
12
13
13
### Input Data
14
14
@@ -19,6 +19,7 @@ The input data should be one or more `.jsonl.gz` files. These are compressed [JS
19
19
"tokens" : ["list", "of", "tokens", "in", "file"]
20
20
}
21
21
```
22
+
Alternative formats can be accepted by providing the `--tokens-field` and `--id-fields` options.
22
23
23
24
The `tokenizers` folder in this repository contains tokenizers for
24
25
C\#, Java, JavaScript and Python. Please, feel free to contribute tokenizers for other languages too.
0 commit comments