Skip to content

Commit 98f569e

Browse files
authored
add_baostock_collector (#1641)
* add_baostock_collector * modify_comments * fix_pylint_error * solve_duplication_methods * modified the logic of update_data_to_bin * modified the logic of update_data_to_bin * optimize code * optimize pylint issue * fix pylint error * changes suggested by the review * fix CI faild * fix CI faild * fix issue 1121 * format with black * optimize code logic * optimize code logic * fix error code * drop warning during code runs * optimize code * format with black * fix bug * format with black * optimize code * optimize code * add comments
1 parent ceff886 commit 98f569e

File tree

17 files changed

+722
-318
lines changed

17 files changed

+722
-318
lines changed

.github/workflows/test_qlib_from_source.yml

+1
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@ jobs:
102102
- name: Check Qlib with pylint
103103
run: |
104104
pylint --disable=C0104,C0114,C0115,C0116,C0301,C0302,C0411,C0413,C1802,R0401,R0801,R0902,R0903,R0911,R0912,R0913,R0914,R0915,R1720,W0105,W0123,W0201,W0511,W0613,W1113,W1514,E0401,E1121,C0103,C0209,R0402,R1705,R1710,R1725,R1735,W0102,W0212,W0221,W0223,W0231,W0237,W0612,W0621,W0622,W0703,W1309,E1102,E1136 --const-rgx='[a-z_][a-z0-9_]{2,30}$' qlib --init-hook "import astroid; astroid.context.InferenceContext.max_inferred = 500; import sys; sys.setrecursionlimit(2000)"
105+
pylint --disable=C0104,C0114,C0115,C0116,C0301,C0302,C0411,C0413,C1802,R0401,R0801,R0902,R0903,R0911,R0912,R0913,R0914,R0915,R1720,W0105,W0123,W0201,W0511,W0613,W1113,W1514,E0401,E1121,C0103,C0209,R0402,R1705,R1710,R1725,R1735,W0102,W0212,W0221,W0223,W0231,W0237,W0246,W0612,W0621,W0622,W0703,W1309,E1102,E1136 --const-rgx='[a-z_][a-z0-9_]{2,30}$' scripts --init-hook "import astroid; astroid.context.InferenceContext.max_inferred = 500; import sys; sys.setrecursionlimit(2000)"
105106
106107
# The following flake8 error codes were ignored:
107108
# E501 line too long
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
## Collector Data
2+
3+
### Get Qlib data(`bin file`)
4+
5+
- get data: `python scripts/get_data.py qlib_data`
6+
- parameters:
7+
- `target_dir`: save dir, by default *~/.qlib/qlib_data/cn_data_5min*
8+
- `version`: dataset version, value from [`v2`], by default `v2`
9+
- `v2` end date is *2022-12*
10+
- `interval`: `5min`
11+
- `region`: `hs300`
12+
- `delete_old`: delete existing data from `target_dir`(*features, calendars, instruments, dataset_cache, features_cache*), value from [`True`, `False`], by default `True`
13+
- `exists_skip`: traget_dir data already exists, skip `get_data`, value from [`True`, `False`], by default `False`
14+
- examples:
15+
```bash
16+
# hs300 5min
17+
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/hs300_data_5min --region hs300 --interval 5min
18+
```
19+
20+
### Collector *Baostock high frequency* data to qlib
21+
> collector *Baostock high frequency* data and *dump* into `qlib` format.
22+
> If the above ready-made data can't meet users' requirements, users can follow this section to crawl the latest data and convert it to qlib-data.
23+
1. download data to csv: `python scripts/data_collector/baostock_5min/collector.py download_data`
24+
25+
This will download the raw data such as date, symbol, open, high, low, close, volume, amount, adjustflag from baostock to a local directory. One file per symbol.
26+
- parameters:
27+
- `source_dir`: save the directory
28+
- `interval`: `5min`
29+
- `region`: `HS300`
30+
- `start`: start datetime, by default *None*
31+
- `end`: end datetime, by default *None*
32+
- examples:
33+
```bash
34+
# cn 5min data
35+
python collector.py download_data --source_dir ~/.qlib/stock_data/source/hs300_5min_original --start 2022-01-01 --end 2022-01-30 --interval 5min --region HS300
36+
```
37+
2. normalize data: `python scripts/data_collector/baostock_5min/collector.py normalize_data`
38+
39+
This will:
40+
1. Normalize high, low, close, open price using adjclose.
41+
2. Normalize the high, low, close, open price so that the first valid trading date's close price is 1.
42+
- parameters:
43+
- `source_dir`: csv directory
44+
- `normalize_dir`: result directory
45+
- `interval`: `5min`
46+
> if **`interval == 5min`**, `qlib_data_1d_dir` cannot be `None`
47+
- `region`: `HS300`
48+
- `date_field_name`: column *name* identifying time in csv files, by default `date`
49+
- `symbol_field_name`: column *name* identifying symbol in csv files, by default `symbol`
50+
- `end_date`: if not `None`, normalize the last date saved (*including end_date*); if `None`, it will ignore this parameter; by default `None`
51+
- `qlib_data_1d_dir`: qlib directory(1d data)
52+
if interval==5min, qlib_data_1d_dir cannot be None, normalize 5min needs to use 1d data;
53+
```
54+
# qlib_data_1d can be obtained like this:
55+
python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --interval 1d --region cn --version v3
56+
```
57+
- examples:
58+
```bash
59+
# normalize 5min cn
60+
python collector.py normalize_data --qlib_data_1d_dir ~/.qlib/qlib_data/cn_data --source_dir ~/.qlib/stock_data/source/hs300_5min_original --normalize_dir ~/.qlib/stock_data/source/hs300_5min_nor --region HS300 --interval 5min
61+
```
62+
3. dump data: `python scripts/dump_bin.py dump_all`
63+
64+
This will convert the normalized csv in `feature` directory as numpy array and store the normalized data one file per column and one symbol per directory.
65+
66+
- parameters:
67+
- `csv_path`: stock data path or directory, **normalize result(normalize_dir)**
68+
- `qlib_dir`: qlib(dump) data director
69+
- `freq`: transaction frequency, by default `day`
70+
> `freq_map = {1d:day, 5mih: 5min}`
71+
- `max_workers`: number of threads, by default *16*
72+
- `include_fields`: dump fields, by default `""`
73+
- `exclude_fields`: fields not dumped, by default `"""
74+
> dump_fields = `include_fields if include_fields else set(symbol_df.columns) - set(exclude_fields) exclude_fields else symbol_df.columns`
75+
- `symbol_field_name`: column *name* identifying symbol in csv files, by default `symbol`
76+
- `date_field_name`: column *name* identifying time in csv files, by default `date`
77+
- examples:
78+
```bash
79+
# dump 5min cn
80+
python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/hs300_5min_nor --qlib_dir ~/.qlib/qlib_data/hs300_5min_bin --freq 5min --exclude_fields date,symbol
81+
```

0 commit comments

Comments
 (0)