|
| 1 | +## Collector Data |
| 2 | + |
| 3 | +### Get Qlib data(`bin file`) |
| 4 | + |
| 5 | + - get data: `python scripts/get_data.py qlib_data` |
| 6 | + - parameters: |
| 7 | + - `target_dir`: save dir, by default *~/.qlib/qlib_data/cn_data_5min* |
| 8 | + - `version`: dataset version, value from [`v2`], by default `v2` |
| 9 | + - `v2` end date is *2022-12* |
| 10 | + - `interval`: `5min` |
| 11 | + - `region`: `hs300` |
| 12 | + - `delete_old`: delete existing data from `target_dir`(*features, calendars, instruments, dataset_cache, features_cache*), value from [`True`, `False`], by default `True` |
| 13 | + - `exists_skip`: traget_dir data already exists, skip `get_data`, value from [`True`, `False`], by default `False` |
| 14 | + - examples: |
| 15 | + ```bash |
| 16 | + # hs300 5min |
| 17 | + python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/hs300_data_5min --region hs300 --interval 5min |
| 18 | + ``` |
| 19 | + |
| 20 | +### Collector *Baostock high frequency* data to qlib |
| 21 | +> collector *Baostock high frequency* data and *dump* into `qlib` format. |
| 22 | +> If the above ready-made data can't meet users' requirements, users can follow this section to crawl the latest data and convert it to qlib-data. |
| 23 | + 1. download data to csv: `python scripts/data_collector/baostock_5min/collector.py download_data` |
| 24 | + |
| 25 | + This will download the raw data such as date, symbol, open, high, low, close, volume, amount, adjustflag from baostock to a local directory. One file per symbol. |
| 26 | + - parameters: |
| 27 | + - `source_dir`: save the directory |
| 28 | + - `interval`: `5min` |
| 29 | + - `region`: `HS300` |
| 30 | + - `start`: start datetime, by default *None* |
| 31 | + - `end`: end datetime, by default *None* |
| 32 | + - examples: |
| 33 | + ```bash |
| 34 | + # cn 5min data |
| 35 | + python collector.py download_data --source_dir ~/.qlib/stock_data/source/hs300_5min_original --start 2022-01-01 --end 2022-01-30 --interval 5min --region HS300 |
| 36 | + ``` |
| 37 | + 2. normalize data: `python scripts/data_collector/baostock_5min/collector.py normalize_data` |
| 38 | + |
| 39 | + This will: |
| 40 | + 1. Normalize high, low, close, open price using adjclose. |
| 41 | + 2. Normalize the high, low, close, open price so that the first valid trading date's close price is 1. |
| 42 | + - parameters: |
| 43 | + - `source_dir`: csv directory |
| 44 | + - `normalize_dir`: result directory |
| 45 | + - `interval`: `5min` |
| 46 | + > if **`interval == 5min`**, `qlib_data_1d_dir` cannot be `None` |
| 47 | + - `region`: `HS300` |
| 48 | + - `date_field_name`: column *name* identifying time in csv files, by default `date` |
| 49 | + - `symbol_field_name`: column *name* identifying symbol in csv files, by default `symbol` |
| 50 | + - `end_date`: if not `None`, normalize the last date saved (*including end_date*); if `None`, it will ignore this parameter; by default `None` |
| 51 | + - `qlib_data_1d_dir`: qlib directory(1d data) |
| 52 | + if interval==5min, qlib_data_1d_dir cannot be None, normalize 5min needs to use 1d data; |
| 53 | + ``` |
| 54 | + # qlib_data_1d can be obtained like this: |
| 55 | + python scripts/get_data.py qlib_data --target_dir ~/.qlib/qlib_data/cn_data --interval 1d --region cn --version v3 |
| 56 | + ``` |
| 57 | + - examples: |
| 58 | + ```bash |
| 59 | + # normalize 5min cn |
| 60 | + python collector.py normalize_data --qlib_data_1d_dir ~/.qlib/qlib_data/cn_data --source_dir ~/.qlib/stock_data/source/hs300_5min_original --normalize_dir ~/.qlib/stock_data/source/hs300_5min_nor --region HS300 --interval 5min |
| 61 | + ``` |
| 62 | + 3. dump data: `python scripts/dump_bin.py dump_all` |
| 63 | + |
| 64 | + This will convert the normalized csv in `feature` directory as numpy array and store the normalized data one file per column and one symbol per directory. |
| 65 | + |
| 66 | + - parameters: |
| 67 | + - `csv_path`: stock data path or directory, **normalize result(normalize_dir)** |
| 68 | + - `qlib_dir`: qlib(dump) data director |
| 69 | + - `freq`: transaction frequency, by default `day` |
| 70 | + > `freq_map = {1d:day, 5mih: 5min}` |
| 71 | + - `max_workers`: number of threads, by default *16* |
| 72 | + - `include_fields`: dump fields, by default `""` |
| 73 | + - `exclude_fields`: fields not dumped, by default `""" |
| 74 | + > dump_fields = `include_fields if include_fields else set(symbol_df.columns) - set(exclude_fields) exclude_fields else symbol_df.columns` |
| 75 | + - `symbol_field_name`: column *name* identifying symbol in csv files, by default `symbol` |
| 76 | + - `date_field_name`: column *name* identifying time in csv files, by default `date` |
| 77 | + - examples: |
| 78 | + ```bash |
| 79 | + # dump 5min cn |
| 80 | + python dump_bin.py dump_all --csv_path ~/.qlib/stock_data/source/hs300_5min_nor --qlib_dir ~/.qlib/qlib_data/hs300_5min_bin --freq 5min --exclude_fields date,symbol |
| 81 | + ``` |
0 commit comments